Last update: 2025-09-06
About me
I’m a researcher in computational linguistics based in Slovenia. My main research areas are corpus linguistics and natural language processing. I focus on implementing quantitative and statistical methods in the research of linguistic data, and compile language resources and datasets for Slovene and other languages.
So far, I’ve worked in several different subfields, including computational morphology, digital lexicography, research on spoken language and computer-mediated communication, computational dialectometry, quantitative syntax analysis, computational phraseology, and large language model development (particularly through the compilation of instruction-tuning datasets).
I’m strongly in favor of empirical and data-driven research that brings about open-access and directly applicable results to the benefit of everyone.
Main Research Interests
- corpus linguistics
- computational linguistics
- natural language processing
Education
- PhD in Translation Studies (2021), Faculty of Arts, University of Ljubljana (Supervisor: Darja Fišer, Co-supervisor: Vojko Gorjanc)
- MA in Translation (2014), Faculty of Arts, University of Ljubljana (Supervisor: Silvana Orel Kos)
- BA in Interlingual Communication (2009), Faculty of Arts, University of Ljubljana (Supervisor: Silvana Orel Kos)
Coding
- Python
- R
- I’ve also used some Javascript on occasion =)
Workshops and Summer Schools
- 29th European Summer School in Logic, Language and Information (ESSLLI) 2017, Toulouse, France [LINK]
- ReLDI Seminar on Experimental and Corpus Methods in Linguistic Research, 2016, Belgrade, Serbia [LINK]
- 28th European Summer School in Logic, Language and Information (ESSLLI) 2016, Bolzano, Italy [LINK]
- Quantitative Methods in Humanistic Studies, 2016, Copenhagen, Denmark
- Lancaster Summer Schools in Corpus Linguistics, 2016, Lancaster, United Kingdom [LINK]
- 27th European Summer School in Logic, Language and Information (ESSLLI) 2015, Barcelona, Spain [LINK]
Affiliations
2017–present |
Centre for Language Resources and Technologies, University of Ljubljana, Slovenia |
2014–present |
Faculty of Arts, University of Ljubljana, Slovenia |
2017–present |
Faculty of Computer and Information Science, University of Ljubljana, Slovenia |
2017–present |
Jožef Stefan Institute, Ljubljana, Slovenia |
2017–2018 |
Trojina, Institute for Applied Slovene Studies, Ljubljana, Slovenia |
National projects
Projects funded by the Slovenian Research and Innovation Agency (ARIS)
Duration |
Title |
Website |
2024–present |
LLM4DH – Large Language Models for Digital Humanities (GC-0002) |
🔗 |
2022–present |
MEZZANINE – Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language (J7-4642) |
🔗 |
2017–2020 |
NSSSS – New Grammar of Contemporary Standard Slovene: Sources and Methods (J6-8256) |
🔗 |
2017–2020 |
KOLOS – Collocations as a Basis for Language Description: Semantic and Temporal Perspectives (J6-8255) |
🔗 |
2017–2020 |
FRENK – Resources, Methods, and Tools for the Understanding, Identification, and Classification of Various Forms of Socially Unacceptable Discourse in the Information Society (J7-8280) |
🔗 |
2016–2017 |
KAS – Slovene Scientific Texts: Resources and Description (J6-7094) |
🔗 |
2014–2018 |
JANES – Resources, Tools and Methods for the Research of Nonstandard Internet Slovene (J6-6842) |
🔗 |
Projects funded by the Ministry of Culture of the Republic of Slovenia
Duration |
Title |
Website |
2023–2023 |
SEMTEH – Interconnected Knowledge Databases for Semantic Technologies |
- |
2022–2023 |
SLOKIT – CLARIN.SI Upgrade: Corpus Presenter and Text Analyzer |
🔗 |
2020–2023 |
RSDO – Development of Slovene in a Digital Environment |
🔗 |
2021–2022 |
SLED – Monitor Corpus of Slovene and Monitoring Language Resources |
🔗 |
2017–2018 |
Propomenke – The Thesaurus of Modern Slovene: By the Community for the Community |
🔗 |
Projects funded by the Ministry of Higher Education, Science and Innovation of the Republic of Slovenia
Duration |
Title |
Website |
2023–present |
PoVeJMo – Adaptive Natural Language Processing with Large Language Models |
🔗 |
Projects funded by the University of Ljubljana
Duration |
Title |
Website |
2025–present |
Improvement of Accessibility of Slovene Scientific and Expert Terminology with a Targeted Technologically Supported Campaign and Inclusion in University Education |
- |
2021–2024 |
ON – Online Notes: Upgrading a Machine Translation System for Learning Communities (and Special Needs Students) |
🔗 |
CLARIN.SI Projects
Duration |
Title |
Website |
2022 |
Compiling the Slovene SI-NLI Dataset for Natural Language Inference |
🔗 |
2018 |
Accentuation of Sloleks |
🔗 |
2018 |
LIST – A Tool for an Effective Analysis of Slovene Corpora |
🔗 |
International Projects
Horizon 2020 Projects
Duration |
Title |
Website |
2018–2022 |
ELEXIS – European Lexicographic Infrastructure |
🔗 |
2016–2018 |
MIME – Mobility and Inclusion in Multilingual Europe |
🔗 |
Bilateral Projects
Duration |
Title |
Website |
2023–2025 |
SI-HR: Automatic Identification of Semantic Relations in Figurative Context in Croatian and Slovene |
- |
2021–2023 |
SI-IN: Cross-lingual Knowledge Transfer for Social Media Analysis in Less-Resourced Languages During COVID-19 |
- |
COST Actions
Duration |
Title |
Website |
2023–present |
UniDive – Universality, diversity and idiosyncrasy in language technology (COST Action CA21167) |
🔗 |
2017–2021 |
EnetCollect – The European Network for Combining Language Learning with Crowdsourcing Techniques (COST Action CA16105) |
🔗 |
2017 |
PARSEME – Parsing and Multi-Word Expressions. Towards Linguistic Precision and Computational Efficiency in Natural Language Processing (COST Action IC1207) |
🔗 |
Other Research Activities
Selected Contributions to Slovene Language Resources
Resource |
Role |
Link |
Sloleks Morphological Lexicon of Slovene |
Lead Editor |
🔗 |
Digital Dictionary Database of Slovene |
Co-editor |
🔗 |
The Thesaurus of Modern Slovene |
Co-editor |
🔗 |
Collocations Dictionary of Modern Slovene |
Co-editor |
🔗 |
Gigafida Corpus of Written Standard Slovene |
Co-author |
🔗 |
JANES Corpus of Internet Slovene |
Co-author |
🔗 |
Trendi Monitor Corpus of Slovene |
Co-author |
🔗 |
Datasets
A selected list of datasets I’ve authored or contributed to is shown below. Please check CLARIN.SI for more.
Dataset |
Link |
Open Slovene WordNet OSWN 1.0 |
🔗 |
Parallel sense-annotated corpus ELEXIS-WSD 1.3 |
🔗 |
Slovene Natural Language Inference Dataset SI-NLI |
🔗 |
English translation of the Slovene Natural Language Inference Dataset SI-NLI-en 1.0 |
🔗 |
“Choice of plausible alternatives” datasets in South Slavic dialects DIALECT-COPA |
🔗 |
Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0 |
🔗 |
List of word relations from the Sloleks 2.0 lexicon 1.1 |
🔗 |
Dataset of Slovene word formation trees ArboSloleks 1.0 |
🔗 |
Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0 |
🔗 |
Lists of Slovene accentuated units SNES 1.0 |
🔗 |
Reviewer
Publications
For my complete and up-to-date bibliography, please check COBISS.