Jaka Čibej

Last update: 2025-09-06

About me

I’m a researcher in computational linguistics based in Slovenia. My main research areas are corpus linguistics and natural language processing. I focus on implementing quantitative and statistical methods in the research of linguistic data, and compile language resources and datasets for Slovene and other languages.

So far, I’ve worked in several different subfields, including computational morphology, digital lexicography, research on spoken language and computer-mediated communication, computational dialectometry, quantitative syntax analysis, computational phraseology, and large language model development (particularly through the compilation of instruction-tuning datasets).

I’m strongly in favor of empirical and data-driven research that brings about open-access and directly applicable results to the benefit of everyone.

Main Research Interests

corpus linguistics
computational linguistics
natural language processing

Education

PhD in Translation Studies (2021), Faculty of Arts, University of Ljubljana (Supervisor: Darja Fišer, Co-supervisor: Vojko Gorjanc)
MA in Translation (2014), Faculty of Arts, University of Ljubljana (Supervisor: Silvana Orel Kos)
BA in Interlingual Communication (2009), Faculty of Arts, University of Ljubljana (Supervisor: Silvana Orel Kos)

Coding

Python
R
I’ve also used some Javascript on occasion =)

Workshops and Summer Schools

29th European Summer School in Logic, Language and Information (ESSLLI) 2017, Toulouse, France [LINK]
ReLDI Seminar on Experimental and Corpus Methods in Linguistic Research, 2016, Belgrade, Serbia [LINK]
28th European Summer School in Logic, Language and Information (ESSLLI) 2016, Bolzano, Italy [LINK]
Quantitative Methods in Humanistic Studies, 2016, Copenhagen, Denmark
Lancaster Summer Schools in Corpus Linguistics, 2016, Lancaster, United Kingdom [LINK]
27th European Summer School in Logic, Language and Information (ESSLLI) 2015, Barcelona, Spain [LINK]

Affiliations

2017–present	Centre for Language Resources and Technologies, University of Ljubljana, Slovenia
2014–present	Faculty of Arts, University of Ljubljana, Slovenia
2017–present	Faculty of Computer and Information Science, University of Ljubljana, Slovenia
2017–present	Jožef Stefan Institute, Ljubljana, Slovenia
2017–2018	Trojina, Institute for Applied Slovene Studies, Ljubljana, Slovenia

National projects

Projects funded by the Slovenian Research and Innovation Agency (ARIS)

Duration	Title	Website
2024–present	LLM4DH – Large Language Models for Digital Humanities (GC-0002)	🔗
2022–present	MEZZANINE – Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language (J7-4642)	🔗
2017–2020	NSSSS – New Grammar of Contemporary Standard Slovene: Sources and Methods (J6-8256)	🔗
2017–2020	KOLOS – Collocations as a Basis for Language Description: Semantic and Temporal Perspectives (J6-8255)	🔗
2017–2020	FRENK – Resources, Methods, and Tools for the Understanding, Identification, and Classification of Various Forms of Socially Unacceptable Discourse in the Information Society (J7-8280)	🔗
2016–2017	KAS – Slovene Scientific Texts: Resources and Description (J6-7094)	🔗
2014–2018	JANES – Resources, Tools and Methods for the Research of Nonstandard Internet Slovene (J6-6842)	🔗

Projects funded by the Ministry of Culture of the Republic of Slovenia

Duration	Title	Website
2023–2023	SEMTEH – Interconnected Knowledge Databases for Semantic Technologies	-
2022–2023	SLOKIT – CLARIN.SI Upgrade: Corpus Presenter and Text Analyzer	🔗
2020–2023	RSDO – Development of Slovene in a Digital Environment	🔗
2021–2022	SLED – Monitor Corpus of Slovene and Monitoring Language Resources	🔗
2017–2018	Propomenke – The Thesaurus of Modern Slovene: By the Community for the Community	🔗

Projects funded by the Ministry of Higher Education, Science and Innovation of the Republic of Slovenia

Duration	Title	Website
2023–present	PoVeJMo – Adaptive Natural Language Processing with Large Language Models	🔗

Projects funded by the University of Ljubljana

Duration	Title	Website
2025–present	Improvement of Accessibility of Slovene Scientific and Expert Terminology with a Targeted Technologically Supported Campaign and Inclusion in University Education	-
2021–2024	ON – Online Notes: Upgrading a Machine Translation System for Learning Communities (and Special Needs Students)	🔗

CLARIN.SI Projects

Duration	Title	Website
2022	Compiling the Slovene SI-NLI Dataset for Natural Language Inference	🔗
2018	Accentuation of Sloleks	🔗
2018	LIST – A Tool for an Effective Analysis of Slovene Corpora	🔗

International Projects

Horizon 2020 Projects

Duration	Title	Website
2018–2022	ELEXIS – European Lexicographic Infrastructure	🔗
2016–2018	MIME – Mobility and Inclusion in Multilingual Europe	🔗

Bilateral Projects

Duration	Title	Website
2023–2025	SI-HR: Automatic Identification of Semantic Relations in Figurative Context in Croatian and Slovene	-
2021–2023	SI-IN: Cross-lingual Knowledge Transfer for Social Media Analysis in Less-Resourced Languages During COVID-19	-

COST Actions

Duration	Title	Website
2023–present	UniDive – Universality, diversity and idiosyncrasy in language technology (COST Action CA21167)	🔗
2017–2021	EnetCollect – The European Network for Combining Language Learning with Crowdsourcing Techniques (COST Action CA16105)	🔗
2017	PARSEME – Parsing and Multi-Word Expressions. Towards Linguistic Precision and Computational Efficiency in Natural Language Processing (COST Action IC1207)	🔗

Other Research Activities

Selected Contributions to Slovene Language Resources

Resource	Role	Link
Sloleks Morphological Lexicon of Slovene	Lead Editor	🔗
Digital Dictionary Database of Slovene	Co-editor	🔗
The Thesaurus of Modern Slovene	Co-editor	🔗
Collocations Dictionary of Modern Slovene	Co-editor	🔗
Gigafida Corpus of Written Standard Slovene	Co-author	🔗
JANES Corpus of Internet Slovene	Co-author	🔗
Trendi Monitor Corpus of Slovene	Co-author	🔗

Datasets

A selected list of datasets I’ve authored or contributed to is shown below. Please check CLARIN.SI for more.

Dataset	Link
Open Slovene WordNet OSWN 1.0	🔗
Parallel sense-annotated corpus ELEXIS-WSD 1.3	🔗
Slovene Natural Language Inference Dataset SI-NLI	🔗
English translation of the Slovene Natural Language Inference Dataset SI-NLI-en 1.0	🔗
“Choice of plausible alternatives” datasets in South Slavic dialects DIALECT-COPA	🔗
Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0	🔗
List of word relations from the Sloleks 2.0 lexicon 1.1	🔗
Dataset of Slovene word formation trees ArboSloleks 1.0	🔗
Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0	🔗
Lists of Slovene accentuated units SNES 1.0	🔗

Reviewer

International Journal of Lexicography
Slovenščina 2.0
Jezikoslovni zapiski
STRIDON: Journal of Studies in Translation and Interpreting
Language Technologies and Digital Humanities Conference (JT-DH) - 2022, 2024
Language Resources and Evaluation Conference (LREC) - 2024
Digital Inclusivity (DIGIN) - 2024, 2025
1st UniDive Training School (Technical University of Moldova, Chișinău, Moldova)
2nd UniDive Training School (Yerevan State University, Yerevan, Armenia)

Publications

For my complete and up-to-date bibliography, please check COBISS.