Jaka Čibej

Logo

Researcher in Computational Linguistics SICRIS | Google Scholar | ORCID | LinkedIn | ResearchGate

Contact:
jaka.cibej@ff.uni-lj.si

Pronunciation:
IPA: /ˈjaːka ʧiˈbɛːɪ/
English Phonetic Spelling: YA-ka chi-BAY

Last update: 2025-09-06

About me

I’m a researcher in computational linguistics based in Slovenia. My main research areas are corpus linguistics and natural language processing. I focus on implementing quantitative and statistical methods in the research of linguistic data, and compile language resources and datasets for Slovene and other languages.

So far, I’ve worked in several different subfields, including computational morphology, digital lexicography, research on spoken language and computer-mediated communication, computational dialectometry, quantitative syntax analysis, computational phraseology, and large language model development (particularly through the compilation of instruction-tuning datasets).

I’m strongly in favor of empirical and data-driven research that brings about open-access and directly applicable results to the benefit of everyone.

Main Research Interests

Education

Coding

Workshops and Summer Schools

Affiliations

2017–present Centre for Language Resources and Technologies, University of Ljubljana, Slovenia
2014–present Faculty of Arts, University of Ljubljana, Slovenia
2017–present Faculty of Computer and Information Science, University of Ljubljana, Slovenia
2017–present Jožef Stefan Institute, Ljubljana, Slovenia
2017–2018 Trojina, Institute for Applied Slovene Studies, Ljubljana, Slovenia

National projects

Projects funded by the Slovenian Research and Innovation Agency (ARIS)

Duration Title Website
2024–present LLM4DH – Large Language Models for Digital Humanities (GC-0002) 🔗
2022–present MEZZANINE – Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language (J7-4642) 🔗
2017–2020 NSSSS – New Grammar of Contemporary Standard Slovene: Sources and Methods (J6-8256) 🔗
2017–2020 KOLOS – Collocations as a Basis for Language Description: Semantic and Temporal Perspectives (J6-8255) 🔗
2017–2020 FRENK – Resources, Methods, and Tools for the Understanding, Identification, and Classification of Various Forms of Socially Unacceptable Discourse in the Information Society (J7-8280) 🔗
2016–2017 KAS – Slovene Scientific Texts: Resources and Description (J6-7094) 🔗
2014–2018 JANES – Resources, Tools and Methods for the Research of Nonstandard Internet Slovene (J6-6842) 🔗

Projects funded by the Ministry of Culture of the Republic of Slovenia

Duration Title Website
2023–2023 SEMTEH – Interconnected Knowledge Databases for Semantic Technologies -
2022–2023 SLOKIT – CLARIN.SI Upgrade: Corpus Presenter and Text Analyzer 🔗
2020–2023 RSDO – Development of Slovene in a Digital Environment 🔗
2021–2022 SLED – Monitor Corpus of Slovene and Monitoring Language Resources 🔗
2017–2018 Propomenke – The Thesaurus of Modern Slovene: By the Community for the Community 🔗

Projects funded by the Ministry of Higher Education, Science and Innovation of the Republic of Slovenia

Duration Title Website
2023–present PoVeJMo – Adaptive Natural Language Processing with Large Language Models 🔗

Projects funded by the University of Ljubljana

Duration Title Website
2025–present Improvement of Accessibility of Slovene Scientific and Expert Terminology with a Targeted Technologically Supported Campaign and Inclusion in University Education -
2021–2024 ON – Online Notes: Upgrading a Machine Translation System for Learning Communities (and Special Needs Students) 🔗

CLARIN.SI Projects

Duration Title Website
2022 Compiling the Slovene SI-NLI Dataset for Natural Language Inference 🔗
2018 Accentuation of Sloleks 🔗
2018 LIST – A Tool for an Effective Analysis of Slovene Corpora 🔗

International Projects

Horizon 2020 Projects

Duration Title Website
2018–2022 ELEXIS – European Lexicographic Infrastructure 🔗
2016–2018 MIME – Mobility and Inclusion in Multilingual Europe 🔗

Bilateral Projects

Duration Title Website
2023–2025 SI-HR: Automatic Identification of Semantic Relations in Figurative Context in Croatian and Slovene -
2021–2023 SI-IN: Cross-lingual Knowledge Transfer for Social Media Analysis in Less-Resourced Languages During COVID-19 -

COST Actions

Duration Title Website
2023–present UniDive – Universality, diversity and idiosyncrasy in language technology (COST Action CA21167) 🔗
2017–2021 EnetCollect – The European Network for Combining Language Learning with Crowdsourcing Techniques (COST Action CA16105) 🔗
2017 PARSEME – Parsing and Multi-Word Expressions. Towards Linguistic Precision and Computational Efficiency in Natural Language Processing (COST Action IC1207) 🔗

Other Research Activities

Selected Contributions to Slovene Language Resources

Resource Role Link
Sloleks Morphological Lexicon of Slovene Lead Editor 🔗
Digital Dictionary Database of Slovene Co-editor 🔗
The Thesaurus of Modern Slovene Co-editor 🔗
Collocations Dictionary of Modern Slovene Co-editor 🔗
Gigafida Corpus of Written Standard Slovene Co-author 🔗
JANES Corpus of Internet Slovene Co-author 🔗
Trendi Monitor Corpus of Slovene Co-author 🔗

Datasets

A selected list of datasets I’ve authored or contributed to is shown below. Please check CLARIN.SI for more.

Dataset Link
Open Slovene WordNet OSWN 1.0 🔗
Parallel sense-annotated corpus ELEXIS-WSD 1.3 🔗
Slovene Natural Language Inference Dataset SI-NLI 🔗
English translation of the Slovene Natural Language Inference Dataset SI-NLI-en 1.0 🔗
“Choice of plausible alternatives” datasets in South Slavic dialects DIALECT-COPA 🔗
Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0 🔗
List of word relations from the Sloleks 2.0 lexicon 1.1 🔗
Dataset of Slovene word formation trees ArboSloleks 1.0 🔗
Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0 🔗
Lists of Slovene accentuated units SNES 1.0 🔗

Reviewer

Publications

For my complete and up-to-date bibliography, please check COBISS.