General-Purpose Text Embeddings Learning for Ukrainian Language

Bocharova, Maiia; Malakhov, Eugene

doi:10.17721/AIT.2024.1.01

General-Purpose Text Embeddings Learning for Ukrainian Language

Тип публікації :

Стаття

Дата випуску :

2024

Автор(и) :

Bocharova, Maiia

Malakhov, Eugene

IEEE, National Technical University "Kharkiv Polytechnic Institute": Kharkiv, Education, National Technical University of Ukraine Kiev Polytechnic Institute, Odessa I. I. Mechnikov National University, Odessa National Polytechnic University

Мова основного тексту :

English

eKNUTSHIR URL :

https://ir.library.knu.ua/handle/15071834/6110

DOI :

10.17721/AIT.2024.1.01

Журнал :

Сучасні інформаційні технології

Випуск :

1(3)

ISSN :

2788-6603

Початкова сторінка :

6

Кінцева сторінка :

12

Цитування :

Bocharova M., Malakhov E. General-Purpose Text Embeddings Learning for Ukrainian Language. Advanced Information Technology. 2024. № 1(3). Р. 6-12.

B a c k g r o u n d . Learning high-quality text embeddings typically requires large corpuses of labeled data, which can be challenging to obtain for many languages and domains. This study proposes a novel adaptation of cross-lingual knowledge transfer that employs a cosine similarity-based loss calculation to enhance the alignment of learned representations.
M e t h o d s . The impact of teacher model selection on the quality of learned text representations is investigated. Specifically, the correlation between cosine similarity scores among vectors of randomly selected sentences and the transferability of representations into another language is explored. Additionally, recognizing the need for effective evaluation methodologies and the limited availability of Ukrainian resources within existing benchmarks, a comprehensive general-purpose benchmark for assessing Ukrainian text representation learning is curated.
R e s u l t s . A cosine-similarity based loss calculation leads to 14.2% improvement in absolute Normalized Mutual Information (NMI) score compared to using mean squared error loss when distilling knowledge from the English language teacher model into Ukrainian student model. The findings demonstrate the strong correlation between the distributions of cosine similarities of the teacher model's representations of random sentences with the quality of learnt text embeddings. Pearson's correlation between "90th percentile of cosine similarity scores distribution" and "Average NMI score" is -0.96, which is a strong negative correlation.
C o n c l u s i o n s . This research advances information theory in cross-lingual knowledge distillation, illustrating that cosine similarity-based loss functions are superior in performance. It underscores the importance of selecting the teacher model with wide distributions of cosine similarity scores. Furthermore, a pioneering broad-scale benchmark, covering five distinct domains for Ukrainian text representation learning is introduced. The source code, pretrained model, and the newly created Ukrainian text embeddings benchmark are publicly available at https://github.com/maiiabocharova/UkrTEB.

Ключові слова :

Natural Language Proc...

text embeddings

Deep Learning

Data Mining

multilingual language...

knowledge transfer

domain adaptation

оброблення природної ...

текстові вкладення

глибоке навчання

видобування даних

багатомовні мовні мод...

перенесення знань

адаптація до домену.

Галузі знань та спеціальності :

122 Комп’ютерні науки

Галузі науки і техніки (FOS) :

Комп'ютерні науки

Тип зібрання :

Publication

Файл(и) :

Формат

Adobe PDF

Розмір :

693.16 KB

Контрольна сума:

(MD5):d5061547670ff952d96cf3c0836717d2

Ця робота розповсюджується на умовах ліцензії Creative Commons CC BY-NC

Параметри

General-Purpose Text Embeddings Learning for Ukrainian Language