Параметри
General-Purpose Text Embeddings Learning for Ukrainian Language
Тип публікації :
Стаття
Дата випуску :
2024
Автор(и) :
Bocharova, Maiia
Malakhov, Eugene
IEEE, National Technical University "Kharkiv Polytechnic Institute": Kharkiv, Education, National Technical University of Ukraine Kiev Polytechnic Institute, Odessa I. I. Mechnikov National University, Odessa National Polytechnic University
Мова основного тексту :
English
eKNUTSHIR URL :
Журнал :
Випуск :
1(3)
ISSN :
2788-6603
Початкова сторінка :
6
Кінцева сторінка :
12
Цитування :
Bocharova M., Malakhov E. General-Purpose Text Embeddings Learning for Ukrainian Language. Advanced Information Technology. 2024. № 1(3). Р. 6-12.
B a c k g r o u n d . Learning high-quality text embeddings typically requires large corpuses of labeled data, which can be challenging to obtain for many languages and domains. This study proposes a novel adaptation of cross-lingual knowledge transfer that employs a cosine similarity-based loss calculation to enhance the alignment of learned representations.
M e t h o d s . The impact of teacher model selection on the quality of learned text representations is investigated. Specifically, the correlation between cosine similarity scores among vectors of randomly selected sentences and the transferability of representations into another language is explored. Additionally, recognizing the need for effective evaluation methodologies and the limited availability of Ukrainian resources within existing benchmarks, a comprehensive general-purpose benchmark for assessing Ukrainian text representation learning is curated.
R e s u l t s . A cosine-similarity based loss calculation leads to 14.2% improvement in absolute Normalized Mutual Information (NMI) score compared to using mean squared error loss when distilling knowledge from the English language teacher model into Ukrainian student model. The findings demonstrate the strong correlation between the distributions of cosine similarities of the teacher model's representations of random sentences with the quality of learnt text embeddings. Pearson's correlation between "90th percentile of cosine similarity scores distribution" and "Average NMI score" is -0.96, which is a strong negative correlation.
C o n c l u s i o n s . This research advances information theory in cross-lingual knowledge distillation, illustrating that cosine similarity-based loss functions are superior in performance. It underscores the importance of selecting the teacher model with wide distributions of cosine similarity scores. Furthermore, a pioneering broad-scale benchmark, covering five distinct domains for Ukrainian text representation learning is introduced. The source code, pretrained model, and the newly created Ukrainian text embeddings benchmark are publicly available at https://github.com/maiiabocharova/UkrTEB.
M e t h o d s . The impact of teacher model selection on the quality of learned text representations is investigated. Specifically, the correlation between cosine similarity scores among vectors of randomly selected sentences and the transferability of representations into another language is explored. Additionally, recognizing the need for effective evaluation methodologies and the limited availability of Ukrainian resources within existing benchmarks, a comprehensive general-purpose benchmark for assessing Ukrainian text representation learning is curated.
R e s u l t s . A cosine-similarity based loss calculation leads to 14.2% improvement in absolute Normalized Mutual Information (NMI) score compared to using mean squared error loss when distilling knowledge from the English language teacher model into Ukrainian student model. The findings demonstrate the strong correlation between the distributions of cosine similarities of the teacher model's representations of random sentences with the quality of learnt text embeddings. Pearson's correlation between "90th percentile of cosine similarity scores distribution" and "Average NMI score" is -0.96, which is a strong negative correlation.
C o n c l u s i o n s . This research advances information theory in cross-lingual knowledge distillation, illustrating that cosine similarity-based loss functions are superior in performance. It underscores the importance of selecting the teacher model with wide distributions of cosine similarity scores. Furthermore, a pioneering broad-scale benchmark, covering five distinct domains for Ukrainian text representation learning is introduced. The source code, pretrained model, and the newly created Ukrainian text embeddings benchmark are publicly available at https://github.com/maiiabocharova/UkrTEB.
Галузі знань та спеціальності :
122 Комп’ютерні науки
Галузі науки і техніки (FOS) :
Комп'ютерні науки
Тип зібрання :
Publication
Файл(и) :
Вантажиться...
Формат
Adobe PDF
Розмір :
693.16 KB
Контрольна сума:
(MD5):d5061547670ff952d96cf3c0836717d2
Ця робота розповсюджується на умовах ліцензії Creative Commons CC BY-NC
10.17721/AIT.2024.1.01