Token-Level Post-Editing Dataset (EN–MT–Human): English-Ukrainian Translation Edit Log (Education-Legal) v. 1.2.

Skrylnyk, Serhii

doi:10.5281/zenodo.18742235

Назва :

Token-Level Post-Editing Dataset (EN–MT–Human): English-Ukrainian Translation Edit Log (Education-Legal) v. 1.2.

Дата випуску :

2026-02-23

Творець

Skrylnyk, Serhii

Анотація :

This dataset provides a structured token-level log of post-editing operations extracted from a triple-layer translation corpus consisting of:(1) English source text,(2) DeepL-generated Ukrainian MT output, and(3) Human-edited Ukrainian translation.
The dataset contains 6,207 non-equal token-level operations (replace / insert / delete), recorded after monotonic block alignment of MT and Human sentence sequences using a banded dynamic-programming procedure. Alignment permits 1–1, 1–2, 2–1, and 2–2 transitions and combines anchor-word similarity with length-based penalties. Token-level differences are extracted within aligned blocks using sequence-based comparison.
English page references are derived via length-based (Gale–Church-style) dynamic alignment between English and MT sentence sequences. Page ranges are propagated to aligned MT–Human blocks.
Each row represents a single edit operation and includes block identifiers, page ranges, sentence ranges, operation type, token positions, and rule-based change classification (lexical/stylistic, addition, omission, orthography, number/formatting, punctuation, capitalization).
This release is a SAFE version. It does not include full textual contexts of the source, MT, or Human translations. Only the token-level edit log and metadata necessary for reproducibility are distributed. Full aligned texts are not publicly shared due to copyright and licensing considerations but may be provided upon justified academic request.
The dataset is suitable for research in:

post-editing analysis

MT quality diagnostics

cognitive translation studies

token-level error modeling

edit-distance–based evaluation research

Primary file: CSV (UTF-8)Documentation: README and full methodology description
License: CC BY 4.0

DOI :

10.5281/zenodo.18742235

eKNUTSHIR URL :

https://doi.org/10.5281/zenodo.18742235

https://ir.library.knu.ua/handle/15071834/11866

https://creativecommons.org/licenses/by/4.0/