Параметри
Назва :
Token-Level Post-Editing Dataset (EN–MT–Human): English-Ukrainian Translation Edit Log (Education-Legal) v. 1.2.
Дата випуску :
2026-02-23
Творець
Skrylnyk, Serhii
Анотація :
This dataset provides a structured token-level log of post-editing operations extracted from a triple-layer translation corpus consisting of:(1) English source text,(2) DeepL-generated Ukrainian MT output, and(3) Human-edited Ukrainian translation.
The dataset contains 6,207 non-equal token-level operations (replace / insert / delete), recorded after monotonic block alignment of MT and Human sentence sequences using a banded dynamic-programming procedure. Alignment permits 1–1, 1–2, 2–1, and 2–2 transitions and combines anchor-word similarity with length-based penalties. Token-level differences are extracted within aligned blocks using sequence-based comparison.
English page references are derived via length-based (Gale–Church-style) dynamic alignment between English and MT sentence sequences. Page ranges are propagated to aligned MT–Human blocks.
Each row represents a single edit operation and includes block identifiers, page ranges, sentence ranges, operation type, token positions, and rule-based change classification (lexical/stylistic, addition, omission, orthography, number/formatting, punctuation, capitalization).
This release is a SAFE version. It does not include full textual contexts of the source, MT, or Human translations. Only the token-level edit log and metadata necessary for reproducibility are distributed. Full aligned texts are not publicly shared due to copyright and licensing considerations but may be provided upon justified academic request.
The dataset is suitable for research in:
post-editing analysis
MT quality diagnostics
cognitive translation studies
token-level error modeling
edit-distance–based evaluation research
Primary file: CSV (UTF-8)Documentation: README and full methodology description
License: CC BY 4.0
The dataset contains 6,207 non-equal token-level operations (replace / insert / delete), recorded after monotonic block alignment of MT and Human sentence sequences using a banded dynamic-programming procedure. Alignment permits 1–1, 1–2, 2–1, and 2–2 transitions and combines anchor-word similarity with length-based penalties. Token-level differences are extracted within aligned blocks using sequence-based comparison.
English page references are derived via length-based (Gale–Church-style) dynamic alignment between English and MT sentence sequences. Page ranges are propagated to aligned MT–Human blocks.
Each row represents a single edit operation and includes block identifiers, page ranges, sentence ranges, operation type, token positions, and rule-based change classification (lexical/stylistic, addition, omission, orthography, number/formatting, punctuation, capitalization).
This release is a SAFE version. It does not include full textual contexts of the source, MT, or Human translations. Only the token-level edit log and metadata necessary for reproducibility are distributed. Full aligned texts are not publicly shared due to copyright and licensing considerations but may be provided upon justified academic request.
The dataset is suitable for research in:
post-editing analysis
MT quality diagnostics
cognitive translation studies
token-level error modeling
edit-distance–based evaluation research
Primary file: CSV (UTF-8)Documentation: README and full methodology description
License: CC BY 4.0
Ця робота розповсюджується на умовах ліцензії Creative Commons CC BY
10.5281/zenodo.18742235