Token-Level Post-Editing Dataset (EN–MT–Human): English-Ukrainian Translation Edit Log (Education-Legal) v. 1.2.

Skrylnyk, SerhiiSerhiiSkrylnykToken-Level Post-Editing Dataset (EN–MT–Human): English-Ukrainian Translation Edit Log (Education-Legal) v. 1.2.My University2026My UniversityMy University2026-03-052026-03-052026-02-23ukdataset[APA 7] Skrylnyk, S. (2026). Token-Level Post-Editing Dataset (EN–MT–Human): English-Ukrainian Translation Edit Log (Education-Legal) v. 1.2.. https://doi.org/10.5281/zenodo.18742235[ДСТУ] Skrylnyk S. Token-Level Post-Editing Dataset (EN–MT–Human): English-Ukrainian Translation Edit Log (Education-Legal) v. 1.2. Kyiv, 2026. DOI: 10.5281/zenodo.18742235 (date of access: 18.07.2026).10.5281/zenodo.18742235https://doi.org/10.5281/zenodo.18742235https://ir.library.knu.ua/handle/15071834/11866Creative Commons Attribution 4.0 Internationalhttps://creativecommons.org/licenses/by/4.0/This dataset provides a structured token-level log of post-editing operations extracted from a triple-layer translation corpus consisting of:(1) English source text,(2) DeepL-generated Ukrainian MT output, and(3) Human-edited Ukrainian translation. The dataset contains 6,207 non-equal token-level operations (replace / insert / delete), recorded after monotonic block alignment of MT and Human sentence sequences using a banded dynamic-programming procedure. Alignment permits 1–1, 1–2, 2–1, and 2–2 transitions and combines anchor-word similarity with length-based penalties. Token-level differences are extracted within aligned blocks using sequence-based comparison. English page references are derived via length-based (Gale–Church-style) dynamic alignment between English and MT sentence sequences. Page ranges are propagated to aligned MT–Human blocks. Each row represents a single edit operation and includes block identifiers, page ranges, sentence ranges, operation type, token positions, and rule-based change classification (lexical/stylistic, addition, omission, orthography, number/formatting, punctuation, capitalization). This release is a SAFE version. It does not include full textual contexts of the source, MT, or Human translations. Only the token-level edit log and metadata necessary for reproducibility are distributed. Full aligned texts are not publicly shared due to copyright and licensing considerations but may be provided upon justified academic request. The dataset is suitable for research in: post-editing analysis MT quality diagnostics cognitive translation studies token-level error modeling edit-distance–based evaluation research Primary file: CSV (UTF-8)Documentation: README and full methodology description License: CC BY 4.0