Code to train and evaluate models for detecting critical errors in machine translations using only the original source text and the machine translated text as described in Knight et al. (2025).
- Background
- Approaches
- Structure of this repository
- Getting started
- Useful links and files
- Development
The goal of critical error detection (CED) is to identify translated text that deviates in meaning from the original text. CED was introduced at the Conference on Machine Translation (WMT) 2021 quality estimation (QE) subtask (Specia et al.,2021), which also released a unique dataset of authentic critical error annotations in translations of Wikipedia comments. See also Knight et al. (2024) for a literature review on machine translation quality estimation (MTQE) including CED.
We used COMETKiwi-22 (Rei et al., 2022), which outputs quality scores between 0 and 1 (1=perfect translation).
For the baseline, we picked a binarisation threshold using the WMT dev data and used it to binarise COMETKiwi-22 predictions on the test data.
We also adapted COMETKiwi-22 for binary classification in the CEDModel class. Broadly, we tried two main training strategies:
- Fine-tune
CEDModel
with the WMT released authentic training data - Pre-train the
CEDModel
with syntethic data from the DEMETR dataset (Karpinska et al., 2022) and then fine-tune with the WMT authentic data
See the notes/ directory for an overview of the different training strategies and the scripts/README file on how to train models.
- We tried three LLM prompts:
- The basic prompt asks if the translation has the same meaning as the original text
- GEMBA-MQM from Kocmi and Federmann (2024)
- Using the original WMT annotator guidelines from Specia et al.,2021
├── configs/ -- configs used for training experiments
│ ├── ...
├── notes/ -- includes overview of training strategies
│ ├── ...
├── notebooks/ -- plots and tables of results
│ ├── ...
├── predictions/ced_data/ -- predictions on the test (and dev) data
│ ├── ...
├── scripts/ -- training, prediction and evaluation code
│ ├── ...
├── src/ -- model and prompt implementations
│ ├── ...
Clone this repository and change the current working directory.
git clone https://github.com/alan-turing-institute/ARC-MTQE.git
cd ARC-MTQE
Install dependencies and pre-commit hooks with Poetry:
make setup
Download and preprocess datasets:
make data
This adds the following directories:
├── data/
│ ├── ... -- downloaded data files
│ ├── preprocessed/ -- preprocessed data used in experiments
See the notes/ directory for an overview of the datasets that will be downloaded when this command is run.
To use COMETKiwi, you need a HuggingFace account and access token (they're under https://huggingface.co/settings/tokens in your account settings). Log in to the HuggingFace CLI which will request the token:
poetry run huggingface-cli login
To use any of the COMET models, you must also acknowledge their license on the HuggingFace page:
We use WandB to track experiments. It is necessary to login first (you should only need to do this once). The below code will prompt you for an API key, which you can find in the User Settings:
import wandb
wandb.login()
To make predictions using GPT, you need an OpenAI API key saved as an environment variable named OPENAI_API_KEY. To do this in a Mac terminal:
export OPENAI_API_KEY="your_api_key"
Follow instructions in the scripts/README.
- Overview of available COMET models.
- Notes on the COMET codebase that our trained
CEDModel
inherits from. - Instructions for using Baskerville's Tier 2 HPC service to train models.
The code base could be updated to use models other than COMETKiwi-22. This would require an update to the load_model_from_file which is currently hard-coded to download COMETKiwi-22:
model_path = download_model("Unbabel/wmt22-cometkiwi-da")
This could be updated to allow for the pre-trained QE model to be changed to, for example, COMETKiwi-23-XL or COMETKiwi-23-XXL.
This would also require updating the encoder related hyperparameters in the config file (e.g., encoder_model: XLM-RoBERTa-XL
).