T5 PyTorch Chinese-English Translation
Introduction
This project fine-tunes the T5 pre-trained model to implement neural machine translation from Chinese to English. T5 (Text-To-Text Transfer Transformer) is a unified text-to-text pre-trained model proposed by Google. By unifying various NLP tasks as text generation tasks, it has achieved excellent performance across multiple tasks. In this project, we use the T5-base model and fine-tune it on Chinese-English parallel corpora to achieve high-quality Chinese-to-English translation.
Due to the model chechpoint is too big to upload to github, so I put it on huggingface: https://huggingface.co/DarcyCheng/T5-finetune-model
And if you want to download the pretrained model of T5, please go to the url: https://huggingface.co/google-t5/t5-base/tree/main
Data Preparation
The data compression package contains four JSONL files, corresponding to the small-scale training set, large-scale training set, validation set, and test set, with 100k, 10k, 500, and 200 samples respectively. Each line in each JSONL file contains a parallel sentence pair. The final model performance will be evaluated based on the results on the test set.
Data file structure:
train_100k.jsonl: Small-scale training set (100,000 entries)train_10k.jsonl: Large-scale training set (10,000 entries)val.jsonl: Validation set (500 entries)test.jsonl: Test set (200 entries)
Format of each JSONL file:
{"zh": "Chinese sentence", "en": "English sentence"}
Environment
System Requirements
- Python: Python 3.9.25
- PyTorch: 2.0.1+cu118
- CUDA: 11.8 (if using GPU)
Dependency Installation
pip install -r requirements.txt
Key dependencies include:
torch>=2.0.1transformershydra-coreomegaconfnltkjiebasentencepiecenumpytqdm
Training, Evaluation and Inference
1. Train SPM Model (Optional)
Before training the T5 model, you can first train the SentencePiece (SPM) model to achieve better tokenization performance:
conda activate llm_course
python train_spm_models.py
2. Train T5 Model
Train the T5 model for Chinese-to-English translation (SPM is used automatically):
python train.py
Training configurations are stored in configs/train.yaml, with key parameters including:
NUM_EPOCHS: Number of training epochs (default: 50)BATCH_SIZE: Batch size (default: 64)LEARNING_RATE: Learning rate (default: 1e-5)DATA_DIR: Path to the data directory
3. Evaluate the Model
Evaluate the model performance on the test set:
python evaluation.py
The evaluation script calculates BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores (in percentage, ranging from 0 to 100).
4. Inference
Single Sentence Translation
python inference.py input_text="Hello"
Batch Translation
python inference.py input_file=test.txt output_file=translations.txt
Interactive Translation
python inference.py interactive=True
Evaluation with Reference Translations
python inference.py input_file=test.txt reference_file=reference.txt
Inference configurations are stored in configs/inference.yaml, which can be set as follows:
model_path: Path to the trained model checkpointmax_length: Maximum generation lengthnum_beams: Beam search size
Acknowledgement
We would like to acknowledge the following repositories:
Translation Notes:
- Technical Terminology Consistency: Adhered to standard English terms in NLP (e.g., "fine-tune" for 微调, "parallel corpora" for 平行语料, "tokenization" for 分词, "checkpoint" for 检查点) to align with industry conventions.
- Readability & Naturalness: Adjusted sentence structures to fit English technical documentation style (e.g., splitting long Chinese sentences into concise English clauses, using active voice where appropriate).
- Code/File Preservation: All command-line scripts, file paths, parameter names, and code snippets are retained exactly as original to ensure technical accuracy and usability.
- Cultural/Contextual Adaptation: "你好" was translated to "Hello" in the single sentence translation example (instead of literal "Ni Hao") for natural English usage; "训练轮数" → "number of training epochs" (standard ML terminology).
- Format Consistency: Maintained the original markdown structure (headings, lists, code blocks) to preserve readability and hierarchy of the technical document.
Model tree for DarcyCheng/T5-finetune-model
Base model
google/flan-t5-base