ChineseFileTranslator

Translate Chinese text (Simplified, Traditional, Cantonese, Classical) inside .txt and .md files to English. Preserves full Markdown syntax. Supports Google Translate, Microsoft Translator, and a fully offline Helsinki-NLP MarianMT backend with vectorized batching.

Key Features

'Never Miss' Global Surgical Translation: Unique strategy to capture ALL Chinese while protecting structure.
Inclusive CJK Detection: Comprehensive 32-bit Unicode coverage (Basic, Ext A-E, Symbols, Punctuation).
Proactive Markdown Protection: Frontmatter, code blocks, links, and HTML are safely tokenized.
Robust Placeholder Restoration: Space-lenient, case-insensitive restoration handles engine mangling.
Unstoppable Backend Resilience: Explicit failure detection with automatic retries and non-crashing fallbacks.
Offline First Option: Fully local Helsinki-NLP MarianMT backend with vectorized batching.
Bilingual Mode: Optional side-by-side Chinese and English output.
Batch Processing: Translate entire directories with recursive discovery and persistent configuration.

Project Structure

ChineseFileTranslator/
├── chinese_file_translator.py   # Main script (single-file, no extra modules)
├── requirements.txt             # Python dependencies
├── README.md                    # This file
├── .gitattributes               # Git line-ending and LFS rules
├── .gitignore                   # Ignored paths
└── LICENSE                      # MIT License

Quickstart

1. Clone the repository

git clone https://github.com/algorembrant/ChineseFileTranslator.git
cd ChineseFileTranslator

2. Create and activate a virtual environment (recommended)

python -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activate

3. Install core dependencies

pip install -r requirements.txt

4. (Optional) Install offline translation backend

Choose the correct PyTorch build for your system:

# CPU only
pip install torch --index-url https://download.pytorch.org/whl/cpu

# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Then install Transformers stack
pip install transformers sentencepiece sacremoses

The Helsinki-NLP/opus-mt-zh-en model (~300 MB) downloads automatically on first use.

Usage

Command Reference

Command	Description
`python chinese_file_translator.py input.txt`	Translate a plain-text file (Google backend)
`python chinese_file_translator.py input.md`	Translate a Markdown file, preserve structure
`python chinese_file_translator.py input.txt -o out.txt`	Set explicit output path
`python chinese_file_translator.py input.txt --backend offline`	Use offline MarianMT model
`python chinese_file_translator.py input.txt --backend microsoft`	Use Microsoft Translator
`python chinese_file_translator.py input.txt --offline --gpu`	Offline + GPU (CUDA)
`python chinese_file_translator.py input.txt --lang simplified`	Force Simplified Chinese
`python chinese_file_translator.py input.txt --lang traditional`	Force Traditional Chinese
`python chinese_file_translator.py input.txt --bilingual`	Keep Chinese + show English
`python chinese_file_translator.py input.txt --extract-only`	Extract Chinese lines only
`python chinese_file_translator.py input.txt --stdout`	Print output to terminal
`python chinese_file_translator.py --batch ./docs/`	Batch translate a directory
`python chinese_file_translator.py --batch ./in/ --batch-out ./out/`	Batch with output dir
`python chinese_file_translator.py input.txt --chunk-size 2000`	Custom chunk size
`python chinese_file_translator.py input.txt --export-history h.json`	Export history
`python chinese_file_translator.py input.txt --verbose`	Debug logging
`python chinese_file_translator.py --version`	Print version
`python chinese_file_translator.py --help`	Full help

Arguments

Argument	Type	Default	Description
`input`	positional	—	Path to `.txt` or `.md` file
`-o / --output`	string	`<name>_translated.<ext>`	Output file path
`--batch DIR`	string	—	Directory to batch translate
`--batch-out DIR`	string	same as `--batch`	Output directory for batch
`--backend`	choice	`google`	`google`, `microsoft`, `offline`
`--offline`	flag	`false`	Shorthand for `--backend offline`
`--lang`	choice	`auto`	`auto`, `simplified`, `traditional`
`--gpu`	flag	`false`	Use CUDA for offline model
`--confidence`	float	`0.05`	Min Chinese character ratio for detection
`--chunk-size`	int	`4000`	Max chars per translation request
`--bilingual`	flag	`false`	Output both Chinese and English
`--extract-only`	flag	`false`	Save only the detected Chinese lines
`--stdout`	flag	`false`	Print result to stdout
`--export-history`	string	—	Save session history to JSON
`--verbose`	flag	`false`	Enable DEBUG logging
`--version`	flag	—	Show version and exit

Configuration

The tool writes a JSON config file on first run:

~/.chinese_file_translator/config.json

Example config.json:

{
  "backend": "google",
  "lang": "auto",
  "use_gpu": false,
  "chunk_size": 4000,
  "batch_size": 10,
  "bilingual": false,
  "microsoft_api_key": "YOUR_KEY_HERE",
  "microsoft_region": "eastus",
  "offline_model_dir": "~/.chinese_file_translator/models",
  "output_suffix": "_translated",
  "retry_attempts": 3,
  "retry_delay_seconds": 1.5,
  "max_history": 1000
}

Supported Chinese Variants

Variant	Notes
Simplified Chinese	Mandarin, mainland China standard
Traditional Chinese	Taiwan, Hong Kong, Macau standard
Cantonese / Yue	Detected via CJK Unicode ranges
Classical Chinese	Treated as Traditional for translation
Mixed Chinese-English	Code-switching text handled transparently

Translation Backends

Backend	Requires	Speed	Quality	Internet
Google Translate	`deep-translator`	Fast	High	Yes
Microsoft Translator	Azure API key + `deep-translator`	Fast	High	Yes
Helsinki-NLP MarianMT	`transformers`, `torch`	Medium	Good	No (after download)

Google Translate is the default. If it fails, the tool falls back to the offline model automatically.

Technical Strategy: 'Never Miss' Logic

The tool employs a sophisticated "Global Surgical" approach to ensure no Chinese fragment is overlooked, regardless of its depth in JSON, HTML, or complex Markdown.

1. Surgical Block Extraction

Instead of line-by-line translation, the script identifies every continuous block of CJK characters (including ideographic symbols and punctuation) across the entire document. This ensures that contextually related characters are translated together for better accuracy.

2. Structural Protection

Markdown and metadata structures are tokenized using unique, collision-resistant placeholders (___MY_PROTECT_PH_{idx}___).

YAML/TOML: Frontmatter is protected globally.
Code Fences: Backticks and language identifiers are protected; Chinese content inside comments or strings remains translatable.
Links & HTML: URLs and tag names are guarded, while display text is surgically translated.

3. Verification & Restoration

Longest-First Replacement: Translated segments are restored starting from the longest strings to prevent partial match overwrites.
Fuzzy Restoration: The restoration logic is space-lenient and case-insensitive to handle cases where online translation engines mangle the placeholder tokens.

Markdown Preservation

The following elements are meticulously protected:

Element	Example	Protection Method
Front Matter	`---\ntitle: ...\n---`	Full Tokenization
Fenced Code	```python ... ```	Boundary Tokenization
Inline Code	`code`	Full Tokenization
Links / Images	`[text](url)`	URL Tokenization
HTML Tags	`<div class="...">`	Tag Tokenization
Symbols	`©`, `&#x...;`	Entity Tokenization

Microsoft Translator Setup

Go to Azure Cognitive Services
Create a Translator resource (Free tier: 2M chars/month)
Copy your API key and region
Add them to ~/.chinese_file_translator/config.json:

{
  "microsoft_api_key": "abc123...",
  "microsoft_region": "eastus"
}

Then run:

python chinese_file_translator.py input.txt --backend microsoft

Files Generated

Path	Description
`~/.chinese_file_translator/config.json`	Persistent settings
`~/.chinese_file_translator/history.json`	Session history log
`~/.chinese_file_translator/app.log`	Application log file
`~/.chinese_file_translator/models/`	Offline model cache (if used)

Author

algorembrant

License

MIT License. See LICENSE for details.

Downloads last month: -; Downloads are not tracked for this model. How to track