ChineseFileTranslator

Python License: MIT Version Offline Support Hugging Face Maintenance Code Style

Translate Chinese text (Simplified, Traditional, Cantonese, Classical) inside .txt and .md files to English. Preserves full Markdown syntax. Supports Google Translate, Microsoft Translator, and a fully offline Helsinki-NLP MarianMT backend with vectorized batching.


Key Features

  • 'Never Miss' Global Surgical Translation: Unique strategy to capture ALL Chinese while protecting structure.
  • Inclusive CJK Detection: Comprehensive 32-bit Unicode coverage (Basic, Ext A-E, Symbols, Punctuation).
  • Proactive Markdown Protection: Frontmatter, code blocks, links, and HTML are safely tokenized.
  • Robust Placeholder Restoration: Space-lenient, case-insensitive restoration handles engine mangling.
  • Unstoppable Backend Resilience: Explicit failure detection with automatic retries and non-crashing fallbacks.
  • Offline First Option: Fully local Helsinki-NLP MarianMT backend with vectorized batching.
  • Bilingual Mode: Optional side-by-side Chinese and English output.
  • Batch Processing: Translate entire directories with recursive discovery and persistent configuration.

Project Structure

ChineseFileTranslator/
β”œβ”€β”€ chinese_file_translator.py   # Main script (single-file, no extra modules)
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ .gitattributes               # Git line-ending and LFS rules
β”œβ”€β”€ .gitignore                   # Ignored paths
└── LICENSE                      # MIT License

Quickstart

1. Clone the repository

git clone https://github.com/algorembrant/ChineseFileTranslator.git
cd ChineseFileTranslator

2. Create and activate a virtual environment (recommended)

python -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activate

3. Install core dependencies

pip install -r requirements.txt

4. (Optional) Install offline translation backend

Choose the correct PyTorch build for your system:

# CPU only
pip install torch --index-url https://download.pytorch.org/whl/cpu

# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Then install Transformers stack
pip install transformers sentencepiece sacremoses

The Helsinki-NLP/opus-mt-zh-en model (~300 MB) downloads automatically on first use.


Usage

Command Reference

Command Description
python chinese_file_translator.py input.txt Translate a plain-text file (Google backend)
python chinese_file_translator.py input.md Translate a Markdown file, preserve structure
python chinese_file_translator.py input.txt -o out.txt Set explicit output path
python chinese_file_translator.py input.txt --backend offline Use offline MarianMT model
python chinese_file_translator.py input.txt --backend microsoft Use Microsoft Translator
python chinese_file_translator.py input.txt --offline --gpu Offline + GPU (CUDA)
python chinese_file_translator.py input.txt --lang simplified Force Simplified Chinese
python chinese_file_translator.py input.txt --lang traditional Force Traditional Chinese
python chinese_file_translator.py input.txt --bilingual Keep Chinese + show English
python chinese_file_translator.py input.txt --extract-only Extract Chinese lines only
python chinese_file_translator.py input.txt --stdout Print output to terminal
python chinese_file_translator.py --batch ./docs/ Batch translate a directory
python chinese_file_translator.py --batch ./in/ --batch-out ./out/ Batch with output dir
python chinese_file_translator.py input.txt --chunk-size 2000 Custom chunk size
python chinese_file_translator.py input.txt --export-history h.json Export history
python chinese_file_translator.py input.txt --verbose Debug logging
python chinese_file_translator.py --version Print version
python chinese_file_translator.py --help Full help

Arguments

Argument Type Default Description
input positional β€” Path to .txt or .md file
-o / --output string <name>_translated.<ext> Output file path
--batch DIR string β€” Directory to batch translate
--batch-out DIR string same as --batch Output directory for batch
--backend choice google google, microsoft, offline
--offline flag false Shorthand for --backend offline
--lang choice auto auto, simplified, traditional
--gpu flag false Use CUDA for offline model
--confidence float 0.05 Min Chinese character ratio for detection
--chunk-size int 4000 Max chars per translation request
--bilingual flag false Output both Chinese and English
--extract-only flag false Save only the detected Chinese lines
--stdout flag false Print result to stdout
--export-history string β€” Save session history to JSON
--verbose flag false Enable DEBUG logging
--version flag β€” Show version and exit

Configuration

The tool writes a JSON config file on first run:

~/.chinese_file_translator/config.json

Example config.json:

{
  "backend": "google",
  "lang": "auto",
  "use_gpu": false,
  "chunk_size": 4000,
  "batch_size": 10,
  "bilingual": false,
  "microsoft_api_key": "YOUR_KEY_HERE",
  "microsoft_region": "eastus",
  "offline_model_dir": "~/.chinese_file_translator/models",
  "output_suffix": "_translated",
  "retry_attempts": 3,
  "retry_delay_seconds": 1.5,
  "max_history": 1000
}

Supported Chinese Variants

Variant Notes
Simplified Chinese Mandarin, mainland China standard
Traditional Chinese Taiwan, Hong Kong, Macau standard
Cantonese / Yue Detected via CJK Unicode ranges
Classical Chinese Treated as Traditional for translation
Mixed Chinese-English Code-switching text handled transparently

Translation Backends

Backend Requires Speed Quality Internet
Google Translate deep-translator Fast High Yes
Microsoft Translator Azure API key + deep-translator Fast High Yes
Helsinki-NLP MarianMT transformers, torch Medium Good No (after download)

Google Translate is the default. If it fails, the tool falls back to the offline model automatically.



Technical Strategy: 'Never Miss' Logic

The tool employs a sophisticated "Global Surgical" approach to ensure no Chinese fragment is overlooked, regardless of its depth in JSON, HTML, or complex Markdown.

1. Surgical Block Extraction

Instead of line-by-line translation, the script identifies every continuous block of CJK characters (including ideographic symbols and punctuation) across the entire document. This ensures that contextually related characters are translated together for better accuracy.

2. Structural Protection

Markdown and metadata structures are tokenized using unique, collision-resistant placeholders (___MY_PROTECT_PH_{idx}___).

  • YAML/TOML: Frontmatter is protected globally.
  • Code Fences: Backticks and language identifiers are protected; Chinese content inside comments or strings remains translatable.
  • Links & HTML: URLs and tag names are guarded, while display text is surgically translated.

3. Verification & Restoration

  • Longest-First Replacement: Translated segments are restored starting from the longest strings to prevent partial match overwrites.
  • Fuzzy Restoration: The restoration logic is space-lenient and case-insensitive to handle cases where online translation engines mangle the placeholder tokens.

Markdown Preservation

The following elements are meticulously protected:

Element Example Protection Method
Front Matter ---\ntitle: ...\n--- Full Tokenization
Fenced Code ```python ... ``` Boundary Tokenization
Inline Code `code` Full Tokenization
Links / Images [text](url) URL Tokenization
HTML Tags <div class="..."> Tag Tokenization
Symbols &copy;, &#x...; Entity Tokenization

Microsoft Translator Setup

  1. Go to Azure Cognitive Services
  2. Create a Translator resource (Free tier: 2M chars/month)
  3. Copy your API key and region
  4. Add them to ~/.chinese_file_translator/config.json:
{
  "microsoft_api_key": "abc123...",
  "microsoft_region": "eastus"
}

Then run:

python chinese_file_translator.py input.txt --backend microsoft

Files Generated

Path Description
~/.chinese_file_translator/config.json Persistent settings
~/.chinese_file_translator/history.json Session history log
~/.chinese_file_translator/app.log Application log file
~/.chinese_file_translator/models/ Offline model cache (if used)

Author

algorembrant


License

MIT License. See LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support