Anime Japanese Difficulty Predictor

This project implements an XGBoost regression model to predict the Japanese language difficulty of anime series. The model assigns a score on a 0-50 scale based on subtitle linguistics, vocabulary statistics, and semantic content.

Dataset and Ground Truth

The model was trained on a dataset of approximately 1,100 anime series and movies.

  • Source: Difficulty ratings were sourced from Natively (learnnatively.com) using the platform's "Data Download" feature.
  • Scale: 0 to 50 (User-generated ratings).
  • Distribution: The dataset is normally distributed but heavily concentrated in the 15-35 range, representing the standard difficulty of most broadcast anime.

Data Collection

Subtitle data was aggregated using jimaku-downloader, a custom tool that interfaces with the Jimaku.cc API.

  • Extraction: The tool utilizes regex-based parsing to identify and map episodes to metadata.
  • Selection Logic: Priority was given to official Web-DL sources and text-based formats (SRT) over OCR or ASS files.
  • Potential Noise: As Jimaku relies on user/group uploads, and episode mapping is automated via regex, the dataset contains a margin of error regarding subtitle timing accuracy and version matching.

Feature Engineering

The model utilizes a combination of hard statistical features and semantic embeddings.

1. Statistical Features

  • Density Metrics: Characters per minute (CPM), Kanji density, Type-Token Ratio (TTR).
  • Vocabulary Coverage: Percentage of words appearing in common frequency lists (Top 1k, 2k, 5k, 10k).
  • Comprehension Thresholds: Number of unique words required to reach 90%, 95%, and 98% text coverage.
  • JLPT Distribution: Proportion of vocabulary corresponding to JLPT levels N5 through N1.
  • Part-of-Speech: Distribution of word types (nouns, verbs, particles, etc.).

2. Semantic Features

  • Text Inputs:
    • Series description.
    • "Lexical Signature": A concatenation of the top 200 most frequent content words (excluding stopwords) extracted from the subtitles.
  • Encoding: Text is encoded using paraphrase-multilingual-MiniLM-L12-v2.
  • Dimensionality Reduction: High-dimensional embeddings are reduced to 30 components using PCA.

Model Architecture

The inference pipeline follows this structure:

  1. Preprocessing:
    • Numeric features are normalized using StandardScaler.
    • Text inputs are vectorized via SentenceTransformer and reduced via PCA.
  2. Estimator:
    • Algorithm: XGBoost Regressor.
    • Optimization: Hyperparameters tuned via Optuna (50 trials) minimizing RMSE.
    • Validation: 5-Fold Cross-Validation.

Performance

Evaluated on a held-out test set (20% split):

Metric Value
RMSE 2.3633
MAE 1.8670
0.5813

Limitations

  • Subtitle Quality: Reliance on user-uploaded subtitles introduces potential variance in transcription accuracy and timing.
  • Ground Truth Subjectivity: Natively ratings are based on user perception of difficulty rather than a standardized linguistic index.
  • Parsing Errors: The automated episode detection in the data collection phase may have resulted in mismatched subtitles for a small fraction of the training data.

Artifacts

The trained model is serialized as anime_difficulty_model.pkl. This file contains a dictionary with the following keys:

  • model: The trained XGBoost regressor.
  • scaler: Fitted StandardScaler for numeric features.
  • pca: Fitted PCA object for text embeddings.
  • feature_cols: List of numeric column names expected by the pipeline.

Note: The SentenceTransformer model is not pickled due to size; it must be re-initialized during inference.

Acknowledgements

  • Natively: For providing the difficulty rating dataset.
  • Jimaku.cc: For providing access to the subtitle repository.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support