Model Card for ResNet-50 Document Classifier

This model is a ResNet-50 Convolutional Neural Network (CNN) finetuned to classify scanned document images into 16 categories (e.g., Emails, Invoices, Resumes, Scientific Reports). It achieves 88.46% overall accuracy on the RVL-CDIP test set and a very strong 95.62% Top-3 Accuracy, making it highly effective for automated document triage and organization pipelines.

Model Details

Model Architecture

Model Description

This model utilizes the standard ResNet-50 architecture designed for image classification. Instead of "reading" the text like an OCR system, it analyzes the visual layout, structure, and low-level texture features of a whole document page to determine its category (e.g., recognizing the block layout of a resume versus the dense, two-column text of a scientific report).

It was trained using Transfer Learning, starting with weights pre-trained on ImageNet and finetuning the backbone while retraining the classification head for the 16 document classes.

  • Developed by: Arpit (@arpit-gour02)
  • Model type: Computer Vision (Image Classification / CNN)
  • Language(s) (NLP): English (Implicitly, via the text present in the RVL-CDIP dataset images)
  • License: MIT

Why ResNet50

Model Approximate Parameters Year Released Layers
VGG16 138.4 Million 2014 16
AlexNet 61.1 Million 2012 8
ResNet-50 25.6 Million 2015 50
Model FLOPs (Billions) Efficiency Score
AlexNet 0.7 GFLOPs Low Cost / Low Acc
ResNet-50 3.8 GFLOPs High Efficiency
VGG-16 15.5 GFLOPs Terribly Inefficient

Model Sources

Uses

Direct Use

This model is specifically designed for Document Triage and Automation Pipelines. It is best used as an initial sorting mechanism:

  1. Office Automation: Automatically routing incoming scans to the correct department folder (e.g., sending "Invoices" to Accounting, "Resumes" to HR).
  2. Archive Digitization: Rapidly tagging metadata for large legacy paper archives.
  3. Preprocessing Filter: Acting as a cheap, fast gatekeeper before sending documents to expensive, specialized downstream systems (e.g., only sending confirmed "Forms" to a dedicated form-extraction model).

Out-of-Scope Use

This model is not suitable for:

  • Text Extraction (OCR): It classifies the type of document, it does not output the text written on it.
  • Handwriting Recognition: While it has a class for "Handwritten" documents, it only detects the presence of handwriting, it cannot read what is written.
  • Non-Document Images: The model will perform poorly on natural images (photos of objects, people, landscapes).

Bias, Risks, and Limitations

Users should be aware of the following technical limitations based on evaluation analysis:

  • Resolution Sensitivity (The "Blur" Problem): The model inputs are resized to 224x224. At this low resolution, dense text pages look like blurry gray blocks. This causes significant confusion between classes defined by dense text, specifically distinguishing Scientific Reports from generic File Folders.
  • Visual Similarity: The model sometimes struggles to differentiate between Forms and Questionnaires, as they share very similar visual structures (checkboxes, lines, header fields).
  • Dataset Bias: The model was trained on the RVL-CDIP dataset, which consists primarily of older, grayscale, lower-quality scans. It may have lower accuracy on modern, born-digital, color PDF documents.

How to Get Started with the Model

Use the code block below to load the model architecture, load your trained weights, preprocess an image, and run inference.

import torch
from torchvision import models, transforms
from PIL import Image

# --- Setup ---
# 1. Define the 16 distinct classes
class_names = [
    'advertisement', 'budget', 'email', 'file folder', 'form', 'handwritten', 
    'invoice', 'letter', 'memo', 'news article', 'presentation', 'questionnaire', 
    'resume', 'scientific publication', 'scientific report', 'specification'
]

# 2. Define the preprocessing transformation (Must match training!)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    # Standard ImageNet normalization
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# --- Model Loading ---
# 3. Load the ResNet-50 architecture and replace the final layer
model = models.resnet50(pretrained=False)
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, len(class_names))

# 4. Load your trained weights (ensure path is correct)
# Note: map_location='cpu' ensures it loads even without a GPU
checkpoint = torch.load("resnet50_epoch_4.pth", map_location=torch.device('cpu'))
# Handle potential differences in how state_dict was saved
state_dict = checkpoint['state_dict'] if 'state_dict' in checkpoint else checkpoint
model.load_state_dict(state_dict)

model.eval() # Set to evaluation mode

# --- Inference ---
# 5. Load and preprocess an image
image_path = "path_to_your_test_document.jpg" 
image = Image.open(image_path).convert('RGB') # Ensure 3 channels
input_tensor = transform(image).unsqueeze(0) # Add batch dimension (B, C, H, W)

# 6. Predict
with torch.no_grad():
    outputs = model(input_tensor)
    probabilities = torch.nn.functional.softmax(outputs, dim=1)
    top_prob, top_catid = torch.topk(probabilities, 1)
    
    print(f"Prediction: {class_names[top_catid.item()]}")
    print(f"Confidence: {top_prob.item()*100:.2f}%")

Training Details

Training Data

The model was trained on the RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset.

  • Total Size: 400,000 grayscale images.
  • Classes: 16 perfectly balanced classes.
  • Split: The standard split is 320k Train, 40k Validation, 40k Test. This model was trained on the 20k images per class with 2.5k Images per class for Val and 2.5k Images per class for Test.
  • Data Handling: Original grayscale images were converted to 3-channel RGB to match the input expectations of the pre-trained ResNet backbone.

Training Procedure

Preprocessing

All images were resized to 224x224 pixels and normalized using standard ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]).

Training Hyperparameters

The training used standard, stable hyperparameters for fine-tuning CNNs:

  • Optimizer: SGD (Stochastic Gradient Descent)
  • Learning Rate: 0.01
  • Momentum: 0.9
  • Batch Size: 64
  • Epochs: 5
  • Training Regime: Mixed Precision (Automatic Mixed Precision used implicitly via PyTorch for speed and memory efficiency).

Evaluation

Testing Data, Factors & Metrics

The model was evaluated on the standard, unseen RVL-CDIP Test Split containing 40,000 images.

Metrics used:

  • Accuracy: The percentage of predictions that exactly matched the ground truth.
  • Top-3 Accuracy: The percentage of times the correct label appeared in the model's top three highest-probability predictions. This is often the most relevant metric for human-in-the-loop triage systems.
  • Precision/Recall/F1-Score: Evaluated on a per-class basis to identify specific strengths and weaknesses in the model's performance.

Results

Metric Result Notes
Overall Accuracy 88.46% Solid baseline performance.
Top-3 Accuracy 95.62% Excellent reliability for triage tasks.

Loss and Accuracy Curves

Confusion Matrix

Confusion Matrix

Detailed Classificatio report

Detailed Classification report

Detailed Performance Analysis (The "Traffic Light" Report)

An analysis of per-class F1-scores reveals distinct tiers of performance:

  • ๐ŸŸข Excellent (>90% F1): Email, Resume, Memo, Handwritten, Specification. The model is highly reliable for core administrative documents with distinct visual structures.
  • ๐ŸŸก Reliable (~85-89% F1): Invoice, Advertisement, News Article, Budget.
  • ๐Ÿ”ด Challenging (<75% Precision): Scientific Report, Form, File Folder. The major weakness is misclassifying Scientific Reports as File Folders due to resolution constraints blurring the dense text.

Environmental Impact

The training was conducted locally on consumer-grade hardware, resulting in negligible environmental impact compared to large-scale language model training.

  • Hardware Type: Apple M-Series Chip / single NVIDIA GPU
  • Hours used: Approximately 5 hours (1 hour per epoch)
  • Carbon Emitted: Negligible local usage.

Technical Specifications

Model Architecture and Objective

The model consists of the ResNet-50 backbone (a 50-layer deep Convolutional Neural Network using residual connections and bottleneck blocks) followed by a custom classification head.

  • Input Shape: (Batch_Size, 3, 224, 224) (RGB Images)
  • Backbone Output: 2048 feature maps of size 7x7.
  • Pooling: Global Average Pooling reduces dimensions to (Batch_Size, 2048).
  • Classification Head: A single fully connected linear layer mapping 2048 features to 16 class logits.
  • Objective: Minimize Cross-Entropy Loss between predicted logits and ground truth class labels.

Citation

If you use this model or the RVL-CDIP dataset, please cite the original paper:

BibTeX:

@inproceedings{harley2015icdar,
  title = {Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval},
  author = {Adam W. Harley and Alex Ufkes and Konstantinos G. Derpanis},
  booktitle = {International Conference on Document Analysis and Recognition (ICDAR)},
  year = {2015}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train arpit-gour02/document-classification

Evaluation results