YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
A0l-8B-PRUNE
Pruned version of schneewolflabs/A0l-12B
This is a depth-pruned variant of the A0l-12B model, reduced from 12.25B to 8.70B parameters using structured layer removal.
Model Details
- Base Model: schneewolflabs/A0l-12B
- Architecture: Mistral-Nemo derivative
- Pruning Method: Depth pruning (layer removal)
- Original Parameters: 12.25B
- Pruned Parameters: 8.70B
- Reduction: 28.9%
- Layers: 40 β 27 (removed 13 middle layers)
Pruning Details
What Changed
- Removed: 13 transformer layers (layers 13-25)
- Kept: Early layers (feature extraction) + late layers (task-specific)
- Preserved:
- β Vocabulary size (128k tokens)
- β Hidden dimensions (5120)
- β FFN dimensions (14336)
- β Attention structure (32 heads, 8 KV heads)
- β SwiGLU activation
- β Same tokenizer
What's Maintained
This model maintains full compatibility with the Mistral-Nemo architecture:
- Same vocabulary and tokenizer
- Same hidden size and FFN dimensions
- Grouped Query Attention (GQA) with 4:1 ratio
- Rotary Position Embeddings (RoPE) with theta=1M
- BFloat16 precision
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model
model = AutoModelForCausalLM.from_pretrained(
"nbeerbower/A0l-8B-PRUNE",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("nbeerbower/A0l-8B-PRUNE")
# Generate text
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Performance
Model Size
- Original: ~24GB (12.25B parameters)
- Pruned: ~17GB (8.70B parameters)
- Savings: ~29% smaller
Expected Benefits
- Inference Speed: ~1.4x faster (fewer layers to compute)
- Memory: ~29% less VRAM required
- Compatibility: Works with all HuggingFace tools (VLLM, TGI, etc.)
Quality Considerations
β οΈ Important: This model was pruned without knowledge distillation or retraining. Output quality is degraded compared to the original A0l-12B:
- Text may be less coherent
- May produce grammatical errors or artifacts
- Suitable for applications where speed/size matter more than perfect quality
For best results: Consider this a "base" pruned model that should be fine-tuned or distilled on your target task/dataset.
Recommended Use Cases
Good for:
- Resource-constrained deployments
- Experimentation and research
- Base model for further fine-tuning
- Applications where speed > quality
Not recommended for:
- Production chatbots without further training
- High-stakes text generation
- Tasks requiring perfect coherence
How to Improve Quality
- Knowledge Distillation: Use original A0l-12B as teacher, train for 1-2 epochs
- Fine-tuning: Train on your specific task/domain
- Try Conservative Pruning: Remove fewer layers (8 instead of 13)
Technical Details
Pruning Configuration
{
"original_model": "schneewolflabs/A0l-12B",
"original_params": 12247782400,
"pruned_params": 8703462400,
"reduction_percent": 28.93,
"pruning_type": "depth",
"layers_removed": 13,
"removed_layer_indices": "13-25"
}
Architecture Comparison
| Component | Original | Pruned | Status |
|---|---|---|---|
| Layers | 40 | 27 | β οΈ Changed |
| Hidden Size | 5120 | 5120 | β Same |
| Intermediate Size | 14336 | 14336 | β Same |
| Attention Heads | 32 | 32 | β Same |
| KV Heads | 8 | 8 | β Same |
| Vocab Size | 131072 | 131072 | β Same |
Pruning Methodology
This model was pruned using structured depth pruning:
- Identified redundant middle layers (13-25)
- Removed entire transformer blocks
- Preserved architectural integrity
- No retraining applied (zero-shot pruning)
Tools used: Torch-Pruning
Limitations
- Output quality degraded vs. original (no distillation applied)
- May produce incoherent text on complex prompts
- Not suitable for production without further training
- Intended as a research/development artifact
Citation
If you use this model, please cite the original A0l-12B:
Original model: schneewolflabs/A0l-12B
Base architecture: Mistral-Nemo-12B
Pruning method: Depth pruning (layer removal)
License
Follows the same license as the original A0l-12B model.
Acknowledgments
- Original model: schneewolflabs/A0l-12B
- Base architecture: Mistral-Nemo-12B
- Pruning library: Torch-Pruning
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support