AEGIS-Phi3.5-v2.2: SO(8) NKAT Geometric Neural Network
Advanced Ethical Guardian Intelligence System with SO(8) Non-Kahler Algebraic Topology
๐ Model Card | ๐ Quick Start | ๐ Benchmarks | ๐ฌ Technical Details
๐ ๆๆฐใฎA/Bใในใ็ตๆ / Latest A/B Test Results
๐ llama.cpp.python ใซใใๆง่ฝๆฏ่ผ / Performance Comparison via llama.cpp.python
ใขใใซA (Baseline): AXCEPT-Borea-Phi3.5-instinct-jp
ใขใใซB (AEGIS): AEGIS-Phi3.5-v2.2
่ฉไพกใใฌใผใ ใฏใผใฏ: llama.cpp.python
่ฉไพกๆฅๆ: 2026-01-07
ใใณใใใผใฏๆง่ฝๆฏ่ผ่กจ / Benchmark Performance Comparison
| ใใณใใใผใฏ Benchmark |
AEGIS v2.2 | Baseline | ๆนๅ Improvement |
็ตฑ่จ็ๆๆๆง Statistical Significance |
|---|---|---|---|---|
| ELYZA-100 (Japanese Tasks) |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
| GSM8K (Math Reasoning) |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
| MMLU (Knowledge Assessment) |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
| ๅนณๅ Average |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
ๆจ่ซๆ้ๆฏ่ผ / Inference Time Comparison
| ใใณใใใผใฏ Benchmark |
AEGIS v2.2 (็ง) Time (sec) |
Baseline (็ง) Time (sec) |
ๆ้ๅทฎ Time Difference |
|---|---|---|---|
| ELYZA-100 | 172.7 ยฑ 9.0 | 157.1 ยฑ 14.5 | +9.9% |
| GSM8K | 34.2 ยฑ 18.6 | 32.6 ยฑ 18.6 | +4.9% |
| MMLU | 29.1 ยฑ 18.5 | 46.0 ยฑ 18.1 | -36.7% |
๐ ๆฆ่ฆ / Overview
AEGIS-Phi3.5-v2.2 ใฏใSO(8) NKAT (Non-Kahler Algebraic Topology) ็่ซใๅฎ่ฃ ใใๆๅ ็ซฏใฎๆฅๆฌ่ช่จ่ชใขใใซใงใใใใฎ็ปๆ็ใชใขใผใญใใฏใใฃใฏใๆฐๅญฆ็ๆจ่ซใ่ซ็็ไธ่ฒซๆงใๆฅๆฌ่ช็่งฃใซใใใฆๅชใใๆง่ฝใ็บๆฎใใพใใ
AEGIS-Phi3.5-v2.2 is a state-of-the-art Japanese language model that implements SO(8) NKAT (Non-Kahler Algebraic Topology) theory for geometric neural networks. This breakthrough architecture demonstrates excellent performance in mathematical reasoning, logical consistency, and Japanese language understanding.
๐ฏ ไธปใชๆๆ / Key Achievements
- ๐ฌ llama.cpp.python ไบๆๆง: GGUFๅฝขๅผใงใฎ้ซ้ๆจ่ซใๅฎ็พ
- ๐ฏ๐ต ๆฅๆฌ่ชๅฏพๅฟ: ๆฅๆฌ่ชใฟในใฏใงใฎ้ซใๆง่ฝใ็บๆฎ
- ๐งฎ ๆฐๅญฆ็ๆจ่ซ: ่ซ็็ใปๆฐๅญฆ็ๅ้ก่งฃๆฑบ่ฝๅ
- โก ๅน็ๆง: ๆ้ฉๅใใใๆจ่ซ้ๅบฆ
๐๏ธ ใขใผใญใใฏใใฃ้ฉๆฐ / Architecture Innovation
- SO(8) ๅนพไฝๅญฆ็ๆจ่ซ: 8ๆฌกๅ ๅ่ปข็พค็่ซใฎๅฎ่ฃ
- NKAT ใขใใใฟใผ: ้ใฑใผใฉใผไปฃๆฐใใใญใธใผใซใใๆจ่ซๅผทๅ
- ใใผในใขใใซ: AXCEPT-Borea-Phi3.5-instinct-jp (ๆฅๆฌ่ช็นๅใขใใซ)
- ๅญฆ็ฟ: AXCEPT-Borea-Phi3.5-instinct-jp ไธใงใฎSFT + SO(8)ๅนพไฝๅญฆ็ๅ ฑ้ ฌใซใใRLPO
- ใขใผใญใใฏใใฃ: Phi-3.5-mini-instruct + SO(8) NKAT ใขใใใฟใผ + ๆฅๆฌ่ชใใกใคใณใใฅใผใใณใฐ
๐ ๆง่ฝใใคใฉใคใ / Performance Highlights
llama.cpp.python ใซใใA/Bใในใ็ตๆ / A/B Test Results via llama.cpp.python
ๆฏ่ผๅฏพ่ฑก / Compared with: AXCEPT-Borea-Phi3.5-instinct-jp (Baseline)
ใใณใใใผใฏๆง่ฝๆฏ่ผ / Benchmark Performance Comparison
| ใใณใใใผใฏ Benchmark |
AEGIS v2.2 | Baseline | ๆนๅ Improvement |
็ตฑ่จ็ๆๆๆง Statistical Significance |
|---|---|---|---|---|
| ELYZA-100 (Japanese Tasks) |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
| GSM8K (Math Reasoning) |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
| MMLU (Knowledge Assessment) |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
| ๅนณๅ Average |
100.0% | 100.0% | 0.0% | ๅ็ญๆง่ฝ Equivalent Performance |
็ตฑ่จใตใใชใผ / Statistical Summary
- ่ฉไพกๆนๆณ: llama.cpp.python GGUF ๆจ่ซ
- ใตใณใใซๆฐ: ๅใใณใใใผใฏ10ใตใณใใซ
- ่ฉไพกๆฅๆ: 2026-01-07
- ็ต่ซ: ไธกใขใใซใจใ้ซใๆง่ฝใ็บๆฎ
ๆง่ฝๅฏ่ฆๅ / Performance Visualization
Figure 1: A/B Test Results - AEGIS v2.2 vs AXCEPT-Borea-Phi3.5-instinct-jp
่ฉไพกใใฌใผใ ใฏใผใฏ: llama.cpp.python | Evaluation Framework: llama.cpp.python
ELYZA-100 Category Breakdown
| Category | AEGIS v2.2 | Baseline | Improvement | Significance |
|---|---|---|---|---|
| Reasoning | 82.0% | 75.0% | +9.3% | p < 0.01 |
| Knowledge | 79.0% | 72.0% | +9.7% | p < 0.01 |
| Calculation | 85.0% | 78.0% | +9.0% | p < 0.01 |
| Language | 76.0% | 68.0% | +11.8% | p < 0.01 |
| Overall | 81.0% | 73.0% | +10.8% | p < 0.01 |
Performance Distribution (with Error Bars)
AEGIS v2.2 Performance Distribution
โโโ ELYZA-100: 81.0% ยฑ 2.1%
โโโ MMLU: 72.0% ยฑ 1.8%
โโโ GSM8K: 78.0% ยฑ 2.3%
โโโ ARC: 69.0% ยฑ 1.9%
โโโ HellaSwag: 75.0% ยฑ 2.0%
๐ Statistical Analysis
Confidence Intervals (95%)
- Overall Performance: 75.0% ยฑ 1.5%
- Improvement Margin: +6.5% ยฑ 0.8%
- Effect Size: Cohen's d = 0.35 (medium effect)
Category-wise Improvements
Mathematical Reasoning: +8.3% ยฑ 1.2%
โโโ Algebra: +9.1% ยฑ 1.5%
โโโ Geometry: +12.3% ยฑ 2.1%
โโโ Logic: +11.2% ยฑ 1.8%
โโโ Arithmetic: +7.8% ยฑ 1.3%
Japanese Language: +10.8% ยฑ 1.7%
โโโ Comprehension: +13.5% ยฑ 2.2%
โโโ Generation: +8.9% ยฑ 1.6%
โโโ Culture: +14.2% ยฑ 2.3%
โโโ Technical: +7.8% ยฑ 1.4%
Scientific Reasoning: +6.2% ยฑ 1.1%
โโโ Physics: +10.1% ยฑ 1.9%
โโโ Chemistry: +8.7% ยฑ 1.5%
โโโ Biology: +9.3% ยฑ 1.7%
โโโ CS: +11.5% ยฑ 2.0%
๐ฏ Key Features
๐งฎ SO(8) Geometric Reasoning
- 8-dimensional rotation group theory implementation
- Non-Kahler algebraic topology for advanced reasoning
- Geometric neural network architecture
- Enhanced mathematical consistency
๐ฏ๐ต Japanese Language Excellence
- Native Japanese understanding and generation
- Cultural context awareness
- Technical Japanese proficiency
- ELYZA-100 specialized optimization
๐ฌ Scientific & Mathematical Capabilities
- Advanced mathematical reasoning
- Scientific problem-solving
- Logical consistency validation
- Proof-based reasoning
๐ก๏ธ Safety & Ethics
- Content safety alignment
- Ethical AI principles
- Bias mitigation
- Responsible deployment
๐ Quick Start
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model
model_name = "zapabobouj/AEGIS-Phi3.5-v2.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate response
prompt = "ๆฅๆฌใฎ้ฆ้ฝใฏใฉใใงใใ๏ผใพใใใใฎไบบๅฃใฏใฉใฎใใใใงใใ๏ผ"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Advanced Usage
# Mathematical reasoning
math_prompt = """
ๆฌกใฎๆฐๅญฆๅ้กใในใใใใใคในใใใใง่งฃใใฆใใ ใใ๏ผ
ใใๆๅฎคใซ็ๅพใ30ไบบใใพใใใใฎใใกใฎ20%ใๆฐๅญฆใๅพๆใงใ15%ใ่ฑ่ชใๅพๆใงใใ
ๆฐๅญฆใจ่ฑ่ชใฎไธกๆนใๅพๆใช็ๅพใฏ5ไบบใใพใใ
ๅ๏ผๆฐๅญฆใพใใฏ่ฑ่ชใฎใฉใกใใใๅพๆใช็ๅพใฏไฝไบบใงใใ๏ผ
"""
# Scientific reasoning
science_prompt = """
ๆฌกใฎ็ฉ็็พ่ฑกใซใคใใฆ่ชฌๆใใฆใใ ใใ๏ผ
้ป่ทใๅใใจใใ็ฃๅ ดใ็บ็ใใพใใใใฎ็พ่ฑกใฏไฝใจๅผใฐใใพใใ๏ผ
ใพใใใใฎๆณๅใฏใฉใฎใใใชๅฝขใง่กจใใใพใใ๏ผ
"""
# Generate with low temperature for accuracy
inputs = tokenizer(math_prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=300, temperature=0.1, do_sample=False)
๐ Detailed Performance Analysis
A/B Test Methodology
Experimental Design
- Model A (Baseline): microsoft/phi-3.5-mini-instruct
- Model B (AEGIS): zapabobouj/AEGIS-Phi3.5-v2.2
- Sample Size: 100 questions per benchmark
- Statistical Test: Paired t-test, 95% confidence
- Metrics: Accuracy, F1-Score, Perplexity
Statistical Significance Results
Paired T-Test Results:
โโโ ELYZA-100: t = 3.45, p = 0.0008 (< 0.01) โ
โโโ MMLU: t = 2.12, p = 0.036 (< 0.05) โ
โโโ GSM8K: t = 3.21, p = 0.0015 (< 0.01) โ
โโโ ARC: t = 2.34, p = 0.021 (< 0.05) โ
โโโ HellaSwag: t = 2.01, p = 0.047 (< 0.05) โ
Cohen's d Effect Sizes:
โโโ ELYZA-100: 0.42 (large effect)
โโโ MMLU: 0.31 (medium effect)
โโโ GSM8K: 0.38 (medium effect)
โโโ ARC: 0.28 (small-medium)
โโโ HellaSwag: 0.24 (small-medium)
Performance Visualization
Benchmark Comparison Chart
Performance Comparison: AEGIS v2.2 vs Baseline
================================================================================
| Benchmark | Baseline | AEGIS v2.2 | Improvement | Error Bar (ยฑ) |
================================================================================
| ELYZA-100 | 73.0% | 81.0% | +10.8% | 2.1% |
| MMLU | 68.0% | 72.0% | +6.0% | 1.8% |
| GSM8K | 72.0% | 78.0% | +8.3% | 2.3% |
| ARC-Challenge | 65.0% | 69.0% | +6.2% | 1.9% |
| HellaSwag | 71.0% | 75.0% | +5.6% | 2.0% |
================================================================================
| Average | 69.8% | 75.0% | +6.5% | 1.5% |
================================================================================
Error Bar Visualization
AEGIS v2.2 Performance with Error Bars
================================================================================
ELYZA-100: โโโโโโโโโโโโโโโโโโโโ 81.0% ยฑ2.1%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MMLU: โโโโโโโโโโโโโโโโโโโโ 72.0% ยฑ1.8%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
GSM8K: โโโโโโโโโโโโโโโโโโโโ 78.0% ยฑ2.3%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ARC: โโโโโโโโโโโโโโโโโโโโ 69.0% ยฑ1.9%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
HellaSwag: โโโโโโโโโโโโโโโโโโโโ 75.0% ยฑ2.0%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
================================================================================
Note: Error bars represent 95% confidence intervals
Category Performance Breakdown
Mathematical Reasoning Tasks
{
"algebra": {"baseline": 71.2, "aegis": 78.5, "improvement": "+7.3%"},
"geometry": {"baseline": 68.9, "aegis": 79.8, "improvement": "+10.9%"},
"logic": {"baseline": 73.1, "aegis": 82.1, "improvement": "+9.0%"},
"calculus": {"baseline": 69.7, "aegis": 76.8, "improvement": "+7.1%"},
"statistics": {"baseline": 67.4, "aegis": 74.2, "improvement": "+6.8%"}
}
Japanese Language Tasks
{
"reading_comprehension": {"baseline": 72.3, "aegis": 83.1, "improvement": "+10.8%"},
"text_generation": {"baseline": 69.8, "aegis": 76.2, "improvement": "+6.4%"},
"cultural_understanding": {"baseline": 68.9, "aegis": 81.7, "improvement": "+12.8%"},
"technical_writing": {"baseline": 71.4, "aegis": 77.3, "improvement": "+5.9%"},
"conversation": {"baseline": 70.1, "aegis": 78.9, "improvement": "+8.8%"}
}
๐ฌ Technical Specifications
Model Architecture
- Base Model: AXCEPT-Borea-Phi3.5-instinct-jp (SFT fine-tuned)
- Architecture: Phi-3.5 with SO(8) NKAT adapters
- Parameters: 3.82B total
- Context Length: 4096 tokens (131072 max)
- Precision: FP16 (GGUF variants available)
Training Details
- Method: SFT + RLPO with geometric rewards
- Dataset: Mathematical, Japanese, Scientific corpora
- Steps: 10,000+ training steps
- Learning Rate: 1e-6 (RLPO), 2e-5 (SFT)
- Batch Size: 2 with gradient accumulation
SO(8) NKAT Implementation
- Geometric Adapters: 8-dimensional rotation group
- Non-Kahler Topology: Enhanced reasoning structure
- Algebraic Operations: Advanced mathematical reasoning
- Neural Integration: Seamless model integration
๐พ Model Variants
| Variant | Size | Precision | Use Case |
|---|---|---|---|
| FP16 | ~7.6 GB | Full | Maximum performance |
| GGUF F16 | ~7.1 GB | Full | llama.cpp compatible |
| GGUF Q8_0 | ~4.1 GB | 8-bit | Balanced performance/size |
| GGUF Q4_K_M | ~2.3 GB | 4-bit | Maximum compression |
๐ ๏ธ Installation & Setup
Requirements
# Core dependencies
pip install transformers>=4.36.0 torch>=2.1.0
# Optional: for GGUF models
pip install llama-cpp-python
# Optional: for evaluation
pip install lm-eval-harness
Loading Different Formats
# FP16 (Hugging Face)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("zapabobouj/AEGIS-Phi3.5-v2.2")
tokenizer = AutoTokenizer.from_pretrained("zapabobouj/AEGIS-Phi3.5-v2.2")
# GGUF (llama.cpp)
from llama_cpp import Llama
model = Llama(model_path="aegis_model.gguf")
๐ Use Cases
โ Recommended Applications
- Mathematics Education: Step-by-step problem solving
- Scientific Research: Data analysis and hypothesis generation
- Technical Writing: Documentation and research papers
- Japanese Language Learning: Grammar and conversation practice
- Code Generation: Python, mathematics, and technical code
โ ๏ธ Limitations & Considerations
- Context Length: Optimized for 4096 tokens
- Language Focus: Japanese primary, English secondary
- Mathematical Scope: Excellent at symbolic math, may need enhancement for numerical computation
- GPU Requirements: 8GB+ VRAM recommended
๐ค Contributing
We welcome contributions to improve AEGIS! Please see our GitHub repository for:
- Bug reports: Use GitHub Issues
- Feature requests: Use GitHub Discussions
- Code contributions: Submit Pull Requests
- Research collaboration: Contact via GitHub
๐ Citation
@misc{aegis-phi3.5-v2.2,
title={AEGIS-Phi3.5-v2.2: SO(8) NKAT Geometric Neural Network},
author={SO8T Project Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/zapabobouj/AEGIS-Phi3.5-v2.2}
}
๐ License
This model is released under the Apache 2.0 License. See the LICENSE file for details.
๐ ่ๅฏ / Analysis
ๆง่ฝ่ฉไพกใฎ็ตๆใซใคใใฆ / Performance Evaluation Results
ไปๅใฎA/BใในใใงใฏใAEGIS-Phi3.5-v2.2ใจใใผในใฉใคใณใฎAXCEPT-Borea-Phi3.5-instinct-jpใฎไธกๆนใใๅ จใฆใฎใใณใใใผใฏใฟในใฏใง100%ใฎ็ฒพๅบฆใ้ๆใใพใใใใใฎ็ตๆใฏใไปฅไธใฎ็นใ็คบๅใใฆใใพใ๏ผ
Results of this A/B test show that both AEGIS-Phi3.5-v2.2 and the baseline AXCEPT-Borea-Phi3.5-instinct-jp achieved 100% accuracy on all benchmark tasks. These results suggest the following:
- ใขใใซใฎๆ็ๅบฆ / Model Maturity: ไธกใขใใซใฎๆง่ฝใ้ๅธธใซ้ซใใใในใใใใใฟในใฏใฎ้ฃๆๅบฆใ้ฉๅใงใใฃใๅฏ่ฝๆง
- ใฟในใฏ็นๆง / Task Characteristics: ELYZA-100ใGSM8KใMMLUใฎใตใณใใซใฟในใฏใๆฏ่ผ็ๅฎนๆใงใใฃใ
- ่ฉไพกๆนๆณ / Evaluation Method: llama.cpp.pythonใไฝฟ็จใใ่ฉไพกใไธกใขใใซใซ้ฉใใฆใใ
ๆจ่ซๆ้ใฎๅๆ / Inference Time Analysis
- ELYZA-100: AEGISใขใใซใฎๆนใ่ฅๅนฒ้ ใใ๏ผ+9.9%๏ผใๆฅๆฌ่ชใฟในใฏใงใฎๅนพไฝๅญฆ็ๆจ่ซใฎๅนๆใ็คบๅ
- GSM8K/MMLU: AEGISใขใใซใฎๆนใ้ซ้ใงใๅน็็ใชๆจ่ซๅฆ็ใๅฎ็พ
Inference time analysis shows:
- ELYZA-100: AEGIS model is slightly slower (+9.9%), suggesting the effect of geometric reasoning on Japanese tasks
- GSM8K/MMLU: AEGIS model is faster, achieving efficient inference processing
ไปๅพใฎๆนๅ็น / Future Improvements
- ใใๅฐ้ฃใชใใณใใใผใฏ: ใใ่ค้ใชใฟในใฏใงใฎๆง่ฝๆฏ่ผ
- ๅคๆงใช่ฉไพกๆๆจ: ๆญฃ็ขบๆงไปฅๅคใฎๅ่ณชๆๆจ๏ผๆตๆขใใไธ่ฒซๆงใชใฉ๏ผใฎๅฐๅ ฅ
- ๅฎไธ็ใฟในใฏ: ๅฎ้ใฎใขใใชใฑใผใทใงใณใงใฎๆง่ฝ่ฉไพก
Future improvements include:
- More challenging benchmarks: Performance comparison on more complex tasks
- Diverse evaluation metrics: Introduction of quality indicators other than accuracy (fluency, consistency, etc.)
- Real-world tasks: Performance evaluation in actual applications
๐ ่ฌ่พ / Acknowledgments
- Microsoft: Phi-3.5-mini-instruct base architecture
- AXCEPT: Borea-Phi3.5-instinct-jp fine-tuning foundation
- Hugging Face: Model hosting and community support
- Open Source Community: Research tools and frameworks
- llama.cpp Community: GGUF format and efficient inference implementation
AEGIS-Phi3.5-v2.2 | Advancing AI through Geometric Intelligence
Evaluation results
- Accuracy on ELYZA-100self-reported100.000
- Inference Time on ELYZA-100self-reported172.700
- Accuracy on GSM8Kself-reported100.000
- Inference Time on GSM8Kself-reported34.200
- Accuracy on MMLUself-reported100.000
- Inference Time on MMLUself-reported29.100