Spaces:
Sleeping
Sleeping
Update README2.md
Browse files- README2.md +21 -78
README2.md
CHANGED
|
@@ -55,102 +55,45 @@ integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
|
|
| 55 |
# File Structure Overview:
|
| 56 |
Spacy_Model_creator/
|
| 57 |
β
|
| 58 |
-
βββ
|
| 59 |
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
β
|
| 61 |
βββ templates/
|
| 62 |
-
β βββ
|
| 63 |
-
β βββ result.html
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
β
|
| 65 |
-
βββ
|
|
|
|
| 66 |
β
|
| 67 |
βββ utils/
|
| 68 |
-
β βββ
|
| 69 |
-
β βββ
|
| 70 |
-
β βββ
|
| 71 |
-
β βββ
|
| 72 |
β
|
| 73 |
βββ venv/ # Virtual environment
|
| 74 |
β
|
| 75 |
βββ .env # Environment variables file (contains Hugging Face token)
|
| 76 |
β
|
| 77 |
-
βββ
|
| 78 |
β
|
| 79 |
βββ requirements.txt # Dependencies required for the project
|
| 80 |
|
| 81 |
-
|
| 82 |
-
# Program Overview:
|
| 83 |
-
|
| 84 |
-
# Mistral Integration (utils/mistral.py)
|
| 85 |
-
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
|
| 86 |
-
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
|
| 87 |
-
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
|
| 88 |
-
|
| 89 |
-
# SpaCy Integration (utils/spacy.py)
|
| 90 |
-
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
|
| 91 |
-
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
|
| 92 |
-
- Validation: Includes validation for extracted emails and contacts.
|
| 93 |
-
|
| 94 |
-
# File Conversion (utils/fileTotext.py)
|
| 95 |
-
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
|
| 96 |
-
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
|
| 97 |
-
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
|
| 98 |
-
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
|
| 99 |
-
- RSF Files: Reads plain text from RSF files.
|
| 100 |
-
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
|
| 101 |
-
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
|
| 102 |
-
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
# Error Handling (utils/error.py)
|
| 106 |
-
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
|
| 107 |
-
|
| 108 |
-
# Flask API (main.py)
|
| 109 |
-
Endpoints:
|
| 110 |
-
- /upload for uploading resumes.
|
| 111 |
-
- Displays parsed results in JSON format on the results page.
|
| 112 |
-
- UI: Simple interface for uploading resumes and viewing the parsing results.
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
# Tree map of program:
|
| 116 |
-
|
| 117 |
-
main.py
|
| 118 |
-
βββ Handles API side
|
| 119 |
-
βββ File upload/remove
|
| 120 |
-
βββ Process resumes
|
| 121 |
-
βββ Show result
|
| 122 |
-
utils
|
| 123 |
-
βββ fileTotext.py
|
| 124 |
-
β βββ Converts files to text
|
| 125 |
-
β βββ PDF
|
| 126 |
-
β βββ DOCX
|
| 127 |
-
β βββ RTF
|
| 128 |
-
β βββ ODT
|
| 129 |
-
β βββ PNG
|
| 130 |
-
β βββ JPG
|
| 131 |
-
β βββ JPEG
|
| 132 |
-
βββ mistral.py
|
| 133 |
-
β βββ Mistral API Calls
|
| 134 |
-
β β βββ Uses Mistral-Nemo-Instruct-2407 model
|
| 135 |
-
β βββ Personal and Professional Extraction
|
| 136 |
-
β β βββ Extracts personal information
|
| 137 |
-
β β βββ Extracts professional information
|
| 138 |
-
β βββ Fallback Mechanism
|
| 139 |
-
β βββ Uses spaCy NER model if Mistral fails
|
| 140 |
-
βββ spacy.py
|
| 141 |
-
βββ Custom Trained Model
|
| 142 |
-
β βββ Uses spaCy model (ner_model_05_3)
|
| 143 |
-
βββ Named Entity Recognition
|
| 144 |
-
β βββ Extracts key information (Name, Email, Contact, etc.)
|
| 145 |
-
βββ Validation
|
| 146 |
-
βββ Validates emails and contacts
|
| 147 |
-
|
| 148 |
-
|
| 149 |
# References:
|
| 150 |
|
| 151 |
- [Flask Documentation](https://flask.palletsprojects.com/)
|
| 152 |
- [spaCy Documentation](https://spacy.io/usage)
|
| 153 |
-
- [Mistral Documentation](https://docs.mistral.ai/)
|
| 154 |
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
|
| 155 |
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
|
| 156 |
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
|
|
|
|
| 55 |
# File Structure Overview:
|
| 56 |
Spacy_Model_creator/
|
| 57 |
β
|
| 58 |
+
βββ Models/
|
| 59 |
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
|
| 60 |
+
β
|
| 61 |
+
βββ data/
|
| 62 |
+
β βββ Json_data.json
|
| 63 |
+
β βββ resume_text.txt
|
| 64 |
+
β βββ Spacy_data.spacy
|
| 65 |
β
|
| 66 |
βββ templates/
|
| 67 |
+
β βββ anoter.html
|
| 68 |
+
β βββ result.html
|
| 69 |
+
β βββ guide.html
|
| 70 |
+
β βββ savejson.html
|
| 71 |
+
β βββ savespacy.html
|
| 72 |
+
β βββ text.html
|
| 73 |
+
β βββ upload.html
|
| 74 |
+
β βββ data_files.html
|
| 75 |
β
|
| 76 |
+
βββ JSON/
|
| 77 |
+
β βββ Json_data.json
|
| 78 |
β
|
| 79 |
βββ utils/
|
| 80 |
+
β βββ model.py # Code for calling Mistral API and handling responses
|
| 81 |
+
β βββ json_to_spacy.py # spaCy fallback model for parsing resumes
|
| 82 |
+
β βββ anoter_to_json.py # Error handling utilities
|
| 83 |
+
β βββ file_To_text.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
|
| 84 |
β
|
| 85 |
βββ venv/ # Virtual environment
|
| 86 |
β
|
| 87 |
βββ .env # Environment variables file (contains Hugging Face token)
|
| 88 |
β
|
| 89 |
+
βββ app.py # Flask app handling API routes for uploading and processing resumes
|
| 90 |
β
|
| 91 |
βββ requirements.txt # Dependencies required for the project
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
# References:
|
| 94 |
|
| 95 |
- [Flask Documentation](https://flask.palletsprojects.com/)
|
| 96 |
- [spaCy Documentation](https://spacy.io/usage)
|
|
|
|
| 97 |
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
|
| 98 |
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
|
| 99 |
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
|