Upload 8 files

Browse files

Files changed (8) hide show

README.md +134 -0
added_tokens.json +6 -0
gliner_config.json +124 -0
pytorch_model.bin +3 -0
special_tokens_map.json +51 -0
spm.model +3 -0
tokenizer.json +0 -0
tokenizer_config.json +86 -0

README.md ADDED Viewed

	@@ -0,0 +1,134 @@

+---
+license: apache-2.0
+language:
+  - en
+library_name: gliner
+datasets:
+  - nvidia/Nemotron-PII
+pipeline_tag: token-classification
+tags:
+  - PII
+  - PHI
+  - GLiNER
+  - information extraction
+  - encoder
+  - entity recognition
+  - privacy
+---
+# GLiNER-PII: Fine-Tuned Model for PII/PHI Detection
+The GLiNER-PII is a fine-tuned successor to the Gretel GLiNER PII models. Built on the GLiNER bi-large base (`knowledgator/gliner-bi-large-v1.0`), it detects and classifies a broad range of Personally Identifiable Information (PII) and Protected Health Information (PHI) in **English text**. The model works with both structured and unstructured text and is non-generative, producing span-level entity annotations with confidence scores across 55+ categories.
+This model is intended for privacy-preserving NLP workflows such as de-identification, redaction, and compliance checks in healthcare, finance, legal, and enterprise data pipelines.
+For more information about the base GLiNER model, including its architecture and general capabilities, please refer to the [GLiNER Model Card](https://huggingface.co/knowledgator/gliner-bi-large-v1.0).
+## Training Data
+The model was fine-tuned on the `nvidia/Nemotron-PII` dataset, a synthetic, persona-grounded dataset containing 100,000 records across 50+ industries with span-level annotations for 55+ PII/PHI categories. The dataset was generated with NVIDIA NeMo Data Designer using synthetic personas grounded in U.S. Census data to ensure demographic realism and contextual consistency.
+**Dataset Details:**
+- **Size:** 100,000 records (50k train / 50k test)
+- **Domains:** 50+ industries (healthcare, finance, cybersecurity, etc.)
+- **Entity Types:** 55+ PII/PHI categories
+- **Locale Coverage:** US and international formats
+- **Content Types:** Both structured (forms, invoices) and unstructured (emails, notes) documents
+For detailed statistics on the dataset, visit the [dataset documentation on Hugging Face](https://huggingface.co/datasets/nvidia/Nemotron-PII).
+## Use Cases
+The GLiNER-PII supports detection and redaction of sensitive information across regulated and enterprise scenarios:
+- **Healthcare**: Redact PHI in clinical notes, reports, and medical documents
+- **Finance**: Identify account numbers, SSNs, and transaction details in banking and insurance documents
+- **Legal**: Protect client information in contracts, filings, and discovery materials
+- **Enterprise Data Governance**: Scan documents, emails, and data stores for sensitive information
+- **Data Privacy Compliance**: Support GDPR, HIPAA, and CCPA workflows across varied document types
+- **Cybersecurity**: Detect sensitive data in logs, security reports, and incident records
+- **Content Moderation**: Flag personal information in user-generated content
+Note: performance varies by domain, format, and threshold, so validation and human review are recommended for high-stakes deployments.
+## Installation & Usage
+Ensure you have Python installed. Then, install or update the `gliner` package:
+```python
+import json
+from gliner import GLiNER
+# Load the fine-tuned GLiNER model
+model = GLiNER.from_pretrained("nvidia/gliner-pii")
+# Sample text containing PII/PHI entities
+text = """
+"""
+# Define the labels for PII/PHI entities
+labels = [
+    "certificate_license_number",
+    "first_name",
+    "date_of_birth",
+    "ssn",
+    "medical_record_number",
+    "password",
+    "unique_id",
+    "phone_number",
+    "national_id",
+    "swift_bic",
+    "company_name",
+    "country",
+    "license_plate",
+    "tax_id",
+    "employee_id",
+    "pin" ,
+    "state",
+    "email",
+    "date_time",
+    "api_key",
+    "biometric_identifier",
+    "credit_debit_card",
+    "coordinate",
+    "device_identifier",
+    "city",
+    "postcode",
+    "bank_routing_number",
+    "vehicle_identifier",
+    "health_plan_beneficiary_number",
+    "url",
+    "ipv4",
+    "last_name",
+    "cvv" ,
+    "customer_id",
+    "date",
+    "user_name",
+    "street_address",
+    "ipv6",
+    "account_number",
+    "time",
+    "age",
+    "fax_number",
+    "county",
+    "gender",
+    "sexuality",
+    "political_view",
+    "race_ethnicity",
+    "religious_belief",
+    "language",
+    "blood_type",
+    "mac_address",
+    "http_cookie",
+    "employment_status",
+    "education_level",
+    "occupation"
+]
+# Predict entities with a confidence threshold of 0.7
+entities = model.predict_entities(text, labels, threshold=0.3)
+# Display the detected entities
+print(json.dumps(entities, indent=2))
+```

added_tokens.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "<<ENT>>": 128002,
+  "<<SEP>>": 128003,
+  "[FLERT]": 128001,
+  "[MASK]": 128000
+}

gliner_config.json ADDED Viewed

	@@ -0,0 +1,124 @@

+{
+  "class_token_index": 128002,
+  "dropout": 0.4,
+  "embed_ent_token": true,
+  "encoder_config": {
+    "_name_or_path": "microsoft/deberta-v3-large",
+    "add_cross_attention": false,
+    "architectures": null,
+    "attention_probs_dropout_prob": 0.1,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.1,
+    "hidden_size": 1024,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-07,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 512,
+    "max_relative_positions": -1,
+    "min_length": 0,
+    "model_type": "deberta-v2",
+    "no_repeat_ngram_size": 0,
+    "norm_rel_ebd": "layer_norm",
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 24,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 0,
+    "pooler_dropout": 0,
+    "pooler_hidden_act": "gelu",
+    "pooler_hidden_size": 1024,
+    "pos_att_type": [
+      "p2c",
+      "c2p"
+    ],
+    "position_biased_input": false,
+    "position_buckets": 256,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "relative_attention": true,
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "share_att_key": true,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "type_vocab_size": 0,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "vocab_size": 128004
+  },
+  "ent_token": "<<ENT>>",
+  "eval_every": 5000,
+  "fine_tune": true,
+  "fuse_layers": false,
+  "has_rnn": true,
+  "hidden_size": 512,
+  "labels_encoder": null,
+  "labels_encoder_config": null,
+  "lr_encoder": "1e-5",
+  "lr_others": "5e-5",
+  "max_len": 384,
+  "max_neg_type_ratio": 1,
+  "max_types": 25,
+  "max_width": 12,
+  "model_name": "microsoft/deberta-v3-large",
+  "model_type": "gliner",
+  "name": "correct",
+  "num_post_fusion_layers": 1,
+  "num_steps": 30000,
+  "post_fusion_schema": "",
+  "random_drop": true,
+  "sep_token": "<<SEP>>",
+  "shuffle_types": true,
+  "size_sup": -1,
+  "span_mode": "markerV0",
+  "subtoken_pooling": "first",
+  "train_batch_size": 8,
+  "transformers_version": "4.45.2",
+  "vocab_size": 128004,
+  "warmup_ratio": 3000,
+  "words_splitter_type": "whitespace"
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4dfd0dcbd718acc86dca65fa99f1097d2796c8d7a681e1bc42f40f946c03802
+size 1782000995

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c679fbf93643d19aab7ee10c0b99e460bdbc02fedf34b92b05af343b4af586fd
+size 2464616

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,86 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128000": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "128001": {
+      "content": "[FLERT]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128002": {
+      "content": "<<ENT>>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "128003": {
+      "content": "<<SEP>>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "[CLS]",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "eos_token": "[SEP]",
+  "mask_token": "[MASK]",
+  "max_length": null,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "sp_model_kwargs": {},
+  "split_by_punct": false,
+  "tokenizer_class": "DebertaV2Tokenizer",
+  "unk_token": "[UNK]",
+  "vocab_type": "spm"
+}