File size: 7,285 Bytes
a02eb67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb01413
 
 
bd23e8e
eb01413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3daef40
eb01413
3daef40
bd23e8e
eb01413
 
 
 
 
 
 
 
 
bd23e8e
eb01413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e3d0af
 
 
 
 
b0851dd
eb01413
 
 
 
 
 
0705206
 
 
 
 
 
 
 
 
 
 
 
9b58b39
0705206
 
9b58b39
0705206
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b58b39
0705206
 
 
 
 
 
9b58b39
0705206
 
 
 
 
 
9b58b39
0705206
 
 
 
eb01413
 
 
3daef40
e357ac2
eb01413
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
library_name: gliner
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/

pipeline_tag: token-classification
language:
  - en
tags:
  - nvidia
  - pytorch
  - PII
  - PHI
  - GLiNER
  - information extraction
  - entity recognition
  - privacy
---

# GLiNER-PII Model Overview

### Description:
GLiNER-PII is inspired by the Gretel GLiNER PII/PHI models. Built on the GLiNER large-v2.1 base, it detects and classifies a broad range of Personally Identifiable Information (PII) and Protected Health Information (PHI) in structured and unstructured text. It is non-generative and produces span-level entity annotations with confidence scores across 55+ categories. This model was developed by NVIDIA.

This model is ready for commercial/non-commercial use. <br>

### License/Terms of Use
Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). <br>

### Deployment Geography:
Global <br>

### Use Case: <br>
GLiNER-PII supports detection and redaction of sensitive information across regulated and enterprise scenarios. 

- **Healthcare**: Redact PHI in clinical notes, reports, and medical documents.
- **Finance**: Identify account numbers, SSNs, and transaction details in banking and insurance documents.
- **Legal**: Protect client information in contracts, filings, and discovery materials.
- **Enterprise Data Governance**: Scan documents, emails, and data stores for sensitive information.
- **Data Privacy Compliance**: Support GDPR, HIPAA, and CCPA workflows across varied document types.
- **Cybersecurity**: Detect sensitive data in logs, security reports, and incident records.
- **Content Moderation**: Flag personal information in user-generated content.

Note: performance varies by domain, format, and threshold, so validation and human review are recommended for high‑stakes deployments. <br>

### Release Date:  <br>
Hugging Face 10/28/2025 via https://huggingface.co/nvidia/gliner-pii <br> 

## References:
- GLiNER base (Hugging Face): https://huggingface.co/urchade/gliner_large-v2.1
- Gretel GLiNER PII/PHI models: https://huggingface.co/gretelai/gretel-gliner-bi-large-v1.0
- Training dataset: https://huggingface.co/datasets/nvidia/nemotron-pii
- GLiNER library: https://pypi.org/project/gliner/

## Model Architecture:
**Architecture Type:** Transformer <br>

**Network Architecture:** GLiNER <br>

**This model was developed based on urchade/gliner_large-v2.1** <br> 
**Number of model parameters: 5.7 × 10^8** <br>

## Input: <br>
**Input Type(s):** Text <br>
**Input Format:** UTF-8 string(s) <br>
**Input Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Input:** supports structured and unstructured text <br>

## Output: <br>
**Output Type(s):** Text <br>
**Output Format:** String <br>
**Output Parameters:** One-Dimensional (1D) <br>
**Other Properties Related to Output:** List of dictionaries with keys {text, label, start, end, score} <br>

## Software Integration:
**Runtime Engine(s):** 
* PyTorch, GLiNER Python library <br> 

**Supported Hardware Microarchitecture Compatibility:** <br>
* NVIDIA Ampere <br>
* NVIDIA Blackwell <br>
* NVIDIA Hopper <br>
* NVIDIA Lovelace <br>
* NVIDIA Pascal <br>
* NVIDIA Turing <br>
* NVIDIA Volta <br>
* CPU (x86_64) <br>

**Preferred/Supported Operating System(s):**
* Linux <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>

## Model Version(s):
- nvidia/gliner-pii
- Version: v1.0

## Training and Evaluation Datasets:

### Training Dataset

**Link:** [nvidia/nemotron-pii](https://huggingface.co/datasets/nvidia/nemotron-pii) <br>
**Data Modality:** Text <br>
**Text Training Data Size:** \~100k records (\~10^5, <1B tokens) <br>
**Data Collection Method:** Synthetic <br>
**Labeling Method:** Synthetic <br>

**Properties:**
Synthetic persona-grounded dataset generated with NVIDIA NeMo Data Designer, spanning 50+ industries and 55+ entity types (U.S. and international formats). Includes both structured and unstructured records. Labels automatically injected during generation.

## Evaluation Datasets

* [Argilla PII](https://huggingface.co/argilla)
* [AI4Privacy](https://huggingface.co/ai4privacy)
* [Gretel PII Dataset V1/V2](https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1)

**Data Collection Method:** Hybrid: Automated, Human <br>
**Labeling Method:** Hybrid: Automated, Human <br>

**Evaluation Results** <br>
From the combined evaluation across Argilla, AI4Privacy, and Gretel PII datasets:

| Benchmark           |   Strict F1  |
| --------------------| -----------: |
| Argilla PII         |         0.70 |
| AI4Privacy          |         0.64 |
| nvidia/Nemotron-PII |         0.87 |
---
We evaluated the model using `threshold=0.3`. <br>

# Inference:
**Acceleration Engine:** PyTorch (via Hugging Face Transformers) <br>
**Test Hardware:** NVIDIA A100 (Ampere, PCIe/SXM) <br>

# Usage Recommendation

First, make sure you have the gliner library installed:

```
pip install gliner
```
Now, let's try to find an email, SSN, and phone number in a messy block of text.

```
from gliner import GLiNER
# 1. Define our new text
text = "Hi support, I can't log in! My account username is 'johndoe88'. Every time I try, it says 'invalid credentials'. Please reset my password. You can reach me at (555) 123-4567 or johnd@example.com"

# 2. Define the labels we're hunting for.
labels = ["email", "phone_number", "user_name"]

# 3. Load the PII model
model = GLiNER.from_pretrained("nvidia/gliner-pii")

# 4. Run the prediction at given threshold
entities = model.predict_entities(text, labels, threshold=0.5)
```

Sample output:
```
[
  {
    "start": 52, 
    "end": 61, 
    "text": "johndoe88", 
    "label": "user_name",
    "score": 0.99
  },
  {
    "start": 159, 
    "end": 173, 
    "text": "(555) 123-4567", 
    "label": "phone_number", 
    "score": 0.99
  },
  {
    "start": 177, 
    "end": 194, 
    "text": "johnd@example.com", 
    "label": "email", 
    "score": 0.99
  }
]
```

## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. <br> 

For more detailed information on ethical considerations for this model, please see the Bias, Explainability, Safety & Security, and Privacy Subcards. <br> 

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).  <br>