Model Card for query_intent model
The query_intent model is a trained BERT model intended to be used for query augmented generation (QAG) on the GDC. The model accepts a natural language query as input and outputs a label or "intent". The label maps the query to a GDC API endpoint during query augmented generation.
Model Details
Model Description
Model Name (model_id): uc-ctds/query_intent
Model Description:
The model is trained over the bert-base-uncased base model using synthetically generated paired queries and labels. Details of synthetic data generation will be presented in the upcoming paper. Training data is all open-source, with genes, mutations and cancer information obtained using the /ssm GDC API endpoint.
This model is used in the GDC QAG web app running on HuggingFace Spaces.
- Developed by: Center for Translational Data Science
- Model type: BERT
- Language(s) (NLP): English
- License: apache-2.0
- Finetuned from model:
google-bert/bert-base-uncased
Model Parameters:109M
Model Sources
- Repository:
https://huggingface.co/uc-ctds/query_intent - Paper: coming soon
- Demo:
https://huggingface.co/spaces/uc-ctds/GDC-QAG
Uses
The model is intended to be used for cancer research, on select precision oncology use cases. The model accepts a natural language query as input and outputs an intent label.
Example Input
What is the co-occurence frequency of somatic homozygous deletions in CDKN2A and CDKN2B in the mesothelioma project TCGA-MESO in the genomic data commons?
Example Output
freq_cnv_loss_or_gain
The model is trained on queries concerning frequencies of simple somatic mutations, frequencies of copy number variants losses or gains, frequencies of microsatellite instability, or frequencies of combination variants. In QAG, this model helps to classify queries into different labels as listed below.
Query use cases supported, labels
frequencies of simple somatic mutations, ssm_frequency
frequencies of copy number variant losses or gains, freq_cnv_loss_or_gain
frequency of microsatellite instability, msi_h_frequency
frequency of copy number variants and/or simple somatic mutations, cnv_and_ssm
Direct Use
Primary use is for QAG, where the model output serves as an intermediate step towards the final result
Out-of-Scope Use
The model is trained on limited use cases. It will not work well for any use case outside of those it is trained on. Please see the query use cases supported under Uses
How to Get Started with the Model
Use the code below to get started with the model.
query = 'What is the co-occurence frequency of somatic homozygous deletions in CDKN2A and CDKN2B in the mesothelioma project TCGA-MESO in the genomic data commons?'
model_id = 'uc-ctds/query_intent'
intent_tok = AutoTokenizer.from_pretrained(
model_id, trust_remote_code=True,
token=AUTH_TOKEN
)
intent_model = BertForSequenceClassification.from_pretrained(model_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
intent_model.to(device)
inputs = intent_tok(query, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = intent_model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=1)
predicted_label = torch.argmax(probs, dim=1).item()
for k, v in intent_labels.items():
if v == predicted_label:
return k
Training Details
Training Procedure
A paired dataset of N=63756 synthetic questions and labels were generated using template questions and open-source data from the GDC API for genes and mutations observed in different cancer types. A 70:30 train test split was used in sklearn to generate training and evaluation datasets and trained for two epochs.
Please refer to our GitHub repo for training details
Training was performed on an A100 GPU with 40GB RAM in an on-prem GPU cluster.
The dataset used for the training is available here (https://huggingface.co/datasets/uc-ctds/query_intent_dataset)
Compute Infrastructure
The model is intended to be used for QAG, and can be ran in a V100 GPU with 16GB GPU RAM.
Citation
Coming up
BibTeX:
- Downloads last month
- 9
Model tree for uc-ctds/query_intent
Base model
google-bert/bert-base-uncased