You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for query_intent model

The query_intent model is a trained BERT model intended to be used for query augmented generation (QAG) on the GDC. The model accepts a natural language query as input and outputs a label or "intent". The label maps the query to a GDC API endpoint during query augmented generation.

Model Details

Model Description

Model Name (model_id): uc-ctds/query_intent

Model Description:

The model is trained over the bert-base-uncased base model using synthetically generated paired queries and labels. Details of synthetic data generation will be presented in the upcoming paper. Training data is all open-source, with genes, mutations and cancer information obtained using the /ssm GDC API endpoint.

This model is used in the GDC QAG web app running on HuggingFace Spaces.

  • Developed by: Center for Translational Data Science
  • Model type: BERT
  • Language(s) (NLP): English
  • License: apache-2.0
  • Finetuned from model: google-bert/bert-base-uncased

Model Parameters:109M

Model Sources

  • Repository: https://huggingface.co/uc-ctds/query_intent
  • Paper: coming soon
  • Demo: https://huggingface.co/spaces/uc-ctds/GDC-QAG

Uses

The model is intended to be used for cancer research, on select precision oncology use cases. The model accepts a natural language query as input and outputs an intent label.

Example Input

What is the co-occurence frequency of somatic homozygous deletions in CDKN2A and CDKN2B in the mesothelioma project TCGA-MESO in the genomic data commons?

Example Output

freq_cnv_loss_or_gain

The model is trained on queries concerning frequencies of simple somatic mutations, frequencies of copy number variants losses or gains, frequencies of microsatellite instability, or frequencies of combination variants. In QAG, this model helps to classify queries into different labels as listed below.

Query use cases supported, labels
frequencies of simple somatic mutations, ssm_frequency
frequencies of copy number variant losses or gains, freq_cnv_loss_or_gain
frequency of microsatellite instability, msi_h_frequency
frequency of copy number variants and/or simple somatic mutations, cnv_and_ssm

Direct Use

Primary use is for QAG, where the model output serves as an intermediate step towards the final result

Out-of-Scope Use

The model is trained on limited use cases. It will not work well for any use case outside of those it is trained on. Please see the query use cases supported under Uses

How to Get Started with the Model

Use the code below to get started with the model.

query = 'What is the co-occurence frequency of somatic homozygous deletions in CDKN2A and CDKN2B in the mesothelioma project TCGA-MESO in the genomic data commons?'
model_id = 'uc-ctds/query_intent'
intent_tok = AutoTokenizer.from_pretrained(
  model_id, trust_remote_code=True,
   token=AUTH_TOKEN
)
intent_model = BertForSequenceClassification.from_pretrained(model_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
intent_model.to(device)
inputs = intent_tok(query, return_tensors="pt", truncation=True, padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = intent_model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=1)
predicted_label = torch.argmax(probs, dim=1).item()
for k, v in intent_labels.items():
    if v == predicted_label:
       return k

Training Details

Training Procedure

A paired dataset of N=63756 synthetic questions and labels were generated using template questions and open-source data from the GDC API for genes and mutations observed in different cancer types. A 70:30 train test split was used in sklearn to generate training and evaluation datasets and trained for two epochs. Please refer to our GitHub repo for training details Training was performed on an A100 GPU with 40GB RAM in an on-prem GPU cluster.

The dataset used for the training is available here (https://huggingface.co/datasets/uc-ctds/query_intent_dataset)

Compute Infrastructure

The model is intended to be used for QAG, and can be ran in a V100 GPU with 16GB GPU RAM.

Citation

Coming up

BibTeX:

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for uc-ctds/query_intent

Finetuned
(6264)
this model

Space using uc-ctds/query_intent 1