Model Card: GraphCodeBERT for AI-Generated Code Detection

This model is a binary classifier fine-tuned to detect whether a given piece of source code was written by a human or generated by a Large Language Model (LLM) such as ChatGPT.

1. Model Details

Model Name: GraphCodeBERT Binary Classifier for GenAI Code Detection
Base Model: microsoft/graphcodebert-base
Architecture: GraphCodeBERTForClassification (A base GraphCodeBERT encoder coupled with a custom PyTorch linear classification head).
Developers/Authors: Pachanitha Saeheng
Model Type: Text/Code Classification (Binary: Human vs. AI-Generated).
Language(s): Source Code (focused primarily on Java).

2. Intended Use

Primary Use Case: To classify source code to determine their origin (Human-written = Class 0, AI-generated = Class 1).

3. Training Data & Preprocessing

Dataset Used: The model was fine-tuned on a custom, extended dataset. It utilizes data collected from the original GPTSniffer research and expands upon it by integrating additional source code samples collected from the CodeNet dataset. This combined approach creates a robust dataset containing paired samples of human-written code and their AI-generated counterparts.
Preprocessing: Input code must be tokenized using the standard Hugging Face AutoTokenizer configured for microsoft/graphcodebert-base.

4. Model Architecture & Hyperparameters

Encoder: AutoModel.from_pretrained("microsoft/graphcodebert-base") is used to capture the semantic representation and structural data-flow of the code.
Classifier Head: A custom PyTorch nn.Linear layer that maps the base model's hidden_size to num_labels = 2.
Optimizer: AdamW optimizer with a learning rate of 5e-5.
Batch Size: 8 (during training and testing).

5. How to Load and Use the Model

Because this model uses a custom PyTorch class wrapper, you should define the class before loading the .pth weights.

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

# 1. Define the custom architecture
class GraphCodeBERTForClassification(nn.Module):
    def __init__(self, model):
        super(GraphCodeBERTForClassification, self).__init__()
        self.model = model
        self.classifier = nn.Linear(self.model.config.hidden_size, 2) # 2 classes

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :] # Extract [CLS] token representation
        logits = self.classifier(cls_output)
        return logits

# 2. Load the base model and tokenizer
base_model = AutoModel.from_pretrained("microsoft/graphcodebert-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")

# 3. Initialize the classification model and load weights
model = GraphCodeBERTForClassification(base_model)
model.load_state_dict(torch.load("Detect_AI.pth", map_location=torch.device('cpu')))
model.eval()kenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base"

📚 Citation

If you use this code or system in your research, please cite our paper:

@conference{icaart26,
  author={Pachanitha Saeheng and Napat Boongaree and Chutweeraya Sriwilailak and Chaiyong Ragkhitwetsagul and Teeradaj Racharak and Ekapol Chuangsuwanich},
  title={NPC: Automated Tool for Detecting and Explaining ChatGPT-Generated Programs},
  booktitle={Proceedings of the 18th International Conference on Agents and Artificial Intelligence - Volume 5: ICAART},
  year={2026},
  pages={4714-4719},
  publisher={SciTePress},
  organization={INSTICC},
  doi={10.5220/0014485500004052},
  isbn={978-989-758-796-2},
  issn={2184-433X},
}

Downloads last month: -; Downloads are not tracked for this model. How to track