| | --- |
| | language: |
| | - code |
| | extra_gated_prompt: >- |
| | ## Model License Agreement |
| | |
| | Please read the BigCode [OpenRAIL-M |
| | license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) |
| | agreement before accepting it. |
| | |
| | extra_gated_fields: |
| | I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox |
| | --- |
| | # StarEnCoder |
| |
|
| | ## Table of Contents |
| |
|
| | 1. [Model Summary](##model-summary) |
| | 3. [Training](##training) |
| | 4. [Use](##use) |
| | 5. [Limitations](##limitations) |
| | 6. [License](##license) |
| |
|
| | ## Model Summary |
| |
|
| | This is an encoder-only model (i.e., bi-directionally self-attentive Transformers) trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset. |
| |
|
| | - **Project Website:** [bigcode-project.org](https://www.bigcode-project.org) |
| | - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org) |
| | - **Languages:** 80+ Programming languages |
| |
|
| |
|
| | We leveraged the : |
| | - Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) objectives from [BERT](https://arxiv.org/abs/1810.04805). |
| | - Predicted masked-out tokens from an input sentence and whether a pair of sentences occur as neighbors in a document. |
| |
|
| | ## Training |
| |
|
| | We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B~tokens are observed. This takes roughly two days using 64 NVIDIA A100 GPUs. |
| | Details about the model architecture are reported in the table below. |
| |
|
| | | Hyperparameter | Value | |
| | |--------------------------|-----------| |
| | | Hidden size | 768 | |
| | | Intermediate size | 3072 | |
| | | Max. position embeddings | 1024 | |
| | | Num. of attention heads | 12 | |
| | | Num. of hidden layers | 12 | |
| | | Attention | Multi-head| |
| | | Num. of parameters | ≈125M | |
| |
|
| |
|
| | ## Use |
| |
|
| | This model is trained on 86 programming languages from GitHub code including GitHub issues and Git Commits, and can be efficiently fine-tuned for both code- and text-related tasks. |
| | We fine-tuned on a token classification task to detect PII and have released [StaPII](https://huggingface.co/bigcode/starpii) model. |
| |
|
| |
|
| | ## Limitations |
| | There are limitations to consider when using StarEncoder. It is an encoder-only model, which limits its flexibility in certain code generation or completion tasks, |
| | and it was trained on data containing PII, which could pose privacy concerns. Performance may vary across the 80+ supported programming languages, |
| | particularly for less common ones, and the model might struggle with understanding domains outside programming languages. |
| |
|
| | ## License |
| |
|
| | The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). |