| --- |
| language: |
| - code |
| extra_gated_prompt: >- |
| ## Model License Agreement |
| |
| Please read the BigCode [OpenRAIL-M |
| license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) |
| agreement before accepting it. |
| |
| extra_gated_fields: |
| I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox |
| --- |
| # StarEnCoder |
|
|
| ## Table of Contents |
|
|
| 1. [Model Summary](##model-summary) |
| 3. [Training](##training) |
| 4. [Use](##use) |
| 5. [Limitations](##limitations) |
| 6. [License](##license) |
|
|
| ## Model Summary |
|
|
| This is an encoder-only model (i.e., bi-directionally self-attentive Transformers) trained on [The Stack](https://huggingface.co/datasets/bigcode/the-stack) dataset. |
|
|
| - **Project Website:** [bigcode-project.org](https://www.bigcode-project.org) |
| - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org) |
| - **Languages:** 80+ Programming languages |
|
|
|
|
| We leveraged the : |
| - Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) objectives from [BERT](https://arxiv.org/abs/1810.04805). |
| - Predicted masked-out tokens from an input sentence and whether a pair of sentences occur as neighbors in a document. |
|
|
| ## Training |
|
|
| We train for 100,000 steps with a global batch size of 4,096 sequences of a maximum length of 1,024 so that approximately 400B~tokens are observed. This takes roughly two days using 64 NVIDIA A100 GPUs. |
| Details about the model architecture are reported in the table below. |
|
|
| | Hyperparameter | Value | |
| |--------------------------|-----------| |
| | Hidden size | 768 | |
| | Intermediate size | 3072 | |
| | Max. position embeddings | 1024 | |
| | Num. of attention heads | 12 | |
| | Num. of hidden layers | 12 | |
| | Attention | Multi-head| |
| | Num. of parameters | ≈125M | |
|
|
|
|
| ## Use |
|
|
| This model is trained on 86 programming languages from GitHub code including GitHub issues and Git Commits, and can be efficiently fine-tuned for both code- and text-related tasks. |
| We fine-tuned on a token classification task to detect PII and have released [StaPII](https://huggingface.co/bigcode/starpii) model. |
|
|
|
|
| ## Limitations |
| There are limitations to consider when using StarEncoder. It is an encoder-only model, which limits its flexibility in certain code generation or completion tasks, |
| and it was trained on data containing PII, which could pose privacy concerns. Performance may vary across the 80+ supported programming languages, |
| particularly for less common ones, and the model might struggle with understanding domains outside programming languages. |
|
|
| ## License |
|
|
| The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). |