File size: 492 Bytes
1227d41 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# TinyStack Tokenizer
ByteLevel BPE tokenizer trained on fhswf/tiny-stack dataset.
## Usage
```python
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
tokenizer = ByteLevelBPETokenizer("./vocab.json", "./merges.txt")
tokenizer._tokenizer.post_processor = BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)
```
Vocab size: 52000
|