Paul Weber commited on
Commit
1227d41
·
1 Parent(s): 52b5bec

Using BPE Tokenizer

Browse files
Files changed (7) hide show
  1. README.md +19 -3
  2. config.json +17 -0
  3. merges.txt +0 -0
  4. special_tokens_map.json +0 -6
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +0 -20
  7. vocab.json +0 -0
README.md CHANGED
@@ -1,3 +1,19 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TinyStack Tokenizer
2
+
3
+ ByteLevel BPE tokenizer trained on fhswf/tiny-stack dataset.
4
+
5
+ ## Usage
6
+
7
+ ```python
8
+ from tokenizers.implementations import ByteLevelBPETokenizer
9
+ from tokenizers.processors import BertProcessing
10
+
11
+ tokenizer = ByteLevelBPETokenizer("./vocab.json", "./merges.txt")
12
+ tokenizer._tokenizer.post_processor = BertProcessing(
13
+ ("</s>", tokenizer.token_to_id("</s>")),
14
+ ("<s>", tokenizer.token_to_id("<s>")),
15
+ )
16
+ tokenizer.enable_truncation(max_length=512)
17
+ ```
18
+
19
+ Vocab size: 52000
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 52000,
3
+ "min_frequency": 2,
4
+ "special_tokens": [
5
+ "<s>",
6
+ "<pad>",
7
+ "</s>",
8
+ "<unk>",
9
+ "<mask>",
10
+ "<code>",
11
+ "</code>",
12
+ "<error_message>",
13
+ "</error_message>",
14
+ "<description>",
15
+ "</description>"
16
+ ]
17
+ }
merges.txt CHANGED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json DELETED
@@ -1,6 +0,0 @@
1
- {
2
- "bos_token": "<s>",
3
- "eos_token": "</s>",
4
- "pad_token": "<pad>",
5
- "unk_token": "<unk>"
6
- }
 
 
 
 
 
 
 
tokenizer.json DELETED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json DELETED
@@ -1,20 +0,0 @@
1
- {
2
- "add_bos_token": false,
3
- "add_prefix_space": false,
4
- "bos_token": "<s>",
5
- "eos_token": "</s>",
6
- "unk_token": "<unk>",
7
- "pad_token": "<pad>",
8
- "mask_token": "<mask>",
9
- "additional_special_tokens": [
10
- "<code>",
11
- "</code>",
12
- "<error_message>",
13
- "</error_message>",
14
- "<description>",
15
- "</description>"
16
- ],
17
- "clean_up_tokenization_spaces": true,
18
- "model_max_length": 2048,
19
- "tokenizer_class": "GPT2Tokenizer"
20
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
vocab.json CHANGED
The diff for this file is too large to render. See raw diff