regex pattern

#84
by pandora-s - opened
Mistral AI_ org
No description provided.

Fixing Mistral-Common vs Hugging Face Tokenizer Mismatch

The current pre-tokenizer in the tokenizer.json is as follows:

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

However the regex pattern (?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+ is wrong, the correct one being: [^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+

The ground source for the regex pattern can be found in our tekken.json file for our official supported tokenizer using mistral-common :

    "config": {
        "pattern": "[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
        "num_vocab_tokens": 150000,
        "default_vocab_size": 131072,
        "default_num_special_tokens": 1000,
        "version": "v11"
    },

Also just to add an easy repro:

# CORRECT
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
tok_mc = MistralTokenizer.from_hf_hub("mistralai/Mistral-Small-3.1-24B-Instruct-2503")

tok_mc.instruct_tokenizer.tokenizer.encode("'The'", True, False)
# [1, 1039, 1784, 1039]

# INCORRECT
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")

# tok.encode("'The'")
# [1, 1039, 1084, 1268, 1039]
# Incorrectly tokenizes to ["'", 'T', 'he', "'"]

Can this be merged? Any blockers?

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment