# Qwen3 8M Model with Falcon-H1-0.5B-Instruct Tokenizer ## Model Description This is an 8M parameter Qwen3 model architecture combined with the Falcon-H1-0.5B-Instruct tokenizer (32K vocabulary). - **Architecture**: Qwen3 (Grouped Query Attention, RMS Normalization, Q/K Normalization, RoPE) - **Tokenizer**: Falcon-H1-0.5B-Instruct (32K vocab) - **Parameters**: 2,183,552 - **Precision**: BF16 - **Format**: SafeTensors - **Vocabulary Size**: 32768 ## Configuration - vocab_size: 32768 - hidden_size: 64 - num_attention_heads: 4 - num_key_value_heads: 2 - num_hidden_layers: 2 - intermediate_size: 160 - head_dim: 16 - max_position_embeddings: 4096 ## Special Tokens - BOS: <|begin_of_text|> (id: 17) - EOS: <|end_of_text|> (id: 11) - PAD: <|pad|> (id: 0) ## Usage ```python from transformers import Qwen3ForCausalLM, AutoTokenizer model = Qwen3ForCausalLM.from_pretrained("./workspace/qwen3-8m-falcon-tokenizer") tokenizer = AutoTokenizer.from_pretrained("./workspace/qwen3-8m-falcon-tokenizer") # Generate text inputs = tokenizer("Hello, world!", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Batch processing (start small) texts = ["Hello", "How are you", "Good morning"] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=20) ``` ## Important Notes - Model uses Qwen3 architecture with Falcon tokenizer (32K vocabulary) - All token IDs must be < 32768 to avoid CUDA errors - Start with small batch sizes (1-4) and gradually increase - Use proper padding to prevent dimension mismatches - Model initialized with random weights - requires fine-tuning - Compatible with Qwen3 APIs but uses Falcon vocabulary