transformers

Running

App Files Files Community

transformers / app /src /content /article.mdx

foldl

Fixed typo in MQA explanation

fe37f33 verified 19 days ago

raw

history blame

28.1 kB

	---
	title: "Porting nanochat to Transformers: an AI modeling history lesson"
	subtitle: "There is a lot to learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
	description: "tldr: There is a lot t learn about ML from nanochat, and even more to learn about the history of the transformer architecture."
	authors:
	- name: "Ben Burtenshaw"
	url: "https://huggingface.co/burtenshaw"
	affiliations: [1]
	- name: "Sergio Paniego"
	url: "https://huggingface.co/sergiopaniego"
	affiliations: [1]
	- name: "Anton Vlasjuk"
	url: "https://huggingface.co/AntonV"
	affiliations: [1]
	- name: "Pedro Cuenca"
	url: "https://huggingface.co/pcuenq"
	affiliations: [1]
	- name: "Aritra Roy Gosthipaty"
	url: "https://huggingface.co/ariG23498"
	affiliations: [1]
	affiliations:
	- name: "Hugging Face"
	url: "https://huggingface.co"
	published: "Dec. 01, 2025"
	doi: 10.1234/abcd.efgh
	licence: >
	Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
	Figures reused from other sources are excluded and marked in their captions ("Figure from …").
	tags:
	- research
	- template
	tableOfContentsAutoCollapse: true
	pdfProOnly: false
	showPdf: true
	---

	import Sidenote from '../components/Sidenote.astro'

	import GRPO from "./chapters/grpo.mdx";
	import SFT from "./chapters/sft.mdx";
	import Inference from "./chapters/inference.mdx";

	<Sidenote>

	The [nanochat-students](https://huggingface.co/nanochat-students) organization on Hugging Face hosts community models and discussions.

	</Sidenote>

	Recently we were working on helping students of the nanochat project to share their models and discuss their learning on Hugging Face. In the process, we thought it would be useful if the model was integrated into the `transformers` library. This would allow others to use their nanochat models for inference in loads of downstream libraries like [vLLM](https://docs.vllm.ai/) for inference or [TRL](https://huggingface.co/docs/trl/index) for post-training.

	<Sidenote>

	[vLLM](https://docs.vllm.ai/) provides high-throughput inference, while [TRL](https://huggingface.co/docs/trl/index) offers tools for reinforcement learning from human feedback (RLHF) and other post-training methods.

	</Sidenote>

	You can now use nanochat models in transformers and tap into all those educational gains across the ecosystem. But along the way, we uncovered a further treasure trove of education about how canonical models relate to each other, and the components they share. We received the lesson from the simple teacher of class inheritance and transformers modular philosophy.

	<Sidenote>

	Learn more about how transformers achieves modularity in the [modular transformers guide](https://huggingface.co/docs/transformers/v4.48.0/modular_transformers).

	</Sidenote>

	Now, let's tuck into this deep dive on how NanoChat relates the lineage of transformer architectures.

	## What is nanochat?

	<Sidenote>

	See Karpathy's [original announcement](https://x.com/karpathy/status/1977755427569111362) and the [nanochat repository](https://github.com/karpathy/nanochat) on GitHub.

	</Sidenote>

	On October 13th 2025, Andrej Karpathy unceremoniously dropped the nanochat repo into the unsuspecting AI world. To hype seekers, this was just a small and pretty average LLM. To ML devotees, this was nirvana. A raw unadulterated chance to tinker, fiddle, and play with a transformer model defined in pure pytorch. Nothing was hidden away in fancy `torch` methods or inherited from complex class structures. It was all there in a simple file.

	![image1](./assets/image/tweet.png)

	<Sidenote>

	The core libraries Karpathy avoided: [transformers](https://huggingface.co/docs/transformers/index), [tokenizers](https://huggingface.co/docs/tokenizers/index), [datasets](https://huggingface.co/docs/datasets/index), [trl](https://huggingface.co/docs/trl/index), and many dependencies. All for the sake of our learning!

	</Sidenote>

	Karpathy had painstakingly implemented an end-to-end build of an LLM system without the use of most major libraries. Even though in real world situations most rely on transformers, tokenizers, datasets, trl, etc. This back to basics approach gives us the chance to genuinely learn and understand something from the ground up.

	Personally, I found the process to be one of the most educational I can remember.

	## What is transformers?

	<Sidenote>

	The [transformers documentation](https://huggingface.co/docs/transformers/index) covers everything from quickstart guides to advanced model internals.

	</Sidenote>

	Most of us know the `transformers` library as the backbone of modern machine learning, but if we dig a little deeper, it's a powerful piece of education.

	If you don't know… transformers is the de facto implementation of modern AI models that bear the same name; 'transformers' like models in GPT, DeepSeek, Claude, series. `transformers` is a special project because it contains the implementation of all major open model architecture and those model architectures are modularized to reuse functionality from each other.

	<Sidenote>

	Explore the [model hub](https://huggingface.co/models) to see thousands of models built on these shared architectures.

	</Sidenote>

	In general, scientists at AI research labs design, implement, and train their models in their framework of choice, be that torch, JAX, etc. When they come to share their open model with the community, they will open a PR on transformers and refactor their code to use relevant modules.

	Because `transformers` contain most major model implementations, researchers have to inherent model architecture attributes from other canonical models. This is in every sense a 'single source of truth'.

	<Sidenote>

	See nanochat's [RMSNorm implementation](https://github.com/huggingface/transformers/blob/9f5b2d1b8995daa539b757e28c337e36408055e6/src/transformers/models/nanochat/modular_nanochat.py#L44) in the transformers codebase.

	</Sidenote>

	This practical feature of the library has an amazingly educational quality to it. We can read a model implementation as a series of references to other usages of those architectural features. For example, when one model uses a certain type of RMSNorm, we can plainly see that it is the same implementation as another model because it inherits that class entirely. For example, check out nanochat's RMSNorm:

	```py
	class NanoChatRMSNorm(Llama4TextL2Norm):
	pass
	```

	The `transformers` library then converts the `modular_` implementation into a `modeling_` implementation, which contains the complete `torch` native implementation:

	```py
	class NanoChatRMSNorm(torch.nn.Module):
	def __init__(self, eps: float = 1e-6):
	super().__init__()
	self.eps = eps

	def _norm(self, x):
	return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

	def forward(self, x):
	return self._norm(x.float()).type_as(x)

	def extra_repr(self):
	return f"eps={self.eps}"
	```

	If we review a model in `transformers`, we can review both sides and learn from the math and literature of the model's implementation. Due to the educational nature of nanochat, I thought that it was a perfect opportunity to explore this aspect of transformers and share what I learnt with students.

	## Why do we need nanochat in transformers?

	It might seem counterintuitive to support an educational model like nanochat in a production grade library like `transformers`. After all, we can see from nanochat's benchmark scores that it does not rival state of the art models like [Qwen3](https://huggingface.co/collections/Qwen/qwen3), [SmolLM3](https://huggingface.co/collections/HuggingFaceTB/smollm3), [Gemma3](https://huggingface.co/collections/google/gemma-3-release), or [Olmo3](https://huggingface.co/allenai/Olmo-3-32B-Think). In fact, that's the reason we think nanochat should be in `transformers`. Here's what the community gains from its inclusion:

	- `transformers` as a single source of truth teaches us about `nanochat`'s lineage.
	- we can use the `nanochat` model in other libraries.
	- save money by reusing nanochat checkpoints for fine-tuning.
	- compare nanochat fine-tuning implementation with other open model checkpoints.

	Firstly, as mentioned above `transformers` teaches us about the modeling conventions that Karpathy uses from other canonical implementations.

	Secondly, because transformers is a standard within the ecosystem, it unlocks more downstream learning in post training libraries, quantisation tools, inference libraries, and device integrations. In practical terms, here are some examples nanochat students could learn on top of `transformers`:

	<Sidenote>

	Learn about [model quantization](https://huggingface.co/docs/transformers/en/quantization/overview) to reduce model size and memory usage.

	</Sidenote>

	- Quantize models in llama.cpp ($0)
	- Integrate models into the browser and WebGPU ($0)
	- SFT training in TRL/torch on Google Colab ($0)
	- RL training TRL/torch on Google Colab (\$0 - \$9)
	- Agentic RL in TRL on Google Colab (\$0 - \$9)


	Finally, training AI models is expensive. Running the nanochat `speedrun.sh` costs between \$200 and \$2k depending on the model size we use. Which is little compared to the millions of dollars invested by frontier labs. But that is still a significant sum for students, who always learn best by taking a few chances to fail and build experience.

	<Sidenote>

	The [speedrun.sh](https://github.com/karpathy/nanochat/blob/master/speedrun.sh) script in nanochat benchmarks training costs across different configurations.

	</Sidenote>

	In short, let's unlock more opportunities for education!

	## The nanochat architecture

	<Sidenote>

	The original [gpt.py](https://github.com/karpathy/nanochat/blob/master/nanochat/gpt.py) implementation is just 291 lines of pure PyTorch.

	</Sidenote>

	As described by Karpathy, nanochat uses an archetypal architecture that is common across the field, which makes it an excellent choice for an educational resource because folk get to learn from what works. The core model implementation demonstrates modern transformer architecture, with every design decision documented and justified.

	The configuration uses a single complexity slider: depth. Set `--depth=20` and everything else automatically adjusts. Model dimension equals depth × 64 (20 layers → 1,280 dimensions). Number of attention heads equals depth ÷ 2 (10 heads). Head dimension is fixed at 128. This "aspect ratio philosophy" simplifies scaling. So if you want a more capable model or have a bigger budget. Just increase depth to 26 (\$300 budget) or 30 (\$1,000 budget).

	The architecture incorporates five key improvements over vanilla transformers. Let's work through the components of this architecture and compare them across implementation:

	#### Forward pass based on the Llama Architecture

	<Sidenote>

	See the [Llama model documentation](https://huggingface.co/docs/transformers/en/model_doc/llama) for the full architecture details.

	</Sidenote>

	The forward pass in nanochat handles both training and generation. We can simply read that the input `x` is embedded and then updated by each layer then the head. During training, a loss is calculated and returned instead of the logits themselves.

	```py
	def forward(self, x, targets=None, loss_reduction='mean'):
	x = self.token_emb(x)
	for layer in self.layers:
	x = layer(x)
	x = self.ln_f(x)
	logits = self.lm_head(x)

	if targets is not None:
	loss = F.cross_entropy(
	logits.view(-1, self.vocab_size),
	targets.view(-1),
	ignore_index=-1,
	reduction=loss_reduction
	)
	return loss
	return logits
	```

	By returning loss directly when targets are provided, the training loop becomes trivial. No separate loss computation, no manual masking logic—just `loss = model(inputs, targets)` followed by `loss.backward()`.

	<Sidenote>

	The [BaseModelOutputWithPast](https://huggingface.co/docs/transformers/en/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPast) class standardizes model outputs across the ecosystem. Base models return raw logits—loss calculation is delegated to wrapper modules like `ForCausalLM`. You'll often see `if labels is not None: loss = self.loss_function(...)` rather than just using `nn.cross_entropy`. This seemingly roundabout approach exists because of potential [gradient accumulation bugs](https://unsloth.ai/blog/gradient) that forced a rethink of how loss is computed depending on the trainer context.

	</Sidenote>

	`transformers` has to make things a bit more complex to facilitate the downstream ecosystem that uses logits in a broad spectrum of ways. Therefore, loss calculation is dealt with in training-specific code, and the `forward` function returns `BaseModelOutputWithPast`.

	```py
	class NanoChatModel(LlamaModel):
	def __init__(self, config: NanoChatConfig):
	super().__init__(config)

	self.initial_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
	self.norm = NanoChatRMSNorm(eps=config.rms_norm_eps)

	def forward(
	self,
	input_ids: Optional[torch.LongTensor] = None,
	attention_mask: Optional[torch.Tensor] = None,
	position_ids: Optional[torch.LongTensor] = None,
	past_key_values: Optional[Cache] = None,
	inputs_embeds: Optional[torch.FloatTensor] = None,
	cache_position: Optional[torch.LongTensor] = None,
	use_cache: Optional[bool] = None,
	**kwargs: Unpack[TransformersKwargs],
	) -> BaseModelOutputWithPast:
	if (input_ids is None) ^ (inputs_embeds is not None):
	raise ValueError("You must specify exactly one of input_ids or inputs_embeds")

	if inputs_embeds is None:
	inputs_embeds: torch.Tensor = self.embed_tokens(input_ids)

	if use_cache and past_key_values is None:
	past_key_values = DynamicCache(config=self.config)

	if cache_position is None:
	past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
	cache_position: torch.Tensor = torch.arange(
	past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
	)

	if position_ids is None:
	position_ids = cache_position.unsqueeze(0)

	causal_mask = create_causal_mask(
	config=self.config,
	input_embeds=inputs_embeds,
	attention_mask=attention_mask,
	cache_position=cache_position,
	past_key_values=past_key_values,
	position_ids=position_ids,
	)

	hidden_states = inputs_embeds
	position_embeddings = self.rotary_emb(hidden_states, position_ids=position_ids)

	hidden_states = self.initial_norm(hidden_states) # Additional norm before the layers
	for decoder_layer in self.layers[: self.config.num_hidden_layers]:
	hidden_states = decoder_layer(
	hidden_states,
	attention_mask=causal_mask,
	position_embeddings=position_embeddings,
	position_ids=position_ids,
	past_key_values=past_key_values,
	cache_position=cache_position,
	**kwargs,
	)

	hidden_states = self.norm(hidden_states)
	return BaseModelOutputWithPast(
	last_hidden_state=hidden_states,
	past_key_values=past_key_values,
	)

	```

	#### Rotary Position Embeddings (RoPE)

	<Sidenote>

	The [RoFormer paper](https://arxiv.org/abs/2104.09864) introduced RoPE, now used in Llama, Mistral, and many other modern LLMs.

	</Sidenote>

	Rotary Position Embeddings (RoPE) replace learned positional encodings by rotating query and key vectors using precomputed sin/cos frequencies:

	```py
	def apply_rope(x, cos, sin):
	x1, x2 = x[..., ::2], x[..., 1::2]
	y1 = x1 * cos - x2 * sin
	y2 = x1 * sin + x2 * cos
	return torch.stack([y1, y2], dim=-1).flatten(-2)
	```

	In transformers, the rotary embeddings are implemented like so:

	```py
	from ..llama.modeling_llama import (
	LlamaDecoderLayer,
	LlamaModel,
	LlamaPreTrainedModel,
	LlamaRotaryEmbedding,
	apply_rotary_pos_emb,
	eager_attention_forward,
	)


	class NanoChatRotaryEmbedding(LlamaRotaryEmbedding):
	pass


	def rotate_half(x):
	"""Rotates half the hidden dims of the input with flipped signs for NanoChat."""
	x1 = x[..., : x.shape[-1] // 2]
	x2 = x[..., x.shape[-1] // 2 :]
	return torch.cat((x2, -x1), dim=-1)
	```

	`NanoChatRotaryEmbedding` almost entirely inherits from the original Llama series, except for a sign inversion in `rotate_half`.

	### QK Normalization

	<Sidenote>

	QK normalization was popularized by [Llama 4](https://huggingface.co/docs/transformers/en/model_doc/llama4) and helps stabilize attention scores during training.

	</Sidenote>

	NanoChat applies RMSNorm to queries and keys before computing attention to stabilize training.

	In the original gpt.py, this is achieved via a functional norm helper applied directly inside the attention forward pass:

	```py
	def norm(x):
	# Purely functional rmsnorm with no learnable params
	return F.rms_norm(x, (x.size(-1),))

	class CausalSelfAttention(nn.Module):
	...
	def forward(self, x, cos_sin, kv_cache):
	B, T, C = x.size()

	# Project the input to get queries, keys, and values
	q = self.c_q(x).view(B, T, self.n_head, self.head_dim)
	k = self.c_k(x).view(B, T, self.n_kv_head, self.head_dim)
	v = self.c_v(x).view(B, T, self.n_kv_head, self.head_dim)

	# Apply Rotary Embeddings to queries and keys to get relative positional encoding
	cos, sin = cos_sin
	q, k = apply_rotary_emb(q, cos, sin), apply_rotary_emb(k, cos, sin) # QK rotary embedding
	q, k = norm(q), norm(k) # QK norm
	q, k, v = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2) # make head be batch dim, i.e. (B, T, H, D) -> (B, H, T, D)
	...
	```

	<Sidenote>

	[Qwen3](https://huggingface.co/docs/transformers/en/model_doc/qwen3) provides a robust attention implementation that nanochat extends with QK normalization.

	</Sidenote>

	In the modular transformers implementation, we see a fascinating mix of lineages. The `NanoChatRMSNorm` inherits directly from `Llama4TextL2Norm`, while the attention mechanism inherits from `Qwen3Attention`. We simply inject the QK normalization into the Qwen3 logic:

	```py

	class NanoChatRMSNorm(Llama4TextL2Norm):
	pass

	class NanoChatAttention(Qwen3Attention):
	def __init__(self, config: NanoChatConfig, layer_idx: int):
	super().__init__(config, layer_idx)
	del self.sliding_window
	del self.layer_type

	self.q_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)
	self.k_norm = NanoChatRMSNorm(eps=config.rms_norm_eps)

	def forward(
	self,
	hidden_states: torch.Tensor,
	position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
	attention_mask: Optional[torch.Tensor] = None,
	past_key_values: Optional[Cache] = None,
	cache_position: Optional[torch.LongTensor] = None,
	**kwargs: Unpack[TransformersKwargs],
	) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
	input_shape = hidden_states.shape[:-1]
	hidden_shape = (*input_shape, -1, self.head_dim)

	query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
	key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
	value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)

	cos, sin = position_embeddings
	query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

	# RoPE -> Norm (instead of usual Norm -> RoPE)
	query_states = self.q_norm(query_states)
	key_states = self.k_norm(key_states)

	if past_key_values is not None:
	# sin and cos are specific to RoPE models; cache_position needed for the static cache
	cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
	key_states, value_states = past_key_values.update(key_states, value_states, self.layer_idx, cache_kwargs)

	attention_interface: Callable = eager_attention_forward
	if self.config._attn_implementation != "eager":
	attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]

	attn_output, attn_weights = attention_interface(
	self,
	query_states,
	key_states,
	value_states,
	attention_mask,
	dropout=0.0 if not self.training else self.attention_dropout,
	scaling=self.scaling,
	**kwargs,
	)

	attn_output = attn_output.reshape(*input_shape, -1).contiguous()
	attn_output = self.o_proj(attn_output)
	return attn_output, attn_weights
	```

	### Untied Weights

	<Sidenote>

	Weight tying between embeddings and the LM head is common but [research shows](https://arxiv.org/abs/1608.05859) untying can improve performance.

	</Sidenote>

	Karpathy's implementation deliberately unties the weights between the token embedding and the language model head to provide the model with more flexibility. In gpt.py, these are initialized as two completely separate modules:

	```py
	class GPT(nn.Module):
	def __init__(self, config):
	super().__init__()
	self.config = config
	self.transformer = nn.ModuleDict({
	"wte": nn.Embedding(config.vocab_size, config.n_embd),
	"h": nn.ModuleList([Block(config, layer_idx) for layer_idx in range(config.n_layer)]),
	})
	self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
	# ... (rest of init)
	```

	<Sidenote>

	[Gemma 2](https://huggingface.co/docs/transformers/en/model_doc/gemma2) supports both tied and untied weight configurations via the model config.

	</Sidenote>

	In the modular implementation, we inherit from [`Gemma2ForCausalLM`](https://huggingface.co/docs/transformers/en/model_doc/gemma2). Gemma 2 also used untied weights and advanced output structures. By simply inheriting the class, we pull in all the necessary machinery for causal generation, while the configuration object (defined elsewhere) ensures the weights remain untied. Though Gemma 2 ties weights by default, we inherit primarily for code structure alignment and softcapping support—the `tie_word_embeddings` config flag controls the behavior, with `_tied_weights_keys` defining the mapping if applied:

	```py
	class NanoChatForCausalLM(Gemma2ForCausalLM):
	def forward(self, **super_kwargs) -> CausalLMOutputWithPast:
	super().forward(**super_kwargs)
	```

	### ReLU² Activation

	<Sidenote>

	The [Primer paper](https://arxiv.org/abs/2109.08668) showed squared ReLU can match or exceed GELU performance with lower compute.

	</Sidenote>

	The original implementation replaces the standard GELU activation with ReLU², which is simply ReLU squared. This provides a faster alternative without performance loss. In gpt.py, this is hardcoded into the MLP block:

	```py
	class MLP(nn.Module):
	def __init__(self, config):
	super().__init__()
	self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
	self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)
	def forward(self, x):
	x = self.c_fc(x)
	x = F.relu(x).square()
	x = self.c_proj(x)
	return x
	```

	<Sidenote>

	[CLIP](https://huggingface.co/docs/transformers/en/model_doc/clip) provides a clean MLP structure that nanochat extends with the ReLU² activation.

	</Sidenote>

	In the modular file, we see another surprising inheritance: `CLIPMLP`. The CLIP architecture uses a structure that fits our needs perfectly, so we inherit the structural definition from CLIP and let the configuration drive the specific activation function (ReLU2):

	```py
	class NanoChatMLP(CLIPMLP):
	def __init__(self, config):
	super().__init__(config)
	self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
	self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
	```

	### Multi-Query Attention (MQA)

	<Sidenote>

	The [GQA paper](https://arxiv.org/abs/2305.13245) explains how grouped-query attention reduces memory while maintaining quality.

	</Sidenote>

	NanoChat uses Multi-Query Attention (MQA) to reduce the memory footprint of the KV cache, using 6 query heads but only 1 key/value head (in the default config). This is a common configuration for smaller models like nanochat.

	<Sidenote>

	PyTorch's [scaled_dot_product_attention](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) handles GQA broadcasting automatically via `enable_gqa`.

	</Sidenote>

	In gpt.py, this logic is handled by passing distinct head counts and relying on PyTorch's functional attention to handle the broadcasting (or explicitly handling it during inference):

	```py
	class CausalSelfAttention(nn.Module):
	# ...
	def forward(self, x, cos_sin, kv_cache):
	# ...
	# Attention: queries attend to keys/values autoregressively. A few cases to handle:
	enable_gqa = self.n_head != self.n_kv_head # Group Query Attention (GQA): duplicate key/value heads to match query heads if desired
	if kv_cache is None or Tq == Tk:
	# During training (no KV cache), attend as usual with causal attention
	# And even if there is KV cache, we can still use this simple version when Tq == Tk
	y = F.scaled_dot_product_attention(q, k, v, is_causal=True, enable_gqa=enable_gqa)
	elif Tq == 1:
	# During inference but with a single query in this forward pass:
	# The query has to attend to all the keys/values in the cache
	y = F.scaled_dot_product_attention(q, k, v, is_causal=False, enable_gqa=enable_gqa)
	else:
	# During inference AND we have a chunk of queries in this forward pass:
	# First, each query attends to all the cached keys/values (i.e. full prefix)
	attn_mask = torch.zeros((Tq, Tk), dtype=torch.bool, device=q.device) # True = keep, False = mask
	prefix_len = Tk - Tq
	if prefix_len > 0: # can't be negative but could be zero
	attn_mask[:, :prefix_len] = True
	# Then, causal attention within this chunk
	attn_mask[:, prefix_len:] = torch.tril(torch.ones((Tq, Tq), dtype=torch.bool, device=q.device))
	y = F.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask, enable_gqa=enable_gqa)
	# ...
	```

	In `modular_nanochat.py`, we don't need to write this logic at all. As seen in the QK Normalization section above, `NanoChatAttention` inherits from `Qwen3Attention`. The Qwen3 implementation is robust and fully supports GQA/MQA out of the box. By using this parent class, we get production-grade attention implementation "for free," allowing us to focus solely on the unique normalizations required by NanoChat.

	## Conclusion

	It's very clear that Andrej Karpathy's implementation offers 10 times more to learn from than the transformer version which inherits almost entirely from existing models or features. That said, we can still take more away from the inherited modular modeling implementation. Models like Llama, Llama4, Gemma2, Qwen3, and CLIP are all reused to create a genuinely canonical implementation of a transformer.

	Ok. Let's cut the philosophy and see what we can do with `nanochat` in transformers.

	<Inference />

	<SFT />