Running on CPU Upgrade Featured 2.76k The Smol Training Playbook 📚 2.76k The secrets to building world-class LLMs
view reply Hi @sirluk , thanks for the great post. Do you know if the above masking technique works for some attention implementations and would be incompatible with some other? For example, would the above masking work with SDPA/flash_attention_2 and eager (each of these implementations are dealt a bit differently in https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L666 for example)?
view article Article Efficient LLM Pretraining: Packed Sequences and Masked Attention Oct 7, 2024 • 64
Running 3.62k The Ultra-Scale Playbook 🌌 3.62k The ultimate guide to training LLM on large GPU Clusters