-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Paper • 2401.10774 • Published • 59 -
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
Paper • 2401.06761 • Published • 1 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 16 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 60
Collections
Discover the best community collections!
Collections including paper arxiv:2404.02258
-
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107 -
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 111 -
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
Paper • 2403.09977 • Published • 11 -
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
Paper • 2403.15360 • Published • 13
-
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper • 2404.07413 • Published • 38 -
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 93 -
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 111 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107
-
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 111 -
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order
Paper • 2404.00399 • Published • 42 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107 -
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 105 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107 -
TransformerFAM: Feedback attention is working memory
Paper • 2404.09173 • Published • 43
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Paper • 2401.10774 • Published • 59 -
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
Paper • 2401.06761 • Published • 1 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 16 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 60
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 628 -
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper • 2310.11453 • Published • 105 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107 -
TransformerFAM: Feedback attention is working memory
Paper • 2404.09173 • Published • 43
-
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107 -
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 111 -
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
Paper • 2403.09977 • Published • 11 -
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series
Paper • 2403.15360 • Published • 13
-
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper • 2404.07413 • Published • 38 -
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 93 -
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 111 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107
-
Jamba: A Hybrid Transformer-Mamba Language Model
Paper • 2403.19887 • Published • 111 -
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order
Paper • 2404.00399 • Published • 42 -
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 107 -
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Paper • 2404.08801 • Published • 66