vLLM v0.11.1 seems to work, but v0.11.2 fails
First tried to run it with latest pip install method (using v0.11.2.dev282+g0353d2e16) and got a:
TypeError: MLAModules.init() missing 1 required positional argument: 'indexer_rotary_emb'
After trying to reproduce a version with your recommendations (which ended in dependency hell), I just tried the previous version v0.11.1 - their standard docker container - which works out of the box. May worth a note on the model card.
Could you share your docker config @stelterlab , I'm having a very hard time running it as well, even with 0.11.1 <3
vllm serve cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit --trust-remote-code --max-model-len 131072 --gpu-memory-utilization 0.95 --max-num-seqs 8 --enable-auto-tool-choice --tool-call-parser kimi_k2
using an H100 NVL (94 GB). Not yet optimized. Was just happy yesterday it started at the end of the day. :-D I copied the line from the other discussion in here and tweaked it just a little. But without limiting max-num-seqs it runs out of memory. So I still need to play with that value. A vLLM VRAM calculator would be nice which considers all those arguments which have impact on the VRAM consumption and the model choosen. To many screws.
Ah and may be noteworthy I get the following warnings:
(EngineCore_DP0 pid=1122) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (28) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=1122) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=1122) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (28) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=1122) return fn(*contiguous_args, **contiguous_kwargs)
But it the model does work.
Hi @stelterlab , thank you for letting me know. At the time of writting my modelcard, the model only worked with then latest commits. vllm v0.11.1 should be the recommended version for the model now.
Regarding the input tensor shape mismatch, I am also investigating to understand the root causes. Were you running the model with pipeline parallelism when the error happened?
Thanks again for using the model and notifying me.
I got the same warnings, and the tokens/s speed on a single Pro 6000 are terrible... 60/s.
Used GLM4.5 Air before, which got me 180 t/s.
I was hoping for a speedup on a 48B / 3B MoE model but I guess it's quite differently layered than the standard... I should probably read that paper :D
Hi @stelterlab , thank you for letting me know. At the time of writting my modelcard, the model only worked with then latest commits. vllm v0.11.1 should be the recommended version for the model now.
Regarding the input tensor shape mismatch, I am also investigating to understand the root causes. Were you running the model with pipeline parallelism when the error happened?
Thanks again for using the model and notifying me.
Thank you for the quants. π I just used one H100 without any parallelism.
I got the same warnings, and the tokens/s speed on a single Pro 6000 are terrible... 60/s.
Used GLM4.5 Air before, which got me 180 t/s.
I was hoping for a speedup on a 48B / 3B MoE model but I guess it's quite differently layered than the standard... I should probably read that paper :D
4.5 air has speculative decoding (mtp) , so it's not very fair to directly comapre those numbers.
Hi, I'm also getting :ERROR 12-05 13:00:10 [multiproc_executor.py:750] TypeError: MLAModules.__init__() missing 1 required positional argument: 'indexer_rotary_emb'
This is with vLLM 0.12.0, and tensor parallelism of 2.
P.S.:It seems that vLLMdoesnotsupportthis in any release:
https://github.com/vllm-project/vllm/issues/30092
https://github.com/vllm-project/vllm/pull/30093