vLLM v0.11.1 seems to work, but v0.11.2 fails

by stelterlab - opened 15 days ago

15 days ago

First tried to run it with latest pip install method (using v0.11.2.dev282+g0353d2e16) and got a:

TypeError: MLAModules.init() missing 1 required positional argument: 'indexer_rotary_emb'

After trying to reproduce a version with your recommendations (which ended in dependency hell), I just tried the previous version v0.11.1 - their standard docker container - which works out of the box. May worth a note on the model card.

zelias

15 days ago

Could you share your docker config @stelterlab , I'm having a very hard time running it as well, even with 0.11.1 <3

stelterlab

15 days ago

vllm serve cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit --trust-remote-code --max-model-len 131072 --gpu-memory-utilization 0.95 --max-num-seqs 8 --enable-auto-tool-choice --tool-call-parser kimi_k2

using an H100 NVL (94 GB). Not yet optimized. Was just happy yesterday it started at the end of the day. :-D I copied the line from the other discussion in here and tweaked it just a little. But without limiting max-num-seqs it runs out of memory. So I still need to play with that value. A vLLM VRAM calculator would be nice which considers all those arguments which have impact on the VRAM consumption and the model choosen. To many screws.

stelterlab

15 days ago

•

edited 15 days ago

Ah and may be noteworthy I get the following warnings:

(EngineCore_DP0 pid=1122) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (28) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=1122)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=1122) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (28) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=1122)   return fn(*contiguous_args, **contiguous_kwargs)

But it the model does work.

cpatonn

cyankiwi org 15 days ago

Hi @stelterlab , thank you for letting me know. At the time of writting my modelcard, the model only worked with then latest commits. vllm v0.11.1 should be the recommended version for the model now.

Regarding the input tensor shape mismatch, I am also investigating to understand the root causes. Were you running the model with pipeline parallelism when the error happened?

Thanks again for using the model and notifying me.

zelias

15 days ago

I got the same warnings, and the tokens/s speed on a single Pro 6000 are terrible... 60/s.
Used GLM4.5 Air before, which got me 180 t/s.
I was hoping for a speedup on a 48B / 3B MoE model but I guess it's quite differently layered than the standard... I should probably read that paper :D

stelterlab

15 days ago

Hi @stelterlab , thank you for letting me know. At the time of writting my modelcard, the model only worked with then latest commits. vllm v0.11.1 should be the recommended version for the model now.

Regarding the input tensor shape mismatch, I am also investigating to understand the root causes. Were you running the model with pipeline parallelism when the error happened?

Thanks again for using the model and notifying me.

Thank you for the quants. 😉 I just used one H100 without any parallelism.

CHNtentes

14 days ago

I got the same warnings, and the tokens/s speed on a single Pro 6000 are terrible... 60/s.
Used GLM4.5 Air before, which got me 180 t/s.
I was hoping for a speedup on a 48B / 3B MoE model but I guess it's quite differently layered than the standard... I should probably read that paper :D

4.5 air has speculative decoding (mtp) , so it's not very fair to directly comapre those numbers.

RawthiL

5 days ago

•

edited 5 days ago

Hi, I'm also getting :
ERROR 12-05 13:00:10 [multiproc_executor.py:750] TypeError: MLAModules.__init__() missing 1 required positional argument: 'indexer_rotary_emb'

This is with vLLM 0.12.0, and tensor parallelism of 2.

P.S.:It seems that vLLMdoesnotsupportthis in any release:
https://github.com/vllm-project/vllm/issues/30092
https://github.com/vllm-project/vllm/pull/30093

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment