---
base_model: deepseek-ai/DeepSeek-V2-Lite-Chat
tags:
- text-generation-inference
- transformers
- unsloth
- deepseek_v2
license: apache-2.0
language:
- en
---
# DeepZirel-V2

An experimental fine-tune of deepseek-ai/DeepSeek-V2-Lite-Chat using novel training approaches aimed at improving older model architectures.

## Model Details

- **Base Model:** deepseek-ai/DeepSeek-V2-Lite-Chat
- **Fine-tuned by:** Daemontatox
- **Purpose:** Architecture improvement research
- **Training:** Experimental data and methodology targeting legacy architecture enhancement
- **Language:** Multilingual

## Training Approach

This model explores new training techniques designed to enhance the performance of older model architectures. The experimental approach focuses on:
- Novel fine-tuning strategies for legacy architectures
- Custom training data optimization
- Architecture-specific improvements

## Inference

# Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Daemontatox/DeepZirel-V2",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Daemontatox/DeepZirel-V2", trust_remote_code=True)

messages = [
    {"role": "user", "content": "Hello, how are you?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
```

# vLLM
```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="Daemontatox/DeepZirel-V2",
    tensor_parallel_size=2,
    dtype="auto",
    trust_remote_code=True
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
```

# vLLM OpenAI-Compatible Server
```bash
vllm serve Daemontatox/DeepZirel-V2 \
    --tensor-parallel-size 2 \
    --dtype auto \
    --trust-remote-code \
    --max-model-len 4096
```
```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123"
)

response = client.chat.completions.create(
    model="Daemontatox/DeepZirel-V2",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)
```

# TensorRT-LLM
```bash
# Convert to TensorRT-LLM format
python convert_checkpoint.py \
    --model_dir Daemontatox/DeepZirel-V2 \
    --output_dir ./trt_ckpt \
    --dtype float16 \
    --tp_size 2

# Build TensorRT engine
trtllm-build \
    --checkpoint_dir ./trt_ckpt \
    --output_dir ./trt_engine \
    --gemm_plugin float16 \
    --max_batch_size 8 \
    --max_input_len 2048 \
    --max_output_len 512
```
```python
from tensorrt_llm import LLM

llm = LLM(model="./trt_engine")

prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, max_new_tokens=512)

for output in outputs:
    print(output.text)
```

# Modular MAX
```bash
# Serve with MAX Engine
max serve Daemontatox/DeepZirel-V2 \
    --port 8000 \
    --tensor-parallel-size 2
```
```python
from max import engine

# Load model with MAX
model = engine.InferenceSession(
    "Daemontatox/DeepZirel-V2",
    device="cuda",
    tensor_parallel=2
)

# Run inference
prompt = "Hello, how are you?"
output = model.generate(
    prompt,
    max_tokens=512,
    temperature=0.7,
    top_p=0.9
)

print(output.text)
```
```python
# Using MAX with Python API
from max.serve import serve
from max.pipelines import pipeline

# Create pipeline
pipe = pipeline(
    "text-generation",
    model="Daemontatox/DeepZirel-V2",
    device="cuda",
    tensor_parallel=2
)

# Generate
result = pipe(
    "Hello, how are you?",
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9
)

print(result[0]["generated_text"])
```


# Limitations

This is an experimental model using novel training approaches on legacy architectures. Results may vary and should be thoroughly tested before production deployment.