--- base_model: deepseek-ai/DeepSeek-V2-Lite-Chat tags: - text-generation-inference - transformers - unsloth - deepseek_v2 license: apache-2.0 language: - en --- # DeepZirel-V2 An experimental fine-tune of deepseek-ai/DeepSeek-V2-Lite-Chat using novel training approaches aimed at improving older model architectures. ## Model Details - **Base Model:** deepseek-ai/DeepSeek-V2-Lite-Chat - **Fine-tuned by:** Daemontatox - **Purpose:** Architecture improvement research - **Training:** Experimental data and methodology targeting legacy architecture enhancement - **Language:** Multilingual ## Training Approach This model explores new training techniques designed to enhance the performance of older model architectures. The experimental approach focuses on: - Novel fine-tuning strategies for legacy architectures - Custom training data optimization - Architecture-specific improvements ## Inference # Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "Daemontatox/DeepZirel-V2", device_map="auto", torch_dtype="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("Daemontatox/DeepZirel-V2", trust_remote_code=True) messages = [ {"role": "user", "content": "Hello, how are you?"} ] input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( input_ids, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True ) response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True) print(response) ``` # vLLM ```python from vllm import LLM, SamplingParams llm = LLM( model="Daemontatox/DeepZirel-V2", tensor_parallel_size=2, dtype="auto", trust_remote_code=True ) sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512 ) prompts = ["Hello, how are you?"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text) ``` # vLLM OpenAI-Compatible Server ```bash vllm serve Daemontatox/DeepZirel-V2 \ --tensor-parallel-size 2 \ --dtype auto \ --trust-remote-code \ --max-model-len 4096 ``` ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="token-abc123" ) response = client.chat.completions.create( model="Daemontatox/DeepZirel-V2", messages=[ {"role": "user", "content": "Hello, how are you?"} ], temperature=0.7, max_tokens=512 ) print(response.choices[0].message.content) ``` # TensorRT-LLM ```bash # Convert to TensorRT-LLM format python convert_checkpoint.py \ --model_dir Daemontatox/DeepZirel-V2 \ --output_dir ./trt_ckpt \ --dtype float16 \ --tp_size 2 # Build TensorRT engine trtllm-build \ --checkpoint_dir ./trt_ckpt \ --output_dir ./trt_engine \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 2048 \ --max_output_len 512 ``` ```python from tensorrt_llm import LLM llm = LLM(model="./trt_engine") prompts = ["Hello, how are you?"] outputs = llm.generate(prompts, max_new_tokens=512) for output in outputs: print(output.text) ``` # Modular MAX ```bash # Serve with MAX Engine max serve Daemontatox/DeepZirel-V2 \ --port 8000 \ --tensor-parallel-size 2 ``` ```python from max import engine # Load model with MAX model = engine.InferenceSession( "Daemontatox/DeepZirel-V2", device="cuda", tensor_parallel=2 ) # Run inference prompt = "Hello, how are you?" output = model.generate( prompt, max_tokens=512, temperature=0.7, top_p=0.9 ) print(output.text) ``` ```python # Using MAX with Python API from max.serve import serve from max.pipelines import pipeline # Create pipeline pipe = pipeline( "text-generation", model="Daemontatox/DeepZirel-V2", device="cuda", tensor_parallel=2 ) # Generate result = pipe( "Hello, how are you?", max_new_tokens=512, temperature=0.7, top_p=0.9 ) print(result[0]["generated_text"]) ``` # Limitations This is an experimental model using novel training approaches on legacy architectures. Results may vary and should be thoroughly tested before production deployment.