Instructions to use Goekdeniz-Guelmez/Josiefied-Gemma-4-12B-DLPO-ORPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Goekdeniz-Guelmez/Josiefied-Gemma-4-12B-DLPO-ORPO with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Goekdeniz-Guelmez/Josiefied-Gemma-4-12B-DLPO-ORPO") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use Goekdeniz-Guelmez/Josiefied-Gemma-4-12B-DLPO-ORPO with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "Goekdeniz-Guelmez/Josiefied-Gemma-4-12B-DLPO-ORPO" --prompt "Once upon a time"
Josiefied Gemma 4 12B DLPO-ORPO
This repo contains the customized JOSIE-style preference model built on the MLX 4-bit conversion of Google's Gemma 4 12B IT.
This model is trianed using my extension of the pre-exising orpo prefernce algorythm.
What This Is
JOSIE Gemma 4 12B DLPO-ORPO is a LoRA preference-tuned assistant model. It is meant to be a no-BS, direct, creative, practical, and less padded than a normal instruction model, while still staying useful and grounded.
The training objective is dlpo-orpo:
ORPOteaches the model to prefer the chosen answer over the rejected answer without needing a separate frozen reference model.DLPO: (Directional Latent Preference Optimization)adds a latent preference signal. Instead of only pushing token probabilities around, it also nudges the hidden-state geometry so chosen answers move in a more consistent preference direction than rejected answers.
In plain English: ORPO trains the visible answer. DLPO also trains the model's internal sense of what a better answer feels like.
Dataset
This model has been trained using a custom dataset with 12K preference pairs. The chosen responses carry the target JOSIE style: clear, candid, imaginative, and useful without performative softness. The rejected responses are the contrast set: weaker, flatter, over-refusing, padded, evasive, or less aligned with the intended assistant personality.
Training Run Stats:
- algorithm:
dlpo-orpo - max examples:
12000 - validation:
256 - batch size:
1 - epochs:
1 - learning rate:
2e-5 - learning rate scheduler:
cosine - max sequence:
1536 - ORPO alpha:
0.1 - latent weight:
0.08 - latent variant:
both - pooling:
answer_mean - layer:
late - LoRA layers:
24 - LoRA rank:
16 - LoRA scale:
32 - LoRA dropout:
0.05
This training run can be recreated using the latest version 2.2.0 of MLX-LM-LoRA.
Research Paper & Benchmarks
A research paper introducing DLPO: Directional Latent Preference Optimization will be released later as well.
Benchmarks for this specific JOSIE Gemma 4 E2B DLPO-ORPO run will also be published after this release, comparing it against the base Gemma 4 12B IT model.
Safety
Unlike the sibling models this one is not uncensored.
