Update README.md
Browse files
README.md
CHANGED
|
@@ -17,6 +17,85 @@ tags:
|
|
| 17 |
This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
| 18 |
Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
## Use with llama.cpp
|
| 21 |
Install llama.cpp through brew (works on Mac and Linux)
|
| 22 |
|
|
|
|
| 17 |
This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
| 18 |
Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
|
| 19 |
|
| 20 |
+
---
|
| 21 |
+
It happened. The long-awaited Gemma-3 is here, and not only are the model sizes really good (1, 4, 12, 27), but the 128k
|
| 22 |
+
context (except for the 1B 32k) was exactly what the Open-Source
|
| 23 |
+
community wanted and asked for. My only issue with Gemma models in
|
| 24 |
+
general, is the VRAM requirement for tuning them, but that's a "me problem." End users will probably be very happy with Gemma-3 in terms of the VRAM requirement for running it.
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
On the 12th of March, the Gemma-3 family of models was released. So I decided to go full superstitious, and took this omen as a divine calling to finetune the 12B model first. This is how Oni_Mitsubishi_12B was born.
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
Before starting the actual training run, I used the following
|
| 31 |
+
command, which I believe has helped the model to converge "better":
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
for i in {1..666}; do nvidia-smi; done
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
Gemma is known for its "Gemma knowledge": fandom and
|
| 39 |
+
\ or other obscure knowledge that sometimes even larger LLMs often do
|
| 40 |
+
not possess. It gets even better, as this time we also got a vision model
|
| 41 |
+
embedded into all the Gemma-3 models, except for the 1B. I wonder what
|
| 42 |
+
are the possibilities for the vision part if the text layers are
|
| 43 |
+
uncensored?
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
I have used brand new long context markdown data, some deslopped instruct data (very lightly deslopped, it's very time-consuming to get right), and more than 50%
|
| 47 |
+
of highly curated and filtered organic human data, meticulously
|
| 48 |
+
cleaned, and parsed into obedience. A new stack of organic and
|
| 49 |
+
data-engineered text was used for the first time for Oni_Mitsubishi_12B. I truly hope creating it was worth the effort.
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
At NO POINT ChatGPT was used for data generation, however, the new Claude 3.7 sonnet was used VERY sparingly for the specific task
|
| 53 |
+
of creating a small number of humorous datasets (very human-like, was
|
| 54 |
+
done with a decent amount of prompt engineering), I've meticulously
|
| 55 |
+
checked them for slop, and it is minimal. This goal of said data was to imitate human text, using the 4chan vernacular.
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
Speaking of which, I've published a highly curated, SFT-ready 4chan dataset here: UBW_Tapestries, naturally I have included it in the dataset used for this model as well.
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
Technical details
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
I've used the "ancient" Alpaca chat template because the Gemma-3 chat template
|
| 73 |
+
was behaving funkily, and I didn't want to waste precious time, and
|
| 74 |
+
instead give the community a more uncensored finetune to play with, as
|
| 75 |
+
fast as possible (I saw this requested a lot on both Reddit and discord,
|
| 76 |
+
understandable). In my opinion, it's silly to let perfect be an enemy
|
| 77 |
+
of the good. Anyway, I had to use both bleeding edge Transformers and Axolotl, and modify stuff that wasn't even supposed to work (like the model's config.json).
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
Since it's a hybrid model, training its text-only part is a bit
|
| 81 |
+
problematic, so I hacked a config.json that gaslights the model into
|
| 82 |
+
thinking it's only a text model, and got some warnings like:
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
'vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight', 'vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias'}
|
| 86 |
+
- This IS expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
|
| 87 |
+
- This IS NOT expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
Then I saw it trains.
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
The absolute state when you can train a model before you can actually inference it.
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
## Use with llama.cpp
|
| 100 |
Install llama.cpp through brew (works on Mac and Linux)
|
| 101 |
|