Instructions to use JackFram/llama-68m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use JackFram/llama-68m with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="JackFram/llama-68m")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("JackFram/llama-68m")
model = AutoModelForMultimodalLM.from_pretrained("JackFram/llama-68m")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use JackFram/llama-68m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "JackFram/llama-68m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JackFram/llama-68m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/JackFram/llama-68m

SGLang

How to use JackFram/llama-68m with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "JackFram/llama-68m" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JackFram/llama-68m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "JackFram/llama-68m" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "JackFram/llama-68m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use JackFram/llama-68m with Docker Model Runner:
```
docker model run hf.co/JackFram/llama-68m
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model description

This is a LLaMA-like model with only 68M parameters trained on Wikipedia and part of the C4-en and C4-realnewslike datasets.

No evaluation has been conducted yet, so use it with care.

The model is mainly developed as a base Small Speculative Model in the SpecInfer paper.

Evaluations (contributed by Akshit, huge thanks!)

Category	Benchmark	Metric	Score / Value	Status
Linguistics & Grammar	BLiMP	Accuracy	70.57%	Success
Commonsense & Reasoning	PIQA	Normalized Accuracy	59.25%	Success
	BoolQ	Accuracy	57.71%	Success
	COPA	Accuracy	53.00%	Success
	WinoGrande	Accuracy	50.59%	Success
	HellaSwag	Normalized Accuracy	29.04%	Success
	RACE	Accuracy	25.36%	Success
	CommonsenseQA	Accuracy	19.82%	Success
Academic & Knowledge	SciQ	Normalized Accuracy	57.80%	Success
	ARC-Easy	Normalized Accuracy	35.98%	Success
	OpenBookQA	Normalized Accuracy	25.60%	Success
	MMLU	Accuracy	22.96%	Success
	ARC-Challenge	Normalized Accuracy	22.87%	Success
Language Modeling	TriviaQA	Accuracy	TriviaQA Standard	Success
	LAMBADA	Accuracy	13.24%	Success
	C4-Perplexity	Word Perplexity	205.79	Success
	WikiText-2	Word Perplexity	306.79	Success

Notes on Failed Tasks: The Arithmetic and SocialIQA benchmarks failed during execution due to runtime pipeline incompatibilities, yielding no score. Total evaluation runtime was 44.74 minutes.

Citation

To cite the model, please use

@misc{miao2023specinfer,
      title={SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification}, 
      author={Xupeng Miao and Gabriele Oliaro and Zhihao Zhang and Xinhao Cheng and Zeyu Wang and Rae Ying Yee Wong and Zhuoming Chen and Daiyaan Arfeen and Reyna Abhyankar and Zhihao Jia},
      year={2023},
      eprint={2305.09781},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Downloads last month: 189,649

Model tree for JackFram/llama-68m

Adapters

215 models

Finetunes

19 models

Quantizations

8 models

Dataset used to train JackFram/llama-68m

Space using JackFram/llama-68m 1

Paper for JackFram/llama-68m

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

Paper • 2305.09781 • Published May 16, 2023 • 4