Benchmark Models – Playnex Starter Path

Benchmark Your Models

Every model behaves differently on your hardware. Some are lightning‑fast but lightweight, while others deliver deeper reasoning at the cost of memory and speed. Benchmarking gives you a clear, data‑driven understanding of how each model performs on your system — a crucial step before building agents that need to think, plan, and act efficiently.

Ollama includes a built‑in benchmarking tool that measures real‑world performance across several key metrics:

Tokens per second — how fast the model generates text
Latency — how long it takes to produce the first token
Memory usage — how much RAM the model consumes
CPU/GPU utilization — how well your hardware is being used

These numbers help you choose the right model for your agent — whether you’re optimizing for speed, reasoning, or resource efficiency.

Run a Benchmark

Benchmark any installed model with a single command. For example, to test Llama 3:

ollama benchmark llama3

Try benchmarking a few different models to compare performance:

ollama benchmark qwen
ollama benchmark mistral
ollama benchmark phi
ollama benchmark deepseek
      

Each benchmark runs a standardized prompt and reports performance metrics tailored to your hardware.

Example Output

model: llama3
tokens per second: 42.1
latency: 0.38s
memory: 6.2 GB
cpu: 78%
gpu: 0% (CPU fallback)
      

Your results will vary depending on your CPU, GPU, RAM, and background processes. If you want to compare your numbers with the broader community, the Open LLM Leaderboard provides helpful context for model capabilities.

Record Your Results

Keeping a simple benchmark table helps you understand which models feel best for your workflow. Fill in your results below:

Model	Tokens/sec	Latency	Memory
llama3	—	—	—
qwen	—	—	—
mistral	—	—	—
phi	—	—	—

As a rule of thumb: faster models are better for interactive tasks, while larger models excel at reasoning, planning, and multi‑step problem solving.

Choosing the Right Model

Your benchmark results will guide you, but here’s a quick reference based on common goals:

If you want speed:

phi — extremely fast and lightweight
qwen — efficient and great for tool‑use

If you want balanced performance:

mistral — strong reasoning with good speed
llama3 (small) — versatile and reliable

If you want maximum reasoning:

llama3 (full) — excellent general intelligence
deepseek — high‑performance reasoning

There’s no single “best” model — only the best model for your hardware and your use case.

Troubleshooting

Benchmark feels slow

Close other apps to free CPU/GPU resources
Try a smaller model like phi or qwen

High memory usage

Use smaller models
Ensure swap is enabled (Linux)

GPU not used

Ollama may fall back to CPU if GPU isn’t supported
GPU support varies by OS and hardware
Check the Ollama GitHub issues for compatibility notes

Next Step
Install Node.js →