Background image representing the theme of this page: Benchmarking local AI models.

Benchmark Models

Step 3 — Measure speed, latency, and performance.

Before building your agent, it’s important to understand how fast your models run. Benchmarking helps you choose the right model for your hardware — and for the tasks your agent will perform.

Benchmark Your Models

Every model behaves differently on your hardware. Some are lightning‑fast but lightweight, while others deliver deeper reasoning at the cost of memory and speed. Benchmarking gives you a clear, data‑driven understanding of how each model performs on your system — a crucial step before building agents that need to think, plan, and act efficiently.

Ollama includes a built‑in benchmarking tool that measures real‑world performance across several key metrics:

  • Tokens per second — how fast the model generates text
  • Latency — how long it takes to produce the first token
  • Memory usage — how much RAM the model consumes
  • CPU/GPU utilization — how well your hardware is being used

These numbers help you choose the right model for your agent — whether you’re optimizing for speed, reasoning, or resource efficiency.


Run a Benchmark

Benchmark any installed model with a single command. For example, to test Llama 3:

ollama benchmark llama3

Try benchmarking a few different models to compare performance:

ollama benchmark qwen ollama benchmark mistral ollama benchmark phi ollama benchmark deepseek

Each benchmark runs a standardized prompt and reports performance metrics tailored to your hardware.


Example Output

model: llama3 tokens per second: 42.1 latency: 0.38s memory: 6.2 GB cpu: 78% gpu: 0% (CPU fallback)

Your results will vary depending on your CPU, GPU, RAM, and background processes. If you want to compare your numbers with the broader community, the Open LLM Leaderboard provides helpful context for model capabilities.


Record Your Results

Keeping a simple benchmark table helps you understand which models feel best for your workflow. Fill in your results below:

Model Tokens/sec Latency Memory
llama3
qwen
mistral
phi

As a rule of thumb: faster models are better for interactive tasks, while larger models excel at reasoning, planning, and multi‑step problem solving.


Choosing the Right Model

Your benchmark results will guide you, but here’s a quick reference based on common goals:

If you want speed:

  • phi — extremely fast and lightweight
  • qwen — efficient and great for tool‑use

If you want balanced performance:

  • mistral — strong reasoning with good speed
  • llama3 (small) — versatile and reliable

If you want maximum reasoning:

  • llama3 (full) — excellent general intelligence
  • deepseek — high‑performance reasoning

There’s no single “best” model — only the best model for your hardware and your use case.


Troubleshooting

Benchmark feels slow

  • Close other apps to free CPU/GPU resources
  • Try a smaller model like phi or qwen

High memory usage

  • Use smaller models
  • Ensure swap is enabled (Linux)

GPU not used

  • Ollama may fall back to CPU if GPU isn’t supported
  • GPU support varies by OS and hardware
  • Check the Ollama GitHub issues for compatibility notes

Next Step
Install Node.js →