Ollama GPU Acceleration: Complete Guide for CUDA, ROCm & Metal
Watching the text crawl across my terminal line by line, I couldn’t help but check the time—47 seconds. That’s how long it took to run Llama 3 8B on my old laptop. CPU maxed out, fans screaming, roughly 5 tokens per second. Honestly, the experience was pretty discouraging. I wanted to use it for coding assistance, but waiting for a response took longer than Googling the answer myself.
Later, I moved the same GPU to my desktop, installed CUDA drivers, ran the same model with the same parameters—3 seconds.
No exaggeration, just from nearly a minute down to a few seconds. That “I asked, now I want the answer” feeling finally came back.
This article will help you set up Ollama GPU acceleration. Whether you’re on NVIDIA, AMD, or Apple Silicon, I’ll walk you through configuration, verification, and troubleshooting. Save those waiting hours for something more interesting.
How Good is GPU Acceleration: The Real Gap from 30 Seconds to 3 Seconds
Let’s cut to the chase: running local LLMs on GPU delivers 10-20x speed improvement. This isn’t marketing hype—it’s my real-world testing data and the consensus across the community.
You might ask: can’t CPU run these models? Why bother with GPU setup?
It can run, but the experience is completely different. A CPU running a 7B model manages 3-8 tokens per second, meaning a 500-word response takes 20-60 seconds. Switch to GPU, and the same model hits 40-80 tokens per second, done in a few seconds. This gap isn’t just “faster”—it’s the difference between “barely usable” and “actually usable.”
Does Your GPU Support It?
Ollama supports three GPU platforms, each with its own requirements:
NVIDIA GPUs: The most mainstream and hassle-free option. Official requirement is Compute Capability 5.0 or higher, which basically means GTX 900 series and later. GTX 1060, RTX 3060, RTX 4090—all good. I have an RTX 3060 12GB that handles models under 14B parameters without breaking a sweat.
AMD GPUs: Slightly more involved setup, but runs just as well. Linux requires ROCm v7, Windows currently only has ROCm v6.1 preview. Supported models are also limited—RX 6000 and RX 7000 series are the safest bets, older cards need some extra configuration.
Apple Silicon: M1/M2/M3/M4 all supported, and it’s automatic. Mac users basically don’t need any configuration—install Ollama and Metal acceleration kicks in. After 2026, there’s also the MLX backend option, pushing performance even higher.
Is Your VRAM Enough?
This is something many people overlook. GPU model inference has one hard requirement: VRAM.
Let’s do some quick math: a 7B model with 4-bit quantization needs roughly 5-6GB VRAM, 14B needs 10-12GB, and 70B requires 40GB+. Your GPU’s VRAM directly determines what model size you can run. My RTX 3060 12GB runs Llama 3 8B comfortably, but Mixtral 8x7B is tight—I have to offload some layers to CPU.
So before configuring GPU, know your card’s model and VRAM capacity. It sets your expectations.
NVIDIA CUDA: The Most Hassle-Free Solution (With a Few Caveats)
If you’re using NVIDIA, congrats—your setup might be the simplest of the three platforms.
First, Check If Drivers Are Installed
Open your terminal and run:
nvidia-smi
If you see a table with GPU model, VRAM size, and driver version, your drivers are good to go. Ollama will automatically detect and use CUDA—no need to install CUDA Toolkit separately. Yep, you read that right. Ollama bundles the necessary CUDA libraries, saving you a step.
If the command isn’t found, you’ll need to install drivers first. Ubuntu users can run:
sudo apt install nvidia-driver-535 # or newer version
Reboot after installation, then verify with nvidia-smi again.
What About Multiple GPUs?
If your machine has multiple GPUs (say, two RTX 3090s), Ollama by default spreads the model across all cards. But sometimes you want to specify which card to use—maybe one card runs the model while another handles something else.
Set an environment variable:
# Use only GPU 0
export CUDA_VISIBLE_DEVICES=0
# Use GPU 0 and GPU 2
export CUDA_VISIBLE_DEVICES=0,2
Add this line to ~/.bashrc or your systemd service config for persistence.
Running Ollama in Docker?
Some folks like putting all services in containers—Ollama can too. But note: Docker containers can’t access host GPU by default, needs extra configuration.
Use NVIDIA’s official nvidia-container-toolkit:
# Install toolkit
sudo apt install nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Then when starting the Ollama container, add the --gpus all flag:
docker run -d --gpus all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
How to Confirm GPU Is Actually Being Used?
A simple verification: while running a model, open another terminal and run:
ollama ps
The output shows the currently running model and GPU usage. If you see something like GPU: 100%, acceleration is working.
You can also use nvidia-smi -l 1 for real-time VRAM monitoring—VRAM should spike noticeably when the model runs.
AMD ROCm: Slightly More Setup, Runs Just As Fast
I feel the AMD struggle—most tutorials online are NVIDIA-focused, and AMD documentation is scattered. But good news: Ollama’s ROCm support has stabilized. Just a few more steps.
The Right Path for Linux Users: ROCm v7
If you’re on Ubuntu 22.04 or newer, ROCm installation isn’t too painful:
# Add AMD official repository
sudo apt update
sudo apt install amdgpu-install
sudo amdgpu-install --usecase=rocm
# Add yourself to render group
sudo usermod -aG render,video $USER
# Reboot to apply
sudo reboot
After reboot, verify with rocminfo command. If you see your GPU info, Ollama is ready to install. AMD version of Ollama auto-detects ROCm environment—no extra config needed.
Windows Users: Still Preview Territory
Honestly, Windows ROCm support isn’t mature yet. Official ROCm v6.1 preview exists but supports limited GPU models and isn’t as stable as Linux. If you mainly work on Windows with an AMD card, my suggestion:
Prioritize WSL2 + Ubuntu. Running Ollama in Linux subsystem delivers much better performance and stability than native Windows.
Older GPUs? HSA_OVERRIDE to the Rescue
AMD GPU architecture codenames get complicated. ROCm officially supports newer architectures (gfx900, gfx1030, etc.). If your card is older, like RX 580 (gfx803), ROCm won’t recognize it by default.
Use an environment variable to force override:
export HSA_OVERRIDE_GFX_VERSION=10.3.0 # Force gfx1030 compatibility mode
This doesn’t work in all cases, but community feedback shows it helps many older cards. Give it a try—if it fails, you’re back to CPU.
Multi-GPU Configuration
Similar to NVIDIA, AMD also supports specifying which GPUs to use:
export ROCR_VISIBLE_DEVICES=0,1 # Use GPU 0 and GPU 1
Check GPU numbers with rocm-smi command.
Common AMD GPU Architecture Codenames
Here are common models and their architectures for troubleshooting:
| GPU Model | Architecture | ROCm Support |
|---|---|---|
| RX 7900 XTX | gfx1100 | Native |
| RX 6800 XT | gfx1030 | Native |
| RX 5700 XT | gfx1010 | Native |
| RX 580 | gfx803 | Needs HSA_OVERRIDE |
| Vega 56/64 | gfx900 | Native |
Overall, AMD setup has a few more pitfalls than NVIDIA, but once it works, performance is comparable.
Apple Metal: Hidden Bonus for Mac Users
If you’re using Apple Silicon Mac (M1/M2/M3/M4), here’s the good news: you don’t need to configure anything.
Really, nothing at all. Install Ollama, run a model, GPU acceleration activates automatically. Apple’s Metal framework is built into Ollama—the system automatically loads models onto GPU.
M-Series Chip Performance
Based on community testing, Mac local LLM performance is actually quite decent:
- M1/M2 8GB: Running 7B models, around 15-20 tok/s
- M2 Pro 16GB: Running 14B models, hits 25-30 tok/s
- M3 Max 36GB: Running 30B+ models, maintains 30+ tok/s
Compared to older CPU-only machines, this speed is practical. Not quite RTX 4090 “instant response” level, but perfectly fine for coding assistance, translation, and polishing.
2026 Bonus: MLX Backend
If you’re on M-series chips with 32GB+ unified memory (like M3 Max, M4 Pro), you can also enable the MLX backend—Apple’s machine learning framework optimized for their silicon.
According to developer community data, MLX backend boosts inference speed by 93%. What does that mean? Running Llama 3 8B at 57.8 tokens per second becomes 111.4 tok/s with MLX. That’s the difference from “pretty smooth” to “actually fast.”
Enabling it is simple—just add a parameter:
ollama run llama3 --backend mlx
Note that MLX currently has high memory requirements—32GB below might be unstable. Also, only some models support MLX, mainly those published in mlx format.
How to Confirm GPU Is Working?
Open Activity Monitor, switch to GPU History tab. When running a model, you should see GPU usage spike. If only CPU moves and GPU stays flat, Metal might not be enabled—rare, but reinstalling Ollama usually fixes it.
GPU Detection Failed: Common Troubleshooting
Hardware configuration issues are almost inevitable. Here are problems I’ve encountered and their solutions—hopefully saves you some detours.
Problem 1: no compatible GPUs were discovered
Most common error—Ollama can’t find a usable GPU.
Possible causes:
- Drivers not installed or too old
- GPU model unsupported (like GTX 700 series)
- Docker container lacks GPU access permissions
Troubleshooting steps:
# NVIDIA: verify driver
nvidia-smi
# AMD: verify ROCm
rocminfo
# If commands fail, install drivers first
Problem 2: Not compiled with GPU offload support
This error means your downloaded Ollama version lacks GPU support.
Solution: Re-download the correct version from the official site. AMD users note: Ollama has a dedicated ROCm version with a different download link than CUDA. Don’t download the wrong one.
Problem 3: NVIDIA Driver Version Too Old
Ollama requires NVIDIA driver 450 or higher. If your system runs 400-series old drivers, CUDA won’t work.
# Check current driver version
nvidia-smi | grep "Driver Version"
# If too old, update driver
sudo apt install nvidia-driver-535
Problem 4: AMD amdgpu Driver Missing
Linux AMD GPUs need amdgpu driver for ROCm. Some systems default to older radeon driver, which doesn’t support ROCm.
# Check currently loaded driver
lsmod | grep amdgpu
# If no output, install manually
sudo apt install amdgpu-dkms
Problem 5: SELinux Blocking Container GPU Access
Ran into this on CentOS/RHEL systems. SELinux default policy blocks container access to GPU devices.
Quick fix:
sudo setenforce 0 # Temporarily disable SELinux
Permanent fix requires adjusting SELinux policy—complex, check Red Hat official docs. Or just switch to Ubuntu, simpler.
Verification Commands Summary
Here’s a quick command checklist for troubleshooting:
# 1. Check if GPU recognized by system
nvidia-smi # NVIDIA
rocminfo # AMD
system_profiler SPDisplaysDataType # macOS
# 2. Check Ollama process status
ollama ps
# 3. Real-time GPU monitoring (while running model)
watch -n 1 nvidia-smi # NVIDIA
rocm-smi -a # AMD
# 4. Check environment variables
echo $CUDA_VISIBLE_DEVICES
echo $ROCR_VISIBLE_DEVICES
Most issues can be pinpointed with these commands. If still stuck, search Ollama GitHub Issues—plenty of people have been down these rabbit holes.
Final Thoughts
After all this, here’s a quick reference table:
| Platform | Prerequisites | Setup Difficulty | Recommendation |
|---|---|---|---|
| NVIDIA | Driver 450+ | Easy (basically zero config) | First choice |
| AMD (Linux) | ROCm v7 | Medium (a few commands) | Second choice |
| AMD (Windows) | ROCm v6.1 Preview | Harder (suggest WSL2) | Average |
| Apple Silicon | None required | Simplest | First choice for Mac |
GPU acceleration—configure once, benefit long-term. The gap between CPU “barely runs” and GPU “actually works” is massive. What platform is your GPU on? Any issues during setup? Share in the comments, I’ll reply when I can.
Ollama GPU Acceleration Setup
Complete GPU acceleration configuration for three platforms
⏱️ Estimated time: 30 min
- 1
Step1: Verify GPU Model and Drivers
Choose verification method based on your GPU type:
• NVIDIA: Run nvidia-smi to view GPU info
• AMD: Run rocminfo to confirm ROCm detection
• macOS: No verification needed, Metal auto-enables - 2
Step2: NVIDIA CUDA Configuration
Simplest solution, just install drivers:
1. Install driver: sudo apt install nvidia-driver-535
2. Reboot system
3. Verify: nvidia-smi should show GPU info
4. Ollama auto-detects CUDA, no extra config needed - 3
Step3: AMD ROCm Configuration (Linux)
Requires ROCm v7 installation:
1. Install: sudo apt install amdgpu-install
2. Configure: sudo amdgpu-install --usecase=rocm
3. Permissions: sudo usermod -aG render,video $USER
4. Reboot and verify: rocminfo
5. Older GPUs may need: export HSA_OVERRIDE_GFX_VERSION=10.3.0 - 4
Step4: Verify GPU Acceleration is Active
Confirm GPU is working while running model:
• Run ollama ps to check GPU usage
• NVIDIA: nvidia-smi -l 1 for real-time monitoring
• AMD: rocm-smi -a for real-time monitoring
• macOS: Activity Monitor GPU History - 5
Step5: Multi-GPU Environment Setup
Specify which GPUs to use:
• NVIDIA: export CUDA_VISIBLE_DEVICES=0,2
• AMD: export ROCR_VISIBLE_DEVICES=0,1
• Add environment variable to ~/.bashrc for persistence
FAQ
What GPU platforms does Ollama support?
What if my GPU VRAM isn't enough?
How do I confirm GPU acceleration is working?
Can older AMD GPUs like RX 580 work?
How to use GPU in Docker containers?
How to enable Apple Silicon MLX backend?
What if I get 'no compatible GPUs were discovered' error?
10 min read · Published on: Apr 25, 2026 · Modified on: Apr 25, 2026
Ollama Local LLM Guide
If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.
Previous
Ollama Production Monitoring: Logging Configuration and Prometheus Alerting in Practice
A complete Ollama production deployment monitoring solution, including logging configuration, Prometheus metrics collection, AlertManager rules, and Grafana dashboard setup for multi-GPU monitoring and automated fault recovery
Part 11 of 14
Next
Ollama API Practice: Python and Node.js Client Development Guide
A comprehensive guide to Ollama API integration, covering Python and Node.js SDK usage, streaming response handling, tool calling with Agent Loop, thinking mode, and OpenAI compatibility comparison
Part 13 of 14
Related Posts
Getting Started with Ollama: Your First Step to Running LLMs Locally
Getting Started with Ollama: Your First Step to Running LLMs Locally
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Ollama Modelfile Parameters Explained: A Complete Guide to Creating Custom Models

Comments
Sign in with GitHub to leave a comment