Ollama GPU Acceleration Configuration: CUDA, ROCm, and Metal Platform Guide
When I first ran a 7B model locally, I used pure CPU. The experience? Less than two characters per second - I could finish half a cup of coffee waiting for it to complete a sentence. Later, I got an RTX 3080, and with the same model and parameters, the speed jumped to over 40 tokens per second - roughly a 50x difference.
That’s not all. Larger models, longer contexts, multi-turn conversations - CPU basically can’t handle these. GPU acceleration isn’t just nice to have, it’s the difference between usable and unusable.
If your computer has a graphics card - whether NVIDIA, AMD, or Apple Silicon - there’s a good chance it can accelerate Ollama. But how to configure it? Each platform has different pitfalls. NVIDIA users have it easiest - just install drivers. AMD users need to deal with ROCm, and Windows users need to use Vulkan. Mac users have it best - nothing to configure.
This article will cover configuration methods, common pitfalls, and troubleshooting approaches for all three platforms in one go.
Why GPU Acceleration Matters
Let’s start with data. Based on testing, the inference speed difference for 7B models across different hardware is substantial:
| Acceleration Method | Typical Performance (7B Model) | Use Case |
|---|---|---|
| CPU-only inference | 0.5-2 tokens/sec | Testing, debugging |
| NVIDIA CUDA | 30-80 tokens/sec | Daily use, production |
| Apple Metal | 20-50 tokens/sec | Mac users |
| AMD ROCm | 25-60 tokens/sec | Linux AMD users |
Why such a huge gap? Simply put, GPUs excel at “repetitive work.” Large model inference is essentially matrix multiplication - trillions of matrix multiplications. CPU doing this is like having a PhD student calculate math problems one by one - accurate but slow. GPU? Thousands of workers doing it together, each handling a small piece. Individually they’re not as smart, but there’s strength in numbers.
Then there’s memory bandwidth. How fast inference runs largely depends on how quickly data can be sent to compute units. GPU memory bandwidth is typically several times higher than CPU - RTX 3080 has 912 GB/s, while typical DDR4 memory is only around 50 GB/s. Data stuck in traffic means fast computation is useless.
So when do you need GPU? Basically, running models larger than 7B requires it. Chat, coding, long text generation - without GPU, the experience will be terrible. If you’re just occasionally playing around or debugging a small model, CPU might suffice.
NVIDIA CUDA Configuration Guide
NVIDIA is the most hassle-free choice. Mature ecosystem, comprehensive documentation, abundant community experience - people have already stepped on all the pitfalls for you.
Hardware and Driver Requirements
Not all NVIDIA graphics cards work. Ollama requires Compute Capability 5.0 or higher. What does that mean? Check this table:
| Compute Capability | Representative Cards | Works? |
|---|---|---|
| 8.9 | RTX 4090/4080/4070 | Perfect |
| 8.6 | RTX 3090/3080/3070 | Perfect |
| 7.5 | RTX 2080 Ti/2080 | Perfect |
| 6.1 | GTX 1080 Ti/1080 | Works |
| 5.2 | GTX 980 Ti/980 | Works |
| Below 5.0 | GTX 7xx and older | Not supported |
Driver version also has requirements. Official requirement is 531+ (Windows) or 535+ (Linux). Too low, and CUDA won’t run.
Verification and Installation Steps
First, confirm your graphics card is recognized by the system. Run this in terminal:
nvidia-smi
If you can see graphics card information, driver version, and CUDA version, you’re good. If it says “command not found”, the driver isn’t installed or the path is wrong.
Ollama automatically detects CUDA after installation. No extra configuration needed, just make sure the driver is working. Run a model to test:
ollama run llama3.2
ollama ps
You should see GPU information in the ollama ps output, like:
ID MODEL SIZE PROCESSOR UNTIL
abc123 llama3.2:7b 4.7 GB 100% GPU 2 minutes from now
If it shows CPU instead of GPU, there’s a problem.
Common Pitfalls
Wrong driver version. Download the latest driver from NVIDIA’s website. Linux users should be careful not to install the wrong version - some distribution default drivers are too old.
Missing CUDA Toolkit. Actually, Ollama doesn’t need the full CUDA Toolkit - it comes with a stripped-down version. But some system configurations are special and might need manual CUDA runtime installation. On Linux:
# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit
Running Ollama in containers. Docker users need to add the --gpus all flag to let the container access the GPU:
docker run --gpus all ollama/ollama
AMD ROCm Configuration Guide
AMD users have more work to do. ROCm (AMD’s CUDA alternative) isn’t as mature as CUDA, but has improved significantly in the past two years. Linux configuration is relatively smooth, but Windows requires some workarounds.
Which AMD Cards Work?
ROCm has best support for RDNA architecture:
| Architecture | Series | Support Level |
|---|---|---|
| RDNA3 | RX 7900 XTX/XT, RX 7800/7700 | Best |
| RDNA2 | RX 6800/6700/6600 | Good |
| RDNA1 | RX 5700/5600/5500 | Usable |
| GCN | RX Vega, RX 500/400 | Not officially guaranteed |
Basically, RX 7000 and 6000 series are fine, 5000 series work okay, and older cards shouldn’t be relied upon.
Linux ROCm Installation
Ubuntu/Debian users follow these steps:
# Confirm system support first
sudo apt update
# Install ROCm core
sudo apt install rocm-dkms rocm-dev rocm-libs
# Install HIP runtime
sudo apt install hip-runtime-amd
# Verify installation
rocminfo
If rocminfo shows graphics card information, you’re set. Then restart once to let the kernel module load properly.
Ollama automatically detects ROCm after installation. Like CUDA, no extra configuration needed.
What About Windows Users?
ROCm’s Windows support is still in development. But there’s an alternative - Vulkan. Just set an environment variable:
# Windows PowerShell
$env:OLLAMA_VULKAN = "1"
ollama run llama3.2
Vulkan performance isn’t as good as ROCm, but it works. Real-world testing shows about 70-80% of ROCm speed.
Multi-GPU Selection
If you have multiple AMD GPUs, you can specify which one to use:
# Use only first GPU
export ROCR_VISIBLE_DEVICES=0
# Use first and third GPUs
export ROCR_VISIBLE_DEVICES=0,2
Performance Comparison
AMD has officially and community-tested some data. RX 7900 XTX (AMD flagship) runs 7B models at about 35-45 tokens/sec, while RTX 4090 (NVIDIA flagship) reaches 50-70 tokens/sec. There’s a gap, but the price difference is even larger - 7900 XTX is about 40% cheaper.
From a price-performance perspective, AMD users should take the time to set up ROCm.
Apple Metal Zero-Configuration Experience
Mac users have it easiest. Ollama’s support for Apple Silicon is zero-configuration - install Ollama, run it, GPU acceleration automatically kicks in.
Which Macs Work?
All Apple Silicon Macs are supported:
- M1 / M1 Pro / M1 Max / M1 Ultra
- M2 / M2 Pro / M2 Max / M2 Ultra
- M3 / M3 Pro / M3 Max
- M4 series
Intel Macs don’t support Metal acceleration, only CPU. But Intel Macs are about ready for retirement anyway.
Automatic Detection Mechanism
Ollama automatically detects Metal at startup. No configuration files, environment variables, or driver installations needed - Apple has deeply integrated Metal into the system.
Verify it:
ollama run llama3.2
ollama ps
The output should show GPU, like:
PROCESSOR: 100% GPU
If you see CPU, there’s a problem. But honestly, this is rare on Mac.
What’s the Performance Like?
Base M2 runs 7B models at about 25-35 tokens/sec. Pro/Max versions are faster because they have more GPU cores. Testing shows M2 Max can reach around 45 tokens/sec, comparable to mid-range NVIDIA cards.
One detail: Apple Silicon uses unified memory architecture - GPU and CPU share system memory. The benefit is VRAM isn’t limited, the downside is running large models eats a lot of memory. M2 8GB can run 7B models okay, 14B is pushing it, 70B is out of the question.
Common Misconceptions
Many people think Mac needs Metal configuration - it doesn’t at all. Ollama’s official code already has Metal detection logic, automatically enabled after installation.
Others ask about installing ROCm or CUDA - Mac doesn’t use these at all. Metal is Apple’s own technology, built into the system.
Multi-GPU and VRAM Management
If you have multiple GPUs, or insufficient VRAM, this section is crucial.
Layer Distribution Mechanism
Large models don’t run entirely on GPU. They’re split into many “layers” - some on GPU, the rest on CPU. This ratio is dynamically calculated - Ollama automatically decides how many layers go on GPU based on available VRAM.
For example: a 7B model has about 80 layers. If your GPU has 8GB VRAM, maybe 60 layers are on GPU, 20 on CPU. If VRAM is insufficient, more layers overflow to system memory.
Pack vs Spread Mode
Multi-GPU environments have two strategies:
- Pack Mode (default): Try to fit the model into one GPU, overflow to another. Good when GPU performance differs significantly.
- Spread Mode: Distribute evenly across all GPUs. Good when GPU performance is similar.
Enable Spread mode:
export OLLAMA_SCHED_SPREAD=1
Honestly, most people can use the default Pack mode. Spread mainly has advantages in VRAM utilization but is more complex to configure and requires experience to tune.
What If VRAM Is Insufficient?
Running large models is most problematic when VRAM isn’t enough. Several solutions:
1. Use quantized models. Q4_K_M quantization can compress 7B model VRAM usage from 14GB to about 4GB, with only about 5-10% performance loss. Very worthwhile.
# Pull quantized version
ollama pull llama3.2:7b-q4_K_M
2. Reduce context length. Long conversations, large documents occupy lots of VRAM. If it’s just simple Q&A, shorter context is fine.
3. Multi-GPU distribution. Two 8GB cards combined are more usable than one 16GB card - because each card has its own compute units.
Dynamic Allocation Logic
Ollama manages this automatically, no need to manually specify layer count. But if you want to force adjustments, you can modify model parameters (advanced usage, most people don’t need it).
Troubleshooting Guide
You’ll always encounter issues when configuring GPU acceleration. Here’s a compilation of common troubleshooting approaches.
GPU Detection Issue Checklist
Check in order:
-
Confirm driver installation
# NVIDIA nvidia-smi # AMD rocminfoIf there’s an error, install drivers first.
-
Confirm Ollama version
ollama --versionVery old versions might not support certain GPUs. Update:
# Linux/macOS curl -fsSL https://ollama.com/install.sh | sh # Windows # Download latest installer from official website -
Check CUDA/ROCm version
# NVIDIA CUDA version nvcc --version # ROCm version rocm-smiOllama requires CUDA 12.3+ or ROCm 6.0+. Upgrade if version is wrong.
-
Restart service
# Linux sudo systemctl restart ollama # macOS/Windows # Kill process and restartSome configuration changes need restart to take effect.
GPU Disappears After Sleep
Both Mac and Windows have this problem - GPU acceleration fails after waking from sleep.
Solutions:
- Mac: Restart Ollama service, or restart computer
- Windows: Check if driver is normal, reload if necessary
- Linux: Generally doesn’t have this problem, but sometimes needs to manually wake GPU
Container GPU Permission Issues
Linux users running Ollama in Docker might encounter SELinux permission issues.
Solution:
# Temporarily disable SELinux (not recommended for long-term use)
sudo setenforce 0
# Or properly configure Docker GPU access
docker run --gpus all --security-opt seccomp=unconfined ollama/ollama
Other Common Issues
“out of memory” error: Model is too large, not enough VRAM. Use quantized version or switch to smaller model.
Inference speed didn’t improve: Confirm ollama ps shows GPU. If it shows CPU, troubleshoot the issues above.
AMD GPU not working: First confirm ROCm is installed correctly. Windows users try Vulkan mode.
Summary
After all this, how to choose?
| Your Hardware | Recommended Solution | Configuration Difficulty |
|---|---|---|
| NVIDIA GPU | CUDA auto-enable | Low, just install drivers |
| AMD GPU + Linux | ROCm | Medium, requires manual installation |
| AMD GPU + Windows | Vulkan | Low, set environment variable |
| Apple Silicon | Metal auto-enable | Very low, zero configuration |
| Intel Mac or no GPU | Pure CPU | No configuration needed, but very slow |
Simply put: NVIDIA users have it easiest, Mac users are happiest, AMD users on Linux are fine but Windows requires workarounds, and those without GPU… better find a way to get one.
GPU acceleration isn’t optional optimization, it’s a basic requirement for running LLMs locally. Once configured, the experience difference is a qualitative leap.
NVIDIA CUDA GPU Acceleration Configuration
Configure Ollama GPU acceleration on NVIDIA graphics cards for high-speed large model inference
⏱️ Estimated time: 10 min
- 1
Step1: Verify graphics card and driver
Run the `nvidia-smi` command to view graphics card information, driver version, and CUDA version. If there's an error, the driver isn't installed or there's a path configuration issue. - 2
Step2: Install or update driver
Download the latest driver from NVIDIA's website. Linux users should note that distribution default drivers may be too old. Windows requires driver 531+, Linux requires 535+. - 3
Step3: Start Ollama and test
Run `ollama run llama3.2` to start the model, then execute `ollama ps` to check processor status. If it shows GPU percentage, acceleration is working. - 4
Step4: Troubleshoot issues (if needed)
If it shows CPU, check if CUDA Toolkit is missing (Linux users can install nvidia-cuda-toolkit), Docker users need --gpus all flag, or restart the Ollama service.
FAQ
Does Ollama support AMD graphics cards?
How can I confirm GPU acceleration is enabled?
What if I don't have enough VRAM for large models?
Do Mac users need to configure Metal?
What NVIDIA graphics card version is required?
9 min read · Published on: May 16, 2026 · Modified on: May 17, 2026
Ollama Local LLM Guide
If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.
Previous
Ollama Model Quantization Guide: GGUF Format and Accuracy Loss Analysis
Deep dive into Ollama GGUF quantization principles, referencing Red Hat's 500K+ evaluation data to reveal accuracy loss truths. Practical quantization recommendations for different hardware configurations to run large models on consumer GPUs.
Part 15 of 16
Next
This is the latest post in the series so far.
Related Posts
Getting Started with Ollama: Your First Step to Running LLMs Locally
Getting Started with Ollama: Your First Step to Running LLMs Locally
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Ollama Modelfile Parameters Explained: A Complete Guide to Creating Custom Models
Comments
Sign in with GitHub to leave a comment