Getting Started with Ollama: Your First Step to Running LLMs Locally
Last month, my OpenAI API bill hit $300. Honestly, I was pretty frustrated—I was just testing features during development, didn’t expect the costs to pile up that fast. Worse, I was dealing with documents containing company data, and uploading everything to the cloud just didn’t feel right. That’s when I decided to find a local alternative. After a week of trying different tools, I discovered Ollama was the most hassle-free option.
You might be facing similar challenges: cloud AI services getting expensive, data privacy concerns, or simply needing AI in offline environments. Ollama solves these problems. It lets you run large language models on your own computer—setup takes minutes, it’s completely free, and your data never leaves your machine. Let me share what I’ve learned so you can avoid the mistakes I made.
What is Ollama?
The name can be a bit confusing. Ollama isn’t a model—it’s a tool for running models. Just like Docker made containerization simple, Ollama makes running large language models locally accessible.
Simply put, you use Ollama to download a model (like Llama 3.2), then chat with it, have it write code, do translations—all inference happens on your machine.
Why bother with local models when cloud services like ChatGPT and Claude are so convenient?
Cost. API calls charge by token, and frequent testing during development can lead to surprisingly high bills. I’ve seen people spend hundreds or even thousands in a month during development phases.
Privacy. All conversations get uploaded to cloud servers. If you’re handling company documents, customer data, or sensitive information, this might not meet compliance requirements.
Network dependency. You need to be online. Unstable connections mean poor experience, and completely offline scenarios are impossible.
Rate limits. Cloud APIs have all kinds of restrictions—rate limits, quotas, feature limits… sometimes you just want to test something and get blocked, which is frustrating.
Ollama addresses all these pain points. Once the model is downloaded to your computer, inference happens entirely locally. Works offline, data never leaves your machine, no API fees, no usage limits.
For developers, Ollama offers extra value—it provides a complete API that’s compatible with OpenAI’s interface format. This means you can test with local models during development to save money, then switch to cloud APIs for production. Very flexible.
Installing Ollama: Three Simple Steps
Installation is straightforward. Different platforms have slight variations, but you’ll be up and running in minutes.
Linux (Most Straightforward)
If you’re on Linux, one command does it all:
curl -fsSL https://ollama.com/install.sh | sh
Run this, and Ollama is installed with the service automatically started.
Honestly, Linux is my recommended platform. Server deployment, containerization—Linux handles it all perfectly. If you’re planning to use Ollama in production, Linux is the way to go.
Manual installation is possible, but I don’t recommend it for beginners. If you really need it, you can download the binary and configure systemd services yourself. I covered the steps in my research notes, but won’t go into detail here.
macOS (User-Friendly GUI)
Two options on macOS.
Using Homebrew:
brew install ollama
After installation, run ollama serve directly, or use brew services start ollama to run it in the background.
Don’t want Homebrew? Download the .pkg installer from the website, double-click, and you’re done. Mac users are familiar with this installation method.
A quick tip: Apple Silicon (M1/M2/M3) Macs automatically enable Metal GPU acceleration, and performance is quite good. My M2 MacBook Pro runs 8B models smoothly, and 13B models work too, just with higher memory usage.
Windows (winget is Easiest)
Windows users can use winget:
winget install --id=Ollama.Ollama -e
Or download the .exe installer directly from the website. After installation, Ollama starts automatically, and you’ll see it in the system tray.
Verify installation:
ollama --version
If you see a version number, you’re good to go. Simple, right?
Running Your First Model
With Ollama installed, let’s run a model.
First, pick a model. For beginners, I recommend llama3.2 or qwen:8b. The former is Meta’s latest model with solid overall capabilities; the latter is Qwen, excellent for Chinese language understanding.
Run it:
ollama run llama3.2
This command automatically downloads the model (takes a few minutes the first time), then starts an interactive interface.
You’ll see:
>>> Send a message (/? for help)
Now you can chat. Try:
>>> Hello, introduce yourself
The model responds in real-time, like having a conversation. Exit with /bye or Ctrl+D.
Understanding Model Sizes
You might notice model names contain numbers like 3b or 8b. This represents parameter count—basically the model’s “brain size.”
3B models: 3 billion parameters, runs on lightweight laptops, only needs 4GB RAM. Fast, but relatively basic capabilities. Good for simple conversations and tool commands.
8B models: 8 billion parameters, the “sweet spot” for most laptops and desktops. Needs 8GB RAM, handles daily conversations and simple coding assistance well. This is what I use most often.
13B models: 13 billion parameters, needs 16GB RAM. Better response quality and longer context, suitable for machines with mid-to-high-end graphics cards.
70B models: 70 billion parameters, needs 64GB RAM or more. Professional server-grade, strongest capabilities but high hardware requirements. Honestly, most individual users don’t need models this large.
For beginners, 3B to 8B is sufficient. These run smoothly on most modern laptops. Don’t chase large models at first—start small to get familiar, then upgrade as needed.
Common Model Management Commands
Ollama provides simple commands for managing models. Just remember a few key ones.
Download models:
ollama pull llama3.2
ollama pull qwen:8b
The pull command downloads from the model library to local storage. Once cached, you don’t need to redownload.
List installed models:
ollama list
Shows all your downloaded models, including name, size, and modification time.
Delete models:
ollama rm llama3.2
Remove models you no longer need to free up space.
View model details:
ollama show llama3.2
Shows technical details like parameter count and quantization method.
Other commands like cp (copy) and push (upload) aren’t used often, so I won’t detail them. Check the documentation when needed.
Recommended Models
Different use cases call for different models:
Daily conversation:
llama3.2: Meta’s latest, solid overall performanceqwen:8b: Qwen, excellent for Chinesemistral: European open-source model, good performance
Coding assistance:
codellama: Optimized specifically for code generationdeepseek-coder: Deep learning code model, works well
Multimodal (can understand images):
llava: Supports image understandingllama3.2-vision: Vision version of Llama 3.2
Not sure which to choose? Try llama3.2:3b or qwen:8b first. Both are great for beginners and run smoothly.
GPU Acceleration: The Speed Difference is Huge
This is really important. CPU can run models too, but the speed difference is massive—GPU can be 10 to 20 times faster.
The first time I used it, I didn’t pay attention to GPU configuration, and the model ran painfully slow. Then I realized GPU acceleration wasn’t enabled. After configuring it, speed immediately improved.
Ollama supports three GPU types: NVIDIA, AMD, and Apple Silicon.
NVIDIA Graphics Cards (RTX Series)
NVIDIA configuration is simplest. Basically automatic.
You need:
- Latest NVIDIA drivers (version 531 or newer)
- CUDA installed
Check if GPU is available:
nvidia-smi
If you see graphics card info, you’re good. Ollama automatically uses GPU when running models.
Specify which GPUs to use:
export CUDA_VISIBLE_DEVICES=0,1
This uses only the first and second GPUs. Sometimes needed on multi-GPU machines.
Using GPU in Docker:
If running Ollama in Docker, you need NVIDIA Container Toolkit:
docker run --gpus all ollama/ollama
This lets the container access the host’s GPU.
AMD Graphics Cards
AMD is slightly more complex, requiring ROCm installation.
AMD GPUs supporting ROCm include:
- Radeon Instinct series (MI100, MI210, MI250, MI300X, etc.)
- Radeon RX series (5700 XT, 5500 XT, 7600, 9070, etc.)
- Radeon PRO series
After installing ROCm v7 or newer, Ollama automatically detects and uses AMD GPUs.
For unsupported GPU architectures, try overriding the environment variable:
export HSA_OVERRIDE_GFX_VERSION=10.3.0
Additionally, AMD has experimental Vulkan support:
export OLLAMA_VULKAN=1
Apple Silicon (M1/M2/M3)
Mac users have it easiest. Apple Silicon automatically enables Metal GPU acceleration—no additional configuration needed.
16GB RAM M-series chips run 8B models smoothly, 32GB can handle 13B or even 30B. Apple Silicon uses unified memory architecture where GPU and CPU share memory, so RAM size directly determines what model sizes you can run.
Performance Tips
Beyond GPU acceleration, several techniques can improve speed:
Quantized models. Default is Q4_K_M quantization, balancing quality and speed. If memory is tight, try Q4_K_S—slightly worse quality but smaller footprint.
Flash Attention. An optimization technique that significantly speeds up large context scenarios:
export OLLAMA_FLASH_ATTENTION=1
Context length. Default context is 2048, if you don’t need that much, reduce it:
export OLLAMA_CONTEXT_LENGTH=4096
This saves memory and improves speed.
Keep model loaded. By default, models unload after 5 minutes of idle time. If you use it frequently, extend this:
export OLLAMA_KEEP_ALIVE=30m
Or even -1 to never unload (as long as Ollama service is running). This means immediate responses every time without waiting for model reload.
API Integration: Connecting Ollama to Your Applications
This is where Ollama’s real value lies—it provides a complete API for integration into your applications.
REST API Basics
After starting, Ollama provides API services on local port 11434.
Two most commonly used endpoints:
Text generation:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?"
}'
Chat completion:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Default response is streaming, returning results bit by bit. If you want complete results, set "stream": false.
OpenAI-Compatible API—Most Valuable Feature
This is incredibly practical. Ollama provides OpenAI-compatible interfaces, meaning your existing OpenAI code barely needs changes to switch to local models.
OpenAI-compatible endpoint:
http://localhost:11434/v1/chat/completions
Python example:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Doesn't matter what you put here
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "user", "content": "Say this is a test"}
]
)
print(response.choices[0].message.content)
JavaScript example:
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama'
});
const response = await client.chat.completions.create({
model: 'llama3.2',
messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.choices[0].message.content);
This way, you can test with local models during development to save money, then switch to cloud APIs for production stability. Or go fully local for cost control and data privacy.
My approach: configure an environment variable to automatically switch between local and cloud based on environment. Development defaults to local, only switching to cloud when explicitly needed.
Ollama’s Python Library
Besides the OpenAI library, Ollama has its own Python library that’s more direct:
pip install ollama
Basic usage:
import ollama
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': 'Why is the sky blue?'},
])
print(response['message']['content'])
Streaming mode:
import ollama
stream = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
stream=True,
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Tool Calling
Ollama supports tool calling, enabling models to call external functions or APIs. This is useful for building complex AI applications.
Example:
import ollama
response = ollama.chat(
model='llama3.1',
messages=[{'role': 'user', 'content': 'What is the weather in Toronto?'}],
tools=[{
'type': 'function',
'function': {
'name': 'get_current_weather',
'description': 'Get the current weather for a city',
'parameters': {
'type': 'object',
'properties': {
'city': {'type': 'string', 'description': 'The name of the city'},
},
'required': ['city'],
},
},
}],
)
print(response['message']['tool_calls'])
The model decides whether tool calling is needed based on the question, then returns call parameters. Your code then actually calls the tool and returns results to the model.
Ollama vs LM Studio: Which to Choose?
You might have heard of LM Studio, another tool for running LLMs locally. How do you choose?
Honestly, both have their strengths—it depends on your needs.
Ollama:
- Command-line focused, developer-friendly
- Simple installation, one command
- Automatic API service startup
- Perfect for server and container deployment
- Ideal for building applications, automation integration
LM Studio:
- Graphical interface, intuitive
- Built-in model browser, easy downloads
- Manual API service startup required
- Not suitable for server deployment
- Good for exploring models, learning and testing
Simply put:
- Building AI applications, server deployment, script automation → Ollama
- Quickly experiencing various models, not comfortable with command line, just exploring → LM Studio
Actually, they complement each other. I discover and test new models in LM Studio, then integrate them formally in Ollama. Both are built on llama.cpp, so models are fully compatible.
Practical Use Cases
Let me share some scenarios where I’ve actually used it.
Local Code Assistant
Integrate Ollama into VS Code or Cursor to help write, explain, and refactor code. I recommend codellama or deepseek-coder.
Tools like Continue.dev and Aider can directly connect to Ollama’s API. Once configured, you have a free local code assistant.
Document Q&A System
If you have many documents for retrieval and Q&A, combine Ollama with a vector database to build a private knowledge base. The process:
- Use Ollama’s embeddings interface to generate document vectors
- Store in a vector database (Chroma, Qdrant both work)
- When users ask questions, first retrieve relevant document fragments
- Use fragments as context for model responses
This is the classic RAG (Retrieval-Augmented Generation) architecture. Completely local, data never leaves your network.
Private Knowledge Base
Company documents, technical materials, customer information… sensitive data that shouldn’t be uploaded to the cloud. Ollama processes everything locally, meeting privacy compliance requirements.
Offline Scenarios
Business trips, field work, unstable network environments… cloud AI won’t work, but Ollama runs completely offline. As long as your computer has power, your AI assistant is available.
Development Testing Environment
When developing AI applications, frequent cloud API calls are expensive. Using Ollama for development testing saves money and is convenient. Test thoroughly, then switch to cloud APIs for production.
Common Issues
I’ve hit a few pitfalls—here are the solutions.
常见问题
Model download failed?
GPU not recognized?
Out of memory?
Too slow?
Model unloads too quickly?
Final Thoughts
I’ve shared all this to make local LLMs simpler. Ollama really delivers—no complex configuration, no deep AI background needed, up and running in minutes. Completely free, works offline, data privacy guaranteed.
What to try next:
- Test different models to find what suits your needs
- Integrate Ollama into your development workflow
- Explore RAG architecture to build private knowledge bases
- Deploy on servers to set up team-shared AI services
The local AI era is here. With Ollama, you don’t need to depend on the cloud—you can run powerful language models on your own machine. Now, start your local AI journey.
Author: Easton (Tech blogger focused on AI development and local deployment)
References:
12 min read · Published on: Apr 1, 2026 · Modified on: Apr 1, 2026
Related Posts
Multi-Agent Collaboration in Practice: A Guide to 4 Architecture Patterns
Multi-Agent Collaboration in Practice: A Guide to 4 Architecture Patterns
AI Workflow Automation in Practice: n8n + Agent from Beginner to Master
AI Workflow Automation in Practice: n8n + Agent from Beginner to Master
Multimodal AI Application Development Guide: From Model Selection to Production Deployment

Comments
Sign in with GitHub to leave a comment