Getting Started with Ollama: Your First Step to Running LLMs Locally

Last month, my OpenAI API bill hit $300. Honestly, I was pretty frustrated—I was just testing features during development, didn’t expect the costs to pile up that fast. Worse, I was dealing with documents containing company data, and uploading everything to the cloud just didn’t feel right. That’s when I decided to find a local alternative. After a week of trying different tools, I discovered Ollama was the most hassle-free option.

You might be facing similar challenges: cloud AI services getting expensive, data privacy concerns, or simply needing AI in offline environments. Ollama solves these problems. It lets you run large language models on your own computer—setup takes minutes, it’s completely free, and your data never leaves your machine. Let me share what I’ve learned so you can avoid the mistakes I made.

What is Ollama?

The name can be a bit confusing. Ollama isn’t a model—it’s a tool for running models. Just like Docker made containerization simple, Ollama makes running large language models locally accessible.

Simply put, you use Ollama to download a model (like Llama 3.2), then chat with it, have it write code, do translations—all inference happens on your machine.

Why bother with local models when cloud services like ChatGPT and Claude are so convenient?

300+

USD

Cost. API calls charge by token, and frequent testing during development can lead to surprisingly high bills. I’ve seen people spend hundreds or even thousands in a month during development phases.

Privacy. All conversations get uploaded to cloud servers. If you’re handling company documents, customer data, or sensitive information, this might not meet compliance requirements.

Network dependency. You need to be online. Unstable connections mean poor experience, and completely offline scenarios are impossible.

Rate limits. Cloud APIs have all kinds of restrictions—rate limits, quotas, feature limits… sometimes you just want to test something and get blocked, which is frustrating.

Ollama addresses all these pain points. Once the model is downloaded to your computer, inference happens entirely locally. Works offline, data never leaves your machine, no API fees, no usage limits.

For developers, Ollama offers extra value—it provides a complete API that’s compatible with OpenAI’s interface format. This means you can test with local models during development to save money, then switch to cloud APIs for production. Very flexible.

Installing Ollama: Three Simple Steps

Installation is straightforward. Different platforms have slight variations, but you’ll be up and running in minutes.

Linux (Most Straightforward)

If you’re on Linux, one command does it all:

curl -fsSL https://ollama.com/install.sh | sh

Run this, and Ollama is installed with the service automatically started.

Honestly, Linux is my recommended platform. Server deployment, containerization—Linux handles it all perfectly. If you’re planning to use Ollama in production, Linux is the way to go.

Manual installation is possible, but I don’t recommend it for beginners. If you really need it, you can download the binary and configure systemd services yourself. I covered the steps in my research notes, but won’t go into detail here.

macOS (User-Friendly GUI)

Two options on macOS.

Using Homebrew:

brew install ollama

After installation, run ollama serve directly, or use brew services start ollama to run it in the background.

Don’t want Homebrew? Download the .pkg installer from the website, double-click, and you’re done. Mac users are familiar with this installation method.

A quick tip: Apple Silicon (M1/M2/M3) Macs automatically enable Metal GPU acceleration, and performance is quite good. My M2 MacBook Pro runs 8B models smoothly, and 13B models work too, just with higher memory usage.

Windows (winget is Easiest)

Windows users can use winget:

winget install --id=Ollama.Ollama -e

Or download the .exe installer directly from the website. After installation, Ollama starts automatically, and you’ll see it in the system tray.

Verify installation:

ollama --version

If you see a version number, you’re good to go. Simple, right?

Running Your First Model

With Ollama installed, let’s run a model.

First, pick a model. For beginners, I recommend llama3.2 or qwen:8b. The former is Meta’s latest model with solid overall capabilities; the latter is Qwen, excellent for Chinese language understanding.

Run it:

ollama run llama3.2

This command automatically downloads the model (takes a few minutes the first time), then starts an interactive interface.

You’ll see:

>>> Send a message (/? for help)

Now you can chat. Try:

>>> Hello, introduce yourself

The model responds in real-time, like having a conversation. Exit with /bye or Ctrl+D.

Understanding Model Sizes

You might notice model names contain numbers like 3b or 8b. This represents parameter count—basically the model’s “brain size.”

parameters

3B models: 3 billion parameters, runs on lightweight laptops, only needs 4GB RAM. Fast, but relatively basic capabilities. Good for simple conversations and tool commands.

8B models: 8 billion parameters, the “sweet spot” for most laptops and desktops. Needs 8GB RAM, handles daily conversations and simple coding assistance well. This is what I use most often.

13B models: 13 billion parameters, needs 16GB RAM. Better response quality and longer context, suitable for machines with mid-to-high-end graphics cards.

70B models: 70 billion parameters, needs 64GB RAM or more. Professional server-grade, strongest capabilities but high hardware requirements. Honestly, most individual users don’t need models this large.

For beginners, 3B to 8B is sufficient. These run smoothly on most modern laptops. Don’t chase large models at first—start small to get familiar, then upgrade as needed.

Common Model Management Commands

Ollama provides simple commands for managing models. Just remember a few key ones.

Download models:

ollama pull llama3.2
ollama pull qwen:8b

The pull command downloads from the model library to local storage. Once cached, you don’t need to redownload.

List installed models:

ollama list

Shows all your downloaded models, including name, size, and modification time.

Delete models:

ollama rm llama3.2

Remove models you no longer need to free up space.

View model details:

ollama show llama3.2

Shows technical details like parameter count and quantization method.

Other commands like cp (copy) and push (upload) aren’t used often, so I won’t detail them. Check the documentation when needed.

Recommended Models

Different use cases call for different models:

Daily conversation:

llama3.2: Meta’s latest, solid overall performance
qwen:8b: Qwen, excellent for Chinese
mistral: European open-source model, good performance

Coding assistance:

codellama: Optimized specifically for code generation
deepseek-coder: Deep learning code model, works well

Multimodal (can understand images):

llava: Supports image understanding
llama3.2-vision: Vision version of Llama 3.2

Not sure which to choose? Try llama3.2:3b or qwen:8b first. Both are great for beginners and run smoothly.

GPU Acceleration: The Speed Difference is Huge

This is really important. CPU can run models too, but the speed difference is massive—GPU can be 10 to 20 times faster.

10-20x

speed boost

The first time I used it, I didn’t pay attention to GPU configuration, and the model ran painfully slow. Then I realized GPU acceleration wasn’t enabled. After configuring it, speed immediately improved.

Ollama supports three GPU types: NVIDIA, AMD, and Apple Silicon.

NVIDIA Graphics Cards (RTX Series)

NVIDIA configuration is simplest. Basically automatic.

You need:

Latest NVIDIA drivers (version 531 or newer)
CUDA installed

Check if GPU is available:

nvidia-smi

If you see graphics card info, you’re good. Ollama automatically uses GPU when running models.

Specify which GPUs to use:

export CUDA_VISIBLE_DEVICES=0,1

This uses only the first and second GPUs. Sometimes needed on multi-GPU machines.

Using GPU in Docker:

If running Ollama in Docker, you need NVIDIA Container Toolkit:

docker run --gpus all ollama/ollama

This lets the container access the host’s GPU.

AMD Graphics Cards

AMD is slightly more complex, requiring ROCm installation.

AMD GPUs supporting ROCm include:

Radeon Instinct series (MI100, MI210, MI250, MI300X, etc.)
Radeon RX series (5700 XT, 5500 XT, 7600, 9070, etc.)
Radeon PRO series

After installing ROCm v7 or newer, Ollama automatically detects and uses AMD GPUs.

For unsupported GPU architectures, try overriding the environment variable:

export HSA_OVERRIDE_GFX_VERSION=10.3.0

Additionally, AMD has experimental Vulkan support:

export OLLAMA_VULKAN=1

Apple Silicon (M1/M2/M3)

Mac users have it easiest. Apple Silicon automatically enables Metal GPU acceleration—no additional configuration needed.

16GB RAM M-series chips run 8B models smoothly, 32GB can handle 13B or even 30B. Apple Silicon uses unified memory architecture where GPU and CPU share memory, so RAM size directly determines what model sizes you can run.

Performance Tips

Beyond GPU acceleration, several techniques can improve speed:

Quantized models. Default is Q4_K_M quantization, balancing quality and speed. If memory is tight, try Q4_K_S—slightly worse quality but smaller footprint.

Flash Attention. An optimization technique that significantly speeds up large context scenarios:

export OLLAMA_FLASH_ATTENTION=1

Context length. Default context is 2048, if you don’t need that much, reduce it:

export OLLAMA_CONTEXT_LENGTH=4096

This saves memory and improves speed.

Keep model loaded. By default, models unload after 5 minutes of idle time. If you use it frequently, extend this:

export OLLAMA_KEEP_ALIVE=30m

Or even -1 to never unload (as long as Ollama service is running). This means immediate responses every time without waiting for model reload.

API Integration: Connecting Ollama to Your Applications

This is where Ollama’s real value lies—it provides a complete API for integration into your applications.

REST API Basics

After starting, Ollama provides API services on local port 11434.

Two most commonly used endpoints:

Text generation:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?"
}'

Chat completion:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

Default response is streaming, returning results bit by bit. If you want complete results, set "stream": false.

OpenAI-Compatible API—Most Valuable Feature

This is incredibly practical. Ollama provides OpenAI-compatible interfaces, meaning your existing OpenAI code barely needs changes to switch to local models.

OpenAI-compatible endpoint:

http://localhost:11434/v1/chat/completions

Python example:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Doesn't matter what you put here
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "Say this is a test"}
    ]
)

print(response.choices[0].message.content)

JavaScript example:

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama'
});

const response = await client.chat.completions.create({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(response.choices[0].message.content);

This way, you can test with local models during development to save money, then switch to cloud APIs for production stability. Or go fully local for cost control and data privacy.

My approach: configure an environment variable to automatically switch between local and cloud based on environment. Development defaults to local, only switching to cloud when explicitly needed.

Ollama’s Python Library

Besides the OpenAI library, Ollama has its own Python library that’s more direct:

pip install ollama

Basic usage:

import ollama

response = ollama.chat(model='llama3.2', messages=[
  {'role': 'user', 'content': 'Why is the sky blue?'},
])

print(response['message']['content'])

Streaming mode:

import ollama

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Tool Calling

Ollama supports tool calling, enabling models to call external functions or APIs. This is useful for building complex AI applications.

Example:

import ollama

response = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'What is the weather in Toronto?'}],
    tools=[{
        'type': 'function',
        'function': {
            'name': 'get_current_weather',
            'description': 'Get the current weather for a city',
            'parameters': {
                'type': 'object',
                'properties': {
                    'city': {'type': 'string', 'description': 'The name of the city'},
                },
                'required': ['city'],
            },
        },
    }],
)

print(response['message']['tool_calls'])

The model decides whether tool calling is needed based on the question, then returns call parameters. Your code then actually calls the tool and returns results to the model.

Ollama vs LM Studio: Which to Choose?

You might have heard of LM Studio, another tool for running LLMs locally. How do you choose?

Honestly, both have their strengths—it depends on your needs.

Ollama:

Command-line focused, developer-friendly
Simple installation, one command
Automatic API service startup
Perfect for server and container deployment
Ideal for building applications, automation integration

LM Studio:

Graphical interface, intuitive
Built-in model browser, easy downloads
Manual API service startup required
Not suitable for server deployment
Good for exploring models, learning and testing

Simply put:

Building AI applications, server deployment, script automation → Ollama
Quickly experiencing various models, not comfortable with command line, just exploring → LM Studio

Actually, they complement each other. I discover and test new models in LM Studio, then integrate them formally in Ollama. Both are built on llama.cpp, so models are fully compatible.

Practical Use Cases

Let me share some scenarios where I’ve actually used it.

Local Code Assistant

Integrate Ollama into VS Code or Cursor to help write, explain, and refactor code. I recommend codellama or deepseek-coder.

Tools like Continue.dev and Aider can directly connect to Ollama’s API. Once configured, you have a free local code assistant.

Document Q&A System

If you have many documents for retrieval and Q&A, combine Ollama with a vector database to build a private knowledge base. The process:

Use Ollama’s embeddings interface to generate document vectors
Store in a vector database (Chroma, Qdrant both work)
When users ask questions, first retrieve relevant document fragments
Use fragments as context for model responses

This is the classic RAG (Retrieval-Augmented Generation) architecture. Completely local, data never leaves your network.

Private Knowledge Base

Company documents, technical materials, customer information… sensitive data that shouldn’t be uploaded to the cloud. Ollama processes everything locally, meeting privacy compliance requirements.

Offline Scenarios

Business trips, field work, unstable network environments… cloud AI won’t work, but Ollama runs completely offline. As long as your computer has power, your AI assistant is available.

Development Testing Environment

When developing AI applications, frequent cloud API calls are expensive. Using Ollama for development testing saves money and is convenient. Test thoroughly, then switch to cloud APIs for production.

Common Issues

I’ve hit a few pitfalls—here are the solutions.

常见问题

Model download failed?

Usually a network issue. In mainland China, you might need a proxy: `export HTTP_PROXY=http://your-proxy:port` and `export HTTPS_PROXY=http://your-proxy:port`

GPU not recognized?

First check if drivers are correctly installed. Use `nvidia-smi` for NVIDIA, `rocminfo` for AMD. If you see GPU info but Ollama still doesn't use it, check environment variable settings.

Out of memory?

Switch to a smaller model, like 8B to 3B. Or use quantized versions (Q4_K_M uses much less memory than FP16). You can also reduce context length.

Too slow?

Make sure GPU acceleration is enabled. Check with `ollama ps`—the Processor column should show `100% GPU`. You can also enable Flash Attention, reduce context length, use quantized models.

Model unloads too quickly?

Default is 5 minutes idle before unloading. If you use it frequently, extend the time: `export OLLAMA_KEEP_ALIVE=30m`

Final Thoughts

I’ve shared all this to make local LLMs simpler. Ollama really delivers—no complex configuration, no deep AI background needed, up and running in minutes. Completely free, works offline, data privacy guaranteed.

What to try next:

Test different models to find what suits your needs
Integrate Ollama into your development workflow
Explore RAG architecture to build private knowledge bases
Deploy on servers to set up team-shared AI services

The local AI era is here. With Ollama, you don’t need to depend on the cloud—you can run powerful language models on your own machine. Now, start your local AI journey.

Author: Easton (Tech blogger focused on AI development and local deployment)

References:

12 min read · Published on: Apr 1, 2026 · Modified on: Apr 1, 2026

Easton

AI & Intelligence

Getting Started with Ollama: Your First Step to Running LLMs Locally

What is Ollama?

Installing Ollama: Three Simple Steps

Linux (Most Straightforward)

macOS (User-Friendly GUI)

Windows (winget is Easiest)

Running Your First Model

Understanding Model Sizes

Common Model Management Commands

Recommended Models

GPU Acceleration: The Speed Difference is Huge

NVIDIA Graphics Cards (RTX Series)

AMD Graphics Cards

Apple Silicon (M1/M2/M3)

Performance Tips

API Integration: Connecting Ollama to Your Applications

REST API Basics

OpenAI-Compatible API—Most Valuable Feature

Ollama’s Python Library

Tool Calling

Ollama vs LM Studio: Which to Choose?

Practical Use Cases

Local Code Assistant

Document Q&A System

Private Knowledge Base

Offline Scenarios

Development Testing Environment

Common Issues

常见问题

Final Thoughts

Comments

Multi-Agent Collaboration in Practice: A Guide to 4 Architecture Patterns

Multi-Agent Collaboration in Practice: A Guide to 4 Architecture Patterns

AI Workflow Automation in Practice: n8n + Agent from Beginner to Master

AI Workflow Automation in Practice: n8n + Agent from Beginner to Master

Multimodal AI Application Development Guide: From Model Selection to Production Deployment

Multimodal AI Application Development Guide: From Model Selection to Production Deployment

What is Ollama?

Installing Ollama: Three Simple Steps

Linux (Most Straightforward)

macOS (User-Friendly GUI)

Windows (winget is Easiest)

Running Your First Model

Understanding Model Sizes

Common Model Management Commands

Recommended Models

GPU Acceleration: The Speed Difference is Huge

NVIDIA Graphics Cards (RTX Series)

AMD Graphics Cards

Apple Silicon (M1/M2/M3)

Performance Tips

API Integration: Connecting Ollama to Your Applications

REST API Basics

OpenAI-Compatible API—Most Valuable Feature

Ollama’s Python Library

Tool Calling

Ollama vs LM Studio: Which to Choose?

Practical Use Cases

Local Code Assistant

Document Q&A System

Private Knowledge Base

Offline Scenarios

Development Testing Environment

Common Issues

常见问题

Final Thoughts

Comments

Related Posts

Multi-Agent Collaboration in Practice: A Guide to 4 Architecture Patterns

Multi-Agent Collaboration in Practice: A Guide to 4 Architecture Patterns

AI Workflow Automation in Practice: n8n + Agent from Beginner to Master

AI Workflow Automation in Practice: n8n + Agent from Beginner to Master

Multimodal AI Application Development Guide: From Model Selection to Production Deployment

Multimodal AI Application Development Guide: From Model Selection to Production Deployment