Switch Language
Toggle Theme

Ollama Multi-Model Deployment: Running Qwen, Llama, and DeepSeek in Parallel

At 3 AM, I stared at the logs scrolling in my terminal, switching models for the seventeenth time. ollama run deepseek-coder for code, ollama run qwen2.5 for translation, ollama run llama3.2 for general questions. Each switch meant waiting over ten seconds, the GPU fans whirring like they were protesting my antics.

After more than half a month of this, I finally realized: since Ollama 0.2, multiple models can run simultaneously. No more switching back and forth. No more reloading every time. One service, three models, called on demand.

This article is about how to deploy Qwen, Llama, and DeepSeek on a single machine, letting each handle what it’s best at, without maxing out your GPU memory. I’ll share specific configuration methods, the strengths of each model, and the pitfalls I’ve encountered.

Basic Configuration for Ollama Multi-Model Parallel Execution

Good news first: since Ollama 0.2, multi-model parallel execution is natively supported. No extra plugins, no complex configuration files. A few environment variables and multiple models will be at your service.

Bad news: if your GPU VRAM isn’t enough, these models might eat up all your system memory, and your computer will feel like an old ox pulling a cart—slow enough to make you question your life choices.

Three Key Environment Variables

Open your terminal and look at these three variables:

# Maximum number of models to load simultaneously
export OLLAMA_MAX_LOADED_MODELS=3

# Maximum concurrent requests per model
export OLLAMA_NUM_PARALLEL=2

# Maximum queue length (rejects new requests when exceeded)
export OLLAMA_MAX_QUEUE=512

OLLAMA_MAX_LOADED_MODELS is the core. Set it to 3, and you can run Qwen, Llama, and DeepSeek simultaneously. But don’t rush to set it high—this value is limited by your hardware. I’ll cover memory requirements in detail later.

OLLAMA_NUM_PARALLEL controls the concurrency of a single model. If your service receives multiple requests simultaneously, you might want to set this higher. Honestly, for personal use, the default value is sufficient.

System Service Configuration (Linux)

If you’re using systemd to manage the Ollama service (the default for most Linux distributions), configuration is straightforward:

sudo systemctl edit ollama.service

Then add the following in the editor:

[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_KEEP_ALIVE=30m"

Save and exit, then restart the service:

sudo systemctl restart ollama

The OLLAMA_KEEP_ALIVE variable is interesting. It controls how long models stay “on standby” in memory. Set it to 30m, and loaded models will stay in memory for 30 minutes, so the next call won’t need to reload. If you want them to stay indefinitely, set it to -1.

Is Your Memory Enough? A Quick Check

Configuration done, but is your memory sufficient?

A rough estimate: a 7B model needs about 4-5GB VRAM (FP16 quantization), a 14B model needs 8-10GB. To load three 7B models simultaneously, you need at least 16GB VRAM on your GPU.

My RTX 3060 has only 12GB VRAM. Running two 7B models is smooth, but the third model has to borrow system memory—speed drops significantly. If you’re using a Mac with Apple Silicon, system memory and GPU memory are shared; 32GB unified memory can handle three models easily.

Three Models: Characteristics and Selection Strategy

Alright, configuration is done. Let’s talk about how to choose between these three models. Honestly, I was pretty confused at first—I downloaded all three, but every time I needed to use one, I didn’t know which to call. After some exploration, I’ve summarized a few patterns.

Qwen: The King of Chinese and Multi-language

Alibaba’s open-source Qwen series has really strong Chinese capabilities. I’ve used it to write quite a few Chinese documents and translate several English technical articles with good results. According to official data, Qwen supports over 100 languages—not just Chinese and English, but Japanese, Korean, French, Spanish, and more.

If you frequently work with Chinese content or do multi-language translation, Qwen is your first choice. I used qwen2.5:7b to translate a 5,000-word technical document, and the translation quality was better than many online translation tools. Especially for technical terms—it won’t translate “API endpoint” into something weird like “API 终点”.

Llama: The Most Balanced All-Rounder

Meta’s Llama series is the “big brother” among open-source models. If you’re not sure about the task type, just use Llama—you probably won’t go wrong.

It has a major advantage: very permissive commercial licensing. As long as your product has fewer than 700 million monthly active users, you can use it for free commercially. This is an important consideration for many independent developers and small teams.

Another advantage is the large context window. Llama 3.2 can handle up to 128K tokens—roughly enough to fit an entire novel. If you need to process very long documents, this advantage is obvious.

DeepSeek: A Powerhouse for Coding and Reasoning

DeepSeek is one I only started using recently, but I fell in love with it quickly. Especially for coding tasks, its performance surprised me—the generated code quality is high, and it actively explains the code logic.

According to Premai’s comparison report, DeepSeek’s reasoning cost is 95% lower than comparable models. For those who need lots of reasoning tasks, this is a real advantage. Lower cost means you can run more tasks on limited hardware.

Which to Choose? A Quick Reference Table

Task TypeRecommended ModelReason
Chinese writing, translationQwenStrongest Chinese capability, 100+ language support
Multi-language contentQwenLeading multi-language processing capability
Code generation, debuggingDeepSeekStrong coding ability, clear explanations
Technical reasoning, analysisDeepSeekLowest reasoning cost, high efficiency
General Q&A, chatLlamaBalanced capabilities, won’t go wrong
Long document processingLlama128K context window
Commercial projects (<700M MAU)LlamaMost permissive commercial license

My personal habit: DeepSeek for coding, Qwen for Chinese documents, Llama for English content or uncertain tasks. The three models have clear divisions of labor, and I rarely need to struggle with which one to use.

Model Switching and Memory Management

Honestly, when I first started using multiple models, my biggest pain point was slow switching. Every time I called a different model, I had to wait a while—the model reloaded, GPU fans whirring, that waiting feeling was really frustrating. Later I learned this delay can be optimized.

What Causes Switching Delay?

Model switching delay ranges from about 10 to 30 seconds. The exact time depends on model size, disk read/write speed, and your memory state. Loading a 7B model from disk to GPU takes about 10-15 seconds; a 14B model might take 20-30 seconds.

This delay is really annoying in practice. Especially when debugging code—you just used DeepSeek to generate a function, then want to use Qwen to translate comments, and you have to wait forever.

keep_alive: Keep Models “On Standby”

The solution is the OLLAMA_KEEP_ALIVE mentioned earlier.

The principle is simple: after a model loads once, keep it in memory without unloading. Next time you call the same model, use it directly without reloading.

You can set it globally:

export OLLAMA_KEEP_ALIVE=30m  # Keep models for 30 minutes after loading
# or
export OLLAMA_KEEP_ALIVE=-1   # Keep indefinitely until manually unloaded

Or override this setting in a single request:

curl http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5:7b", "prompt": "", "keep_alive": "60m"}'

Sending an empty request (prompt is empty) will load and keep the model in memory. This is a good way to preload—you can load commonly used models in advance, and subsequent calls will respond instantly.

Manually Unloading Models

If memory is tight, you can manually unload a specific model:

curl http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5:7b", "keep_alive": 0}'

Setting keep_alive to 0 will immediately unload the model from memory. This operation doesn’t delete the model file, it just releases the occupied memory.

Memory Requirements by Model Size

I compiled this table over time, for reference only (data from community feedback and my own testing):

Model SizeFP16 QuantizationQ4 QuantizationRecommended GPU
7B14-16GB4-5GBRTX 3060 (12GB) or better
14B28-32GB8-10GBRTX 4070 (12GB) or better
32B64-70GB18-20GBRTX 4090 (24GB) or better

My RTX 3060 has 12GB VRAM. With Q4 quantization, I can run one 14B and one 7B simultaneously, or three 7B models. The third model will borrow system memory—slower, but won’t crash.

If your GPU VRAM isn’t enough but you have plenty of system memory (32GB+), you can also use CPU inference. Much slower, but it works. Sometimes working is enough.

A Small Tip: Prioritize High-Frequency Models

I have a habit: load the most frequently used model to GPU first, let other models borrow system memory.

For example, in my workflow, I use DeepSeek most, so I keep it in GPU on standby. Qwen and Llama are used less often, loaded only when called, possibly borrowing system memory.

This arrangement gives the fastest response for high-frequency tasks, slightly slower for low-frequency ones, creating a better overall experience. You can adjust this order based on your own usage habits.

Practical Application Scenarios

Enough theory, let’s talk about practical usage. Here are a few scenarios I commonly use, maybe they’ll give you some ideas.

Coding Assistant: Code + Documentation

This is my most-used scenario. When writing code, two models work together:

  • DeepSeek generates code logic, explains code, debugs bugs
  • Qwen translates code comments, generates Chinese documentation

For example, when I’m writing an API interface, I first use DeepSeek to generate code skeleton:

curl http://localhost:11434/api/generate \
  -d '{"model": "deepseek-coder:6.7b", "prompt": "Write an Express.js RESTful API handling user registration and login"}'

After code generation, use Qwen to translate comments to Chinese:

curl http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5:7b", "prompt": "Translate the English comments in this code to Chinese:\n[code content]"}'

Two models dividing labor is much more efficient than using just one. DeepSeek’s code quality is reliable, and Qwen’s translations are natural and smooth.

Smart Customer Service: Chinese-English Bilingual

If you’re building a customer service system for international users, you can arrange it like this:

  • Llama handles English user inquiries
  • Qwen handles Chinese user inquiries

Implementation isn’t complex either. Detect the language of user input, then call the corresponding model:

import requests

def get_response(user_input, language):
    model = "llama3.2:3b" if language == "en" else "qwen2.5:7b"

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": user_input, "stream": False}
    )

    return response.json()["response"]

This way each user gets natural responses in their language, much better experience than single-language models.

Knowledge Q&A: Reasoning + Chat

Sometimes user question types differ, requiring different handling approaches:

  • DeepSeek handles questions requiring reasoning (like “why”, “how to”)
  • Llama handles open-ended Q&A (like “what do you think”, “your opinion”)

Reasoning questions need logical analysis; DeepSeek’s reasoning capability is stronger, answers are more organized. Open-ended Q&A doesn’t need rigorous reasoning; Llama’s responses are more natural, more conversational.

You can judge the type based on question keywords. If the question contains “why”, “reason”, “principle”, use DeepSeek; if it contains “what do you think”, “your opinion”, “let’s discuss”, use Llama.

A Complete Example

Here’s a simple Python script that automatically selects the model based on task type:

import requests

def call_ollama(task_type, prompt):
    """Select appropriate model based on task type"""

    model_map = {
        "code": "deepseek-coder:6.7b",
        "chinese": "qwen2.5:7b",
        "english": "llama3.2:3b",
        "reasoning": "deepseek-coder:6.7b",
        "general": "llama3.2:3b"
    }

    model = model_map.get(task_type, "llama3.2:3b")

    # Preload model (optional)
    requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": "", "keep_alive": "30m"}
    )

    # Actual call
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )

    return response.json()["response"]

# Example call
result = call_ollama("code", "Write a Python function to calculate the Nth Fibonacci number")
print(result)

This script is simple but already implements basic multi-model scheduling. You can extend it as needed, adding more task types, more complex selection logic, or combining responses from multiple models.

By the way, if you want to learn more about Ollama basics, check out the first two articles in this series: “Ollama Introduction: Your First Step Running LLMs Locally” and “Ollama Modelfile Parameters: A Complete Guide to Creating Custom Models”. This article continues the series, covering this advanced topic of multi-model deployment.

Configuration Quick Reference

VariableRecommended ValuePurpose
OLLAMA_MAX_LOADED_MODELS2-3Number of models to load simultaneously
OLLAMA_NUM_PARALLEL2Concurrent requests per model
OLLAMA_KEEP_ALIVE30m or -1Model standby time
OLLAMA_MAX_QUEUE512Queue length

Summary

After all this talk, the core comes down to three points:

  1. Configure three environment variables, and multi-model parallel execution is ready. OLLAMA_MAX_LOADED_MODELS controls quantity, OLLAMA_KEEP_ALIVE controls standby time.

  2. Choose models by task. DeepSeek for code, Qwen for Chinese, Llama for general use. Don’t overthink it—each has its strengths.

  3. Memory management matters. Preload high-frequency models, load low-frequency ones on demand. Not enough GPU memory? Borrow system memory—it works.

If you have a GPU with 16GB+ VRAM, or a Mac with 32GB+ unified memory, give three-model parallel deployment a try. There might be a learning curve at first, but once configured, the experience is way better than single model. No frequent switching, no waiting for loading—that smooth feeling is pretty satisfying.

Any questions or pitfalls you’ve encountered, feel free to leave a comment. I’m learning too—maybe I have the answer to your problem.

11 min read · Published on: Apr 6, 2026 · Modified on: Apr 8, 2026

Comments

Sign in with GitHub to leave a comment

Related Posts