Ollama Production Monitoring: Logging Configuration and Prometheus Alerting in Practice
3:17 AM. My phone vibrates once on the nightstand, then a second time, a third. I groggily swipe the screen, and the red Slack alert burns my eyes: Ollama API timeout - service unavailable.
My first thought: this is bad.
We had just launched a Llama 3.1-based customer service system two weeks earlier. The user base was small, maybe a few hundred calls per day. When we deployed it, I’ll admit I was a bit nervous—we only had basic logging configured, no monitoring or alerting whatsoever. The plan was “let’s get it running first.” The result? When I was woken up at 3 AM, I had no idea what was wrong. GPU memory full? Process crashed? Network issue? I was completely in the dark.
That incident took until 6 AM to resolve. In the post-mortem, I found that many teams are in the same boat.
Lack of monitoring is one of the main culprits.
This article is my attempt to help you avoid the pitfalls I encountered. I’ll share a complete solution—from logging configuration to Prometheus + Grafana monitoring to AlertManager setup—with configuration files you can copy and use directly. Following this guide, you can set up a production-grade monitoring system in about 30 minutes. Honestly, if I’d had this setup back then, I could have slept at least three more hours that night.
Core Challenges of Production Monitoring
Ollama is different from typical web services. It’s a “resource hog.” Each loaded model consumes 4 to 16 GB of memory alone (data from Markaicode’s benchmarks). And cold starts—loading models from disk to memory—take 10 to 30 seconds. This means if your service crashes and restarts, users have to wait half a minute before getting a response.
The pitfalls I’ve encountered include:
Memory leaks and GPU exhaustion. After running for extended periods, Ollama sometimes “forgets” to release GPU memory. I’ve seen a 24GB VRAM machine that, after two days of running, had only 2GB available—all new requests were rejected. The problem was, I had no idea what was happening until users started complaining.
Request queue buildup. Inference is inherently slow; a single request can take 5-20 seconds. If dozens of requests arrive simultaneously, the queue grows longer and longer until timeouts occur. But how do you know if the queue is backing up? You can only guess.
Model loading latency. When switching between multiple models, loading time is a black box. Users don’t know why responses are slow, and neither do you.
So the monitoring objectives are clear: service availability (is the process still running?), performance metrics (how fast are responses?), resource utilization (how much GPU memory is left?), and error rate (how many requests failed?). Once these four dimensions are covered, you can have peace of mind.
For monitoring solution selection, I’ve tried several combinations. For small teams, Prometheus + Grafana is sufficient; if you need to track LLM Prompts and responses, Langfuse is excellent; for enterprise environments, consider SigNoz, which is based on OpenTelemetry and unifies logs, metrics, and traces. I’ll focus on the Prometheus solution since it’s the most universal foundation.
Logging Configuration and systemd Service Optimization
Getting Ollama running is easy, but keeping it stable requires getting logging right first. I learned this the hard way—when something went wrong and I went to check the logs, I found nothing was recorded, or the log files had ballooned to dozens of GB and filled the disk.
systemd Service Configuration
If you installed Ollama using the official script, it already created a systemd service for you. But the default configuration is basic. For production environments, you need to add a few things:
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network.target
[Service]
Type=simple
User=ollama
Group=ollama
# Working directory
WorkingDirectory=/usr/share/ollama
# Environment variables
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_LOG_FORMAT=json"
# Resource limits (adjust based on your hardware)
LimitNOFILE=65535
LimitNPROC=4096
MemoryMax=32G
# Auto-restart strategy
Restart=always
RestartSec=10
# Startup command
ExecStart=/usr/local/bin/ollama serve
# Standard output and error output
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Let me share some key lessons learned:
Restart=always and RestartSec=10: Automatically restart the process after abnormal exit. The 10-second wait gives the system some breathing room. I once encountered repeated crashes due to memory exhaustion—without this interval, it would have restarted frantically and flooded the logs.
MemoryMax=32G: Limit the maximum memory Ollama can use. This is critical if your machine runs other services. I once didn’t set a limit, and Ollama consumed all 64GB of memory—I couldn’t even SSH in.
OLLAMA_DEBUG=1 and OLLAMA_LOG_FORMAT=json: I recommend enabling debug mode in production—it’s invaluable when troubleshooting issues. JSON format makes it easier to parse logs with tools later.
After modifying the configuration, don’t forget to reload:
sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl enable ollama # Enable on boot
Docker Deployment Logging Configuration
If running with Docker, log management is even easier to mess up. Docker writes logs to /var/lib/docker/containers/ by default, and without limits, they grow indefinitely.
My docker-compose configuration looks like this:
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: always
ports:
- "11434:11434"
volumes:
- ./ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_DEBUG=1
deploy:
resources:
limits:
memory: 32G
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "5"
max-size: "100m" means each log file is capped at 100MB, and max-file: "5" keeps 5 files maximum. That’s at most 500MB of logs—enough for troubleshooting without filling the disk.
Log Level Reference
Ollama supports these environment variables:
| Variable | Description | Production Recommendation |
|---|---|---|
OLLAMA_DEBUG | Set to 1 to enable detailed logging | Recommended enabled |
OLLAMA_LOG_LEVEL | Log level (INFO/DEBUG/WARN) | INFO or DEBUG |
OLLAMA_LOG_FORMAT | Log format (text/json) | JSON |
I generally keep DEBUG enabled—disk space isn’t an issue, and it saves a lot of time when troubleshooting.
Practical journalctl Logging
Once configured, use journalctl to view logs:
# View logs in real-time
sudo journalctl -u ollama -f
# View last 100 lines
sudo journalctl -u ollama -n 100
# View today's logs
sudo journalctl -u ollama --since today
# Search for specific keywords
sudo journalctl -u ollama | grep -i "error"
# Export logs to file
sudo journalctl -u ollama --since "2026-04-12 00:00:00" > ollama-debug.log
Here’s a tip: if you enabled JSON format logging, you can use jq to parse:
sudo journalctl -u ollama -o json | jq 'select(.level=="error")'
This filters to error-level logs only—no more digging through piles of INFO entries.
Prometheus + Grafana Monitoring Solution
Logging is for post-mortem analysis; monitoring is the early warning system. I’ve been using Prometheus + Grafana for over two years. The setup can be tedious, but it’s stable, reliable, and has abundant community resources.
ollama-exporter Deployment
Ollama doesn’t expose Prometheus metrics directly—you need an exporter to collect them. I use frcooper/ollama-exporter. While it has only 36 stars, it does the job.
There are two deployment options: run the binary directly, or use Docker. I recommend Docker:
# Add exporter service to docker-compose.yml
services:
ollama-exporter:
image: frazco/ollama-exporter:latest
container_name: ollama-exporter
restart: always
ports:
- "9101:9101"
environment:
- OLLAMA_HOST=ollama:11434 # Point to ollama container
depends_on:
- ollama
Then the Prometheus configuration:
# prometheus.yml
global:
scrape_interval: 30s # Scrape interval, Markaicode recommends 30 seconds
evaluation_interval: 30s
scrape_configs:
- job_name: 'ollama-exporter'
static_configs:
- targets: ['ollama-exporter:9101']
labels:
instance: 'ollama-prod'
# GPU monitoring (if using NVIDIA)
- job_name: 'nvidia-gpu'
static_configs:
- targets: ['localhost:9835']
Add Prometheus to docker-compose as well:
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: always
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
volumes:
prometheus_data:
Key Monitoring Metrics
The ollama-exporter collects these metrics—here are the important ones:
| Metric Name | Description | Watch For |
|---|---|---|
ollama_requests_total | Total requests | Error rate calculation |
ollama_requests_failed | Failed requests | Direct monitoring |
ollama_model_load_duration_seconds | Model load time | Cold start performance |
ollama_request_duration_seconds | Request response time | P95/P99 latency |
ollama_tokens_per_second | Inference speed | Throughput |
There are also system-level metrics (requiring node-exporter):
- CPU utilization:
node_cpu_seconds_total - Memory utilization:
node_memory_MemAvailable_bytes - Network traffic:
node_network_receive_bytes_total
GPU Monitoring Configuration
The GPU is the heart of an LLM service—monitoring must be thorough. I use nvidia_gpu_prometheus_exporter:
# Install NVIDIA GPU exporter
docker run -d \
--name nvidia-exporter \
--restart always \
-p 9835:9835 \
--gpus all \
nvidia/gpu-prometheus-exporter:latest
It outputs these key metrics:
nvidia_gpu_utilization: GPU utilizationnvidia_gpu_memory_used_bytes: Memory usagenvidia_gpu_memory_free_bytes: Available memorynvidia_gpu_temperature: GPU temperature
In multi-GPU environments, metrics include a gpu_id label, allowing you to display each card separately in Grafana.
Grafana Dashboard Configuration
I’ll give you a ready-to-import Grafana Dashboard JSON. Save this as a file, then in Grafana click Import Dashboard:
{
"dashboard": {
"title": "Ollama Production Monitor",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(ollama_requests_total[5m])",
"legendFormat": "Requests/sec"
}
],
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 6}
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "rate(ollama_requests_failed[5m]) / rate(ollama_requests_total[5m]) * 100",
"legendFormat": "Error %"
}
],
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 6}
},
{
"title": "GPU Memory Usage",
"type": "graph",
"targets": [
{
"expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100",
"legendFormat": "GPU {{gpu_id}}"
}
],
"gridPos": {"x": 0, "y": 6, "w": 12, "h": 6}
},
{
"title": "Response Latency P95",
"type": "stat",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95 Latency"
}
],
"gridPos": {"x": 12, "y": 6, "w": 6, "h": 6}
}
]
},
"overwrite": true
}
The actual effect looks something like this:
- Top left: Request rate curve, shows peak periods
- Top right: Error rate gauge, turns red above 5%
- Bottom left: Multi-GPU memory usage curves
- Bottom right: P95 latency value
I also add a Tokens/s panel to compare inference speeds across different models horizontally.
Grafana Data Source Configuration
After the Grafana container starts, you need to manually configure the Prometheus data source:
- Login to Grafana (default admin/admin)
- Configuration -> Data Sources -> Add data source
- Select Prometheus, URL set to
http://prometheus:9090 - Save & Test
If using docker-compose for deployment, the containers can communicate directly using container names.
Alert Rules and AlertManager Configuration
Monitoring shows you problems, but alerting tells you “handle this now.” I once made a mistake: I set all alerts to critical, my phone buzzed dozens of times a day, and eventually I became numb—when a real issue came up, I didn’t react properly.
Alert Tiering Strategy
I divide alerts into three tiers. This logic came from iterating through several incidents:
| Level | Trigger Condition | Response Requirement |
|---|---|---|
| Critical | Service down, GPU memory >95%, error rate >20% | Immediate action (Slack + phone push) |
| Warning | Response time >60s, GPU memory >80%, error rate >5% | Review within 1 hour (Slack only) |
| Info | Model switch, new version deployment | Log only (email digest) |
Key principle: Critical alerts must be rare—when you see one, it should make you nervous.
Prometheus Alert Rules
Add alert rules to prometheus.yml:
rule_files:
- 'ollama_alerts.yml'
Then create a separate ollama_alerts.yml:
# ollama_alerts.yml
groups:
- name: ollama_critical
rules:
# Service down alert
- alert: OllamaServiceDown
expr: up{job="ollama-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Ollama Service Down"
description: "Ollama exporter unreachable, service may have stopped"
# GPU memory alert (>95%)
- alert: GPUMemoryCritical
expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "GPU Memory Nearly Exhausted"
description: "GPU {{ gpu_id }} memory usage exceeds 95%, currently {{ $value | humanizePercentage }}"
# High error rate alert
- alert: HighErrorRate
expr: rate(ollama_requests_failed[5m]) / rate(ollama_requests_total[5m]) > 0.20
for: 3m
labels:
severity: critical
annotations:
summary: "Request Error Rate Too High"
description: "Error rate exceeded 20% in the last 5 minutes, check logs"
- name: ollama_warning
rules:
# Response time alert
- alert: SlowResponseTime
expr: histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m])) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "P95 Response Time Too Slow"
description: "95% of requests have response time exceeding 60 seconds"
# GPU memory warning
- alert: GPUMemoryWarning
expr: nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "GPU Memory Usage High"
description: "GPU {{ gpu_id }} memory usage exceeds 80%"
# Error rate warning
- alert: ErrorRateWarning
expr: rate(ollama_requests_failed[5m]) / rate(ollama_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Request Error Rate Rising"
description: "Error rate exceeded 5% in the last 5 minutes"
A few notes:
for: Xm: Trigger only after X minutes to avoid false positives from momentary spikes- GPU alert threshold at 95%: In practice, once you exceed 95%, things go wrong almost immediately
- Error rate alerts use
rate(): Absolute numbers are meaningless; you need to look at trends
AlertManager Configuration
AlertManager handles sending alerts. Configuration file alertmanager.yml:
global:
resolve_timeout: 5m
# Routing configuration
route:
group_by: ['severity', 'alertname']
group_wait: 30s # Wait 30 seconds to collect alerts in same group
group_interval: 5m # Interval between same group alerts
repeat_interval: 3h # Repeat interval for unresolved alerts
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: false
- match:
severity: warning
receiver: 'warning-alerts'
continue: false
- match:
severity: info
receiver: 'info-alerts'
# Receiver configuration
receivers:
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#ollama-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'warning-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#ollama-monitor'
send_resolved: true
- name: 'info-alerts'
email_configs:
- to: '[email protected]'
send_resolved: true
Slack Webhook Configuration Steps
- Create an App in Slack (or use Incoming Webhooks)
- Add the Webhook URL to the
api_urlfield - Recommended to use separate channels: dedicated channel for critical, regular channel for warnings
I also add mobile push notifications. If you use PagerDuty or OpsGenie, AlertManager has built-in integrations. For a free option, Telegram Bot works well and isn’t complicated to configure.
Silences and Inhibition Rules
Sometimes you need to temporarily silence alerts, like during maintenance. You can do this directly in the AlertManager UI:
# Access AlertManager UI
http://your-server:9093
# Click Silences -> New Silence
# Set duration, match labels
You can also use the API:
curl -X POST http://localhost:9093/api/v1/silences \
-d '{
"matchers": [{"name": "alertname", "value": "OllamaServiceDown", "isRegex": false}],
"startsAt": "2026-04-12T10:00:00Z",
"endsAt": "2026-04-12T12:00:00Z",
"createdBy": "admin",
"comment": "Scheduled maintenance"
}'
Advanced LLM-Specific Monitoring Tools
Prometheus + Grafana is a general-purpose solution, but LLMs have special monitoring needs: Prompt tracing, Token costs, response quality evaluation. These metrics are hard to track with traditional monitoring tools.
Langfuse: LLM Tracing and Prompt Management
Langfuse is a monitoring platform designed specifically for LLM applications. It’s MIT-licensed open source and supports self-hosting. What it can do:
- Trace every conversation: Record input Prompt, output content, Token count, duration
- Prompt version management: Compare effects after Prompt changes
- Quality evaluation: Record user feedback, manual annotations, track model output quality
Integration is straightforward—Langfuse has official Ollama support:
# Python integration example
from langfuse import Langfuse
import requests
langfuse = Langfuse(
public_key="pk-xxx",
secret_key="sk-xxx",
host="https://cloud.langfuse.com" # Or self-hosted address
)
# Record each call
trace = langfuse.trace(
name="ollama-chat",
input={"prompt": user_prompt},
metadata={"model": "llama3.1"}
)
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3.1", "prompt": user_prompt}
)
trace.update(
output=response.json()["response"],
metadata={"tokens": response.json().get("eval_count", 0)}
)
Deploy the self-hosted version with Docker:
services:
langfuse-server:
image: langfuse/langfuse:latest
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgres://user:pass@db:5432/langfuse
- NEXTAUTH_SECRET=your-secret
If you use LangChain, integration is even simpler—Langfuse has an official callback handler.
SigNoz: OpenTelemetry Unified Monitoring
SigNoz is an OpenTelemetry-based observability platform that unifies logs, metrics, and traces. The benefit is you don’t need to maintain Prometheus, Jaeger, and ELK separately.
For LLM applications, SigNoz’s tracing is practical: you can see the complete chain from API entry to model inference to database queries for a single request.
Deploying SigNoz requires more resources—at least 4GB of RAM recommended. Official Docker Compose one-click deployment:
git clone https://github.com/SigNoz/signoz.git
cd signoz/deploy/docker
docker compose up -d
Tool Selection Recommendations
Here’s my recommendation for different scenarios:
| Scenario | Recommended Solution | Reason |
|---|---|---|
| Small team (under 5 people) | Prometheus + Grafana | Simple and sufficient, rich community resources |
| Need Prompt tracing | Prometheus + Langfuse | Langfuse focuses on LLM, complementary |
| Enterprise multi-service | SigNoz + OpenTelemetry | Unified platform, lower ops cost |
| Pure cloud-native | Use managed services | Save ops effort |
I currently use the Prometheus + Grafana + Langfuse combination. Prometheus handles infrastructure metrics, Langfuse handles the LLM application layer—separate responsibilities, clear picture.
Final Thoughts
After all this, it comes down to one thing: Don’t wait for problems to think about monitoring.
That 3 AM lesson cost me a complete monitoring solution. Now my Ollama service has been running for over a year. I’ve encountered GPU memory alerts a few times, but they were all handled at the Warning level—never woken up in the middle of the night again.
The setup cost for this solution is actually low. I’ve organized all the configuration files—you can download and use them directly:
- systemd service configuration
- Docker Compose complete deployment (Ollama + Exporter + Prometheus + Grafana)
- Prometheus alert rules
- AlertManager configuration template
- Grafana Dashboard JSON
The supporting GitHub repository is linked at the end of the article. Following this configuration, experienced users can get it running in 20 minutes, beginners in about 30.
Next steps I recommend:
- Start with basic Prometheus + Grafana to get metrics flowing
- Observe for 3-5 days to understand normal data ranges
- Adjust alert thresholds based on actual conditions
- Add Langfuse if you need Prompt tracing
Monitoring is an investment you make once with continuous returns. I hope you don’t have to learn this lesson the hard way at 3 AM like I did.
Configuration Repository: github.com/yourname/ollama-monitoring-config (example link, replace with actual deployment)
Series Articles:
- Ollama Local Deployment Complete Guide — First in the series
- Ollama Performance Tuning in Practice — Coming next
FAQ
What are the core metrics needed for Ollama production monitoring?
What is the difference between Prometheus + Grafana and Langfuse?
How should I set reasonable alert thresholds?
What should I do if log files grow indefinitely in Docker deployment?
How do I monitor each GPU card separately in a multi-GPU environment?
How do I quickly diagnose issues when I receive an alert at 3 AM?
12 min read · Published on: Apr 12, 2026 · Modified on: Apr 12, 2026
Related Posts
Ollama GPU Scheduling and Resource Management: VRAM Optimization, Multi-GPU Load Balancing
Ollama GPU Scheduling and Resource Management: VRAM Optimization, Multi-GPU Load Balancing
Ollama Performance Optimization: Complete Guide to Quantization, Batch Processing, and Memory Tuning
Ollama Performance Optimization: Complete Guide to Quantization, Batch Processing, and Memory Tuning
Ollama Embedding in Practice: Local Vector Search and RAG Setup

Comments
Sign in with GitHub to leave a comment