LLM Structured Outputs: JSON Schema Enforcement and Tool Calling Reliability Assurance
At 3 AM, my phone buzzed—a production alert. Checking the logs, I saw the Agent tool calling had failed, retrying 5 times consecutively, all due to parameter format errors. The city field should have been "Beijing", but the LLM returned {"name": "Beijing", "id": null} instead. The parser crashed, and the entire data processing pipeline ground to a halt.
This was a major pitfall I encountered last year.
After that, I systematically researched LLM structured output issues—from OpenAI’s Structured Outputs to Anthropic’s Tool Use, from Instructor’s automatic retries to Outlines’ constrained decoding. Honestly, I initially thought this was just a “write better prompts” problem, but later discovered—it’s not a prompt issue at all; it’s a reliability architecture problem.
This article shares the “three-layer reliability assurance architecture”: parameter validation layer, failure retry layer, and constrained decoding layer. At the end, I’ll compare OpenAI, Claude, and Gemini’s solutions side-by-side, explaining how to choose and when to use what. I’ll also include production-grade code templates you can use directly.
I. Why Structured Outputs Are the Foundation of Agents
Let me start with the “format drift” problem I encountered. This isn’t an isolated case—it’s a nightmare everyone developing Agents will face.
Three Types of Format Drift
First: Missing fields. You ask the LLM to return a user object containing name, age, and email, but it returns {"name": "Zhang San"}—the last two fields are gone. Not missing every time, but occasionally. In production, “occasionally” means “inevitably.”
Second: Type errors. The documentation clearly states: user_id is an integer. The LLM returns "user_id": "12345", a string. Python’s Pydantic validation fails immediately, breaking the entire call chain.
Third: Extra content. The most insidious type. You ask for JSON output, but it prefixes with “Here is the response:” and appends “I hope this helps!” The JSON parser sees this and panics.
vs
OpenAI’s official data is quite telling: JSON Mode (which only guarantees valid JSON) has a failure rate of 5-10%, while Structured Outputs (which enforces Schema compliance) has a failure rate of less than 0.1%. That’s two orders of magnitude difference.
Why This Matters So Much
You might think: “It’s just a parsing failure, right? Just add more retries.”
The problem is retries aren’t free.
API call costs. A single GPT-4 call might cost tens of cents; 5 retries become several dollars. If your Agent processes 100,000 requests daily with an average of 2 retries per request—do the math yourself.
Latency stacking. One call takes 2 seconds, 3 retries means users wait over 6 seconds. In real-time conversation scenarios, this is unacceptable.
User experience breakdown. A user asks about the weather, your Agent freezes, spins for 10 seconds, then returns a “system error.” They won’t come back next time.
So structured output isn’t “nice-to-have”—it’s the foundation of whether an Agent can run stably. Let me discuss how to solve this problem—not by “writing better prompts,” but through a reliable architecture.
II. Three-Layer Reliability Assurance Architecture
This architecture is what I summarized after stepping through many pitfalls. It’s not a silver bullet, but it can reduce your format error probability from 5-10% to near zero.
L1: Parameter Validation Layer—The First Line of Defense
This layer does something simple: define your expected data structure using Pydantic, enforce type conversion, and apply whitelist filtering.
from pydantic import BaseModel, Field, field_validator
from typing import Optional, List
from datetime import datetime
class ToolCallParams(BaseModel):
"""Tool call parameter model"""
city: str = Field(..., min_length=1, max_length=50, description="City name")
date: Optional[datetime] = Field(None, description="Query date")
units: str = Field("metric", pattern="^(metric|imperial)$")
@field_validator("city")
@classmethod
def validate_city(cls, v: str) -> str:
# Whitelist validation
allowed_cities = {"Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou"}
if v not in allowed_cities:
raise ValueError(f"Unsupported city: {v}, currently supported: {allowed_cities}")
return v
Pydantic handles three things for you: type coercion (string “123” to integer 123), missing field detection, and custom validation. This is the most basic yet crucial layer.
L2: Failure Retry Layer—Self-Correction with Feedback
When LLM-returned data fails validation, instead of a simple retry, feed the error message back so it can correct itself. The Instructor library does this well, encapsulating this logic.
import instructor
from openai import OpenAI
from pydantic import ValidationError
client = instructor.patch(OpenAI())
def get_weather_with_retry(user_query: str, max_retries: int = 3):
"""Retry mechanism with error feedback"""
messages = [{"role": "user", "content": user_query}]
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o",
response_model=ToolCallParams, # Pydantic model
messages=messages,
temperature=0.1 # Use low temperature for structured output
)
return response # Automatic validation passed
except ValidationError as e:
# Feed error to LLM for correction
error_msg = f"Parameter validation failed: {str(e)}\nPlease correct and return the correct JSON format."
messages.append({"role": "assistant", "content": "Generating parameters..."})
messages.append({"role": "user", "content": error_msg})
if attempt == max_retries - 1:
raise Exception(f"Still failed after {max_retries} retries: {e}")
# Usage example
result = get_weather_with_retry("Check tomorrow's weather in Beijing for me")
The core idea is: the LLM isn’t guessing blindly—it knows what’s wrong and why. Give it feedback, and it can fix it. In my testing, adding this feedback mechanism improved retry success rate from 60% to over 95%.
L3: Constrained Decoding Layer—Preventing Errors at the Source
The first two layers are “remedial,” but L3 is “preventive.”
The principle of constrained decoding is this: when the LLM generates each token, limit its choice range through a Finite State Machine (FSM), forcing it to only generate token sequences that comply with the Schema. It’s like installing a “brake” on the LLM—it can’t output incorrectly even if it wants to.
There are two mainstream implementation choices:
Outlines (open-source solution, suitable for local models):
from outlines import models, generate
import json
# Load local model
model = models.transformers("Qwen/Qwen2.5-7B-Instruct")
# Define Schema
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
# Create constrained generator
generator = generate.json(model, schema)
result = generator("Extract user info: Zhang San is 28 years old")
# 100% Schema compliant, no retries needed
vLLM’s guided_json (suitable for deploying large models):
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-72B-Instruct")
sampling_params = SamplingParams(
temperature=0.0,
guided_decoding_backend="outlines",
guided_json={ # Directly pass JSON Schema
"type": "object",
"properties": {
"tool_name": {"type": "string"},
"arguments": {"type": "object"}
}
}
)
The cost of L3 is additional compilation overhead—the FSM needs to be pre-built based on the Schema. If your Schema changes frequently, rebuilding the FSM each time introduces latency. But for most Agent applications, Schemas are relatively stable, making this overhead acceptable.
How to Choose Among the Three Layers
| Scenario | Recommended Solution |
|---|---|
| Calling OpenAI API | L1 + L2 (Pydantic + Instructor) |
| Calling Claude API | L1 + L2 (Claude doesn’t support Strict Mode) |
| Deploying local models | L1 + L3 (Outlines/vLLM guided_json) |
| Extremely high reliability requirements | L1 + L2 + L3 all enabled |
III. Vendor Comparison: OpenAI, Claude, Gemini—How to Choose
This chapter discusses implementation differences among vendors. Honestly, without cross-comparison, it’s easy to step into pitfalls—different vendors have very different concepts and implementations of “structured output.”
OpenAI: Strict Mode, Enforced Compliance
OpenAI launched the Structured Outputs feature in August 2024, which is currently the most reliable solution among commercial APIs.
The core mechanism is the strict: true parameter. When enabled, the LLM’s output is forcibly constrained to your defined JSON Schema, guaranteeing 100% compliance. Under the hood, it uses constrained decoding technology (based on Grammar-based Constrained Decoding), similar in principle to Outlines.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Extract user information"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "user_info",
"strict": True, # Key parameter
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
}
}
)
# Output 100% Schema compliant
OpenAI’s official data shows Strict Mode’s failure rate is less than 0.1%. In my usage, I haven’t encountered format errors—but there are limitations: no support for recursive Schemas, and some complex nested structures require workarounds.
Anthropic Claude: Tool Use, No Compliance Guarantee
Claude’s structured output takes a different approach—Tool Use.
You define a tool, and Claude will call it and pass parameters. But here’s the pitfall: Claude’s strict parameter can be set, but the official documentation clearly states—it will be ignored. Claude doesn’t guarantee parameters will definitely comply with your defined Schema.
This is what Anthropic’s official documentation says (updated April 2026):
“The
strictparameter is currently ignored for tool definitions. Claude will make a best effort to provide valid arguments, but does not guarantee schema compliance.”
Translation: It will try its best, but no guarantees. So when using Claude for tool calling, you must add L1 (parameter validation) and L2 (failure retry).
import anthropic
client = anthropic.Anthropic()
# Claude's tool definition
tools = [{
"name": "get_weather",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}]
response = client.messages.create(
model="claude-3.5-sonnet",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "Beijing weather"}]
)
# Important: Must manually validate tool_use parameters
for block in response.content:
if block.type == "tool_use":
# Pydantic validation required here
validated_params = ToolCallParams.model_validate(block.input)
Google Gemini: Controlled Generation
Gemini’s solution is called Controlled Generation, specifying output structure through the response_schema parameter.
import google.generativeai as genai
model = genai.GenerativeModel('gemini-1.5-pro')
response = model.generate_content(
"Extract user information",
generation_config={
"response_mime_type": "application/json",
"response_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
}
)
Gemini’s reliability is between OpenAI and Claude—there are constraints, but not as strong as OpenAI’s “enforced compliance.” In testing, the failure rate is about 1-2%, better than JSON Mode but not reaching Strict Mode levels.
Open Source Models: Depend on Outlines/vLLM
Open source models (like Qwen, Llama, Mistral) don’t natively support structured output and need external tools. The mainstream solutions are Outlines and vLLM’s guided_json mentioned earlier.
Here’s an interesting point: open source models combined with Outlines actually have higher structured output reliability than some commercial APIs—because FSM is a hard constraint, there’s no “best effort but no guarantee” situation.
Quick Reference Selection Table
| Need | Recommended Solution | Reason |
|---|---|---|
| Pure API calls, pursuing stability | OpenAI + Structured Outputs | 0.1% failure rate, most reliable |
| Need complex reasoning + tool calling | Claude + L1/L2 validation | Strong reasoning ability, but requires validation |
| Deploy private models | Qwen/Llama + Outlines | Controllable cost, high reliability |
| Extremely high format requirements (finance, medical) | OpenAI Strict or Outlines | Both achieve near-zero failures |
| Rapid prototype validation | Instructor + any API | Well-encapsulated, automatic retries |
IV. Production Code Templates
This chapter provides production-grade code templates. I’ve verified all this code in production environments—use it directly.
Template 1: OpenAI Structured Outputs Complete Example
"""
OpenAI Structured Outputs Complete Example
Suitable for: Tool calling, data extraction, report generation, etc.
"""
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
import json
# 1. Define Pydantic model
class SearchQuery(BaseModel):
"""Search query parameters"""
keywords: List[str] = Field(
...,
min_length=1,
max_length=5,
description="List of search keywords"
)
filters: Optional[dict] = Field(
default=None,
description="Optional filter conditions"
)
limit: int = Field(
default=10,
ge=1,
le=100,
description="Number of results to return"
)
# 2. Convert Pydantic model to JSON Schema
def model_to_schema(model: type[BaseModel]) -> dict:
"""Convert Pydantic model to JSON Schema"""
schema = model.model_json_schema()
# Clean Pydantic-added metadata
schema.pop("title", None)
for prop in schema.get("properties", {}).values():
prop.pop("title", None)
return schema
# 3. Structured output call
client = OpenAI()
def extract_search_params(user_input: str) -> SearchQuery:
"""Extract search parameters from user input"""
schema = model_to_schema(SearchQuery)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a search assistant helping users extract search parameters."
},
{"role": "user", "content": user_input}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "search_query",
"strict": True,
"schema": schema
}
},
temperature=0.1 # Use low temperature for structured output
)
# 4. Parse and double-validate
raw_content = response.choices[0].message.content
data = json.loads(raw_content)
return SearchQuery.model_validate(data)
# Usage example
if __name__ == "__main__":
query = extract_search_params(
"I want to find articles about Python async programming, only from the last month, maximum 20"
)
print(query)
# SearchQuery(keywords=['Python', 'async programming'], filters={'date_range': 'last_month'}, limit=20)
Template 2: Instructor Automatic Retry Example
"""
Instructor Automatic Retry Example
Suitable for: Claude API, OpenAI JSON Mode (non-Strict), scenarios requiring fault tolerance
"""
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, ValidationError
class AgentAction(BaseModel):
"""Agent action decision"""
action_type: str = Field(
...,
pattern="^(search|execute|respond|clarify)$"
)
parameters: dict = Field(default_factory=dict)
reasoning: str = Field(..., min_length=10)
# patch OpenAI client
client = instructor.patch(OpenAI())
def get_agent_decision(
context: str,
user_request: str,
max_retries: int = 3
) -> AgentAction:
"""
Get Agent's action decision with automatic retry
Args:
context: Current conversation context
user_request: User request
max_retries: Maximum retry count
Returns:
AgentAction: Validated action decision
"""
messages = [
{"role": "system", "content": "You are an intelligent assistant analyzing user needs and deciding next actions."},
{"role": "user", "content": f"Context: {context}\n\nUser request: {user_request}"}
]
try:
response = client.chat.completions.create(
model="gpt-4o",
response_model=AgentAction, # Instructor auto-validation
messages=messages,
max_retries=max_retries, # Built-in retry
temperature=0.1
)
return response
except ValidationError as e:
# Instructor already retried max_retries times
raise Exception(f"Format error cannot be fixed, please check model definition: {e}")
# Usage example
decision = get_agent_decision(
context="User is querying weather information",
user_request="Check tomorrow's weather in Beijing for me, and recommend outdoor activities if it's sunny"
)
print(f"Action type: {decision.action_type}")
print(f"Parameters: {decision.parameters}")
print(f"Reasoning: {decision.reasoning}")
Template 3: Outlines Local Model Structured Output
"""
Outlines Local Model Structured Output Example
Suitable for: Private deployment, cost-sensitive, high privacy requirements
"""
from outlines import models, generate
from pydantic import BaseModel
from typing import List
import json
# Define data structure
class ProductInfo(BaseModel):
"""Product information"""
name: str
price: float
category: str
tags: List[str]
# Load model (first load has a few seconds delay)
model = models.transformers("Qwen/Qwen2.5-7B-Instruct")
# Create structured generator
# Note: schema will be compiled to FSM on first call, ~1-2 second overhead
schema_str = json.dumps(ProductInfo.model_json_schema())
generator = generate.json(model, schema_str)
def extract_product_info(description: str) -> ProductInfo:
"""
Extract structured information from product description
Args:
description: Product description text
Returns:
ProductInfo: Structured product information
"""
prompt = f"Extract key information from the following product description in JSON format:\n{description}"
# Generated result 100% Schema compliant
result = generator(prompt)
# Convert to Pydantic model (double validation to be absolutely sure)
return ProductInfo.model_validate(result)
# Usage example
description = """
This Bluetooth headphone features the latest noise-canceling technology,
priced at 299 yuan, belongs to digital accessories category,
suitable for sports, commuting, and other scenarios.
"""
product = extract_product_info(description)
print(product)
# ProductInfo(name='Bluetooth headphone', price=299.0, category='Digital accessories', tags=['sports', 'commuting'])
Template 4: Complete Tool Calling Flow
"""
Complete Tool Calling Parameter Validation Flow
Includes: Schema definition → LLM call → Parameter validation → Failure retry → Tool execution
"""
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator, ValidationError
from typing import Callable, Dict, Any
import json
# 1. Define tool parameter model
class WeatherQueryParams(BaseModel):
"""Weather query tool parameters"""
city: str = Field(..., min_length=1, max_length=50)
date_offset: int = Field(default=0, ge=-7, le=7, description="Date offset, 0 means today")
@field_validator("city")
@classmethod
def validate_city(cls, v: str) -> str:
allowed = {"Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou", "Chengdu", "Wuhan"}
if v not in allowed:
raise ValueError(f"Unsupported city, options: {allowed}")
return v
# 2. Tool call manager
class ToolCallManager:
"""Manage complete tool calling flow"""
def __init__(self):
self.client = OpenAI()
self.tools: Dict[str, Callable] = {}
def register_tool(self, name: str, func: Callable, param_model: type[BaseModel]):
"""Register tool"""
self.tools[name] = {
"function": func,
"param_model": param_model
}
def execute_with_retry(
self,
tool_name: str,
user_request: str,
max_retries: int = 3
) -> Any:
"""Execute tool call with retry"""
tool_config = self.tools[tool_name]
param_model = tool_config["param_model"]
schema = param_model.model_json_schema()
messages = [
{"role": "system", "content": f"Extract call parameters for tool '{tool_name}'"},
{"role": "user", "content": user_request}
]
for attempt in range(max_retries):
try:
# Call LLM to get parameters
response = self.client.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format={
"type": "json_schema",
"json_schema": {
"name": tool_name,
"strict": True,
"schema": schema
}
},
temperature=0.1
)
# Validate parameters
params = param_model.model_validate_json(
response.choices[0].message.content
)
# Execute tool
return tool_config["function"](params)
except ValidationError as e:
# Feed back error for LLM to correct
messages.append({
"role": "user",
"content": f"Parameter validation failed: {e}\nPlease correct parameter format."
})
continue
raise Exception(f"Tool call failed, still cannot pass validation after {max_retries} retries")
# 3. Usage example
def get_weather(params: WeatherQueryParams) -> str:
"""Simulate weather query"""
# Actual API call logic here
return f"{params.city} weather will be sunny for the next {params.date_offset} day(s)"
manager = ToolCallManager()
manager.register_tool("get_weather", get_weather, WeatherQueryParams)
result = manager.execute_with_retry(
"get_weather",
"Check tomorrow's weather in Beijing for me"
)
print(result) # Beijing weather will be sunny for the next 1 day(s)
These templates cover the most common scenarios. You can combine and modify them according to your needs in practice.
V. Production Best Practices
The code is written, but there are many details to watch in production. Here are some pitfalls I’ve encountered and their solutions.
Temperature Setting: Don’t Go High
For structured output scenarios, Temperature should be set between 0.0-0.2. This range is recommended by OpenAI’s official documentation, and it’s indeed the most stable in practice.
What’s the problem with high temperature? The LLM becomes more “divergent,” outputting more randomly. Randomness is the enemy of structured output—you want determinism, not creativity. I once set Temperature to 0.7, and the format error rate shot up to 15%. After changing to 0.1, I basically stopped encountering issues.
Retry Strategy: Not All Errors Should Be Retried
Before retrying, first determine the error type:
| Error Type | Retry? | Reason |
|---|---|---|
| Parameter format error (missing field, wrong type) | Retry + error feedback | LLM can self-correct |
| API service error (429, 500) | Retry + exponential backoff | Temporary server issue |
| Business validation failure (city not in whitelist) | No retry, return error directly | Needs user confirmation |
| Tool execution failure (empty result) | No retry, fallback | Tool itself problem |
I’ve seen people retry all errors infinitely, resulting in a city name not being in the whitelist—LLM guessed 10 times without getting it right, eventually timing out and crashing. Distinguish error types to handle them efficiently.
Performance Overhead Comparison
| Solution | Latency Increase | Cost Increase | Reliability |
|---|---|---|---|
| Prompt constraint (no special params) | +0ms | +0% | 5-10% failure |
| JSON Mode (OpenAI only) | +50ms | +0% | 2-5% failure |
| Structured Outputs (Strict) | +100ms | +0% | <0.1% failure |
| Instructor retry | +200-500ms/retry | +cost×retry count | Near 0% failure |
| Outlines FSM | +1-2s (first compilation) | +0% | 100% compliant |
Choose by weighing: for ultimate stability, choose Structured Outputs or Outlines; for rapid prototyping, use Instructor automatic retry; for limited budget, JSON Mode + manual validation can work too.
Monitoring Metrics: Three Must-Watch
After going live, these metrics must be monitored:
- Format failure rate: Percentage of requests that fail validation. Investigate if over 1%.
- Average retry count: Should normally be between 0.5-1.5. Over 2 indicates problems with the model or Schema.
- Average latency: Structured output adds 50-200ms compared to normal output, but should stay within acceptable range.
I use Prometheus + Grafana for monitoring and check the report weekly. Once I found retry count suddenly jumped from 0.8 to 2.5—investigation revealed the Schema was changed but not synced to code. Fortunately, monitoring caught the issue in time.
Conclusion
After all this, there’s really one core point: structured output in 2026 is no longer a difficult problem—as long as you use the right approach.
The three-layer architecture (parameter validation + failure retry + constrained decoding) covers all scenarios from “working” to “running stably.” When choosing vendors, remember: OpenAI Strict Mode is the most stable, Claude needs self-validation, and open source models combined with Outlines are actually highly reliable.
The code templates are all in Chapter IV—take them and modify as needed. If you’re just starting Agent development, I recommend beginning with Instructor—it’s well-encapsulated with built-in automatic retries and error feedback. Once familiar, consider whether to use Outlines for 100% forced compliance.
If you have questions, leave a comment or reach out directly. This content is a bit technical—hope it helps you avoid some pitfalls.
Complete Process for Implementing OpenAI Structured Outputs
Complete steps from Pydantic model definition to structured output calling
⏱️ Estimated time: 15 min
- 1
Step1: Define Pydantic Data Model
Create Pydantic model class, use Field to define field constraints:
• Use `Field(..., min_length=1, max_length=50)` to define string length range
• Use `Field(default=10, ge=1, le=100)` to define numeric range
• Use `@field_validator` to add custom validation logic (e.g., whitelist filtering)
• Use `Optional[T]` to define optional fields - 2
Step2: Convert Pydantic Model to JSON Schema
Use `model.model_json_schema()` method to convert:
```python
schema = SearchQuery.model_json_schema()
schema.pop("title", None) # Clean Pydantic metadata
```
Ensure Schema complies with OpenAI Structured Outputs requirements. - 3
Step3: Call OpenAI API and Enable Strict Mode
Set `response_format` parameter in API request:
• `type: "json_schema"` — specify structured output type
• `strict: True` — enable enforced compliance mode
• `json_schema.name` — Schema name (custom)
• `json_schema.schema` — JSON Schema from previous step - 4
Step4: Parse Response and Double-Validate
Although Strict Mode guarantees 100% compliance, still recommend double validation:
• Use `json.loads()` to parse response string
• Use `model.model_validate(data)` for Pydantic validation
• Catch `ValidationError` exception and handle edge cases - 5
Step5: Configure Temperature Parameter
Set low temperature for structured output scenarios:
```python
temperature=0.1 # Recommended 0.0-0.2
```
Avoid high temperature increasing output randomness, affecting format stability.
FAQ
What should I do when LLM outputs incorrect JSON format?
• L1 Parameter Validation Layer: Use Pydantic to define data models, automatic type conversion and field validation
• L2 Failure Retry Layer: Use Instructor library for automatic retry, feed errors back to LLM for self-correction
• L3 Constrained Decoding Layer: Use Outlines or vLLM's guided_json to guarantee compliance at the source
What's the difference between OpenAI and Claude's structured output?
How to choose the right structured output solution?
• **OpenAI API calls**: Structured Outputs + Strict Mode (most stable)
• **Claude API calls**: Pydantic validation + Instructor retry (needs self-validation)
• **Local model deployment**: Outlines or vLLM guided_json (controllable cost, high reliability)
• **Rapid prototype validation**: Instructor library (well-encapsulated, ready to use)
• **Finance/medical high-requirement scenarios**: OpenAI Strict or Outlines (near-zero failure)
How should Temperature parameter be set?
How much performance overhead does structured output add?
• **Prompt constraint**: +0ms latency, 5-10% failure rate
• **JSON Mode**: +50ms latency, 2-5% failure rate
• **Structured Outputs**: +100ms latency, <0.1% failure rate
• **Instructor retry**: +200-500ms/retry, near 0% failure rate
• **Outlines FSM**: +1-2s first compilation, 100% compliance
Choose based on reliability needs and budget trade-offs.
Which errors should be retried? Which shouldn't?
**Should retry**:
• Parameter format errors (missing fields, wrong types) — LLM can self-correct
• API service errors (429, 500) — Temporary server issue
**Should not retry**:
• Business validation failure (city not in whitelist) — Needs user confirmation
• Tool execution failure (empty result returned) — Tool itself problem, should fallback
Infinite retrying leads to timeout; distinguish error types to handle efficiently.
11 min read · Published on: May 6, 2026 · Modified on: May 6, 2026
AI Development
If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.
Previous
RAG Query Routing in Practice: Multi-Vector Store Coordination and Intelligent Retrieval Distribution
RAG query routing in practice: A systematic comparison of three approaches—logical routing, semantic routing, and EnsembleRetriever—with complete LangChain code implementations, including cost optimization strategies like Semantic Caching and Tiered Retrieval.
Part 28 of 33
Next
DeepAgents Architecture: Planning Tools, Sub-agents, and File System
Deep dive into DeepAgents' four-pillar architecture: Planning Tools, Sub-agents, File System, and System Prompts. Compare with LangGraph, AutoGen, and other frameworks. Includes practical code examples and best practices.
Part 30 of 33
Related Posts
Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI
Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI
AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks
AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks
OpenAI Blocked in China? Set Up Workers Proxy for Free in 5 Minutes (Complete Code Included)

Comments
Sign in with GitHub to leave a comment