Parallel tool calling is one of the most impactful performance features in modern LLM APIs. Instead of waiting for one tool call to finish before starting the next, the model can request multiple tool executions in a single response - and your agent runtime can execute them all at once.
This guide covers how parallel tool calling works across OpenAI, Anthropic Claude, and Google Gemini APIs, with working code examples in both Python and TypeScript. You will also learn error handling patterns, when to choose sequential over parallel execution, and how real-world AI coding agents use this to speed up code review and analysis.
What Is Parallel Tool Calling?
In a traditional LLM agent loop, tool use follows a strict sequential pattern:
- The model requests a single tool call
- Your code executes that tool
- You send the result back to the model
- The model requests the next tool call
- Repeat until done
This works fine when each step depends on the previous result. But many agent tasks involve independent operations that have no dependency on each other. Reading multiple files, querying several APIs, or running different checks on the same codebase - these can all happen at the same time.
Parallel tool calling lets the model express this. In a single response, the model returns multiple tool call requests. Your agent runtime sees them all at once and can execute them concurrently.
The execution flow looks like this:
Sequential (slow):
Model -> Tool A (200ms) -> Model -> Tool B (200ms) -> Model -> Tool C (200ms)
Total: ~600ms + 3 model round trips
Parallel (fast):
Model -> [Tool A, Tool B, Tool C] (all execute at once, 200ms) -> Model
Total: ~200ms + 1 model round trip
The performance gain comes from two sources. First, the tool executions overlap in time. Second, you eliminate extra round trips to the model API, which often add 1-3 seconds each.
How Parallel Tool Calling Works in OpenAI API
OpenAI introduced parallel function calling with GPT-4 Turbo in late 2023 and it has been a core feature since. When the model determines that multiple tool calls are independent, it returns all of them in a single response.
Python Example - OpenAI Parallel Tool Calls
from openai import AsyncOpenAI
client = AsyncOpenAI()
tools = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path"}
},
"required": ["path"]
}
}
},
{
"type": "function",
"function": {
"name": "run_linter",
"description": "Run a linter on a file and return issues",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path"},
"linter": {"type": "string", "description": "Linter name"}
},
"required": ["path", "linter"]
}
}
}
]
async def execute_tool(name: str, arguments: dict) -> str:
"""Execute a single tool and return the result."""
if name == "read_file":
# Simulate file read
return f"Contents of {arguments['path']}: ..."
elif name == "run_linter":
return f"Linter {arguments['linter']} found 0 issues in {arguments['path']}"
return "Unknown tool"
async def run_agent():
messages = [
{"role": "system", "content": "You are a code review agent."},
{"role": "user", "content": "Review src/auth.py and src/api.py for issues."}
]
response = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
# parallel_tool_calls=True is the default
)
message = response.choices[0].message
if message.tool_calls:
print(f"Model requested {len(message.tool_calls)} tool calls in parallel")
# Execute all tool calls concurrently
tasks = []
for tool_call in message.tool_calls:
args = json.loads(tool_call.function.arguments)
tasks.append(execute_tool(tool_call.function.name, args))
results = await asyncio.gather(*tasks)
# Send all results back in one request
messages.append(message)
for tool_call, result in zip(message.tool_calls, results):
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
# Get the model's final response
final = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
print(final.choices[0].message.content)
asyncio.run(run_agent())
The key detail here is that message.tool_calls is a list. When the model decides to call multiple tools in parallel, this list contains more than one entry. You execute them all, then send all the results back in the same message sequence.
TypeScript Example - OpenAI Parallel Tool Calls
const client = new OpenAI();
const tools: OpenAI.ChatCompletionTool[] = [
{
type: "function",
function: {
name: "read_file",
description: "Read the contents of a file",
parameters: {
type: "object",
properties: {
path: { type: "string", description: "File path" },
},
required: ["path"],
},
},
},
{
type: "function",
function: {
name: "run_linter",
description: "Run a linter on a file and return issues",
parameters: {
type: "object",
properties: {
path: { type: "string", description: "File path" },
linter: { type: "string", description: "Linter name" },
},
required: ["path", "linter"],
},
},
},
];
async function executeTool(name: string, args: Record<string, string>): Promise<string> {
if (name === "read_file") {
return `Contents of ${args.path}: ...`;
}
if (name === "run_linter") {
return `Linter ${args.linter} found 0 issues in ${args.path}`;
}
return "Unknown tool";
}
async function runAgent() {
const messages: OpenAI.ChatCompletionMessageParam[] = [
{ role: "system", content: "You are a code review agent." },
{ role: "user", content: "Review src/auth.py and src/api.py for issues." },
];
const response = await client.chat.completions.create({
model: "gpt-4o",
messages,
tools,
// parallel_tool_calls defaults to true
});
const message = response.choices[0].message;
if (message.tool_calls && message.tool_calls.length > 0) {
console.log(`Model requested ${message.tool_calls.length} parallel tool calls`);
// Execute all tool calls concurrently with Promise.all
const results = await Promise.all(
message.tool_calls.map((tc) =>
executeTool(tc.function.name, JSON.parse(tc.function.arguments))
)
);
// Append assistant message and all tool results
messages.push(message);
message.tool_calls.forEach((tc, i) => {
messages.push({
role: "tool",
tool_call_id: tc.id,
content: results[i],
});
});
const final = await client.chat.completions.create({
model: "gpt-4o",
messages,
tools,
});
console.log(final.choices[0].message.content);
}
}
runAgent();
Disabling Parallel Tool Calls in OpenAI
If you need sequential execution - for example when the second tool call depends on the first result - you can disable parallel calling:
response = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
parallel_tool_calls=False, # Force one tool call at a time
)
How Parallel Tool Calling Works in Anthropic Claude API
Anthropic Claude supports parallel tool use across Claude 3.5 Sonnet, Claude 3 Opus, and Claude 4 models. The implementation differs from OpenAI because Claude uses a content block model where a single response can contain multiple tool_use blocks.
Python Example - Claude Parallel Tool Use
client = anthropic.AsyncAnthropic()
tools = [
{
"name": "read_file",
"description": "Read the contents of a file from the repository",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path"}
},
"required": ["path"]
}
},
{
"name": "search_code",
"description": "Search for a pattern across the codebase",
"input_schema": {
"type": "object",
"properties": {
"pattern": {"type": "string", "description": "Search pattern"},
"file_glob": {"type": "string", "description": "File glob filter"}
},
"required": ["pattern"]
}
}
]
async def execute_tool(name: str, tool_input: dict) -> str:
if name == "read_file":
return f"File contents of {tool_input['path']}: def main(): pass"
elif name == "search_code":
return f"Found 3 matches for '{tool_input['pattern']}'"
return "Unknown tool"
async def run_agent():
messages = [
{
"role": "user",
"content": "Read auth.py and search for all SQL query patterns in the codebase."
}
]
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=tools,
messages=messages,
)
# Collect all tool_use blocks from the response
tool_use_blocks = [
block for block in response.content
if block.type == "tool_use"
]
if tool_use_blocks:
print(f"Claude requested {len(tool_use_blocks)} tool calls in parallel")
# Execute all tool calls concurrently
tasks = [
execute_tool(block.name, block.input)
for block in tool_use_blocks
]
results = await asyncio.gather(*tasks)
# Build tool results - one for each tool_use block
tool_results = [
{
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
}
for block, result in zip(tool_use_blocks, results)
]
# Send results back
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
final = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=tools,
messages=messages,
)
# Extract text from the final response
for block in final.content:
if hasattr(block, "text"):
print(block.text)
asyncio.run(run_agent())
TypeScript Example - Claude Parallel Tool Use
const client = new Anthropic();
const tools: Anthropic.Tool[] = [
{
name: "read_file",
description: "Read the contents of a file from the repository",
input_schema: {
type: "object",
properties: {
path: { type: "string", description: "File path" },
},
required: ["path"],
},
},
{
name: "search_code",
description: "Search for a pattern across the codebase",
input_schema: {
type: "object",
properties: {
pattern: { type: "string", description: "Search pattern" },
file_glob: { type: "string", description: "File glob filter" },
},
required: ["pattern"],
},
},
];
async function executeTool(name: string, input: Record<string, string>): Promise<string> {
if (name === "read_file") return `Contents of ${input.path}: ...`;
if (name === "search_code") return `Found 3 matches for '${input.pattern}'`;
return "Unknown tool";
}
async function runAgent() {
const messages: Anthropic.MessageParam[] = [
{
role: "user",
content: "Read auth.py and search for all SQL query patterns.",
},
];
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
tools,
messages,
});
const toolUseBlocks = response.content.filter(
(block): block is Anthropic.ToolUseBlock => block.type === "tool_use"
);
if (toolUseBlocks.length > 0) {
console.log(`Claude requested ${toolUseBlocks.length} parallel tool calls`);
// Execute concurrently
const results = await Promise.all(
toolUseBlocks.map((block) =>
executeTool(block.name, block.input as Record<string, string>)
)
);
const toolResults: Anthropic.ToolResultBlockParam[] = toolUseBlocks.map(
(block, i) => ({
type: "tool_result",
tool_use_id: block.id,
content: results[i],
})
);
messages.push({ role: "assistant", content: response.content });
messages.push({ role: "user", content: toolResults });
const final = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
tools,
messages,
});
for (const block of final.content) {
if (block.type === "text") console.log(block.text);
}
}
}
runAgent();
The main structural difference from OpenAI is that Claude uses a content block array. A single assistant message can contain a mix of text and tool_use blocks. You filter for tool_use blocks, execute them all, then return the results as tool_result blocks.
How Parallel Tool Calling Works in Google Gemini API
Google Gemini supports parallel function calling with Gemini 1.5 Pro and Gemini 2.0 models. The API uses a function_call part structure where multiple function calls can appear in a single response.
Python Example - Gemini Parallel Function Calling
from google import genai
from google.genai import types
client = genai.Client()
read_file_tool = types.FunctionDeclaration(
name="read_file",
description="Read the contents of a file",
parameters=types.Schema(
type=types.Type.OBJECT,
properties={
"path": types.Schema(type=types.Type.STRING, description="File path"),
},
required=["path"],
),
)
run_test_tool = types.FunctionDeclaration(
name="run_test",
description="Run a test file and return results",
parameters=types.Schema(
type=types.Type.OBJECT,
properties={
"test_path": types.Schema(type=types.Type.STRING, description="Test file path"),
},
required=["test_path"],
),
)
gemini_tools = types.Tool(function_declarations=[read_file_tool, run_test_tool])
async def execute_tool(name: str, args: dict) -> dict:
if name == "read_file":
return {"content": f"Contents of {args['path']}: ..."}
elif name == "run_test":
return {"passed": True, "tests_run": 5}
return {"error": "Unknown tool"}
async def run_agent():
model = "gemini-2.0-flash"
response = client.models.generate_content(
model=model,
contents="Read src/main.py and run the tests in tests/test_main.py",
config=types.GenerateContentConfig(
tools=[gemini_tools],
),
)
# Collect all function calls from the response
function_calls = []
for part in response.candidates[0].content.parts:
if part.function_call:
function_calls.append(part.function_call)
if function_calls:
print(f"Gemini requested {len(function_calls)} parallel function calls")
# Execute all concurrently
tasks = [
execute_tool(fc.name, dict(fc.args))
for fc in function_calls
]
results = await asyncio.gather(*tasks)
# Build function response parts
function_responses = []
for fc, result in zip(function_calls, results):
function_responses.append(
types.Part.from_function_response(
name=fc.name,
response=result,
)
)
# Send results back for final response
final = client.models.generate_content(
model=model,
contents=[
types.Content(
role="user",
parts=[types.Part.from_text("Read src/main.py and run the tests in tests/test_main.py")]
),
response.candidates[0].content,
types.Content(
role="user",
parts=function_responses,
),
],
config=types.GenerateContentConfig(
tools=[gemini_tools],
),
)
print(final.text)
asyncio.run(run_agent())
Performance Benefits of Parallel Tool Calling
The real-world performance gains from parallel tool calling are substantial. Here is a breakdown of where the time savings come from.
Latency Reduction
Consider an AI code review agent that needs to analyze a pull request with 8 changed files. For each file, it reads the content, checks the git blame, and runs a linter. That is 24 tool calls.
Sequential approach:
- 24 tool calls x ~150ms average = 3,600ms of tool execution
- 24 model round trips x ~1,500ms average = 36,000ms of API latency
- Total: ~40 seconds
Parallel approach (batches of 8):
- 3 batches x ~150ms (parallel execution) = 450ms of tool execution
- 3 model round trips x ~1,500ms average = 4,500ms of API latency
- Total: ~5 seconds
That is an 8x improvement. The dominant factor is eliminating model round trips, not just the tool execution overlap.
Token Efficiency
Each model round trip carries the full conversation context. With sequential tool calling, you pay for the input tokens of the full conversation 24 times. With parallel calling, you pay for it 3 times. For long conversations with large tool results, this can mean significant cost savings on API usage.
When Parallel Calling Hurts
Parallel tool calling is not always better. It degrades performance in these situations:
- Dependent operations - If tool B needs the result of tool A, they must run sequentially
- Rate-limited APIs - Firing 20 parallel requests at an API with a rate limit of 5/second will cause failures
- Resource-intensive tools - Running 10 heavy database queries in parallel can overload your database
- Non-deterministic ordering - If the order of results matters for the model's reasoning, sequential is safer
Error Handling Patterns for Parallel Tool Calls
Robust error handling is critical when executing tools in parallel. A single failure should not crash your entire agent loop.
Python - Resilient Parallel Execution
from dataclasses import dataclass
from typing import Any
@dataclass
class ToolResult:
tool_call_id: str
name: str
success: bool
content: str
async def execute_tool_safe(tool_call_id: str, name: str, args: dict) -> ToolResult:
"""Execute a tool with error handling."""
try:
result = await execute_tool(name, args)
return ToolResult(
tool_call_id=tool_call_id,
name=name,
success=True,
content=result,
)
except TimeoutError:
return ToolResult(
tool_call_id=tool_call_id,
name=name,
success=False,
content=f"Error: Tool '{name}' timed out after 30 seconds",
)
except Exception as e:
return ToolResult(
tool_call_id=tool_call_id,
name=name,
success=False,
content=f"Error: Tool '{name}' failed with: {str(e)}",
)
async def execute_parallel_tools(tool_calls: list) -> list[ToolResult]:
"""Execute multiple tool calls in parallel with error isolation."""
tasks = [
execute_tool_safe(
tool_call_id=tc.id,
name=tc.function.name,
args=json.loads(tc.function.arguments),
)
for tc in tool_calls
]
return await asyncio.gather(*tasks)
TypeScript - Resilient Parallel Execution
interface ToolResult {
toolCallId: string;
name: string;
success: boolean;
content: string;
}
async function executeToolSafe(
toolCallId: string,
name: string,
args: Record<string, unknown>
): Promise<ToolResult> {
try {
const result = await executeTool(name, args);
return { toolCallId, name, success: true, content: result };
} catch (error) {
const message = error instanceof Error ? error.message : String(error);
return {
toolCallId,
name,
success: false,
content: `Error: Tool '${name}' failed with: ${message}`,
};
}
}
async function executeParallelTools(
toolCalls: OpenAI.ChatCompletionMessageToolCall[]
): Promise<ToolResult[]> {
// Use Promise.allSettled for maximum resilience
const settled = await Promise.allSettled(
toolCalls.map((tc) =>
executeToolSafe(
tc.id,
tc.function.name,
JSON.parse(tc.function.arguments)
)
)
);
return settled.map((result, i) => {
if (result.status === "fulfilled") return result.value;
return {
toolCallId: toolCalls[i].id,
name: toolCalls[i].function.name,
success: false,
content: `Error: Unexpected failure - ${result.reason}`,
};
});
}
The critical pattern here is to always return a result for every tool call. If you skip a tool result, the model will not know what happened and may hallucinate the missing result or get stuck in a loop.
Sequential vs Parallel - Decision Framework
Choosing between sequential and parallel tool calling depends on the dependency graph of your operations. Here is a practical decision framework.
Use Parallel When:
- Reading multiple independent files - No dependencies between reads
- Running multiple linters or checks - Each check is independent
- Querying multiple APIs - External data fetches rarely depend on each other
- Searching across different directories - Multiple grep or search operations
- Fetching metadata for multiple items - Git blame, file stats, etc.
Use Sequential When:
- Results feed into the next operation - Read a config file, then use values from it to query a database
- Conditional branching - Check if a file exists before reading it
- Ordered mutations - Creating a branch, committing files, then opening a PR
- Pagination - Fetching page 2 requires knowing the cursor from page 1
- Authentication flows - Get a token before making authenticated requests
Hybrid Pattern - The Agent Loop
Most real-world agents use a hybrid approach. They let the model decide what can be parallelized on each turn, then loop until the task is complete.
async def agent_loop(messages: list, tools: list, max_turns: int = 20):
"""Generic agent loop with automatic parallel tool execution."""
for turn in range(max_turns):
response = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
message = response.choices[0].message
messages.append(message)
# If no tool calls, the agent is done
if not message.tool_calls:
return message.content
# Execute all tool calls from this turn in parallel
results = await execute_parallel_tools(message.tool_calls)
# Append all results
for result in results:
messages.append({
"role": "tool",
"tool_call_id": result.tool_call_id,
"content": result.content,
})
return "Max turns reached"
This loop naturally handles both parallel and sequential patterns. When the model needs parallel execution, it emits multiple tool calls in one turn. When it needs sequential execution, it emits one tool call per turn and waits for the result before deciding the next step.
Real-World Examples in AI Code Review
Parallel tool calling is a core feature of modern AI code review tools and coding agents. Here is how some of the most popular tools use it.
AI Code Review Agents
When an AI code review tool like CodeRabbit or CodeAnt AI reviews a pull request, the agent typically needs to:
- Fetch the PR diff
- Read every changed file in full
- Read related files that import or are imported by the changed files
- Check the repository's linting and style rules
- Look up past review comments on similar code
Steps 2 through 5 can all happen in parallel once the diff is available from step 1. A well-designed code review agent will make one sequential call to get the diff, then fan out into parallel calls for everything else.
Turn 1: Model requests get_pr_diff(pr_number=42)
Turn 2: Model receives diff, requests in parallel:
- read_file("src/auth.py")
- read_file("src/api.py")
- read_file("src/models/user.py")
- search_code("import auth")
- get_lint_config(".eslintrc.json")
Turn 3: Model has all context, generates review comments
Without parallel calling, turn 2 would take 5 separate round trips instead of 1. For a PR with 20 changed files, that difference can mean 30+ seconds saved per review.
Coding Agents
Tools like Claude Code, GitHub Copilot, and Cursor use parallel tool calling extensively for code exploration. When you ask a coding agent to "find all usages of the deprecated authenticate() function and update them," the agent might:
- Search for the function definition (sequential - need this first)
- Search for all call sites in parallel across multiple directories
- Read each file containing a call site in parallel
- Apply edits (sequential if order matters, parallel if independent files)
The ability to read 10 files simultaneously instead of one at a time is what makes modern coding agents feel responsive even on large codebases.
Agentic Testing Tools
AI testing tools like Qodo and other test generation platforms use parallel tool calling when generating tests for multiple functions. The agent can read all target functions in parallel, then generate test cases for each function in parallel, then write all test files in parallel. This three-phase parallel approach is dramatically faster than generating tests one function at a time.
Implementation Best Practices
1. Set Timeouts on Individual Tool Calls
Do not let a single slow tool call block the entire batch. Set per-tool timeouts:
async def execute_with_timeout(coro, timeout_seconds: float = 30):
try:
return await asyncio.wait_for(coro, timeout=timeout_seconds)
except asyncio.TimeoutError:
return "Error: Operation timed out"
2. Limit Concurrency
If your tools hit external APIs, use a semaphore to avoid rate limit errors:
semaphore = asyncio.Semaphore(5) # Max 5 concurrent tool executions
async def execute_with_limit(name: str, args: dict):
async with semaphore:
return await execute_tool(name, args)
3. Log Tool Call Batches for Debugging
When debugging agent behavior, log each batch of parallel calls so you can trace the execution flow:
logger = logging.getLogger("agent")
async def execute_parallel_tools_logged(tool_calls):
batch_id = uuid4().hex[:8]
names = [tc.function.name for tc in tool_calls]
logger.info(f"Batch {batch_id}: executing {names} in parallel")
results = await execute_parallel_tools(tool_calls)
failed = [r for r in results if not r.success]
if failed:
logger.warning(f"Batch {batch_id}: {len(failed)} tools failed: {[f.name for f in failed]}")
return results
4. Return Structured Error Messages
When a tool fails, return the error as structured text that helps the model recover:
# Bad - the model cannot reason about what went wrong
return "Error"
# Good - the model knows the specific failure and can retry or work around it
return "Error: File 'src/auth.py' not found. The file may have been deleted or renamed. Try searching for files matching 'auth' to find the correct path."
5. Consider Tool Result Size
When running many tools in parallel, the combined results can be very large. If 10 file reads each return 500 lines of code, you are sending 5,000 lines of code back to the model in a single turn. This consumes a lot of context window and can degrade model performance.
Strategies to manage this:
- Truncate large file reads to the most relevant sections
- Summarize tool results when full detail is not needed
- Use a two-phase approach: first read file metadata (size, language), then read only the files that matter
API Comparison Table
Here is a quick comparison of how parallel tool calling works across the three major LLM APIs:
| Feature | OpenAI | Anthropic Claude | Google Gemini |
|---|---|---|---|
| Parallel by default | Yes | Yes | Yes |
| Disable parallel | parallel_tool_calls=false |
Via system prompt | Via system prompt |
| Tool call container | message.tool_calls[] |
content[].tool_use blocks |
parts[].function_call |
| Result format |
role: "tool" messages |
tool_result blocks |
function_response parts |
| Max parallel calls | No hard limit | No hard limit | No hard limit |
| Streaming support | Yes | Yes | Yes |
| Models supported | GPT-4o, GPT-4 Turbo+ | Claude 3.5+, Claude 4 | Gemini 1.5 Pro, 2.0 |
Conclusion
Parallel tool calling is not just a performance optimization - it is a fundamental capability that makes LLM agents practical for real-world tasks. Without it, a code review agent that needs to read 15 files would take over a minute just on model round trips. With parallel calling, that drops to seconds.
The implementation patterns are straightforward across all three major APIs. The model decides what can run in parallel, your runtime executes the calls concurrently, and you return all results at once. The most important details are proper error handling (never drop a result) and concurrency management (use semaphores for rate-limited APIs).
If you are building AI agents for code review, testing, or development workflows, parallel tool calling should be one of the first optimizations you implement. The combination of reduced latency, fewer model round trips, and lower token costs makes it a clear win for any agent that touches more than a couple of tools per task.
For a deeper look at how AI code review tools use these patterns in practice, see our guides on best AI code review tools and how to automate code review.
Frequently Asked Questions
What is parallel tool calling in LLM agents?
Parallel tool calling is a feature where a large language model requests multiple tool or function calls in a single response instead of making them one at a time. This allows the agent runtime to execute all the calls simultaneously, reducing total latency and making AI agents significantly faster for tasks that involve independent operations like fetching files, querying APIs, or running checks in parallel.
Which LLM APIs support parallel tool calling?
OpenAI supports parallel function calling with GPT-4o, GPT-4 Turbo, and newer models. Anthropic Claude supports parallel tool use with Claude 3.5 Sonnet, Claude 3 Opus, and Claude 4 models. Google Gemini supports parallel function calling with Gemini 1.5 Pro and Gemini 2.0 models. All three APIs return multiple tool call requests in a single assistant message when appropriate.
How much faster is parallel tool calling compared to sequential?
Parallel tool calling can reduce total execution time by 50-80% depending on the number of independent tool calls and individual call latency. For example, if an agent needs to fetch 5 files that each take 200ms, sequential execution takes about 1 second while parallel execution completes in roughly 200ms. The speedup scales with the number of independent calls.
Can I disable parallel tool calling in the OpenAI API?
Yes. In the OpenAI API you can set parallel_tool_calls to false in your request to force the model to make only one tool call per response. This is useful when tool calls have dependencies on each other or when you need strict ordering of operations. Anthropic and Gemini do not offer an explicit toggle but you can guide behavior through system prompts.
How should I handle errors in parallel tool calls?
Use Promise.allSettled in TypeScript or asyncio.gather with return_exceptions=True in Python so that one failing call does not cancel the others. Return the error message as the tool result for the failed call so the LLM can decide how to recover. Never silently drop failed results because the model expects a result for every tool call it requested.
What is the difference between parallel tool calling and multi-turn tool use?
Parallel tool calling happens within a single turn where the model requests multiple tools at once and they all execute simultaneously. Multi-turn tool use is when the model makes a tool call, receives the result, then makes another tool call in the next turn based on that result. Most real-world agents combine both patterns since some operations depend on previous results while others can run in parallel.
Originally published at aicodereview.cc
Top comments (0)