Reviews & Comparisons

Why Local AI Models Fail at Agent Tasks: A Benchmark Analysis

Posted by u/Yogawife · 2026-05-16 14:55:26

Agent tasks—such as calling tools, following protocols, and chaining actions—demand more than just strong code generation. In a recent experiment, I tested six local language models on a custom agent readiness benchmark. The results were eye-opening: models that scored over 90% on code quality often scored below 20% on agent tasks. Even the best performer, SmolLM3-3B, managed only 50%. This analysis uncovers the disconnect between code benchmarks and true agent capability, and offers practical guidance for developers working with local models.

What was the initial assumption about code quality and agent capability?

Many in the AI community assume that high performance on code-quality benchmarks automatically translates to strong agent behavior. The reasoning seems logical: if a model can generate correct Python, read files, and fix bugs with over 93% accuracy, it should be able to call a single function when instructed. In my experiment, that assumption crumbled within the first two minutes of testing. The models that excelled at isolated code tasks often failed to interpret simple tool-calling instructions. They could produce flawless code from a prompt, but struggled to work within a structured protocol where they had to choose among multiple tools, obey a tool_choice: required directive, or remain silent when no tools were available. This reveals a fundamental difference between producing outputs from a prompt and following a dynamic, multi-step process.

Why Local AI Models Fail at Agent Tasks: A Benchmark Analysis — Source: dev.to

How was the agent readiness benchmark designed?

To measure real agent capability, I built a benchmark with six pass/fail dimensions: calling a single tool when explicitly told, selecting the correct tool from a set of three, obeying a tool_choice: required instruction, staying silent when no tools exist, chaining tool calls across multiple conversation turns, and passing the correct arguments in each call. To make the tests feasible for local models, I also created a 100-line translation proxy. Many local models output tool calls as plain text—<tool_call> blocks, JSON, or Python syntax—but agent frameworks like those from OpenAI expect a native tool_calls format. Without this proxy, most models would score 0% simply due to format mismatch. The benchmark was run on a laptop using purely open-weight models under 15 billion parameters.

What were the key results for different models?

The results exposed a stark gap. SmolLM3-3B, which scored 93.3% on the code quality benchmark, managed only 50% on agent tasks. It could call single tools correctly and write files with proper arguments, but when given three tools and asked to choose, it froze. It also failed at chaining two calls across turns. Phi-4-mini, with a 90% code score, landed at just 17% on agents—the only dimension it passed was “no false positives,” meaning it stayed quiet instead of hallucinating. That became the ceiling. Qwen2.5-Coder-14B, weighing in at 7.7 gigabytes, scored 85% on code but couldn’t call a single tool. Llama 3.1-8B followed the same pattern: larger model, zero agent capability. The highest scoring model for agents was SmolLM3-3B at 50%, proving that code benchmarks alone are misleading.

Why is there such a big gap between code and agent performance?

Code benchmarks test a model’s ability to generate correct output from a static prompt—essentially pattern completion. Agent tasks, on the other hand, require following a protocol: receive a set of tools, reason about which to use, make the call, receive the result, and decide the next step. A model that writes perfect Python can still fail to understand that search_files is the right tool when a user says “find files.” The 93.3% drop to as low as 17% isn’t a fluke; it’s revealing a capability that many open-weight models under approximately 3 billion parameters simply lack. These models haven’t been trained to handle the multi-turn reasoning and protocol adherence that agent workflows demand. Their training data focuses more on static text generation than on interactive tool use.

What role does architecture play versus model size?

Architecture matters more than parameter count. Qwen2.5-Coder-14B, with 14 billion parameters and a large 7.7 GB footprint, couldn’t call a single tool. SmolLM3-3B, at only about 1.8 GB, managed a 50% agent score. This counterintuitive result shows that parameter count tells you nothing about agent readiness. The internal design—how the model handles instruction following, memory, and tool integration—seems crucial. Some smaller models are specifically fine-tuned for chat or agentic behaviors, while larger models may be optimized for code generation without the necessary protocol awareness. When building local agents, reliance on size alone can lead to disappointment.

What practical advice does this give for building local AI agents?

First, test tool calling separately from code quality—the correlation is weak, so a 90% code model might be useless as an agent. Second, use a translation proxy to convert text-based tool calls into the JSON format that agent frameworks expect. My 100-line proxy (available at github.com/vystartasv/toolcall-proxy) alone raised some models from 0% to 17%. Third, don’t assume bigger means better for agent tasks; architecture beats parameter count. Fourth, benchmark your model on real tasks before building your system. I built the proxy first, but should have tested the models earlier. The benchmark and results are available at benchmarks.workswithagents.dev. Remember: something else will break tomorrow—it always does.

What tools or resources were developed from this experiment?

Two key resources came from this work. First, the translation proxy (github.com/vystartasv/toolcall-proxy) converts local models’ text-based tool calls into the native tool_calls format used by OpenAI-compatible agent frameworks. It’s only 100 lines of code, but it bridges a format mismatch that can cripple agent performance. Second, the agent readiness benchmark itself is open and hosted at benchmarks.workswithagents.dev. It tests six pass/fail dimensions and is designed run on a laptop. The benchmark helped reveal which models actually can function as agents, and which fail despite stellar code scores. These resources let others quickly evaluate models before committing to building around them. The goal is to save developers from the same two minutes of false hope I experienced.

Share Save Report