Multimodal Agents Sound Great. Deployment Is Brutal.
GPT-4o and Claude can see, hear, and reason. Getting them to do useful work in production is harder than the capabilities suggest.
The models are genuinely multimodal now. GPT-4o processes images, audio, and text natively. Claude's vision capabilities handle screenshots, documents, and diagrams. You can build agents that see a screen, hear a conversation, read a document, and respond coherently.
The capability demos are impressive. Agents that read receipts and file expenses. Agents that watch video meetings and summarize action items. Agents that interpret charts and answer questions about them.
Production deployment tells a different story.
The first problem is latency. Vision processing is slow. An agent that analyzes a screenshot takes seconds, not milliseconds. For interactive use cases—agents that respond to what's on your screen in real time—this latency kills the experience.
The workaround is preprocessing. Capture screenshots periodically and analyze them in the background. But this trades latency for staleness. By the time the agent understands what's on screen, the screen has changed. You end up building complex state management to track what the agent knows versus what's actually happening.
The second problem is cost. Vision tokens are expensive. A single screenshot can consume thousands of tokens. If your agent needs continuous visual awareness, the API bills add up fast. The economics work for occasional high-value tasks. They don't work for persistent visual monitoring.
The third problem is reliability. Vision models hallucinate, and the hallucinations look plausible. An agent reads a document and confidently extracts a number that isn't there. It interprets a chart and reports a trend that the data doesn't show. The errors aren't random—they're systematic misreadings that require ground truth to catch.
For text-based agents, you can often verify outputs against source material. For vision-based agents, the source material is an image, and the only way to verify is to manually look at the image yourself. The automation savings disappear if a human needs to check every visual interpretation.
The fourth problem is scope. Vision models understand images. They don't understand applications. An agent can see that a button says "Submit" but can't predict what clicking it will do. It can read a form but doesn't know which fields are required or what validation will occur.
This matters for agents that need to take actions, not just interpret displays. Computer Use and similar approaches add action capability to vision, but they inherit all the vision problems plus the additional complexity of controlling interfaces.
Where multimodal agents actually work in production: batch processing of visual documents.
Invoice processing. Receipt extraction. Document classification. Tasks where images come in batches, processing time isn't user-facing, costs are predictable per document, and outputs can be verified against business rules.
For these use cases, multimodal capabilities are transformative. A system that was impossible without vision is now tractable. The latency and cost are acceptable because they're budgeted per document, not per interaction.
The pattern I keep seeing: teams start with the interactive vision agent vision, hit the latency and cost walls, and retreat to batch processing where the economics work. The capability exists for real-time multimodal agents. The infrastructure to make them practical doesn't.
I expect this to improve. Latency will drop. Costs will fall. Reliability will increase. The trajectory is clear even if the timeline isn't.
For now, build for batch visual processing if you need multimodal. Save the real-time agent demos for fundraising decks.