Building the proactive, multi-agent architecture powering Scouts
By the Yutori team on December 22, 2025
Scouts are proactive, always-on AI agents that monitor the web — for news, price drops, reservations, tickets, leads — anything you care about.
To get timely, comprehensive coverage of information across the long tail of the web, an average Scout run invokes 76 agents across a wide variety of tools, and processes ~1M tokens. It is available via our API at 35 cents per run. The Claude Code query to extract this stat from our database cost us 64 cents (!).
Besides the per-run costs, the proactive, always-on nature of Scouts creates design challenges that don't exist in reactive, one-off agents. Scouts must be effective at managing large contexts. They scan the entire web for information — for days, weeks, sometimes months on end — and then all previous reports to not report stale or duplicate content. They must also be adaptable to a dynamic web and changing user preferences.
In this post, we share the principles we arrived at as we took Scouts from research to production — the architectural decisions that make continuous monitoring practical, the pitfalls we encountered, and how we balanced quality with cost.
Powering Scouts is a multi-agent architecture that required careful engineering of context, tools, and infrastructure — combined with a state-of-the-art web navigation agent to access the heavy tail of the web that doesn't have APIs.
Table of Contents
- Distribute the work across multiple specialized subagents
- Design for coverage and persistence
- Tell your agents what tools they can access and when to use them
- Don't assume credibility; teach your model to fact check
- Let the agent extract feedback from every user interaction
- Prepending past reports in context leads to drift towards them
- Putting it all together
Distribute the work across multiple specialized subagents
The first and simplest version we built used a single agent to handle the entire pipeline — from understanding user intent to calling tools, verifying results, checking for duplicates against past reports, and drafting new reports — following the "get out of the model's way" principle.
Scouting for rental properties in Seattle.
However, cramming all tokens into the context of a single model quickly runs into issues:
- All tool results don't fit into the model's context
- Performance degrades as more tool responses are added to the context
- Even if everything fit in context, the costs for simple tasks make real-world usage prohibitive
We found that a modular architecture with specialized subagents, each with its own tools tailored to a specific subtask, and an orchestrator with access to the full task context directing them, simply scales better.
This architecture enables:
- Spending more tokens without blowing up context. Each subagent tackles a focused subtask with only the relevant context, spending more tokens across them all than any single agent could carry.
- Separation of concerns. Having distinct subagents for each role enables us to optimize each independently (similar to microservices in software 1.0). Subagents prioritize recall, calling multiple tools and checking numerous sources for coverage, while the orchestrator focuses on precision, filtering for relevance. The result is a system where you get both breadth and accuracy.
- Managing intelligence intelligently. A multi-agent system enables splitting context not just by task, but also by the level of intelligence needed for each task. E.g. fetching the top posts from a subreddit or extracting pricing from a product listing page doesn't require sophisticated reasoning, so these simpler operations can be routed to faster, more cost-effective models.
We note that coordinating across subagents does introduce some overhead. Agents need to communicate, passing relevant context to each other for their respective subtasks. At the same time, with every round of message passing, information degrades (akin to the telephone game) — context gets lost, facts become misrepresented, unsubstantiated claims leak into the final report. Balancing communication overhead with context management will vary depending on your use case.
A popular approach to context-engineering is context compression. It also introduces trade-offs (e.g., cache misses) and getting it right is non-trivial. Once we adopted the multi-agent architecture and carefully optimized the context, compression was no longer necessary.
Design for coverage and persistence
Consider the following Scouts query:
"Monitor new technical blog posts, research papers and technical discussions on social media relevant to computer and browser use models."
Fan-out search with subagents
The orchestrator breaks this down and assigns it to separate subagents — one for searching social media, another for searching academic journals, etc. Each subagent explores in parallel with its own tooling. Fan-out search improves coverage substantially by expanding the number of tools and sources explored in a Scout run.
Enrich the intent
The orchestrator also translates the user's query into imperative step-by-step instructions for the subagents, enriched with time bounds, content categories, quality constraints, and concrete deliverables.
In instances where the user's query may be ambiguous, the orchestrator may invoke multiple subagents (to span multiple modes), each with its own enriched context.
For the example above, the orchestrator provides the following context to the web-search subagent:
Query sent to web search agent
Search for recent technical blog posts and research papers about computer use models and browser use models published in the past 7 days (on or after December 12, 2025).
**CRITICAL:** Only include information published on or after December 12, 2025. Discard anything older.
**Sources to check:**
- Google search with date filter for past 7 days
- ArXiv for recent research papers
- Medium, Dev.to, and technical blogs
- Company technical blogs (Google, Mozilla, Microsoft, etc.)
**Todo list:**
1. Search for "computer use models" OR "browser use models" with date filter (past 7 days)
2. Search for "AI browser automation" OR "browser agents" with date filter
3. Search ArXiv for papers on browser use, computer use, or agent models
4. Search for "LLM browser interaction" OR "multimodal agents" with date filter
**Deliverables:**
- For each blog post/article: Title, publication date (YYYY-MM-DD), source/publication, URL, brief summary of what it discusses
- For each research paper: Title, authors, publication date, ArXiv link, brief summary
- Only include items published on or after December 12, 2025
- Format as a structured list with clear dates and sources
The resulting context for the subagent is longer, more structured, and operationally explicit — resulting in higher quality responses. In practice, the difference is stark:
- Without intent enrichment: The response is unstructured with missing citations, publication dates and low recall (6 results found, most of which are outdated or only tangentially relevant).
- With intent enrichment: The response is structured with complete citations, publication dates and summaries, uncovering 14 results — most of which are highly relevant and recent, and the rest are filtered out by the orchestrator.
Query and response of web search agent without intent enrichment
>>> Subagent query
Monitor new technical blog posts, research papers and technical discussions on social media relevant to computer and browser use models. Only include information published on or after December 12, 2025.
>>> Subagent response
### Comprehensive Report on Technical Developments in Computer and Browser Use Models (Published on or after December 12, 2025)
This report details verified technical developments in computer use models and browser use models published on or after December 12, 2025, based on exhaustive web research. All information is sourced from primary publications and official documentation.
#### 1. Google's Gemini 2.5 Computer Use Model (Official Documentation)
The most significant and verified development is the official documentation for the Gemini 2.5 Computer Use model, which was last updated on December 18, 2025.
* **Model ID:** `gemini-2.5-computer-use-preview-10-2025`
* **Core Functionality:** This model enables AI agents to interact with computer interfaces by analyzing screenshots and generating precise UI actions. It can "see" a screen and "act" by performing actions like clicking, typing, scrolling, and navigating web browsers.
* **Supported Actions:** The model can request 13 specific UI actions via its API, including `click_at`, `type_text_at`, `navigate`, `scroll_document`, `drag_and_drop`, and `open_web_browser`. Developers must implement client-side logic to execute these actions.
* **Technical Specifications:** The model supports an input token limit of 128,000 and an output token limit of 64,000. It accepts image and text inputs and generates text outputs.
* **Availability:** The model is available to developers through the Gemini API via Google AI Studio and Vertex AI.
* **Critical Requirements and Limitations:** As a preview model, it is explicitly labeled as potentially prone to errors and security vulnerabilities. Google mandates that agents be run in secure, sandboxed environments. Developers must implement robust safety measures, including user confirmation for high-risk actions (e.g., purchases), input sanitization, and content guardrails. The model is primarily optimized for web browsers and is not yet designed for full desktop OS-level control. A recommended screen resolution of (1440, 900) is advised for optimal performance.
#### 2. Microsoft's Fara-7B: An Efficient Agentic Model for Computer Use
Microsoft Research published a detailed blog post on November 24, 2025, introducing Fara-7B, an open-weight, 7-billion parameter model designed for computer use.
* **Architecture:** Fara-7B is built on the Qwen2.5-VL-7B base model. It operates as a single, native agent that perceives the computer environment solely through screenshots of the browser window, without relying on accessibility trees or other metadata.
* **Training Methodology:** The model was trained on synthetic data generated by a novel system called FaraGen, which creates diverse, verified multi-step web task trajectories at a low cost.
* **Capabilities:** Fara-7B can automate everyday web tasks by visually perceiving a webpage and executing actions like scrolling, typing, and clicking on predicted coordinates. It uses standard browser automation tools (e.g., Playwright) for these actions.
* **Performance:** Fara-7B achieves state-of-the-art performance within its size class (7B parameters) and is competitive with larger models. On the WebVoyager benchmark, it achieved a 73.5% task success rate, outperforming UI-TARS-1.5-7B (66.4%) and GPT-4o (65.1%). On a new benchmark, WebTailBench, it achieved a 38.4% success rate, surpassing GPT-4o (30.0%).
* **Availability:** The model is open-weight and available under an MIT license on Microsoft Foundry and Hugging Face.
#### 3. Browser-Use: Open-Source AI Agent for Web Automation
The Browser-Use project, an open-source framework for AI-driven web automation, has been actively developed, with its latest release occurring on December 17, 2025.
* **Core Functionality:** Browser-Use is a Python library that enables AI agents to control web browsers using Large Language Models (LLMs). It combines visual recognition (via screenshots) with browser automation tools like Playwright to perform tasks like form filling, data extraction, and multi-step workflows.
* **Supported LLMs:** The project supports a variety of LLM providers, including OpenAI, Anthropic, Google, and its own optimized model, `ChatBrowserUse`. It also supports local models via Ollama.
* **Key Features:** It offers a cloud service for scalable, stealth-enabled browser automation, handles authentication via real browser profiles, and provides ready-to-run templates for quick deployment. The project emphasizes making websites accessible for AI agents.
* **Status:** The project is actively maintained, with its latest code commit and release occurring on December 17, 2025, confirming its relevance to the specified date range.
#### 4. Other Relevant Technical Developments
* **BRAID: Bounded Reasoning for Autonomous Inference and Decisions:** A paper published on arXiv on December 17, 2025, introduces BRAID, a framework for structured prompting of LLMs using instruction graphs (Mermaid-based). While not a browser use model, it is a significant advancement in LLM reasoning efficiency that can be applied to improve the performance and reliability of computer use agents by enabling more structured, cost-efficient, and accurate decision-making.
* **CSS Grid Lanes:** A blog post from WebKit on December 19, 2025, introduced CSS Grid Lanes, a new CSS feature for creating masonry-style layouts. While not an AI model, this is a critical advancement in web development technology that directly impacts the structure and layout of web pages that browser use models must interact with, potentially simplifying the task of element detection and interaction for these agents.
* **GotaTun:** A blog post from Mullvad VPN on December 19, 2025, announced GotaTun, a new implementation of the WireGuard protocol written in Rust. This is a significant development in network security and privacy infrastructure, which is foundational for secure and reliable remote operation of AI agents, especially in cloud-based deployments.
#### 5. Excluded Information
* **Gemini 2.5 Computer Use Model (Google Blog):** The initial announcement by Google on October 7, 2025, was published before the required cutoff date and is therefore excluded from this report.
* **Anthropic's Claude 3.5 Sonnet:** The introduction of its "Computer Use" capability on October 22, 2024, is also before the cutoff date and excluded.
* **OpenAI's Computer-Using Agent (Operator):** The announcement on January 23, 2025, is excluded for the same reason.
* **Reddit and Social Media Discussions:** While numerous discussions about these models exist on platforms like Reddit and LinkedIn, these are secondary sources and were not used as primary evidence for factual claims in this report. The official documentation and research papers are the authoritative sources.
Query and response of web search agent with intent enrichment
>>> Subagent query
Search for recent technical blog posts and research papers about computer use models and browser use models published in the past 7 days (on or after December 12, 2025).
CRITICAL: Only include information published on or after December 12, 2025. Discard anything older.
Sources to check:
- Google search with date filter for past 7 days
- ArXiv for recent research papers
- Medium, Dev.to, and technical blogs
- Company technical blogs (Google, Mozilla, Microsoft, etc.)
Todo list:
1. Search for "computer use models" OR "browser use models" with date filter (past 7 days)
2. Search for "AI browser automation" OR "browser agents" with date filter
3. Search ArXiv for papers on browser use, computer use, or agent models
4. Search for "LLM browser interaction" OR "multimodal agents" with date filter
Deliverables:
- For each blog post/article: Title, publication date (YYYY-MM-DD), source/publication, URL, brief summary of what it discusses
- For each research paper: Title, authors, publication date, ArXiv link, brief summary
- Only include items published on or after December 12, 2025
- Format as a structured list with clear dates and Sources
> > > Subagent response
### Recent Technical Blog Posts and Research Papers on Computer Use Models and Browser Use Models (Published on or after December 12, 2025)
#### **Blog Posts and Articles**
1. **Title:** Computer Use Models
**Publication:** the singularity is nearer
**Publication Date:** 2025-12-18
**URL:** https://geohot.github.io//blog/jekyll/update/2025/12/18/computer-use-models.html
**Summary:** The post explores the capabilities of Claude Code as an elite AI hacker, envisioning a future where a local AI agent on a phone handles all tasks—bypassing ads, navigating apps, and executing complex prompts. The author describes this as a shift toward "computer use models," where AI agents integrate perception and action to automate digital workflows without custom APIs. The post explicitly references "computer use models" and implies browser-like interactions through app navigation and UI interaction.
2. **Title:** The human role has shifted from operator → supervisor
**Publication:** LinkedIn (Browserbase)
**Publication Date:** 2025-12-18
**URL:** https://www.linkedin.com/posts/browserbasehq_the-human-role-has-shifted-from-operator-activity-7407535410379161600-ZnxF
**Summary:** This LinkedIn post by Browserbase defines "computer-use models" as systems that integrate perception, reasoning, and action into a single loop, eliminating the need for custom APIs. It argues that agents capable of using computers will automate repetitive tasks like navigating internal tools and filling forms, shifting the human role from operator to supervisor. The post links to an external article at https://stageh.link/computer-use, which is inaccessible, but the core definition of computer-use models is clearly stated.
3. **Title:** Simular's computer use agent outperforms humans
**Publication:** Simular
**Publication Date:** 2025-12-16
**URL:** https://www.simular.ai/articles/simulars-computer-use-agent-outperforms-humans
**Summary:** Simular announces that its open agentic framework, Agent S, achieved a 72.6% success rate on the OSWorld benchmark for multimodal agents performing real computer tasks, surpassing the human baseline of 72.36%. The breakthrough is attributed to the Behavior Best-of-N (bBoN) method. The article explicitly discusses "computer use models" in the context of AI agents interacting with operating systems and software interfaces, positioning this as a historic milestone in AI automation.
4. **Title:** Build production-ready browser-based AI agents!
**Publication:** LinkedIn (Sumanth P)
**Publication Date:** 2025-12-18
**URL:** https://www.linkedin.com/posts/sumanth077_build-production-ready-browser-based-ai-agents-activity-7407360044851269632-cZHt
**Summary:** This post introduces Notte, an all-in-one platform for building browser-based AI agents for production use. It emphasizes features like agent identities (unique emails/phone numbers for authentication), secure vaults, live browser viewing, and converting natural language or click-through demonstrations into editable code. The platform is designed to move browser agents from prototypes to reliable infrastructure, directly addressing browser use models through autonomous web interaction.
5. **Title:** Letting AI Browse The Web For You Sounds Great Until It Goes Wrong
**Publication:** Forbes
**Publication Date:** 2025-12-17
**URL:** https://www.forbes.com/sites/bernardmarr/2025/12/17/letting-ai-browse-the-web-for-you-sounds-great-until-it-goes-wrong/
**Summary:** This article discusses AI-powered browser agents as transformative tools for searching, shopping, and working online. It highlights their potential to act on behalf of users but also warns of risks such as unintended actions, security vulnerabilities, and the complexity of navigating dynamic web environments. The piece frames browser agents as a key component of modern AI automation, directly relating to browser use models.
6. **Title:** Even Google and Replit struggle to deploy AI agents reliably — here's why
**Publication:** VentureBeat
**Publication Date:** 2025-12-19
**URL:** https://venturebeat.com/orchestration/even-google-and-replit-struggle-to-deploy-ai-agents-reliably-heres-why
**Summary:** Leaders from Google Cloud and Replit acknowledge that "computer use models" are still in their infancy, describing them as expensive, slow, and unreliable. The article details challenges in enterprise deployment, including fragmented data, immature governance, and the need to rework workflows around probabilistic agents. It confirms that computer use models are a central focus of current AI development, with Replit's next-gen agent achieving up to 200 minutes of autonomous operation.
7. **Title:** Built a preprocessor that makes browser agents actually work on complex sites
**Publication:** Reddit (r/AI_Agents)
**Publication Date:** 2025-12-13
**URL:** https://www.reddit.com/r/AI_Agents/comments/1plisyv/built_a_preprocessor_that_makes_browser_agents/
**Summary:** A user describes developing a preprocessor that converts live DOM into structured JSON, focusing on visible and clickable elements and their spatial relationships. This reduces token count from 1.3M to ~100K, enabling LLMs like Claude and GPT to reason about UI structure without hallucinating. The tool directly addresses browser use models by improving the input quality for agents navigating complex websites.
8. **Title:** Browserbase vs Firecrawl vs Skyvern: Which is Better for Workflow Automation?
**Publication:** Skyvern Blog
**Publication Date:** 2025-12-13
**URL:** https://www.skyvern.com/blog/
**Summary:** This blog post compares three automation tools, positioning Skyvern as superior for complex tasks like logging into portals and submitting orders, while Browserbase manages headless browsers and Firecrawl extracts content. The article discusses AI browser automation and browser agents as tools for end-to-end workflow automation, emphasizing the need for agents that perform actions beyond data extraction.
9. **Title:** How to Access GLM-4.6V and Build Reliable Multimodal Agents?
**Publication:** Novita AI
**Publication Date:** 2025-12-18
**URL:** https://blogs.novita.ai/how-to-access-glm-4-6v-and-build-reliable-multimodal-agents/
**Summary:** This guide details how to access GLM-4.6V, a multimodal model capable of understanding images, documents, and UI screenshots. It describes the model's role as a reasoning and coordination layer in agent workflows, enabling tasks like frontend replication from UI screenshots and document-aligned reasoning. While not focused on browser use models per se, it provides a framework for building multimodal agents that can interact with graphical interfaces, a core component of advanced browser automation.
10. **Title:** A Communicative Multimodal Multi-Agent Benchmark
**Publication:** arXiv
**Publication Date:** 2025-12-16
**URL:** https://deeplearn.org/arxiv/673536/comma:-a-communicative-multimodal-multi-agent-benchmark
**Summary:** This paper introduces COMMA, a benchmark for evaluating communication between multimodal multi-agent systems. It reveals weaknesses in state-of-the-art models like GPT-4o and LLaVA-CoT in collaborative tasks requiring language-based coordination. While not focused on browser or computer use, it advances the field of multimodal agents, which underpin the next generation of autonomous systems interacting with digital environments.
#### **Research Papers**
1. **Title:** BRAID: Bounded Reasoning for Autonomous Inference and Decisions
**Authors:** Armağan Amcalar, Eyup Cinar
**Publication Date:** 2025-12-17
**ArXiv Link:** https://arxiv.org/abs/2512.15959
**Summary:** This paper introduces BRAID, a framework that uses structured, machine-readable prompts (Mermaid-based instruction graphs) to enhance LLM reasoning. It demonstrates increased accuracy and cost efficiency on benchmarks like AdvancedIF and GSM-Hard. The research directly contributes to the development of autonomous agents capable of reliable decision-making, a foundational requirement for computer use models and browser automation systems.
2. **Title:** The Unreasonable Effectiveness of Scaling Agents for Computer Use
**Authors:** Simular Team
**Publication Date:** 2025-10-02 (preprint)
**ArXiv Link:** https://arxiv.org/abs/2510.02250
**Summary:** This research paper, referenced in Simular's announcement, details the scaling effects behind Agent S's success on the OSWorld benchmark. It provides the technical foundation for how large-scale agent training improves reliability in computer use tasks, offering empirical evidence for the viability of computer use models.
3. **Title:** CELLMATE: Sandboxing Browser AI Agents
**Authors:** Teemu Maatta (via GitHub repository)
**Publication Date:** 2025-12-19 (last commit)
**URL:** https://github.com/tmgthb/Autonomous-Agents
**Summary:** This paper, listed in a curated repository of autonomous agent research, introduces CELLMATE, a framework for sandboxing browser AI agents to enhance security and reliability. It directly addresses browser use models by proposing methods to isolate and control agent interactions with web environments, mitigating risks associated with autonomous web navigation.
4. **Title:** WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment
**Authors:** Teemu Maatta (via GitHub repository)
**Publication Date:** 2025-12-19 (last commit)
**URL:** https://github.com/tmgthb/Autonomous-Agents
**Summary:** This paper presents WebOperator, a tree search algorithm designed for autonomous agents navigating web environments. It enables agents to plan and execute complex sequences of actions on websites, directly contributing to the advancement of browser use models by improving task completion accuracy and robustness.
NOTE: the subagent response is designed to be recall heavy, the orchestrator is responsible for filtering out irrelevant and outdated content for higher precision
Ask agents to be persistent
There are many tool calling errors and dead ends that can be overcome by trying different tools or queries. We bake persistence into our agents by biasing them to iteratively reformulate queries and try new tools upon failure. This results in higher coverage. The social media subagent for instance:
- Without persistence: The agent response is mediocre with references to 2 X posts, 1 good HackerNews post and some other outdated content. Notably, the agent tried calling the
x_search_tweetstool with 6 different queries in the first step, encountered an error for 4 of them and never retried the tool. - With persistence: The response is comprehensive and high coverage with references to 9 very relevant X posts and some other outdated content. The agent persisted through its first step
x_search_tweetscall failure with a successful retry in the second step. The agent made a single call tox_search_tweetsboth times with a smarter query combining the different search terms using theORoperator.
In both cases, outdated content is filtered out by the orchestrator.
Query, tool calls and response of social media subagent without persistence
>>> Subagent query
Search for recent technical discussions about computer use models and browser use models on social media platforms, published in the past 7 days (on or after December 12, 2025).
CRITICAL: Only include discussions/posts published on or after December 12, 2025. Discard anything older.
Platforms to check:
- Reddit (r/MachineLearning, r/LanguageModels, r/programming, r/webdev, r/artificial)
- Twitter/X (search for technical discussions, not just retweets)
- HackerNews (recent discussions and submissions)
- YouTube (recent technical videos about browser agents or computer use models)
Search terms to use:
- "computer use models"
- "browser use models"
- "browser agents"
- "AI browser automation"
- "multimodal agents"
- "LLM browser interaction"
Deliverables:
- For each discussion/post: Platform, author/username, post date (YYYY-MM-DD), title/topic, URL, brief summary of discussion
- Only include substantive technical discussions (not casual mentions)
- Only include items published on or after December 12, 2025
- Format as a structured list with clear dates and sources
>>> Subagent tool calls
>>> Step 1:
{
"name": "x_search_tweets",
"input": {
"keywords": ""computer use models" lang:en min_retweets:5 since:2025-12-12",
"max_results": 10
},
"output": {
"content": "tweet_text:Computer Use Models https://t.co/eEHAwc457l
url:https://x.com/geohotarchive/status/2001690745787335082
---
tweet_text:We got early access to Gemini 3 Flash and it blew us away,
url:https://x.com/pk_iv/status/2001326377665245672"
}
},
{
"name": "x_search_tweets",
"input": {
"keywords": ""browser use models" lang:en min_retweets:5 since:2025-12-12",
"max_results": 10
},
"output": {
"content": "No results found"
},
},
{
"name": "x_search_tweets",
"input": {
"keywords": ""browser agents" lang:en min_retweets:5 since:2025-12-12",
"max_results": 10
},
"output": {
"content": "Internal error, try again later."
},
},
{
"name": "x_search_tweets",
"input": {
"keywords": ""AI browser automation" lang:en min_retweets:5 since:2025-12-12",
"max_results": 10
},
"output": {
"content": "Internal error, try again later."
},
},
{
"name": "x_search_tweets",
"input": {
"keywords": ""multimodal agents" lang:en min_retweets:5 since:2025-12-12",
"max_results": 10
},
"output": {
"content": "Internal error, try again later."
},
},
{
"name": "x_search_tweets",
"input": {
"keywords": ""LLM browser interaction" lang:en min_retweets:5 since:2025-12-12",
"max_results": 10
},
"output": {
"content": "Internal error, try again later."
},
}
... reddit, search and youtube tools
>>> Steps 2 through 10:
No subsequent calls to x_search_tweets
>>> Subagent response
### Social Media Findings: Technical Discussions on Computer Use Models and Browser Use Models (Published on or after December 12, 2025)
#### **Twitter/X**
1. **Post:**
*Text:* "Computer Use Models"
*Author:* geohotarchive (@geohotarchive)
*Date:* December 18, 2025
*URL:* https://x.com/geohotarchive/status/2001690745787335082
*Engagement:* 361 likes, 22 retweets, 6 replies, 38,116 views
*Summary:* This post shares a link titled "Computer Use Models," likely referencing emerging AI agent frameworks. The account is an archive of George Hotz-related content, suggesting the post is a curated link rather than an original technical analysis.
2. **Post:**
*Text:* "We got early access to Gemini 3 Flash and it blew us away, With Stagehand Agent, it nearly matched accuracy with Gemini 3 Pro while being cheaper and faster. Computer use models are slow, but with a good harness like Stagehand, you can get comparable accuracy at lightspeed."
*Author:* Paul Klein IV (@pk_iv), founder of Browserbase
*Date:* December 17, 2025
*URL:* https://x.com/pk_iv/status/2001326377665245672
*Engagement:* 140 likes, 10 retweets, 19 replies, 16,426 views
*Summary:* A substantive technical comparison highlighting the performance trade-offs between native computer use models (e.g., Gemini 3 Pro) and agent frameworks like Stagehand. The author asserts that Stagehand, as a "harness," enables near-parity in accuracy with significantly improved speed and cost-efficiency, positioning it as a practical optimization layer for AI agents. This reflects a key industry trend: decoupling reasoning from execution to enhance agent performance.
#### **Reddit**
1. **Post:**
*Title:* Microsoft to move away from C/C++ to Rust using AI assisted coding
*Subreddit:* r/programming
*Author:* ishammohamed
*Date:* December 19, 2025
*URL:* https://www.reddit.com/r/programming/comments/1pqulm0/microsoft_to_move_away_from_cc_to_rust_using_ai/
*Engagement:* 374 upvotes, 172 comments
*Summary:* This post discusses Microsoft's strategic initiative to transition core codebases from C/C++ to Rust, leveraging AI-assisted coding tools. The discussion reveals a broader technical context where AI agents are being used not just for automation, but for large-scale, system-level code refactoring. This implies that "computer use models" are evolving into sophisticated code generation and refactoring agents capable of understanding and modifying complex legacy systems.
2. **Post:**
*Title:* AI helps ship faster but it produces 1.7× more bugs
*Subreddit:* r/webdev
*Author:* thewritingwallah
*Date:* December 19, 2025
*URL:* https://www.reddit.com/r/webdev/comments/1pqfwjq/ai_helps_ship_faster_but_it_produces_17_more_bugs/
*Engagement:* 261 upvotes, 37 comments
*Summary:* This post cites a report from Coderabbit.ai, quantifying a critical trade-off in AI-assisted development: while AI accelerates delivery speed, it significantly increases bug density (1.7x). This directly informs the discussion on browser use models, as it underscores the necessity for robust validation, testing, and human oversight in any agent-based automation system. The findings suggest that the reliability of "browser use models" is a primary concern for adoption in production environments.
3. **Post:**
*Title:* Gemini Flash hallucinates 91% times, if it does not know answer
*Subreddit:* r/artificial
*Author:* msaussieandmrravana
*Date:* December 19, 2025
*URL:* https://www.reddit.com/r/artificial/comments/1pqgofe/gemini_flash_hallucinates_91_times_if_it_does_not/
*Engagement:* 64 upvotes, 17 comments
*Summary:* A detailed benchmark analysis reveals that Gemini 3 Flash has a 91% hallucination rate on the AA-Omniscience Hallucination Rate benchmark. This data point is crucial for evaluating browser use models, as it indicates that even state-of-the-art models struggle with uncertainty calibration. For an agent that must interact with live web interfaces, a high rate of confident but incorrect actions (e.g., clicking the wrong button, submitting false data) represents a critical safety and reliability flaw.
4. **Post:**
*Title:* LG Will Let TV Owners Delete Microsoft Copilot After Customer Outcry
*Subreddit:* r/artificial
*Author:* JonSpartan29
*Date:* December 18, 2025
*URL:* https://www.reddit.com/r/artificial/comments/1pq0ycl/lg_will_let_tv_owners_delete_microsoft_copilot/
*Engagement:* 258 upvotes, 58 comments
*Summary:* This post highlights a consumer backlash against forced AI integration. The discussion reflects a philosophical divide: users demand control and choice, rejecting the notion that AI features should be mandatory. This sentiment directly impacts the adoption of browser use models, as users may resist any browser or application that embeds AI agents without a clear, easy opt-out mechanism.
#### **YouTube**
1. **Video:**
*Title:* Gemini 2.5 Computer Use: Google's FULLY FREE Browser Use AI Agent! Automate ANYTHING! (Ranked #1)
*Channel:* WorldofAI
*Date:* October 8, 2025
*URL:* https://www.youtube.com/watch?v=aSXZ4IImPlE
*Engagement:* 43,923 views, 864 likes
*Summary:* This video provides a comprehensive overview of Google's Gemini 2.5 Computer Use model, positioning it as a free, high-performance agent for web automation. It details its capabilities in controlling web UIs, its performance advantages over OpenAI and Anthropic's agents, and its integration with the Browserbase platform. The video is a primary source for understanding the technical specifications and real-world potential of a leading "browser use model."
2. **Video:**
*Title:* Google Gemini 2.5 Computer Use Testing – The FASTEST Agent Yet?
*Channel:* Bijan Bowen
*Date:* October 9, 2025
*URL:* https://www.youtube.com/watch?v=-qClMpXRu_s
*Engagement:* 13,809 views, 336 likes
*Summary:* This video presents a rigorous, real-world test of the Gemini 2.5 Computer Use model. The creator conducts autonomous tasks including customer service inquiries, chess games, and car purchases, evaluating speed, accuracy, and reliability. The testing methodology provides empirical evidence of the model's strengths (e.g., speed) and weaknesses (e.g., struggles with complex, multi-step tasks), offering a balanced technical assessment.
3. **Video:**
*Title:* ChatGPT Operator is expensive....use this instead (FREE + Open Source)
*Channel:* NetworkChuck
*Date:* February 21, 2025
*URL:* https://www.youtube.com/watch?v=sxTNACldK3Y
*Engagement:* 652,424 views, 19,202 likes
*Summary:* Although published earlier, this video remains highly relevant as it popularized the concept of free, open-source alternatives to proprietary agents like OpenAI's Operator. It demonstrates the Browser Use framework, a foundational open-source project for building browser agents. The video's massive popularity underscores a strong community-driven movement toward accessible, transparent, and customizable browser automation tools, which are the building blocks of modern "browser use models."
#### **Hacker News**
1. **Post:**
*Title:* Show HN: Operational web infra (Why we moved away from pure Computer Use agents)
*URL:* https://news.ycombinator.com/item?id=46281208
*Date:* December 16, 2025
*Summary:* This post introduces Mino, a new API for web agents that moves away from the "continuous inference" approach of models like Gemini and Claude. The author argues that the current paradigm of taking screenshots and reasoning step-by-step is too slow and expensive for production use. Mino proposes a solution where AI learns a workflow and then converts it into deterministic, executable code, promising speed, low cost, and reliability. This represents a significant technical evolution in the field, suggesting that the future of browser use models lies in hybrid architectures that combine AI learning with deterministic execution.
2. **Post:**
*Title:* A2UI: A Protocol for Agent-Driven Interfaces
*URL:* https://news.ycombinator.com/item?id=46286407
*Date:* December 17, 2025
*Summary:* This post discusses A2UI, a protocol designed to allow AI agents to send declarative UI component descriptions to clients for rendering. The discussion reveals a deep technical debate: proponents see it as a universal language for generative UIs, enabling agents to create interfaces on the fly. Critics argue it's a reinvention of server-side rendering and raises significant security and complexity concerns. This highlights a critical frontier in the field: how to standardize the interaction between an agent's reasoning engine and the user interface it controls, a core challenge for any "computer use model.
Query, tool calls and response of social media subagent with persistence
>>> Subagent query
Search for recent technical discussions about computer use models and browser use models on social media platforms, published in the past 7 days (on or after December 12, 2025).
CRITICAL: Only include discussions/posts published on or after December 12, 2025. Discard anything older.
Platforms to check:
- Reddit (r/MachineLearning, r/LanguageModels, r/programming, r/webdev, r/artificial)
- Twitter/X (search for technical discussions, not just retweets)
- HackerNews (recent discussions and submissions)
- YouTube (recent technical videos about browser agents or computer use models)
Search terms to use:
- "computer use models"
- "browser use models"
- "browser agents"
- "AI browser automation"
- "multimodal agents"
- "LLM browser interaction"
Deliverables:
- For each discussion/post: Platform, author/username, post date (YYYY-MM-DD), title/topic, URL, brief summary of discussion
- Only include substantive technical discussions (not casual mentions)
- Only include items published on or after December 12, 2025
- Format as a structured list with clear dates and sources
> > > Subagent tool calls
Step 1:
{
"name": "x_search_tweets",
"input": {
"keywords": ""computer use models" OR "browser use models" OR "browser agents" OR "AI browser automation" OR "multimodal agents" OR "LLM browser interaction" min_retweets:5 lang:en",
"max_results": 20
},
"output": {
"content": "Internal error, try again later.",
}
},
... search, reddit and youtube tools
Step 2:
{
"name": "x_search_tweets",
"input": {
"keywords": ""computer use models" OR "browser use models" OR "browser agents" OR "AI browser automation" OR "multimodal agents" OR "LLM browser interaction" min_retweets:5 lang:en since:2025-12-12 until:2025-12-19",
"max_results": 20
},
"output": {
"content": "tweet_text:Computer Use Models https://t.co/eEHAwc457l
url:https://x.com/geohotarchive/status/2001690745787335082
tweet_text:Build production-ready browser-based AI agents! Notte...
url:https://x.com/Sumanth_077/status/2001644218268495975
tweet_text:🚀🚀🚀 Gemini 3 Flash is live...
url:https://x.com/RohanLikesAI/status/2001341853657178412
tweet_text:Meteor is now the most intelligent AI browser....
url:https://x.com/moggrammer/status/2001340709727842506
tweet_text:We got early access to Gemini 3 Flash and it blew us away,...
url:https://x.com/pk_iv/status/2001326377665245672
tweet_text:Notte introduces Automation Studio - a dedicated code editor, live browser, and execution logs environment...
url:https://x.com/GithubProjects/status/2000974299767955692
... more results"
}
},
... search, reddit and youtube tools
Steps 3 through 10:
Calls to reddit, youtube, search, x_get_users_info and read webpage tools (X and youtube pages)
> > > Subagent response
### Social Media Monitoring Report: Technical Discussions on Computer Use Models and Browser Use Models (Published on or after December 12, 2025)
#### **Twitter/X Findings**
1. **Post: "Computer Use Models"**
- **Platform:** X (Twitter)
- **Author:** @geohotarchive
- **Date:** December 18, 2025
- **URL:** https://x.com/geohotarchive/status/2001690745787335082
- **Summary:** The post contains the single phrase "Computer Use Models" with no additional context or technical detail. It does not elaborate on browser agents, AI automation, or LLM interaction.
- **Citation:** [X/@geohotarchive, Dec 18, 2025, https://x.com/geohotarchive/status/2001690745787335082]
2. **Post: Notte Platform for Browser Agents**
- **Platform:** X (Twitter)
- **Author:** @Sumanth_077
- **Date:** December 18, 2025
- **URL:** https://x.com/Sumanth_077/status/2001644218268495975
- **Summary:** This post details Notte, an all-in-one platform for building production-ready browser agents. It highlights features such as agent identities for authentication, an automation studio with a live browser and execution logs, and modes for converting natural language or user demonstrations into executable code. The post explicitly discusses browser agents and AI browser automation.
- **Citation:** [X/@Sumanth_077, Dec 18, 2025, https://x.com/Sumanth_077/status/2001644218268495975]
3. **Post: Gemini 3 Flash and Multimodal Agents**
- **Platform:** X (Twitter)
- **Author:** @RohanLikesAI
- **Date:** December 17, 2025
- **URL:** https://x.com/RohanLikesAI/status/2001341853657178412
- **Summary:** This post announces the launch of Gemini 3 Flash, describing it as a multimodal agent with Pro-level intelligence that is cheaper and faster. It claims the model unlocks a new generation of multimodal agents used by companies like Figma and Bridgewater. The term "multimodal agents" is explicitly mentioned.
- **Citation:** [X/@RohanLikesAI, Dec 17, 2025, https://x.com/RohanLikesAI/status/2001341853657178412]
4. **Post: Meteor AI Browser**
- **Platform:** X (Twitter)
- **Author:** @moggrammer
- **Date:** December 17, 2025
- **URL:** https://x.com/moggrammer/status/2001340709727842506
- **Summary:** The post promotes "Meteor" as the most intelligent AI browser, capable of setting up monitors and having agents act on events. It claims browser agents are becoming more human-like and robust, and that Meteor checks sources like X, Reddit, and forums before acting. The post explicitly discusses browser agents and AI browser automation.
- **Citation:** [X/@moggrammer, Dec 17, 2025, https://x.com/moggrammer/status/2001340709727842506]
5. **Post: Computer Use Models are Slow**
- **Platform:** X (Twitter)
- **Author:** @pk_iv
- **Date:** December 17, 2025
- **URL:** https://x.com/pk_iv/status/2001326377665245672
- **Summary:** The post states that "Computer use models are slow," but with a good harness like Stagehand Agent, comparable accuracy to Gemini 3 Pro can be achieved at "lightspeed." This is a direct technical comparison between general computer use models and a specific agent framework.
- **Citation:** [X/@pk_iv, Dec 17, 2025, https://x.com/pk_iv/status/2001326377665245672]
6. **Post: Notte Automation Studio**
- **Platform:** X (Twitter)
- **Author:** @GithubProjects
- **Date:** December 16, 2025
- **URL:** https://x.com/GithubProjects/status/2000974299767955692
- **Summary:** This post announces Notte's Automation Studio, a unified environment with a code editor, live browser, and execution logs for crafting browser agents and automation scripts. It explicitly mentions "browser agents" and "automation scripts."
- **Citation:** [X/@GithubProjects, Dec 16, 2025, https://x.com/GithubProjects/status/2000974299767955692]
7. **Post: Notte Automation Studio (Follow-up)**
- **Platform:** X (Twitter)
- **Author:** @nottecore
- **Date:** December 16, 2025
- **URL:** https://x.com/nottecore/status/2000966751933137265
- **Summary:** This post reiterates the launch of Automation Studio as a dedicated IDE for crafting, debugging, and deploying browser agents and automations end-to-end. It explicitly mentions "browser agents."
- **Citation:** [X/@nottecore, Dec 16, 2025, https://x.com/nottecore/status/2000966751933137265]
8. **Post: Browser Agents and Login Issues**
- **Platform:** X (Twitter)
- **Author:** @GithubProjects
- **Date:** December 15, 2025
- **URL:** https://x.com/GithubProjects/status/2000611905061392396
- **Summary:** This post identifies a key technical challenge for browser agents: breaking at login. It proposes a solution involving "real identity primitives" such as email inboxes, phone numbers, automated 2FA, and credential vaults to enable autonomous handling of email verification and SMS 2FA. This is a substantive technical discussion on browser agent functionality.
- **Citation:** [X/@GithubProjects, Dec 15, 2025, https://x.com/GithubProjects/status/2000611905061392396]
9. **Post: Launch Week for Browser Agents**
- **Platform:** X (Twitter)
- **Author:** @nuitdotfun
- **Date:** December 12, 2025
- **URL:** https://x.com/nuitdotfun/status/1999543486391955671
- **Summary:** This post announces "Launch Week Volume I" for Notte, highlighting "Five primitives for building and scaling browser agents." It explicitly mentions "browser agents" as the core subject.
- **Citation:** [X/@nuitdotfun, Dec 12, 2025, https://x.com/nuitdotfun/status/1999543486391955671]
#### **YouTube Findings**
1. **Video: Gemini 2.5 Computer Use: Google's FULLY FREE Browser Use AI Agent!**
- **Platform:** YouTube
- **Author:** WorldofAI
- **Date:** October 8, 2025
- **URL:** https://www.youtube.com/watch?v=aSXZ4IImPlE
- **Summary:** This video introduces Google's Gemini 2.5 Computer Use model, a free AI agent designed to interact with web browsers like a human. It highlights its performance on benchmarks, its ability to automate tasks without coding, and its superiority over OpenAI and Anthropic alternatives. The video explicitly discusses "Computer Use," "Browser Use," and "AI agents."
- **Citation:** [YouTube/WorldofAI, Oct 8, 2025, https://www.youtube.com/watch?v=aSXZ4IImPlE]
2. **Video: Browser Use: This New AI Agent Can Do Anything (Full AI Scraping Tutorial)**
- **Platform:** YouTube
- **Author:** Tech With Tim
- **Date:** March 23, 2025
- **URL:** https://www.youtube.com/watch?v=zGkVKix_CRU
- **Summary:** This tutorial demonstrates how to build an AI agent using the "Browser Use" framework to control a browser for tasks like purchasing items, booking flights, and parsing information. It explicitly mentions "Browser Use," "AI agents," and "AI browser automation."
- **Citation:** [YouTube/Tech With Tim, Mar 23, 2025, https://www.youtube.com/watch?v=zGkVKix_CRU]
3. **Video: How to create your own Browser AI Agent using any LLM Model + Playwright + Browser-Use + Web-UI**
- **Platform:** YouTube
- **Author:** Naveen AutomationLabs
- **Date:** February 4, 2025
- **URL:** https://www.youtube.com/watch?v=AK9mRsXdr4w
- **Summary:** This tutorial provides a step-by-step guide to creating a browser AI agent using an LLM, Playwright, and the Browser-Use framework. It explicitly discusses "Browser AI Agent," "LLM Model," "Playwright," and "Browser-Use."
- **Citation:** [YouTube/Naveen AutomationLabs, Feb 4, 2025, https://www.youtube.com/watch?v=AK9mRsXdr4w]
4. **Video: Playwright MCP: How AI Agents Can Control Your Browser**
- **Platform:** YouTube
- **Author:** Playwright
- **Date:** April 20, 2024
- **URL:** https://www.youtube.com/watch?v=2716IUeCIQo
- **Summary:** This video introduces Microsoft Playwright MCP, a tool that enables AI agents to interact with websites using structured accessibility data instead of screenshots. It demonstrates an agent logging into GitHub and starring a repository. The video explicitly discusses "AI agents" and "LLM" (Large Language Models).
- **Citation:** [YouTube/Playwright, Apr 20, 2024, https://www.youtube.com/watch?v=2716IUeCIQo]
5. **Video: AI Agent Engineering: Lessons from Browser Use & Computer Use**
- **Platform:** YouTube
- **Author:** deepsense
- **Date:** June 5, 2025
- **URL:** https://www.youtube.com/watch?v=y6omO_oXNpE
- **Summary:** This session explores the transition of LLMs from text-only tools to agents capable of navigating browsers and desktop environments. It details two projects: "Browser Use" for web automation and "Computer Use" for desktop GUI control. It explicitly mentions "Browser Use," "Computer Use," "AI agents," and "LLM."
- **Citation:** [YouTube/deepsense, Jun 5, 2025, https://www.youtube.com/watch?v=y6omO_oXNpE]
6. **Video: Browser Use - AI Agent with the Browser**
- **Platform:** YouTube
- **Author:** Ozgur Ozer
- **Date:** January 5, 2025
- **URL:** https://www.youtube.com/watch?v=grXfjufSleY
- **Summary:** This short video demonstrates an AI agent using the "Browser Use" framework to find the cheapest flight from Istanbul to New York, navigating Google Flights and selecting the optimal date. It explicitly mentions "Browser Use" and "AI Agent."
- **Citation:** [YouTube/Ozgur Ozer, Jan 5, 2025, https://www.youtube.com/watch?v=grXfjufSleY]
#### **Reddit and HackerNews Findings**
- **Reddit (r/MachineLearning, r/LanguageModels, r/programming, r/webdev, r/artificial):** No posts published on or after December 12, 2025, were found that contained substantive technical discussions on "computer use models," "browser use models," "browser agents," "AI browser automation," "multimodal agents," or "LLM browser interaction." Posts on these subreddits were either unrelated, promotional, or dated before the required cutoff.
- **HackerNews:** No relevant submissions or discussions were found on HackerNews for the specified search terms within the past seven days.
#### **Summary**
Substantive technical discussions on the requested topics were found exclusively on Twitter/X and YouTube. The most consistent themes are the development and deployment of browser agents using frameworks like Notte, Browser Use, and Playwright MCP, with a strong focus on automating web interactions. The term "multimodal agents" is specifically tied to Google's Gemini 3 Flash announcement. Discussions on "computer use models" are limited to a single tweet comparing their performance to agent-based systems. No substantive content was found on Reddit or HackerNews within the specified date range.
Tell your agents what tools they can access and when to use them
Not all tools are created equal; they differ in speed, reliability, cost, and quality. For example:
- A web search returns results in milliseconds and rarely fails, but the coverage of information — which is central to a monitoring use-case as in Scouts — can be hit-or-miss.
- Browser automation with a navigator tool gives you high-fidelity access to any webpage, but it is 1000x slower than an API call, more expensive, and prone to webpage timeouts or rendering issues.
- API calls to structured databases are fast and reliable, but they can be expensive and only work for specific domains.
- Even within similar tool types, quality varies. Some search APIs return comprehensive metadata — article titles, full content, authors, publication dates — while others return a link and a snippet, forcing follow-up tool calls to retrieve more metadata / details.
When we manually browse the web, we're implicitly aware of these differences and follow a pattern — start broad to gather context, then narrow down to the right tool. Use fast, cost-effective options first, and only escalate to complex, expensive tools when absolutely necessary.
Your agents can do this too — but only if you give them the right context about tool properties and teach them how to sequence their work. We encode explicit guidance into Scouts about which tools work for reconnaissance versus deep dives, and which operations can be done in parallel. Tool descriptions from off-the-shelf MCP servers don't capture these nuances.
Example: Finding concert tickets
For example, consider a Scout looking for Taylor Swift concert tickets under $100. The naive approach is to send Navigator agents directly to Ticketmaster, Vivid Seats, and SeatGeek in parallel — expensive, slow, and checking similar inventory three times over.
Instead, we teach agents to:
- Start with a single Google search to identify resellers and get a sense of pricing,
- Invoke a
read_webpagetool on promising sites in parallel to extract basic availability, - Send 1-2 Navigator agents to the best options with optimized starting URLs.
This initial search can also reveal when to short-circuit entirely — if tickets aren't on sale yet or the minimum price is way above the user's threshold, you can stop without proceeding to expensive browser automation.
But there's a catch — the agent also needs to know when fast tools will mislead and/or are likely to fail. Booking a restaurant reservation requires selecting dates and times. Checking product availability requires selecting sizes and filters. Using a read_webpage tool on these dynamic pages may return misleading results — appointments showing as available when they're not, products appearing in stock when the right size isn't selected. Teach your model the hierarchy, then teach it when to ignore it.
Don't assume credibility; teach your model to fact check
The tools themselves are only half the problem; the information they surface presents another layer of complexity.
- Search snippets can be misleading when truncated: a query for "startups funded by OpenAI" might return a result titled "Startup X raises $10B" with a snippet reading "X raises $10B...OpenAI," triggering confirmation bias — the model sees the keyword it's looking for and assumes OpenAI is an investor without verifying the full context.
- Scouting across the entire internet means venturing into corners where even humans struggle to judge credibility at first glance. While modern LLMs have some sense of authoritative versus dubious sources, continuous monitoring surfaces content from obscure blogs, niche forums, and fringe publications that test these internal models. Humans instinctively cross-check questionable sources; teach your agents to do the same.
- Social media compounds the issue — a post claiming "Google's new product kills this startup" doesn't mean the startup is dead, just that Google launched a competitor, but models can misinterpret the dramatic language.
Continuous monitoring is unforgiving and creates an asymmetric risk — good work often means finding nothing new (invisible to users), while mistakes get surfaced immediately. Even if the model gets judgment calls right 90% of the time, the 10% error rate is enough to surface incorrect findings to users, degrading the signal-to-noise ratio over time.
We had to teach Scouts to evaluate source credibility actively — to reason about who's saying what and why, cross-check important claims across multiple independent sources, and escalate to more authoritative data before turning those claims into recommendations. The goal isn't just to find information, but for users to be able to trust it and make decisions they can defend.
Let the agent extract feedback from every user interaction
As is the nature of long-horizon research tasks, the questions of interest often start broad and then iteratively get crisper / narrower based on the gathered information. Scouts need to be easily personalizable and co-evolve through this process.
A user might start with a broad query like, "Find startups who have recently raised funding". After initial findings, users reframe their query with additional constraints:
- Only Pre-seed, Seed, and Series A
- Only tech companies
- Exclude defense tech entirely
In addition, the world is evolving — new companies appear, categories shift, and what counts as "relevant" changes over time.
Scouts must keep evolving too. We support multiple ways to steer a Scout over time — users can comment on a single line in a report (line-level), the report as a whole (report-level), or the Scout's overall behavior across reports (Scout-level), providing corrections and clarification of intent via a chat.
In all cases, the user's feedback is translated into inspectable, editable guidelines1 for future Scout runs. The system adapts to the user, instead of forcing the user to adapt to the system.
Prepending past reports in context leads to drift towards them
Scouts track changes over time, building up a compounding knowledge system.
To do this well, agents need to have memory of past reports in order to appropriately contextualize new findings and provide succinct deltas ("what changed since last week").
The trap: "Just paste the old report at the top"
Our early implementation was straightforward — if a Scout needs memory of what's recently happened so it can report updated information, then just include the latest reports at the top of its context. In practice, this leads to a subtle but nasty behavior drift.
Say the user's query is "Track updates on AI regulations worldwide". One of the past reports covered a high-profile AI law that passed in a particular state. If we include that report in its context, in the subsequent run, the agent may over-index its search on that one law — its articles, the people involved, community sentiment, and lawsuits citing it — instead of giving a balanced view of other regional frameworks, executive orders, industry standards, etc relevant to the original query.
As an analogy, prepending past reports into the context makes an agent prone to availability bias.
Use iterative keyword search on the report archive
To avoid drift, we build up a report archive that is searchable by every agent.
The common approach to do this today would be
- Convert every past report (or chunks of reports) into vectors,
- Retrieve the most semantically similar chunks,
- Include the top-k results in the agent's context.
Embeddings are powerful, but we found that simple keyword search plus metadata filters is sufficient and surprisingly effective for Scouts to search through past reports.
This works well because reports are already biased to be compressed and information-dense, precisely what keyword search can latch on to, e.g.
- Proper nouns & entities: company names, products, people, projects, competitors
- Exact phrases: "SOC 2 Type II", "price increase", "go-to-market"
- Numbers & dates: "$20M", "Q2 2025", "Dec 3", "30-day trial", "99.9% uptime"
- Hard constraints and negation: "exclude X", "only Y", "not Z"
Our report searcher performs this keyword search in an iterative loop where it refines both keywords and conditions (e.g., negation) based on the retrieved results:
- Start with a narrow query (entity + intent),
- Inspect results quickly (titles/snippets/metadata),
- Expand with aliases, exclude irrelevant content, refine constraints,
- Run follow-up queries for any missing pieces (dates, numbers, decision rationale),
- Include just the relevant excerpts into its working context.
This loop keeps recall high without polluting the context window, and empirically, this simple approach leads to surprisingly few failure modes.
Putting it all together
- Modularize and distribute work into specialized subagents
- Design for coverage and persistence
- Recognize that tools are heterogeneous: design for trade-offs between speed, cost, and reliability
- Let agents evolve and learn from user feedback
- Build long-term external memory for agents, so knowledge compounds over time
For our users, the payoff is clear — an always-on partner that watches the web for what matters to them and adapts to their preferences.
These principles are the backbone of the architecture that powers Scouts today, but we continue to iterate on design. If you're building your own agent system, we hope these are helpful!
Footnotes
-
Coming soon to the Scouts UI. ↩