How to design a Deep Research Agent (that’ll actually work)

Here's an unfashionable opinion in 2026: most of the work of building a good agent has nothing to do with the agent. It's data curation, eval design, workflow mapping, and arguing about schemas. The coding bit is maybe 20% of the task. This blog covers a good chunk of the remaining 80%. 

At Veratai, we follow a five-step methodology for developing deep research agents. There are five stages: 

  1. Envision: understand the problem the agent will solve and how it will integrate into the business 

  1. Explore: design the agent and agree how we’ll know it works as intended 

  1. Execute: build the agent in rapid iterations; prove that it works 

  1. Engage: observe the agent during its early lifetime and compile a wish-list for improvements. 

  1. Enhance: optimise all aspects of the agent’s performance and build out its “safety net” 

This blog covers the Explore phase. 

Where are we in the process? 

Imagine you've been engaged to build a deep research agent (DRA) - a system that will accept a brief, gather information, synthesise it and produce a structured output. It could be a sales briefing, a competitive intelligence report, or a due diligence pack.  

The Envision stage has furnished us with a good understanding of the task. Now we need translate these findings into a detailed design for the agent. This is also the best time to start hunting down (or curating) the data we’ll need to feed, evaluate or improve the agent. The key questions at this stage are: 

  • Data: what data will the agent need and how will it acquire it? 

  • Deviations: will there be any deviations from our DRA reference design which will require development? 

  • Models: what data models will our agent use: what goes in and what comes out? 

  • Evals: how will we know the agent works? 

  • Workflow: what process should the agent follow and what tools will it need? 

  • Feedback: how can we collect feedback and what learning paradigm should we use to ensure that the agent will get better over time? 

I’ll cover each of these, in turn, below. 

Data 

Every research process generates collateral: past research briefings, final reports, populated templates, sources and working documents. These artifacts are gold dust because they make concrete the implicit knowledge and thought processes of the humans who produced them. It’s exactly the kind of knowledge we need to transfer to our agent. 

The first task is to assemble a collection of these documentary artifacts. In Envision, we built a picture of the research process as it actually happens. Now we’re looking for the evidence trail it leaves behind. What will be useful is case-dependent, but examples include: 

  • Inputs: past research briefings, requests, intake forms. 

  • Outputs: the final deliverables. Reports, summaries, slide packs. 

  • Intermediate artifacts: notes, sources, documents, datasets, search results. These are often the most revealing, because they show the process of research, not just the result. 

Once assembled, review what you have and - critically - identify the gaps. Which parts of the research process are well documented; which are opaque? Are there categories of artifact missing? Is the data representative and diverse enough? 

Don't be surprised if the gaps are substantial. In our experience, the intermediate steps of a research workflow are rarely well-documented. People don't tend to write down why they chose one source over another, or how they decided which information was relevant. But these are precisely the decisions we need our agent to learn. 

Why are we doing this? 

  • The inputs tell us what sources we’ll want the agent to be able to access and what formats it’ll need to work with. 

  • The outputs give us something to design for and evaluate against. 

  • The intermediate artifacts guide us through the decision-points and sub-tasks that occur during the research process. In some cases, they provide evaluations that we can use to optimise sub-steps within the agentic research process. 

Data Curation 

I said there would be gaps - areas where we’ll be missing data that is crucial to evaluating or prompting the agent. We’ll now need to fill them. This is a pragmatic exercise and there's no single right approach. We'll use a mix of methods depending on what's missing and what's feasible. 

Human data curation and labelling is the most reliable route for capturing the expert judgments. If we need to label source quality, annotate relevancy decisions, or create gold-standard examples of intermediate reasoning steps, then we'll need human experts to do this. Depending on the scale, a dedicated data curation platform can help manage the process - but for many engagements, a well-structured spreadsheet and a clear annotation guide will do the job. 

Data gathering from internal or external sources fills a different kind of gap. Perhaps we need to assemble a corpus of the sources that researchers typically consult or pull together historical data from internal systems that wasn't previously accessible. This might involve API integrations, web scraping, database exports or simply asking someone to dig out the files from a shared drive that nobody's looked at in two years. 

Synthetic data generation is also an option. If inputs and resource examples are thin on the ground, we can use a strong model to generating variations of research briefs or other collateral, or to produce “good” example outputs that a human expert can then quickly correct and refine. Ideally we supplement, not replace. Synthetic data can boost the diversity of our evaluation dataset but a purely synthetic dataset will never be as strong as a human-generated one. 

Models 

With our data assembled and curated, we need to impose some structure on it. This means developing logical data models for the inputs and outputs of the research process. 

A common mistake here is to model only the bookends: the initial briefing that kicks off the research and the final report that comes out the other end. The intermediate steps matter just as much. We need models that capture the structure of every significant input and output during the research process - source evaluations, search queries, extraction results, synthesis steps, citation records, confidence assessments. 

Why does this matter? Because data models also serve as a contract between the agent and the orchestration system that will run it. At Veratai, we split development - with the agent being developed separately to the orchestration layer - as follows: 

  • research workflow → agent 

  • data storage, logging & observability, task management, imports & exports, batch runs, human-in-the-loop facilities, report interactions and user feedback → orchestration layer 

With a good contract in place, the two workstreams can run in parallel. 

Evals 

Before we build anything, we’ll need to agree how to measure quality. The evaluation method is one of the areas where agent developers tend to cut corners – and usually regret it later on. 

Be clear on the distinction between the following three types of metric: 

  • Ultimate metrics describe what we actually care about. For a research agent, this might be "the quality and completeness of the research output as judged by a senior analyst" or "the time saved compared to a fully manual process." These are the metrics that matter to the business, but they're often expensive or slow to measure. 

  • Measurable metrics are proxies that we can compute more cheaply and frequently. Factual accuracy against known sources, citation precision and recall, structural adherence to output templates, coverage of required topics. These should correlate well with the ultimate metrics but will never be a perfect substitute. 

  • Optimisable metrics are the ones we can actually use in a training or prompt-optimisation loop. These need to be computable automatically - or at least semi-automatically - and differentiable enough to drive meaningful improvement. LLM-as-judge rubric scores, ROUGE variants against reference outputs, retrieval quality metrics, tool-use accuracy are all examples. 

We’ll start with the ultimate metrics, then figure out the measurable and optimisable ones to use during development. Equally important is to think carefully about how we’ll collect the ultimate metrics. If we’re expecting your end users to help us review outputs, how will we enable them to do this in a structured, comprehensive and useful way? (Clue: we’ll probably need to produce a specialised end-user app for this.) 

Deviations 

At Veratai, we have a “reference design” for our orchestration application. It saves reinventing the wheel each time. Whilst 80% of the non-agent code required will be the same each time, the remaining 20% will be specific to the use case we’re building for. Therefore, we figure out how we need to adapt the reference application for each use case and plan out a list of changes. 

To put this into context: a common deviation is for “human-in-the-loop” needs. In our experience, most agents will need to pause under certain conditions and accept human input or validation. This, in turn, requires us to develop some sort of customised view or form in the orchestration application. 

Feedback 

A research agent is a junior assistant. It won’t get better unless you take the time to provide feedback. Therefore, encourage your users to do so. 

Really, this is a UX design consideration. We need to think of all the dimensions along which our agent’s outputs could be criticised, then think about how our users can tell us when this happens. 

Think about the kinds of feedback that users will provide. The most obvious is their assessment of the final output - was it good, bad, missing key information, well-structured, off-topic? But there are richer signals available if we plan for them. Did the user accept or override a source recommendation? Did they edit a particular section heavily? Did they ask a follow-up question that suggests the agent missed something? 

Let’s be realistic: most people will not want to provide detailed feedback, most of the time. Simple mechanisms like 5-point scoring scales, thumbs up/down buttons or feedback pop-ups are best. 

The above are examples of process reward signals - feedback on the intermediate steps of the research process, not just the final result. They're invaluable for targeted improvement, because they tell us where in the process things went right or wrong, not just that the overall output was unsatisfactory. 

If we know what feedback we’ll receive and can estimate how much of it there might be, we can figure out what learning mechanisms we might be able to use to improve the agent over time. For further details, we wrote a blog about this. 

Workflow 

Finally. This is where we get into the specifics of how the agent will reason and act. Note how this is only one step in the design process - there is so much more to consider in a successful project! 

Here are the specific aspects we should now be able to design: 

The agent’s state graph. The state graph is the backbone of the agent. It defines the nodes (the things the agent does), the edges (the transitions between them) and the state (the data that flows through the graph). This is where our data models pay dividends: each node receives and produces structured data according to the contracts we've already defined. 

The tool registry. Research agents need tools: search APIs, document retrievers, code interpreters, etc. Having done our due diligence understanding the research process and tracking down the various data artefacts, we’ll be in a prime position to specify the tools our agent needs. 

Data indexes. Our agent may need a “memory”. Typically, this will be a vector store, graph database or other index. Now’s the time to select one that will give our agent what it needs. 

Initial prompt design. Prompts contain our best attempt to define steps in the research process. We need to design them carefully - as if we were writing a tutorial for a junior colleague on their first day.  That said, we mustn’t agonise over perfection at this stage. If possible, we should get a stakeholder to review and correct them. The initial prompts should establish a solid baseline - something that works well enough to begin with. We’ll look at prompt optimisation strategies in the forthcoming Enhance blog. 

Concluding the Explore phase 

The Explore phase is a lot of upfront work before any code gets written - but don’t be tempted to skip ahead. The reason we developed and invested in the Explore method is because it enables us to build better agents - agents we can properly evaluate and, in the future, improve. 

In the next blog in this series, we'll cover the Execute, Engage and Enhance phases all together - explaining how we actually build and what our goals are at each stage. 

Next
Next

How to create a Self-improving AI agent