How to create a Self-improving AI agent

23 Jan

If you’re developing an AI agent and you want it to improve itself with usage, this post is for you.

Over the last two years, AI researchers and builders have developed a plethora of “improvement mechanisms”, all of which work well in some cases and poorly in others. The aim of this post is to provide some guidance: what is likely to work well for your agent?

Planning for self-improvement

You should be thinking about the improvement mechanism early on - ideally when designing the agent and it’s integration. This is because the decision will affect what feedback you need to retain and how you’ll collect it. The decision could impact any end-user application design and will be definitely be affected by the quality and quantity of feedback that users are willing to provide.

Ask yourself: what opportunities exist for collecting outputs (including intermediate outputs, tool calls and reasoning chains) and for evaluating those outputs? Roughly speaking, there are four ways you can evaluate outputs:

Human feedback. Get end users to edit or score the agents’ work.
LLM-as-a-judge. Get another LLM to evaluate the agents’ work against set criteria.
Verifiable criteria. Look for programmatic ways to evaluate behaviour. Does a piece of code execute? Is a tool-call valid? Is a summary less than 50 words?
Gold standard outputs. Can you curate a “gold standard” dataset of inputs, intermediate outputs and final outputs? If you are automating a process that is performed frequently by your users, this data may be readily available. If not, can you acquire the time from experts to help you produce such a dataset?

Pre-existing, “gold standard” datasets are a great place to start but take care to understand the coverage of the data. Do you have examples to reflect all possible permutations of user request? If not, at least some permutations of your agent’s behaviour will fall outside of your evaluation scope and be undefined. Either curate examples to cover the gaps or introduce guard-rails into your initial deployment to flag these situations.

Human-corrected outputs and feedback is also great, but you need to be careful collecting it. Design mechanisms that make it easy for users to provide high quality, aspect-specific feedback and be aware that - if users are forced to provide it - they will often rush and give it little thought. This can introduce “label noise” into your feedback dataset, lowering its utility significantly.

The other thing you should do is perform some rough calculations in advance. How often will your agent be executed? How much feedback will you realistically gather? Knowing this will help you select your “improvement mechanism”.

Now, let’s introduce five of the most useful “improvement mechanisms”.

Prompt Optimisation

Well written instructions and carefully chosen examples can perform wonders guiding LLMs to complete tasks correctly. It is possible to improve performance by analysing the failures and hand-crafting the prompts but often you will find yourself playing whack-a-mole, breaking an eval each time you fix a failure. Far better to use gradient-free optimisation procedures like MIPROv2 and GEPA. This is the easiest way to improve an agent’s work-steps; the automated methods will start to show results with as few as ~50 labelled examples.

Consider automatic prompt optimisation when:

Your agentic workflow breaks down into steps and at least some of these steps involve simple calls to an LLM.
You can gather “ground truth” labels for the outputs from these steps using one of the methods above.

Limitations:

Does not scale well to learning from large feedback datasets: it’s a computationally expensive (or human-intensive) way to optimise. When you start to exceed 500 examples, look at fine-tuning instead.

Case-based reasoning (CBR)

The key insight here is that past successful tasks can be reused to improve future task completion. At it’s simplest, CBR can involve nothing more than saving successful past outputs and enabling them to be retrieved by querying an index of past inputs. For example, if an agent needs to produce a SQL query to solve a task, the output would be the successful query and the input would be the request to generate it. (So far this is, of course, a simple RAG system.)

CBR extends simple RAG by allowing reuse and adaptation of past solutions, often across multiple iterations. The process of querying, adaption and evaluation iterates until either a token budget is hit or a solution is successful.

The other key innovation over simple RAG is that the results of previous attempts to solve a problem (in previous iterations) are used to modify both the query and the way the results are used. In this way, an agent can explore and adapt the way it applies past outputs until it successfully solves its current task.

Useful when:

You can easily label outputs (and intermediate outputs) as successful or not.
Your agent (or work-steps within it) is responsible for planning or complex tool use.
You expect future tasks to look similar to past tasks.

Supervised Fine-Tuning (SFT)

Modern LLMs are surprisingly literate in myriad domains of human expertise but will not be honed to speak the language and follow the practices of your organisation and workflows. SFT is a sample-efficient way to bring an LLM “close to” the right kind of behaviours for your operational context: teaching it how to use tools effectively, how to format outputs or what kinds of terminology and language to use.

You can get excellent results from SFT with fewer than 500, high quality examples of “good outputs”. The challenge, of course, is assembling them. If you can prepare gold standard outputs from pre-existing data, you can use some of these for SFT. This will aid in situations where the model is not following complex instructions correctly, is formatting results incorrectly or isn’t using the right tools at the right time.

SFT is not so good at correcting errors in reasoning and may not teach a model how to effectively handle out-of-distribution tasks that it has never seen before. For these, more subtle failure modes, you may need to turn to DPO.

Direct Preference Optimisation (DPO)

What happens when a user doesn’t like the agents’ outputs? One option is to re-run the agent - perhaps with some modifications to its instructions. If your solution works like this then you have the chance to collect preference data: multiple outputs per task where (at least one) is incorrect and one is satisfactory.

DPO is a lightweight (compared to fully fledged RL) post-training algorithm that teaches a model to produce the preferred outputs. (Clearly, you’ll need access to the model’s weights or the provider must make a DPO facility available to you.) DPO works well with shorter (single work-step or shallow multi-step) traces where the agent is directionally right but not quite matching your users’ requirements.

The challenge with DPO is that quality is everything. Small preference datasets (say, in excess of 500 examples) can be effective but only if they constitute a diverse, challenging and non-redundant set of examples.

N.B. be wary of using human corrected outputs in your preference pairs. DPO is observed to work well using “on-policy” generations (outputs produced by the model) but less well if the outputs are dissimilar to those the model would produce.

RL with explicit reward modelling

High-quality, diverse preference data can be tricky to obtain. An alternative is to explicitly score each output using an automated method - a “reward model”. This can be done in a number of ways, for instance:

Software tests that can be run against code outputs.
Evaluation metrics that can be directly computed.
Sets of business rules that can run against the outputs.
Using an LLM to judge outputs against a “constitution” or specification.

Combining these factors into an overall reward model is something of a black art because coming up with a weighted sum of different factors (none of which are sufficient on their own) is tricky. Furthermore, RL algorithms are notorious for learning to “hack” rewards so you need to guard against degenerate outputs and inspect your RL checkpoints to verify that agents act in the spirit of the task as well as maximising the reward.

With a reward model in place, you can use RL to fine-tune the agent’s behaviours. (In practice, you’ll probably want to focus on optimising the agent’s planning and action-selection steps - rather than using it to simply improve text generation.)

RL is hard, but it’s often appropriate when:

Your agent takes multiple reasoning / action steps in one go - and you want to optimise the entire chain of actions;
You have multiple - or compound - objectives to satisfy.
You don’t have a suitable dataset of “gold standard” outputs that covers the spectrum of inputs your agent may receive.

Choosing an improvement mechanism

Let’s summarise the five methods above into a handy table:

	Prompt optimisation	Case-based reasoning (CBR)	Supervised fine-tuning (SFT)	DPO	RL w/ reward model
Data	50+ gold standard exemplars	Library of successful past tasks	500+ high-quality gold standard outputs	500+ preference pairs (on-policy generations)	Scorable outputs via a reward model
Single vs multi-step optimisation	Single steps or shallow workflows	Multi-step workflows and tool use	Single steps (or collapsed workflow trajectories)	Single step or shallow multi-step workflows	Explicitly multi-step workflows
What improvements do you seek?	Instruction following	Planning, effective tool use, reuse of procedural artifacts	Style, conventions, tool usage, output structure	Aligning outputs with user preferences	Shaping overall behaviour and decision-making
Sensitivity to data coverage	High. Sensitive to gaps	Medium. New use cases must be similar to previous ones	High. Sensitive to gaps	High. Quality and diversity is key	Medium. It's all about the quality of the reward model
Complexity	Low	Medium / High	Medium	Medium / High	High

My “rule of thumb”

One of the challenges of building agents in 2026 is there are so many choices for the designer to consider: LLMs, orchestration frameworks, agent patterns and improvement mechanisms to name a handful. Often, hitting the required quality metrics involves some experimentation and it is difficult to offer performance guarantees in advance.

Hopefully this post has simplified one of these choices a little bit, guiding you towards the improvement method that is likely to work best for each situation. If you remember one thing, try this little rule of thumb:

Prompt optimisation fixes instruction following.
CBR improves planning.
SFT teaches local conventions, tools and style.
DPO aligns the agent’s preferences with the users’ preferences.
RL fixes policies and long-horizon sequences.

About Veratai

At Veratai, we build Deep Research Agents to automate the tiresome parts of knowledge work. If you want to talk further about self-improving agents, hit the Schedule a Call button above, and we’ll get in touch.

Will Hardman