Why Building AI Agents is Actually a Path Exploration Problem

I’m working on a projet to use LLMs for mental health counseling. Super hot topic, I know. Very crowded space, I know. I’m loving every minute of it nonetheless.

Now, building a conversational AI agent that is meant to be deeply relational is an incredibly hard problem. Today, LLMs excel in domains where GPT-flavored answers are acceptable like customer service or use-cases where the bar for value is so low that you can get vulnerable pre-teens hooked quickly (looking at you in judgement, character.ai).

Their tendency to be sycophantic is fundamentally at odds with what is needed for effective therapy. Instead, they should stand as a neutral third-party helping you navigate your own thoughts, emotions, and values to guide you towards your unlock so you can leave the session feeling more hopeful, optimistic, and empowered to create lasting positive change.

At least, that’s what I’ve noticed my best therapists do. My friend William (a trained psychiatrist) agrees.

Thus, every single word is of critical importance and can break therapeutic alliance (or rapport as we call it in the normal world) before you get to the “aha” moment.

For a given user input, we can represent the range of all possible model generations like so:

\[ f(i) = \{D_1, D_2, \ldots, D_n\} \]

Theoretically n is almost infinite. Practically, it isn’t. There is an upper bound.

If we consider all the paths that a model could go towards in the next turn and we map that over a distribution curve of representation in the training data, we will find most of the time the model is likely to land in the middle. This is not news at all for the seasoned AI builders but it’s something to grapple with daily.

Standard Distribution

In some cases, that’s fine but in many others this can be undesirable. The “mean” of the entire internet is useful for factual information but terribly bland, generic and untasteful for a conversation. We want surgical precision here, not a blunt hammer.

So what do we do? We resort to context engineering. We attempt to move the “Eye of Sauron” of the LLM and direct its attention away from the center of mass and towards the margins. To take the path less traveled so to say.

Standard Distribution Shifted

But, What is a Good Session?

Understanding the mark is key to determine if we hit or miss it. In my case, what the LLM needs to be able to do is make a good strategic decision and deliver that intent well over the entire length of a session.


Good Strategy + Good Delivery = Good Session


Let’s imagine someone comes in with some issues at work.

Work Issues

After their last message, which direction should the LLM go in? Lots of things come into play here: how far along are we into this session? What do we know and not know?

In this example, we’re early into the session and we’ve been uncovering what’s bothering the user. We do not yet have explicit buy-in from them on the core issue they’d like to work through and the end state they wish to reach. This helps make the right decision a bit more clear.

Work Issues, Open Chosen

Other decisions are less obvious, but you can start to get a sense of what we’re up against. At every turn of the conversation the model must decide where to go next based on context \(\mathcal{C}\). We can represent it in this way:

Decision Tree The mighty decision tree

Reframing the Problem As Path Exploration

This representation is exactly where we want to end up! The one thing I learned from my Calculus math courses in University is that many difficult problems become solvable if you can find a clever way to reframe the question into one where you can apply proven, existing theorems.

The Bitter Lesson by Sutton argues that over the long-haul breakout improvements in AI comes from search and learning. The approach of representing the problem as a path exploration problem (ie: search) allows us to focus our improvement efforts on work that benefits from future model upgrades, not work that is rendered redundant by them.

Shrinking the Size of the “D” Set

Something I’ve found worked well for me (your mileage may vary) is to make the decision space explicit. Basically when the user says:

“My mom doesn’t support my decision to drop out of grad school to become a Youtuber”

We do not rely on the LLM to compute in its latent space what it should do and come out with an answer that may completely miss the mark, we force it to choose amongst a pre-defined list of class labels (or intents). For example:

Turn Instructions

I’ve settled on about 50 class labels which is good enough for my use-case.

\[ L = \{l_1, \ldots, l_{50}\} \]

This significantly shrinks the size of our D Set and there are many benefits to this approach, just to name a few:

The cool thing is you can always change n based on your emerging needs.

In conclusion, using class labels allows us to break the LLM’s myopic tendencies, causing the attention mechanism to consider disparate paths across the entire distribution curve and weight them against each other. Huge win!

Hard Numbers on Theoretical Ceiling

With that said, how many ways could our AI model really respond and how many paths can exist?

\[ \begin{gather} \text{Let us define the range of possibilities as:}\\ f(i) = \{D_j\}_{j=1}^{n} \end{gather} \]

Let us assume that 90% of sessions are less than 30 turns, which means 15 agent messages and thus 15 decisions. At every turn, the agent must decide:

  1. What strategy to use
  2. How to deliver that strategic intent well

We will assume for simplicity sake at every turn there is 8 viable strategies to choose from. This isn’t an enforced cap per se but a rough guess that matches my experience testing the agent.

We can also assume we won’t consider every intent at every turn. There is an aspect of phase-appropriateness here where using the connect_welcome intent as the session is coming to a close or using encourage_hope three messages into the session simply doesn’t make sense.

\[ \begin{align} &\text{Let }\mathcal{L}_1 \text{ be all viable intents per turn.}\\ &\mathcal{L}_1 \subseteq L, |\mathcal{L}_1| \leq 25 \\ \\ &\text{Let }\mathcal{L}_2 \text{ be all unique combinations.}\\ & \mathcal{L}_2 = \mathcal{L}_1 \cup \{\{l_i, l_j\} : i < j\}\\ &|\mathcal{L}_2| = |\mathcal{L}_1| + \binom{|\mathcal{L}_1|}{2} \leq 325\\ \\ &\text{Let }\mathcal{S}\text{ be strategic intents per turn.}\\ &\mathcal{S}=\{s_1, \ldots, s_8\}\text{ such that }|\mathcal{S}|=8\\ \\ &\text{We define the decision space }\mathcal{D}\text{ as:}\\ &\mathcal{D}=|\mathcal{S}|\times|\mathcal{L}_2|\text{ such that }|\mathcal{D}| \leq 2600 \end{align} \]

In reality there is some segmentation whereby every strategy naturally captures a specific subset of \(\mathcal{L}_2\) but let’s just make our math simple. This gets us back to some concrete numbers.

\[ f(i) = \{D_i\}_{i=1}^{2600} \]

About 2,600 nodes per level. We said earlier most sessions require 15 agent decisions so over the course of an entire session, what does that look like?

\[ \begin{align} &\text{Let total nodes be:}\\ &\sum_{t=1}^{T} |\mathcal{D}| = T|\mathcal{D}|\\ &\sum_{t=1}^{13} 2600 = 2600 \times 13 = 33800\\ \\ \\ &\text{Let total paths be:}\\ &\prod_{t=1}^{T} |\mathcal{D}| = |\mathcal{D}|^{T}\\ &\prod_{t=1}^{13} 2600 = 2600^{13} = 2.48 \times 10^{44} \end{align} \]

Our magnificent decision tree ends up with ~ 34K nodes and astronomical number of unique paths. Yet, we know from our earlier examples not all paths are created equal. Like wanting to “challenge thinking” when the user is still just explaining their situation.

Decision Tree, Highlighted Some paths lead to bad outcomes, others to good outcomes

The name of the game then becomes aggressive and strategic tree pruning and we have two primary levers we can use to that end:

Engineering: Algorithmic Gains

One strange phenomenon I noticed while testing this AI agent is that it had a tendency to be overly validating.

Over Validation Example of repeated validation

This is a serious problem. Having your conversation partner say “I understand how that must feel” every single message gets old QUICK! We needed to de-incentivize this pattern where the model abuses certain intents.

I created a domain scoring algorithm to do that. It works by accruing a scaling heat penalty for repeated uses of a specific intent with a cooldown rate for every turn it’s not used.

Domain Scoring Paths Domain scoring algorithm

Intents that are “hot” are ranked lower and those even hotter are completely banned from re-use until they cool down a bit. This forces more diversity out of the LLM that it inherently wants to provide. You can see a snippet of the algorithm below:

Domain Scoring Snippet Overview of the domain scoring algorithm

This was very effective. Depending on your use case, there may be other techniques to discover but this is where you can create lasting value as an engineer. Examine the unique ways your path exploration sucks and make improvements.

Data Augmentation: Monkey-see, Monkey-do

Despite those efforts, the model can still struggle with effective path exploration when the decision criteria are not clear. Take for example the exploration phase. We define it as:

Exploration Phase: Digging deeper into thoughts, feelings, beliefs, and values to identify the root of the problem

In this phase of the session, how should the agent know to progress? The decision criteria are rather easy. For instance:

User reached their “aha” moment, connecting their current issue to some deeper underlying concept. Identifiable by statements like:
- “I guess it’s due to [some childhood memory]”
- “I realize that [ … ]” 
- etc…

This is mostly black and white. LLM with a good system prompt can easily determine when that part is done. However, in the “discovery” phase, defined as:

Discovery Phase: Gathering background context & hard facts about the situation, problem identification and exploration

This one is a particularly hairy phase. We tell the LLM in its prompt:

When you have enough context and hard facts about the situation, move on to the exploration phase.

but what the heck is considered enough ? It’s a very nuanced ask and in most cases the model got stuck, never feeling satisfied to move on. Other times, it jumped the gun and we entered into deeper territory under-equipped with the relevant information to be effective.

Solving for Nuance

Based on a given context \(\mathcal{C}\), all decision nodes ultimately fall into 3 buckets:

  1. Bad (30%)
  2. Decent (60%)
  3. Great (10%)

There are a many obviously bad ways to react we can prune out, but the trickier issues is sifting out the few seeds of Great in a large sea of Average. Knowing the difference and consistently choosing well is task beyond their capabilities (today). In this case, data can come to our rescue! It’s much easier when you can show the model examples of good decisions it can pattern itself after.

Going back to the size of our decision tree:

\[ \begin{align} &\text{Let total nodes be:}\\ &\sum_{t=1}^{T} |\mathcal{D}| = T|\mathcal{D}| = 3.38 \times 10^2\\ \\ &\text{Let total paths be:}\\ &\prod_{t=1}^{T} |\mathcal{D}| = |\mathcal{D}|^{T} = 2.48 \times 10^{44}\\ \end{align} \]

Where do we start? First, we don’t need to cover all the paths (we can’t anyways) but want enough examples to cover the decision nodes. We also want the examples to be diverse and represent various preceding context \(\mathcal{C}\) so we will end up wanting let’s say 10 examples per node [1]. About 338K examples in total to get +99% coverage.

Thinking Hard, Doing Math
Calculating where I’m gonna get all this data

If you have OpenAI’s balance sheet that’s easy. If you’re one guy with a Claude Code subscription (hi, thats me!), that is ridiculously large amount of data to gather. Let’s figure out how we can scope it down.

A quick prompt to AI reveals the following breakdown for counseling session topics[2]:

  1. 30-40% are anxiety & depression disorders
  2. 15-25% relationship problems & various traumas
  3. 5-10% others

I’m pretty surprised this! I would’ve swapped the first two’s order. Anyhow… If we just focus on anxiety use-case, we can reduce our data requirement to 30% of its original size and bring it down to 112K records.

Furthermore, we don’t realistically need 100% coverage. While vanilla LLMs default to generic responses, they can learn to navigate specific regions of the decision space when given targeted examples that are high-quality. Thus, we can apply the Pareto Principle (80/20 rule) and conclude that 22K records that are broad enough can capture most needs for good decisions guidance in anxiety-related sessions.

Now, creating 20K records is doable over a weekend with synthetic data. This would have been totally impossible with ChatGPT 3.5 but today’s SOTA models can perform the task reasonably well with a high-quality seed.

I can write 200 examples by hand, then write a Python script to expand that list.

Risks and Considerations

Now, if you’ve been paying attention you will notice we make a lot of assumptions in modeling our approach. This is necessary because it’s impossible to know exactly what’s needed, and it’s helpful to establish a baseline. It’s important to acknowledge this doesn’t get you to the finish line, just to starting one. You get to something you can test and iterate from but you should beware of false confidence and YOLOing your approach into the market.

Haven’t you read the scriptures?

“Foolish is the man who develops agentic AI systems and does not have a exhaustive eval tests in his CI/CD pipeline before every production releases. He will face great ruin and embarrassment.”

Proverbs 53:4. I think. Probably.

Conclusion

LLMs can feel magical and in some ways they are, but fundamentally[3] they are extremely elaborate machine learning algorithm. Inasmuch, principals of engineering and math can be applicable in interesting ways to improve performance.

If you can twist the frame of the picture, your impossibly abstract task can boil down to a mathematical & engineering problem: How do I constraint my output space to minimize my false positives and how do I optimize my path exploration?

Fun stuff!


[1]: This isn’t exhaustive but if you use LLM to summarize context \(\mathcal{C}\) into a search query string of < 240 chars for a semantic search, then 10 core examples gives 80% search coverage of one or more hits above 0.6 Cosine similarity.

[2]: It should noted most patients come with an overlap of 2-3 issues and the one they first come with such as “stress management” is rarely the core issue they truly have.

[3]: I almost said “from first principles thinking” here to virtue-signal how smart and hip I am.