Saturday, November 2, 2024

User Agent Agents

Charlie Sweeting

When you opened this page your device sent a network request. Given you’re reading these words it successfully told the computer on the other end who was calling, what you were asking for and where to send it.

Usually that response is the same static content, but without a reference we often don't notice the subtle changes in our experiences.

Your http headers tell websites what language you want a response in, your browser hints at your local time (imagine scheduling without it) and your user agent likely changes the kinds of ads you’re shown.

It also happens not so subtly.

Whenever we authenticate (login) on the web we see radically different content from the next person. Bearer tokens and cookies help identify you between sessions and personalise what you’re shown and the recommendations that are made.

Much of this is actively beneficial. You don’t want “coffee shops near me” to give you coffee shops in another country and a “For You” page to show you videos of mongolian throat singing (unless you’re really into Mongolian throat singing) but it does come with risks.

When we hand over our data, we’re no longer in control of it. Nobody wants their payment details or search history leaked to strangers on the internet and no rational company wants to expose themselves to fines when data invariably leaks, but both of these become a possibility.

So we’ve struck an equilibrium where we share small slices of our digital footprint to companies, they model us and build products which are better on the basis of our trust and governments step in when something goes wrong in the value exchange or a bad actor tries to derail the system.

I don’t think that’s a compromise anyone is happy with.

The boundary between what runs on our devices versus what we transfer to servers is like a property boundary. We’ve closed the doors to many of the rooms in our house, but our collective front doors are unlocked and even if nothing gets stolen, how do we stop service providers wildly misquoting us because they’ve worked out we’re desperate to get some work done? We can’t, we either have to let them in or not engage with them at all.

So where do agents come in? What about shifting data from client to server changes with the introduction of LLMs?

Local models can access the full scope of our data without moving it off device, reason about the world and be primed to represent our interests. They’re a private, nuanced and direct-able interface. They can sit at the property line and describe our homes in detail without letting someone look around and, best of all, they can lie about it if they’re not convinced visitors have good intentions.

World knowledge, reasoning and local context is a technology shift that respects the client-server boundary. Now each TCP connection can be a rich negotiation that finds balance for both parties, an almost perfect reflection of the relationships between people that the digital world mirrors. A user agent string tells a company what kind of device is calling, a User Agent can say so much more. It resets the power balance between those who create products and the people that consume them with two models, one on the server and one on the client, that can negotiate a nuanced and custom arrangement that serves both sets of interests in the time it takes for a few round trips between the device and server.

What about businesses, do they do worse in this paradigm? The opposite. They get a rich consumer interface that tells them everything they need to know without having to expend the effort to capture and model a snapshot of a customer over time. If market forces push companies towards cheaper, faster and better then an agentic interface between users and products ticks all three boxes. There’s less of a need to store and model data over time, conversations happen in 100’s of milliseconds and the personalisation we get from few shot queries with full user context is a better consumer experience than the low context environment that most products operate in. Companies with strong current user data moats suffer, but that’s a positive. This is a technology shift where utility pushes the equilibrium back towards respecting a users privacy and preferences.

When i saw the release of gemini nano in the canary release of chrome and the original apple intelligence announcement it felt like we were taking baby steps towards this future where local models represent users and negotiate with server based models representing products. When apple intelligence was revealed to have access to a swappable slice of user context it felt like were taking leaps towards that full context local agent. When I read the apple docs and realised we were limited to rewriting text in a different tone of voice and creating custom emojis it felt far less substantial, although recent demos of apple intelligence prompt injection do suggest a higher level of unrestrained potential.

Stepping stones with strong guard rails make sense. We’re still understanding what LLMs look like as a security problem and learning in production is a terrible idea when sensitive data is in the balance, but i think we’ll work out those kinks in the next few years, and when we do other interfaces are going to change.

It seems the default assumption currently is that agents will engage with the internet on our behalf, browsing sites and showing us the most relevant information that reflects your intent. Given the internet is geared towards more rigid user flows currently that seems sensible, but this isn’t a static system, interfaces also can adapt. Experiment with generative UI and already showing it. UI snippets that are served in response to natural language are a more flexible alternative than the user paths we currently use to guide people through products. When you match a dynamic UI with the context you can query from a model representing a user you can shorten a path to three elements: a confirmation, a “try again” button and a text field to nudge the interface towards another direction.

Generative UI in response to a client side model acting as a users agent is just one example of an API mediated by models on both sides but I think it’s already outside of the current zeitgeist of likely model applications.

The buzz over local models is currently geared inwards. No network requests means faster inference, no moving data off your device and full user context. All of these are exciting, but i think the real excitement is in the new API this creates with the networks outside your device. It seems like both our request and response objects are going to start looking more like natural language and our logs more like conversations. The implications of that are enormous because the expressiveness of the format is magnitudes more than the highly structured way we currently communicate through APIs.

I want to explore many of the branches individually in future writing, but i think the overall trend is likely the same: a more human client-server API interface is going to mechanically reflect human interactions. Reputation, trust, gossip and a host of other dynamics that surround software but aren’t integrated into it will start to play a part as systems become more intimate, highly networked and high bandwidth.