LLMs for Human Operators

There is a clear correlation between the types of data available on the internet and the types of tasks that large language models excel at.

LLMs are trained on the text of the internet, which contains most human world-knowledge, and as a consequence models can engage in dialog on niche topics. LLMs are trained on Github, which contains most of the world’s code, and naturally is the first domain where models found true product market fit —Github Copilot was shipped in June 2021, a year before ChatGPT, and today half of Anthropic’s API usage is code or code-adjacent. LLMs are trained on mathematical puzzles –which exist in large quantities on the web–, and unsurprisingly models match or exceed the best humans at competition mathematics.

However, LLMs frequently perform poorly in domains that at first glance seem significantly easier to do well in than software engineering and competition math. TauBench is a benchmark that measures a model’s ability to handle customer service tasks in the airline and retail spaces —think requests like ‘I want to refund my order, except the water bottle’ or ‘push my flight back by a day and a half’. The best LLMs achieve around 60% accuracy on this test, far below what would be tolerable for a human operator.

We contend that this underperformance is due to the absence of training data for a certain flavour of medium-complexity business-logic processing tasks, which are omnipresent in enterprise settings but rarely found on the public internet.

Our aim is to rectify this lacuna and build the business intelligence data corpus.

Our customers give us high-complexity and undifferentiated tasks, we perform them fast and affordably using a combination LLM’s and human-supervision. We record in detail the steps taken to perform this task, and in so doing start building the corpus of medium-complexity business-logic processing that current LLMs are lacking.