Asana's LLM testing playbook: our analysis of Claude 3.5 Sonnet

Bradley Portnoy

1. Juli 2024

8 Lesezeit (Minuten)

How Asana tests frontier LLMs: our analysis of Claude 3.5 Sonnet blog post thumbnail

Many special thanks to Kelvin Liu, Daniel Hudson, Val Kharitonov, Nik Greenberg, Poom Chiarawongse, Rob Aga, and Eric Pelz, who did the work behind this analysis and have been key to the development of our QA process over the past year.

At Asana, AI is more than a tool. It’s an integrated teammate that helps advise on your daily priorities, power workflows that can intelligently triage bugs or field marketing requests, and provide insight into sales trends and customer feedback. To stay at the forefront of AI, it’s essential that we keep up-to-date on the leading frontier large language models (LLMs). Over the past year we’ve upgraded our systems and prompts regularly as our partners have offered more advanced models. And we’ve invested significant time and effort in building out our LLM QA capabilities to the point that we’re now able to provide an initial assessment of new models in under an hour.

We’ve partnered with Anthropic in testing a few of their pre-release models, and were early testers for Claude 3.5 Sonnet. We spent a quick sprint putting the model through its paces, testing everything from pre-production performance to agentic reasoning and writing quality.

The results? Claude 3.5 Sonnet is 67% faster than its predecessor, and is simply the best model at writing and finding insight in our data that we’ve tested. It has also leapt forward in its ability to act as an agent, scoring 90% on our tool use benchmark and successfully executing longer and more complex workflows. And Claude 3.5 Sonnet produced twice as many perfect scores in our qualitative answer assessment as Claude 3 Sonnet, passing 78% of our LLM unit tests – the highest score for accuracy in long contexts we’ve recorded, on par with Claude 3 Opus.

Read on to find out more of what we learned about Claude 3.5 Sonnet, and how we approach LLM QA at Asana.

Our methodology

Publicly available model evaluations are great indicators of how frontier models perform. But ensuring that our LLM-powered features leverage Asana’s Work Graph and provide reliable insights to our largest enterprise customers means that we need to rely on our own testing and evaluation. In addition to the extensive safety, bias, and quality testing run by Anthropic, we’ve invested significantly in building out a high-quality and high-throughput QA process that reflects our customers’ real-world collaborative work management use cases.

When evaluating a new LLM, our testing process can make use of any of the following methods – and more.

Unit testing

Our LLM Foundations team (we’re hiring!) built an in-house unit testing framework that enables our engineers to test LLM responses during development similarly to traditional unit testing. Because the output of LLMs can be slightly different each time even given the same input data, we use calls to LLMs themselves to test whether our assertions are true. For example, we may wish to assert that the model captures key details of a task, such as the launch date:

javascript llm question: "Does the input say that the feature needs to launch by the end of 2022?"

Other examples of tests using include the Asana version of a needle-in-a-haystack test (asking the model to find the correct data in a very large project) , as well as a test of whether the model can synthesize the correct answer to a query from retrieved data.

While this is more accurate than other methods that could be used to test LLMs (such as using a RegExp to extract a date), we typically run each test multiple times to get the final result (often using best-of-3) as there is still an element of randomness. We expect this to improve over time as models become more advanced.

For this type of testing, we target test coverage similar to traditional unit tests, and similar performance as well—these tests should run quickly enough for our developers to use them as we iterate in our sandboxes. We’re looking forward to sharing more about our LLM unit testing framework in an upcoming post.

Integration testing

Many of our AI-powered features require chaining multiple prompts together, including tools selected agentically. We use our LLM testing framework to test these chains together before features are released. This allows us to determine whether we’re able to both retrieve the necessary data and write an accurate user-facing response based on that data

End-to-end tests

Our e2e testing uses realistic data in sandboxed test instances of Asana. This is closest to the experience our customers will have with our features, but comes at the expense of evaluation time; rather than use automated evaluation, we run these tests more infrequently, and they’re graded by one of our product managers.

We’ve found that the cost in PM time is more than worth the investment. Our end-to-end testing process has discovered a number of surprising quality and formatting changes when rolling out new models well before they reached production. This process also allows us to assess other aspects of "intelligence" that are harder to isolate and quantify – ranging from style and tone, to deeper understanding and "connecting the dots.” Automated unit and integration testing has been a great time-saver during development, but we’ve learned from experience that their output is not a replacement from a real human evaluating the output of a model.

Additional tests for new models

When we work with partners to test pre-production models, we sometimes bring in other assessments. We have a collection of scripts that allow us to rapidly generate performance statistics, specifically measuring time-to-first-token (TTFT) and tokens-per-second (TPS). And to test the agentic capabilities of Claude 3.5 Sonnet, we added two more tests: a quantitative benchmark on our custom tool-use extractor, and qualitative testing using our internal multi-agent prototyping platform.

What we learned about Claude 3.5 Sonnet

Performance¹

No one likes a teammate who doesn’t respond promptly. Providing a speedy experience for our users has been a major focus as we’ve developed our AI teammate chat experience, so the first tests we ran on 3.5 Sonnet were our performance suite. Our primary metric here is time to first token, which indicates how quickly we can stream a response to the user.

Claude 3.5 Sonnet made a leap in performance as compared to Claude 3 Sonnet & Opus, resulting in a TTFT that’s competitive with the lowest-latency frontier models in our testing set. In real terms, we’re seeing approximately 67% lower TTFT with Claude 3.5 Sonnet as compared to Claude 3 Sonnet.

Our measures of output TPS (tokens per second) shows that Claude 3.5 Sonnet remains approximately equivalent to its predecessor, Claude 3 Sonnet.

Claude 3.5 Sonnet Tokens Per Second chart

Agentic reasoning

Agents are one of the hottest areas of development surrounding frontier models, and at Asana we’re hard at work giving our AI teammates the ability to undertake chains of tasks, both with our AI workflows and teammate chat features. Taking advantage of the power of AI to act as an agent, our AI workflows already allow us to triage incoming requests for sales, marketing, and R&D, prioritize personal work, and eliminate painful administrative tasks. As we see improved agentic reasoning from frontier large language models, we anticipate soon being able to automate more complex flows like project planning and resource allocation.

In order to test Claude 3.5 Sonnet’s capabilities as an agent, we dropped the model into our multi-agent prototyping platform and ran our tool use extractor benchmark on the available models. Quantitatively, our tool use benchmark showed huge improvements as compared to Claude 3 Sonnet, rising from a 76% to 90% success rate.

For qualitative testing, we updated our internal agent prototype framework to leverage Claude 3.5 Sonnet, using Claude 3 Opus (Anthropic’s previous most capable model) as the comparison. We presented the agent with objectives that we expected to take several reasoning and tool use steps to complete – many more than we expect from our typical AI teammates. When using Claude 3 Opus with our prototype agent, we’d sometimes find that the model would shortcut a complex workflow or even refuse to take on a task. It felt like Claude 3 Opus had room for growth in terms of agentic reasoning.

This changed when we tested Claude 3.5 Sonnet. Not only did we get results much more quickly than with Claude 3 Opus, but Claude 3.5 Sonnet also performed like it was a true agent, and followed our objectives through to completion. We see more effective decision-making, as well as the ability to follow longer chains of reasoning. Combined with the performance we’ve observed and the model’s cost profile, we’ve already switched our default AI workflows agent to the new model.

Answer quality and precision

We were especially curious to learn how Claude 3.5 Sonnet would perform on our trickiest writing test: the suite for our smart chat feature. This is the tool that Asana’s AI teammate chat uses to answer questions about information you have access to from across the Asana organization, and we’re planning to make it available to more product features in the coming months. It gives AI teammates the ability to provide assistance and insight at any level of your organization, from projects and tasks to company-wide goals. This often requires filling the models’ context windows, so it’s especially important that the models are able to attend to the most relevant information and extract key insights from long contexts.

While we often find that any model change requires at least some prompt engineering, we were pleasantly surprised at the quality of responses provided by Claude 3.5 Sonnet before we made any tweaks at all! Due to time constraints, we proceeded with evaluating against these unoptimized prompts – but will of course optimize our prompts for Claude 3.5 Sonnet before putting them into our production, as we would with any model change.

The first responses to come in were from our internal LLM unit testing frameworks. These focus on needle-in-a-haystack² tests as well as drawing correct conclusions based on the provided data. Claude 3.5 Sonnet scored the highest on this benchmark of any model we’ve ever tested, passing the tests 78% of the time on average, matching Claude 3 Opus. It handily beat out Claude 3 Sonnet, which scored 59%.

Claude 3.5 Sonnet LLM Unit Testing Framework Pass Rate chart

Before discussing qualitative results, we want to add a few caveats: our PM did this grading by hand, unblinded. And in most cases, Claude 3.5 Sonnet failed to include @-mention link references for in its answers; which is disappointing, but not surprising – we’ve typically had to do some prompt engineering to ensure new models link to references appropriately.

Each test in our suite represents a real-world question & answer using data from our own Asana, Inc. organization, and is assigned a letter grade from A to F, with A being the best. For each edit that we assess would need to be made for a human to publish the answer, we subtract one letter grade. Our PMs take into account accuracy and insight, writing style, formatting, and overall quality. Claude 3.5 Sonnet quite simply set a new standard for answer quality and reasoning on this test suite.

Claude 3.5 Sonnet excelled at clearly articulating insights about complex topics. One of the first examples we graded explained something that I’d personally been confused about just a few days before, so I was impressed (and grateful!) when 3.5 Sonnet was able to explain it to me correctly.

Where prior models were only able to call out risks that had been specifically labeled in the provided data with the word “risk,” the new model seemed to reason about what might actually be a risk to the work being discussed, resulting in more successful risk identification and reporting. And Claude 3.5 Sonnet identified key decisions in projects that no other model was able to find.

For those curious as to why Claude 3 Opus received a lower score than Claude 3 Sonnet: this prompt was written for a specific model, and we didn’t make any changes for testing. We found that Claude 3 Opus sometimes left out key facts in an effort to be concise, which resulted in a lower overall score on some questions.

Claude 3.5 Sonnet Smart Chat Qualitative Model Performance chart

Takeaways

Anthropic’s latest model Claude 3.5 Sonnet is a leap forward in Anthropic’s offerings, pairing performance increases on-par with the best frontier models with enhanced reasoning and the best writing we’ve yet seen from a model. While we do have more QA to do before deploying Claude 3.5 Sonnet in our production features, we’re continuing to test and expect to roll it out to some features within the coming weeks.

The pace of model development means we’ll soon have new models to test from Anthropic and our other LLM vendors, and Asana is eager to test new models as they become available. Our investment in robust QA has given us the capability to evaluate frontier models within hours. Of course, results may vary depending on your techniques and use cases.

The frontier of large language models continues to evolve at a blistering pace, and at Asana we’re committed to using the latest and most powerful models to power our AI teammates. We’re enabling our customers to leverage the power of AI to build complex AI workflows, helping supercharge the way work gets done.

P.S. — If this work interests you, we currently have a number of openings in our AI org at Asana. Check out our careers page for more info.

¹ We’ve updated this performance data with production statistics since Claude 3.5 Sonnet’s public release.

² For us, this means picking out data in the right task in a large project or Asana organization.