09 Mar 2026

The Largest Challenge in AI

"Bluejays are high-altitude birds that live in mountainous climates." - A voice agent not tested by Bluejay. Photo by Marek Piwnicki.

Back when Rohan and I were building voice agents for restaurants, we found it tedious and inconsistent to manually test each menu item that our voice agent was supposed to handle for a store. There were around 150 different menu combinations, reservation possibilities, and general FAQs that our voice agent was supposed to handle. Each call was ~3 minutes long, and checking to see whether our order was properly processed by our agent after the call ended took another 2 minutes.

That's 5 minutes per call, and 150 calls to make. To complete an E2E test run for our simple restaurant voice AI agent, it would've taken us 12.5 hours.

Each time we made changes to our system prompt or model infrastructure, the only way to be sure our prompt changes did not cause regressions in un-intended scenarios was to re-test all our cases.

This problem quickly got out of hand. How were we meant to iterate when catching regressions was a 12.5 hour cycle (that too, assuming no breaks and no human error)? How was anyone supposed to solve this problem? This problem is the largest challenge in AI: repeatable, quality evals.

We built Bluejay to solve the largest problem in AI. Now, teams can iterate on large scale conversational AI projects without having to spend hundreds of hours in manual QA. The best part?

With Bluejay, a solo developer has access to the same testing power as an enterprise.

Bluejay is intentionally democratizing testing for the masses. A QA engineer or conversational designer should not need to make thousands of manual phone calls to trust their agent works.

We think that as voice AI agents become the preferred way for businesses to interact with customers, voice AI testing should receive an upgrade, too.

Announcements

Here's what happened at Bluejay last week:

Bluejay is hosting a Voice AI Infrastructure Meetup on March 25th in San Francisco. Our announcement will drop on all platforms this week.
Bluejay crossed the 80 billion tokens threshold for building agentic evals!
Last week, the team pushed 39,756 lines of code to make Conversational AI reliable.

Coming Soon

Upcoming Features: We are releasing a redesigned Workflow builder! Soon, you will be able to design your customer-agent interaction as a diagram, and Bluejay will intelligently create Digital Humans that cover every flow. Also, Bluejay is undergoing a brand/logo redesign!

That's all for this week. I'll see you next time!

Faraz Siddiqi
Co-Founder & CTO @ Bluejay

Announcements

Coming Soon

Subscribe to The Bluejay Times