
Businesses are losing money every day because of issues including manual, repetitive tasks, siloed knowledge, slow responses, and an inability to keep up with a competitive market. Thankfully, artificial intelligence tools — particularly large language models (LLMs) and other forms of generative AI and machine learning technology — have emerged that allow businesses to boost employee productivity and supercharge their growth. It is these enterprise artificial intelligence solutions that promise to bring about a paradigm shift in the world of business.
As the founder and CEO of Krazimo, a company that specializes in providing and implementing enterprise AI solutions, technological maverick Akhil Verghese has seen firsthand how this technology can be a transformative tool for businesses. However, success in AI strategy is not simply about implementing new technologies willy-nilly; it’s about deploying these applications creatively to alleviate pain points that are costing your business time and money.
How to effectively evaluate enterprise AI solutions
That creative deployment is where effective evaluation of LLMs becomes critical. Businesses must evaluate an AI initiative before, during, and after a task’s completion to ensure that they are leveraging AI to the benefit of their organization.
“My approach to LLM evaluation varies a lot based on the problem being solved and the nature of the validation process,” says Verghese. “In general, the approach matches how you'd evaluate a human doing a similar task.”
The first approach Verghese suggests for evaluating LLMs applies to tasks he deems “easy to evaluate,” using examples such as categorizing customer service queries or composing basic follow-up emails. For these tasks, he suggests measuring performance by the number of edits required by the human validator per output.
“It's usually fairly easy to measure edit time versus generation time to determine if the LLM is adding real ROI,” Verghese adds.
However, this approach is only effective for this specific subset of AI applications. Other tasks, such as validating the performance of an AI sales associate on a customer call, take as long to validate as they do to perform. And for tasks that are more subjective, this method of evaluation could prove challenging.
As such, Verghese offers two possible approaches to evaluating the effectiveness of an LLM:
1. Sampling: When numbers are small, Verghese suggests using sampling as a method of evaluation. “Evaluating a statistically significant sample set of all calls,” he says, “can save the user time while still providing the level of insight needed to understand the effectiveness of a given enterprise AI technology.”
2. Defining a new metric: Other times, evaluation can be as simple as defining a new metric that reflects the performance of a model. “In the case of sales calls, this could simply be the closure rate,” Verghese explains.
How to effectively implement enterprise AI platforms into your AI strategy
However, Verghese adds that there are more metrics to consider when evaluating enterprise AI platforms than just their key performance indicators, the chief of which is ease of scaling. As a business’s use cases for its AI systems expand, the cost of training, deploying, and maintaining these models increases in tandem, but these are usually trivial compared to hiring new people, so it’s crucial to weigh the benefits relative to the investment.
“Even if AI doesn’t perform better than a human at a task like sales calls, they may still be the better approach for low success rate operations like outbound calls because you can scale them up or down, and their pricing is trending downwards. For example, let's assume a human sales agent and an AI agent have closure rates of 60% and 50% respectively on inbound calls, and 6% and 3% respectively on outbound calls,” Verghese proposes. “It might seem like the gap is narrower for inbound sales, but if there are currently no inbound sales calls that are dropped due to a lack of human resources, it might not make sense to use AI agents for inbound calls and suffer the lower closure rate. Meanwhile, if the company can only call 20% of potential outbound leads, the AI agent may close 2.5 times the number of clients, despite the fact that its closure rate is 50% lower.”
The other key consideration in evaluating the effectiveness of AI technologies is the need for ongoing evaluations. Because of the evolving nature of artificial intelligence technology and the need for continuous training to maintain the relevance and timeliness of outputs, it is important to adopt the “man in the middle” approach suggested by Verghese and maintain the oversight necessary for success.
“Continue to look at the same north star metrics identified as key performance indicators,” Verghese suggests. “It can also be helpful to use standard rollout techniques like AB testing, backed by the same indicators, to deploy new versions of the LLMs.”
AI tools, including LLMs, have shown tremendous potential and success in business use cases, but to reap the benefits of AI, businesses must effectively and continuously evaluate their LLMs. By adopting an evaluation method specifically designed to measure the ROI of the particular use case, leaders can understand the real impacts that artificial intelligence is having on their business operations.