Study

Wednesday, April 24, 2024

SOP-Bench Study Preview [PRIVATE – Not published]

The Benchmark Study

SOP-Bench was created because of a lack of utilities to measure AI's capabilities at solving workflows or so-called Standard Operating Procedures (SOPs). It aims to gather 1200 of the most commonly used SOPs across a range of white-collar industries, and establish a benchmarking procedure for accurately and fairly measure LLM powered platform's capabilities at automating and performing complex workflows.

The first SOP-Bench report discussed in this brief blog post evaluates major LLM powered platforms including VISS.AI, AgentGPT, Claude 3 (Opus), Gemini (1.0 Pro), and ChatGPT (GPT-4). This benchmark study employed a novel SOP dataset designed to reflect the complex nature of real-world business operations across several industries.

The full research including methodology, analysis, discussion and subset of 438 SOPs used is available here.

» Link to study

Standout Performer: VISS.AI

Our findings show that VISS.AI outshines its counterparts in managing complex workflows involving multiple systems and SOPs with a larger number of steps. The platform demonstrated notable superiority by directly interacting with diverse external systems, making it a formidable tool in for business process automation.

Our study reveals that VISS.AI is approximately three times more effective in handling complex SOPs compared to GPT-4, and even higher compared to other LLM powered platforms. The analysis section of the study dives deeper into the capabilities, limitations and differences between the compared platforms – and a tech breakdown of VISS.AI has been published here.

Sources and Evaluation

The study employs methods to assess how these platforms handle an array of SOPs extracted from the newly created SOP dataset, which spans various sectors including finance, healthcare, HR, marketing, and customer service, among others. The SOPs were collected and created from six different information sources; interviews, observations, job descriptions, surveys, scraped secondary data, and official SOP documents from companies.

Results and Insights

The results indicate selecting the right platform can lead to significant improvements in operational efficiency, cost reduction, and even revenue generation through enhanced customer interactions.

  • VISS.AI: Excelled in tasks requiring complex, multi-step processing and interaction with multiple external systems.

  • ChatGPT and Claude 3: Showed strengths in creative problem-solving and sophisticated language tasks but fell behind in SOPs that required direct system interactions.

  • ChatGPT and Gemini Pro: Were the sole platforms able to solve any SOPs including images and media.

  • AgentGPT: Was noted for its decision-making transparency, helpful in scenarios requiring clear audit trails.

Recommendations for Future Research

Future studies should expand and diversify the Standard Operating Procedures (SOP) datasets to enhance the representation of underrepresented industries and accommodate the non-deterministic nature of Large Language Models (LLMs). The objective would be to increase the dataset size from 438 to at least 1200 SOPs, broadening the scope to include varied cultural and geographic business practices, thereby improving the global applicability of LLM benchmarks.

Furthermore, it is imperative to refine the metrics and evaluation criteria used to assess LLM performance to provide a more accurate measure of their reliability and effectiveness. As LLMs gain prevalence in business operations, establishing robust ethical frameworks and compliance models is crucial to ensure their responsible deployment in line with industry standards and legal requirements.

These advancements will significantly enhance our understanding of LLM capabilities and ensure their ethical integration into diverse operational environments.

Conclusion

The adoption of LLM powered platforms in business operations is about transforming SOPs into more efficient, cost-effective processes. VISS.AI, with its robust performance in our benchmark study, represents a leap towards realizing this potential, promising significant impacts on productivity and operational efficiency across industries.

The long-term impact of this research extends beyond the immediate improvements it suggests for LLM powered platforms. As LLM technology continues to evolve, its potential to revolutionize business process automation grows, promising substantial impacts on productivity, efficiency, and decision-making processes within organizations. This vision, supported by ongoing research and innovation, promises to redefine the boundaries of what AI can achieve in the business world and beyond.

Businesses looking to stay ahead in the digital era would do well to consider how best to leverage these powerful tools. For those eager to delve deeper into the specifics of our findings or explore the dataset used, please refer to the resources linked below for the complete SOP dataset, and a more detailed breakdown of the methodologies and results.

» Dataset is available here

» Report is available here

» Tech breakdown of VISS.AI


© 2024 VISSAI AB. All rights reserved.