/* ---- Google Analytics Code Below */

Sunday, July 02, 2023

AI Benchmarks

 AI benchmarks: A new paper challenges the status quo of evaluating artificial intelligence

Ben Dickson    @BenDee983 in Venturebeat

June 13, 2023 6:00 AM

In recent years, artificial intelligence (AI) has made remarkable progress in performing complex tasks that were once considered the domain of human intelligence. From passing the bar exam and acing the SAT to mastering language proficiency and diagnosing medical images, AI systems such as GPT-4 and PaLM 2 have surpassed human performance on various benchmarks.

Benchmarks are essentially standardized tests that measure the performance of AI systems on specific tasks and goals. They’re widely used by researchers and developers to compare and improve different models and algorithms; however, a new paper published in Science challenges the validity and usefulness of many existing benchmarks for evaluating AI systems.

The paper argues that benchmarks often fail to capture the real capabilities and limitations of AI systems, and can lead to false or misleading conclusions about their safety and reliability. For example, benchmarks may not account for how AI systems handle uncertainty, ambiguity, or adversarial inputs. They may also not reflect how AI systems interact with humans or other systems in complex and dynamic environments.

This poses a major challenge when making informed decisions about where these systems are safe to use. And given the growing pressure on enterprises to use advanced AI systems in their products, the community needs to rethink its approach to evaluating new models.

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.

Register Now

The need for aggregate metrics

To develop AI systems that are safe and fair, researchers and developers must make sure they understand what a system is capable of and where it fails.

“To build that understanding, we need a research culture that is serious about both robustness and transparency,” Ryan Burnell, AI researcher at the University of Cambridge and lead author of the paper, told VentureBeat. “But we think the research culture is been lacking on both fronts at the moment.”

One of the key problems that Burnell and his co-authors point out is the use of aggregate metrics that summarize an AI system’s overall performance on a category of tasks such as math, reasoning or image classification. Aggregate metrics are convenient because of their simplicity. But the convenience comes at the cost of transparency and lack of detail on some of the nuances of the AI system’s performance on critical tasks.

“If you have data from dozens of tasks and maybe thousands of individual instances of each task, it’s not always easy to interpret and communicate those data. Aggregate metrics allow you to communicate the results in a simple, intuitive way that readers, reviewers or — as we’re seeing now — customers can quickly understand,” Burnell said. “The problem is that this simplification can hide really important patterns in the data that could indicate potential biases, safety concerns, or just help us learn more about how the system works, because we can’t tell where a system is failing.”

There are many ways aggregate benchmarks can go wrong. For example, a model might have acceptable overall performance on an aggregate benchmark but perform poorly on a subset of tasks. A study of commercial facial recognition systems found that models that had a very high overall accuracy performed poorly on darker-skinned faces. In other cases, the model might learn the wrong patterns, such as detecting objects based on their backgrounds, watermarks or other artifacts that are not related to the main task. Large language models (LLM) can make things even more complicated.   ...'


No comments: