The Eponymous Pickle: Metrics

Showing posts with label Metrics. Show all posts

Wednesday, January 25, 2023

Ranking Models: NDCG: in Towards Data Science

Used this in practice ... Good to see this overview for use.

Demystifying NDCG in Towards Data Science by Aparna Dhinakaran

How to best use this important metric for monitoring ranking models

Ranking models underpin many aspects of modern digital life, from search results to music recommendations. Anyone who has built a recommendation system understands the many challenges that come from developing and evaluating ranking models to serve their customers.

While these challenges start in data preparation and model training and continue through model development and model deployment, often what tends to give data scientists and machine learning engineers the most trouble is maintaining their ranking models in production. It is notoriously difficult to maintain models in production because of how these models are constantly changing as they adapt to dynamic environments.

In order to break down how to monitor normalized discounted cumulative gain (NDCG) for ranking models in production, this post covers:

What is NDCG and where is it used?

The intuition behind NDCG

What is NDCG@K?

How does NDCG compare to other metrics?

How is NDCG used in model monitoring?

After tackling these main questions, your team will be able to achieve real time monitoring and root cause analysis using NGCG for ranking models in production.

What Is NDCG and Where Is It Used?

Normalized discounted cumulative gain is a measure of ranking quality. ML teams often use NDCG to evaluate the performance of a search engine, recommendation, or other information retrieval system. Search engines are popular for companies that have applications which directly interact with customers, like Alphabet, Amazon, Etsy, Netflix, and Spotify — just to name a few.

The value of NDCG is determined by comparing the relevance of the items returned by the search engine to the relevance of the item that a hypothetical “ideal” search engine would return. For example, if you search “Hero” on a popular music streaming app, you might get 10+ results with the word “Hero” in either the song, artist, or album.

The relevance of each song or artist is represented by a score (also known as a “grade”) that is assigned to the search query. The scores of these recommendations are then discounted based on their position in the search results — did they get recommended first or last? The discounted scores are then cumulated and divided by the maximum possible discounted score, which is the discounted score that would be obtained if the search engine returned the documents in the order of their true relevance.

If a user wants the song “My Hero” by Foo Fighters, for example, the closer that song is to the top for the recommendation the better the search will be for that user. Ultimately, the relative order of returned results or recommendations is important for customer satisfaction. .... ' (more below at link)

Wednesday, January 06, 2021

Monitoring in-production ML models

Useful and detailed look at real world problems with Sagemaker Model Monitor. Good graphical views. Somewhat but practically technical.

Monitoring in-production ML models at large scale using Amazon SageMakerModel Monitor | AWS Machine Learning Blog

by Sireesha Muppala, Archana Padmasenan, and David Nigenda | on 17 DEC 2020 | in Amazon SageMaker, Artificial Intelligence

Machine learning (ML) models are impacting business decisions of organizations around the globe, from retail and financial services to autonomous vehicles and space exploration. For these organizations, training and deploying ML models into production is only one step towards achieving business goals. Model performance may degrade over time for several reasons, such as changing consumer purchase patterns in the retail industry and changing economic conditions in the financial industry. Degrading model quality has a negative impact on business outcomes. To proactively address this problem, monitoring the performance of a deployed model is a critical process. Continuous monitoring of production models allows you to identify the right time and frequency to retrain and update the model. Although retraining too frequently can be too expensive, not retraining enough could result in less-than-optimal predictions from your model.

Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at any scale. After you train an ML model, you can deploy it on SageMaker endpoints that are fully managed and can serve inferences in real time with low latency. After you deploy your model, you can use Amazon SageMaker Model Monitor to continuously monitor the quality of your ML model in real time. You can also configure alerts to notify and trigger actions if any drift in model performance is observed. Early and proactive detection of these deviations enables you to take corrective actions, such as collecting new ground truth training data, retraining models, and auditing upstream systems, without having to manually monitor models or build additional tooling.

In this post, we discuss monitoring the quality of a classification model through classification metrics like accuracy, precision, and more. ... "

Tuesday, September 15, 2020

Inconsistent Benchmarking Found

Important finding. Further classification of form of inconsistency would also be useful for later pre checking new papers.
.
Researchers find ‘inconsistent’ benchmarking across 3,867 AI research papers By Kyle Wiggers in Venturebeat

The metrics used to benchmark AI and machine learning models often inadequately reflect those models’ true performances. That’s according to a preprint study from researchers at the Institute for Artificial Intelligence and Decision Support in Vienna, which analyzed data in over 3,000 model performance results from the open source web-based platform Papers with Code. They claim that alternative, more appropriate metrics are rarely used in benchmarking and that the reporting of metrics is inconsistent and unspecific, leading to ambiguities.

Benchmarking is an important driver of progress in AI research. A task (or tasks) and the metrics associated with it (or them) can be perceived as an abstraction of a problem the scientific community aims to solve. Benchmark data sets are conceptualized as fixed representative samples for tasks to be solved by a model. But while benchmarks covering a range of tasks including machine translation, object detection, or question-answering have been established, the coauthors of the paper claim some — like accuracy (i.e., the ratio of correctly predicted samples to the total number of samples) — emphasize certain aspects of performance at the expense of others. ... "

Tuesday, March 03, 2020

Google Fairness Gym

A considerable effort reported on here to experiment with the broad idea of fairness in machine learning, via the notion of a 'gym' to exercise choices and results with varying data. Article below has quite a bit of detail on what this is trying to be.

ML-fairness-gym: A Tool for Exploring Long-Term Impacts of Machine Learning Systems
Wednesday, February 5, 2020
Posted by Hansa Srinivasan, Software Engineer, Google Research

Machine learning systems have been increasingly deployed to aid in high-impact decision-making, such as determining criminal sentencing, child welfare assessments, who receives medical attention and many other settings. Understanding whether such systems are fair is crucial, and requires an understanding of models’ short- and long-term effects. Common methods for assessing the fairness of machine learning systems involve evaluating disparities in error metrics on static datasets for various inputs to the system. Indeed, many existing ML fairness toolkits (e.g., AIF360, fairlearn, fairness-indicators, fairness-comparison) provide tools for performing such error-metric based analysis on existing datasets. While this sort of analysis may work for systems in simple environments, there are cases (e.g., systems with active data collection or significant feedback loops) where the context in which the algorithm operates is critical for understanding its impact. In these cases, the fairness of algorithmic decisions ideally would be analyzed with greater consideration for the environmental and temporal context than error metric-based techniques allow. .... "

Thursday, October 31, 2019

Numbers Being Transformations

Claimed success or sometimes not. Rarely seen the actual numbers discussed. Metrics are important.

The numbers behind successful transformations By Kevin Laczkowski, Tao Tan, and Matthias Winter

“What gets measured,” Peter Drucker famously observed, “gets managed.” One might add a corollary that what goes unmeasured—or gets measured only superficially—risks being mismanaged or, at least, undermanaged.

So it is with transformations. As we’ve noted before, the term “transformation” can be vague, and it too often refers only to minor or isolated initiatives. What should define a transformation is in fact the opposite: an intense, well-managed, organization-wide program to enhance performance and to boost organizational health. And the results should always be measured.

Transformatics: Inside the metrics of transformation

As part of an analysis we term “transformatics,” we’ve built the capability to measure the data set we’ve assembled of more than 200 large transformations stretching back nearly a decade. More recently, we isolated the 82 public companies that had undertaken a full-scale transformation and had an observable 18-month transformation track record to see what we could learn from a statistical analysis of their experiences. The research highlighted four indicators that showed a statistically significant correlation with top-quartile financial performance during the 18-month test period (for more about the methodology, see sidebar “Transformatics: Inside the metrics of transformation”). Taken together, the four indicators suggest some potential lessons for senior managers seeking to maximize the odds of a successful transformation. Let’s look at each in turn. ... "

Friday, September 27, 2019

Measures for AI

Essential to get these straight, sometimes quite simple, often not. How do they link to goals?

The problem with metrics is a big problem for AI in Fast.AI

Written: 24 Sep 2019 by Rachel Thomas

Goodhart’s Law states that “When a measure becomes a target, it ceases to be a good measure.” At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so.

This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok (such as Google’s algorithm contributing to radicalizing people into white supremacy, teachers being fired by an algorithm, or essay grading software that rewards sophisticated garbage) all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI. ... "

The Eponymous Pickle

About Me

RSS

Blog Archive