The Eponymous Pickle: Anomalies

Showing posts with label Anomalies. Show all posts

Tuesday, February 28, 2023

Anomaly Detection, Supervised or Unsupervised. A Space We worked with Key Data

This kind of data is everywhere.

Unsupervised and semi-supervised anomaly detection with data-centric ML

February 08, 2023 In Googleblog

Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Google Research, Cloud AI Team

Anomaly detection (AD), the task of distinguishing anomalies from normal data, plays a vital role in many real-world applications, such as detecting faulty products from vision sensors in manufacturing, fraudulent behaviors in financial transactions, or network security threats. Depending on the availability of the type of data — negative (normal) vs. positive (anomalous) and the availability of their labels — the task of AD involves different challenges.

(a) Fully supervised anomaly detection, (b) normal-only anomaly detection, (c, d, e) semi-supervised anomaly detection, (f) unsupervised anomaly detection.

While most previous works were shown to be effective for cases with fully-labeled data (either (a) or (b) in the above figure), such settings are less common in practice because labels are particularly tedious to obtain. In most scenarios users have a limited labeling budget, and sometimes there aren’t even any labeled samples during training. Furthermore, even when labeled data are available, there could be biases in the way samples are labeled, causing distribution differences. Such real-world data challenges limit the achievable accuracy of prior methods in detecting anomalies.

This post covers two of our recent papers on AD, published in Transactions on Machine Learning Research (TMLR), that address the above challenges in unsupervised and semi-supervised settings. Using data-centric approaches, we show state-of-the-art results in both. In “Self-supervised, Refine, Repeat: Improving Unsupervised Anomaly Detection”, we propose a novel unsupervised AD framework that relies on the principles of self-supervised learning without labels and iterative data refinement based on the agreement of one-class classifier (OCC) outputs. In “SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch”, we propose a novel semi-supervised AD framework that yields robust performance even under distribution mismatch with limited labeled samples. ... '

Sunday, May 02, 2021

Simple Examples of Anomaly/Outlier Detection

A common need, nicely and simply put here in KDNuggets. Very classic and should be done with most every set of data you are seriously working with. With more tech and coding at the link:

Four Techniques for Outlier Detection

Tags: DBSCAN, Knime, Outliers, Python

There are many techniques to detect and optionally remove outliers from a dataset. In this blog post, we show an implementation in KNIME Analytics Platform of four of the most frequently used - traditional and novel - techniques for outlier detection. By Maarit Widmann, Moritz Heine, Rosaria Silipo, Data Scientists at KNIME

Anomalies, or outliers, can be a serious issue when training machine learning algorithms or applying statistical techniques. They are often the result of errors in measurements or exceptional system conditions and therefore do not describe the common functioning of the underlying system. Indeed, the best practice is to implement an outlier removal phase before proceeding with further analysis.

But hold on there! In some cases, outliers can give us information about localized anomalies in the whole system; so the detection of outliers is a valuable process because of the additional information they can provide about your dataset.

Thursday, April 15, 2021

Detecting Anomalies with AI

Classic aspect using AI in smart manufacturing. Had interacted with Fujitsu in the past.

Fujitsu develops AI to detect product abnormalities during manufacturing

By Ryan Daws | March 29, 2021 | TechForge Media, Categories: Manufacturing,

Fujitsu has developed an AI which can highlight abnormalities in the appearance of products to help detect issues earlier.

Catching problems during production enables intervention before materials are wasted—incurring direct and environmental costs. It also saves on the reputational damage and costs associated with returns/recalls after a defective product is shipped to customers.

The solution uses an AI model trained on images of products with abnormalities. These defects are simulated so images of actual products with issues pulled from a production line aren’t necessary. ... "

Thursday, March 11, 2021

Amazon Lookout spots Defects and Anomalies in Visuals

Had not heard of this particular AWS service, could have been useful on our industrial manufacturing and supply chain areas. Just as we use our vision to scan for anomalies in representations of systems or data.

Amazon’s Lookout for Vision spots defects and anomalies in visual representations By Duncan Riley in SiliconAngle

Amazon Web Services Inc. today announced the general availability of Amazon Lookout for Vision, a cloud service that uses machine learning to spot defects and anomalies in visual representation using computer vision.

Designed with manufacturing companies in mind, the service can identify differences in images of objects at large scale, delivering the ability to identify manufacturing and production defects such as cracks, dents, incorrect colors and irregular shapes. The technology uses a technique called “few-shot learning” so it can train a model for a customer using as few as 30 baseline images.

Amazon Lookout for Vision can process thousands of images an hour to spot defects and anomalies with no machine learning experience required. Customers send camera images to Amazon Lookout for Vision in real-time to identify anomalies, such as damage to a product’s surface, missing components and other irregularities in production lines. In addition to enabling the service to detect anomalies without large amounts of training data, this capability allows the service to adapt to a wide range of inspection tasks within industrial settings.

Upon analyzing data, Amazon Lookout for Vision reports images that differ from the baseline via the service dashboard or the “DetectAnomalies” real-time application programing interface so that appropriate action can be taken. ... '

Monday, January 25, 2021

Anomaly Detection with Lacework

Brought to my attention:

Lacework covers topics and issues around threat defense, intrusion detection, cloud containers, workloads, accounts, devops, and more.

Anomaly Detection and Behavioral Analytics Focus on user and Application Behavior and how it changes over time.

Identify and Analyze Anomalies in Cloud and Container Environments

Public clouds enable enterprises to implement infrastructure-as-code and allows them to rapidly develop, test, and deploy services at scale. In this environment, network resources are in constant flux, providing ample opportunities for attackers. Unfortunately, legacy security solutions are ill-equipped to handle these and leave organizations vulnerable. IT security teams need solutions that leverage anomaly detection to safeguard cloud data.

Employ Big Data to Do Security

Traditional security solutions rely on signatures, or rule-based approaches, where rules are readily understandable – but the drawbacks are that these rules are manually entered and do not catch new attack profiles. To reduce false-positive rates, the rules are often written for very well-defined threat scenarios, limiting their effectiveness in production environments. ... "

Thursday, March 05, 2020

Responding to the Unexpecteced

A place where we often need assistance, Often an important aspect of process design and optimization. Technical.

Cognitive Work of Hypothesis Exploration During Anomaly Response
A look at how we respond to the unexpected
Marisa R. Grayson

Web-production software systems operate at an unprecedented scale today, requiring extensive automation to develop and maintain services. The systems are designed to adapt regularly to dynamic load to avoid the consequences of overloading portions of the network. As the software systems scale and complexity grows, it becomes more difficult to observe, model, and track how the systems function and malfunction. Anomalies inevitably arise, challenging incident responders to recognize and understand unusual behaviors as they plan and execute interventions to mitigate or resolve the threat of service outage. This is anomaly response.1

The cognitive work of anomaly response has been studied in energy systems, space systems, and anesthetic management during surgery.9,10 Recently, it has been recognized as an essential part of managing web-production software systems. Web operations also provide the potential for new insights because all data about an incident response in a purely digital system is available, in principle, to support detailed analysis. More importantly, the scale, autonomous capabilities, and complexity of web operations go well beyond the settings previously studied.7,8

Four incidents from web-based software companies reveal important aspects of anomaly response processes when incidents arise in web operations, two of which are discussed in this article. One particular cognitive function examined in detail is hypothesis generation and exploration, given the impact of obscure automation on engineers' development of coherent models of the systems they manage. Each case was analyzed using the techniques and concepts of cognitive systems engineering.9,10 The set of cases provides a window into the cognitive work "above the line" (see "Above the Line, Below the Line" by Richard Cook in this issue) in incident management of complex web-operation systems (cf. Grayson, 2018). .... "

Monday, December 03, 2018

AI Is Watching Employee Expenses

A classic approach to look anomalies in streams of data.

AI Is Watching Employee Expenses in Bloomberg By Olivia Carville

AppZen has developed an artificial intelligence program that can identify dubious work expense claims and educate employees about travel and expense policies. The company, which touts Amazon, IBM, Salesforce.com, and Comcast as users, estimated that it has saved its clients $40 million in fraudulent expenses. AppZen can audit 100% of claims in real time by running receipts through an algorithm that looks for duplication, discrepancies, or inflated expenses. The program reimburses legitimate employee expenses on the same day and kicks back any suspicious claims to human auditors for further investigation. In addition, the algorithm can compare the average cost of a flight from New York to Chicago against the amount expensed, and flag it if the price seems out of line for other similar flights that day. ... "

Thursday, November 15, 2018

Benford's Law and Data Science

Used it from the very beginning in enterprise data science. well worth understanding, especially for anomaly cases in finance or research fraud. Even in finance we found relatively few people that had heard of it or how to use it. Good, mostly non technical overview.

What is Benford’s Law and why is it important for data science?

By Tirthajyoti Sarkar
Sr. Principal Engineer | Ph.D. in EE (U. of Iilinois)| AI/ML certification (Stanford, MIT) | Data science author | Open-source contributor| AI in Simulations

We discuss a little-known gem for data analytics — Benford’s law, which tells us about expected distribution of significant digits in a diverse set of naturally occurring datasets and how this can be used for anomaly or fraud detection in scientific or technical publications.

Introduction
We all know about the Normal distribution and its ubiquity in all kind of natural phenomena or observations. But there is another law of numbers which does not get much attention but pops up everywhere — from nations’ population to stock market volumes to the domain of universal physical constants.

It is called “Benford’s Law”. In this article, we will discuss what it is, and why it is important for data science.

What is Benford’s law?

Benford’s Law, also known as the Law of First Digits or the Phenomenon of Significant Digits, is the finding that the first digits (or numerals to be exact) of the numbers found in series of records of the most varied sources do not display a uniform distribution, but rather are arranged in such a way that the digit “1” is the most frequent, followed by “2”, “3”, and so in a successively decreasing manner down to “9”. ... "

Monday, October 15, 2018

Patents for Anomaly Detection

There are many ways to do this kind of anomaly detection. Done them for years. Anomaly detection alone is not new or patentable, but perhaps as a larger process?

Anodot Gains Patents for Anomaly Detection By George Leopold in Datanami

Anodot, which focuses on using machine learning techniques to spot anomalies in time-series data, announced a pair of U.S. patent awards this week covering its autonomous analytics framework.

The analytics vendor said Thursday (Oct. 11) it has been granted two U.S. patents for algorithms that allow users to apply machine learning-base anomaly detection. The algorithms are designed specifically to quickly identify the source of anomalies in large data sets, then perform root-cause analysis. The approach is promoted as faster than traditional business intelligence tools or dashboards.

The first patent award covers a method for identifying and analyzing data anomalies by comparing them with previous incidents to “determine their sensitivity,” the company said. Anodot trains its machine learning algorithms based on human behavior rather than using statistical analysis tools.

“By leveraging machine learning and artificial intelligence capabilities, we’re able to tap into human perception and identify business incidents that other BI tools would never find,” claimed Ira Cohen, Anodot’s co-founder and chief data scientist.

The second patent award is for an algorithm used to identify “seasonal trends,” including daily and weekly patterns that could be used to improve detection of data anomalies. Anodot said the technology can be used to provide autonomous analytics alerts to business customers as incidents are detected..... "

Tuesday, June 05, 2018

GigaOm Interview with Ira Cohen

Another good interview and Podcast.

Voices in AI – Episode 47: A Conversation with Ira Cohen
Byron Reese Jun 5, 2018 - 7:00 AM CDT

In this episode, Byron and Ira discuss transfer learning and AI ethics.
Byron Reese: This is Voices in AI, brought to you by GigaOm, and I’m Byron Reese. Today our guest is Ira Cohen, he is the cofounder and chief data scientist at Anodot, which has created an AI-based anomaly detection system. Before that he was chief data scientist over at HP. He has a BS in electrical engineering and computer engineering, as well as an MS and a PhD in the same disciplines from The University of Illinois. Welcome to the show, Ira.

Ira Cohen: Thank you very much for having me.

So I’d love to start with the simple question, what is artificial intelligence?

Well there is the definition of artificial intelligence of machines being able to perform cognitive tasks, that we as humans can do very easily. What I like to think about in artificial intelligence, is machines taking on tasks for us that do require intelligence, but leave us time to do more thinking and more imagination, in the real world. So autonomous cars, I would love to have one, that requires artificial intelligence, and I hate driving, I hate the fact that I have to drive for 30 minutes to an hour every day, and waste a lot of time, my cognitive time, thinking about the road. So when I think about AI, I think how it improves my life to give me more time to think about even higher level things. ... "

Wednesday, March 21, 2018

Defining Normal

Useful idea. The example shows a very specific context at what space or times scales?

Researchers at Bethel University are studying how to teach computers to define "normal" data and then detect anomalies.

The team used mathematical models and real-world data to determine ways to detect needle-in-the-haystack anomalies and report them in real time, using far less computational power than conventional systems.

Their algorithm is based on recognizing a sudden increase of distance between vectors in a high-dimensional vector space.

The researchers tested the algorithm by installing a webcam in an office window to pick up a feed of outdoor foot traffic. Each quadrant in the field has its own anomaly detector attached to it, and if something enters into that box previously unseen by the system, an alert is sent, says Bethel's Brian Turnquist. ... "

Saturday, December 09, 2017

Unsupervised Decision Trees

Nicely done piece. Big supporter of decision trees in general, since they have a basic element of transparency.

Have You Heard About Unsupervised Decision Trees

By William Vorhies in DSC

Summary: Unless you’re involved in anomaly detection you may never have heard of Unsupervised Decision Trees. It’s a very interesting approach to decision trees that on the surface doesn’t sound possible but in practice is the backbone of modern intrusion detection.

I was at a presentation recently that focused on stream processing but the use case presented was about anomaly detection. When they started talking about unsupervised decision trees my antenna went up. What do you mean unsupervised decision trees? What would they split on?
It turns out that if you’re in the anomaly detection world unsupervised decision trees are pretty common. Since I’m not in that world and I suspect few of us are, I thought I’d share what I found. .... "

Friday, September 22, 2017

Anomaly Detection

A number of analytics sytems we worked with were essentially anomaly detection. So this is close to home. In particular, that in almost all cases the systems need to be re-calibrated and re run over time.

In O'Reilly, Video:

" ... What may work for anomaly detection today may not work tomorrow. Master statistician Arun Kejariwal helps you understand why in this fascinating walk-through of modern anomaly detection systems - how the definition of “normal” changes as applications, platforms, infrastructure, and algorithms evolve; as well as recognizing the effect of context in what defines an anomaly.

Learn how you, your data, and your decision-making can keep from getting skewed in master statistician Arun Kejariwal's course from Safari on what works – and doesn't work - when building anomaly detection systems. .... "

Friday, April 28, 2017

Isolation Forests for Anomaly Detection

New to me, but seems to be worthwhile to run, at least in parallel to other techniques for now. Potentially good for smaller datasets.

Anomaly Detection Using Isolation Forests

One of the newest techniques to detect anomalies is called Isolation Forests. The algorithm is based on the fact that anomalies are data points that are few and different. As a result of these properties, anomalies are susceptible to a mechanism called isolation.

This method is highly useful and is fundamentally different from all existing methods. It introduces the use of isolation as a more effective and efficient means to detect anomalies than the commonly used basic distance and density measures. Moreover, this method is an algorithm with a low linear time complexity and a small memory requirement. It builds a good performing model with a small number of trees using small sub-samples of fixed size, regardless of the size of a data set.

Typical machine learning methods tend to work better when the patterns they try to learn are balanced, meaning the same amount of good and bad behaviors are present in the dataset. ..... "

Monday, January 18, 2016

Data Mining and Reporting Blog

Brought to my attention, The Data Mining Reporting Blog, by Rosaria Silipo, Principal Data Scientist at KNIME.com AG. The blog covers data analytics in general and not only KNIME. Have added it to my reading list, looks to have some excellent technology value. See the latest post: Anomaly Detection for Predictive Maintenance with Time Series Analysis ... of particular interest.

Wednesday, August 19, 2015

An AI System Captions Cartoons

In the CACM: I note that this is in part solved by crowdsourcing, not by a computer understanding humor. But the integration of the crowdsourcing into the solution process is interesting, We used the Mechanical Turk system in this way. Use for choosing advertising copy/captions? Interacting with consumers to tag their sentiments?

" ... Microsoft researchers aim to teach artificial intelligence (AI) software how humor works by training it on an archive of New Yorker cartoons and entries into the magazine's cartoon caption contest. Researcher Dafna Shahaf fed the cartoons and contest entries to the software and taught it to select the funniest choices among captions that make similar jokes, relying partly on crowdsourced input from contract workers via Amazon.com's Mechanical Turk. Ranking jokes was the next step, requiring the researchers to manually describe what was happening in each cartoon, and to categorize its context and anomalies. ... "

Tuesday, March 17, 2015

Data Visualization Playbook

Revisiting the basics. Nicely done. I add, always consider visualization as a form of analytics, and the one that should always be used first. Your eye is the best anomaly detector. Keep what you use as understandable and simple as possible. Display everything well labeled and proportional. Let the user interact with the data. This article is a good introduction. Includes links to a number of easy to use applications for data visualization.

Saturday, March 14, 2015

Data Science Books for Key Value Topics

Detailed piece on two new books on favorite topics: Time Series and Anomaly detection. based on my long experience in the enterprise, these were the most important topics that connect with decision science. These were important far before the emergence of 'Big Data'. They remain essential today. Also mentioned, key connections to the Internet of things (IOT). If you are interested in producing value with data science, they are a great place to start.

Friday, August 08, 2014

Proposing a Metadata of Expertise

A conversation brought up a topic which we called the 'Metadata of Expertise'. Data that relates directly to the description of expertise. A simplistic example might be a trend line extracted from sales data, which would be used to derive a prediction rule. Or a categorization of an anomaly that would lead to a selection of applicable rules. Like all metadata this needs to be managed in terms of lifecycle. Not unlike metadata of the usual kind.

'Metadata of Expertise'. At this point it is broadly described as a means of making sure you have, can deliver, and continue to update needed expertise. The expertise can be human resources, data, decision process or expertise than is embedded in systems.

Like Metadata in a database, it has a lifecycle that must be managed to keep it credible and current. In our enterprise experiments we found that the initial and continued credibility was one of the most difficult things to achieve. My particular interest now is how you use data analytics to make that happen.

Tuesday, July 29, 2014

Visualizing Dynamic Networks

In IEEE Transactions, an example showing dynamic changes in networks:

Dynamic Network Visualization with Extended Massive Sequence Views
Networks are present in many fields such as finance, sociology, and transportation. Often these networks are dynamic: they have a structural as well as a temporal aspect. In addition to relations occurring over time, node information is frequently present such as hierarchical structure or time-series data. We present a technique that extends the Massive Sequence View ( msv) for the analysis of temporal and structural aspects of dynamic networks. Using features in the data as well as Gestalt principles in the visualization such as closure, proximity, and similarity, we developed node reordering strategies for the msv to make these features stand out that optionally take the hierarchical node structure into account.

This enables users to find temporal properties such as trends, counter trends, periodicity, temporal shifts, and anomalies in the network as well as structural properties such as communities and stars. We introduce the circular msv that further reduces visual clutter. In addition, the (circular) msv is extended to also convey time-series data associated with the nodes. This enables users to analyze complex correlations between edge occurrence and node attribute changes. We show the effectiveness of the reordering methods on both synthetic and a rich real-world dynamic network data set. ... "

About Me

RSS

Blog Archive