/* ---- Google Analytics Code Below */

Tuesday, June 02, 2020

Leveraging Unlabeled Data

The first step in using data is to make sure we know what the data is.   Surprisingly this can often be an issue.   Seen it a number of times in the real world.    How was the data gathered, protected, updated, maintained,  shared, preprocessed .... ?    If we don't know how it was precisely identified, we don't know what it is.   Taking it further, what data do we need to make this data useful?   What is the metadata, and how has it been found?   Has it been usefully labeled?

This article takes it farther yet.   Efforts are underway to construct synthetic data for further and future use. Examples of robot control and speech recognition and analysis, healthcare learning, explainability and causality data is brought up.  Which made me think, all those efforts need to be carefully labelled too, to make use feasible.

Leveraging Unlabeled Data   By Chris Edwards
Communications of the ACM, June 2020, Vol. 63 No. 6, Pages 13-14
10.1145/3392496

Despite the rapid advances it has made it over the past decade, deep learning presents many industrial users with problems when they try to implement the technology, issues that the Internet giants have worked around through brute force.

"The challenge that today's systems face is the amount of data they need for training," says Tim Ensor, head of artificial intelligence (AI) at U.K.-based technology company Cambridge Consultants. "On top of that, it needs to be structured data."

Most of the commercial applications and algorithm benchmarks used to test deep neural networks (DNNs) consume copious quantities of labeled data; for example, images or pieces of text that have already been tagged in some way by a human to indicate what the sample represents.

The Internet giants, who have collected the most data for use in training deep learning systems, have often resorted to crowdsourcing measures such as asking people to prove they are human during logins by identifying objects in a collection of images, or simply buying manual labor through services such as Amazon's Mechanical Turk. However, this is not an approach that works outside a few select domains, such as image recognition.

Holger Hoos, professor of machine learning at Leiden University in the Netherlands, says, "Often we don't know what the data is about. We have a lot of data that isn't labeled, and it can be very expensive to label. There is a long way to go before we can make good use of a lot of the data that we have."

To attack a wider range of applications beyond image classification and speech recognition and push deep learning into medicine, industrial control, and sensor analysis, users want to be able to use what Facebook's chief AI scientist Yann LeCun has tagged the "dark matter of AI": unlabeled data.

"The problem I see now is that supervising with high-level concepts like 'door' or 'airplane' before the computer even knows what an object is simply invites disaster."

In parallel with those working in academia, technology companies such as Cambridge Consultants have investigated a number of approaches to the problem. Ensor sees the use of synthetic data as fruitful, using as one example a system built by his company to design bridges and control robot arms that is trained using simulations of the real world, based on calculations made by the modeling software to identify strong and weak structures as the DNN makes design choices. .... "

No comments: