/* ---- Google Analytics Code Below */

Monday, August 22, 2022

Competition Makes Big Datasets Winners

 For a number of reasons, big datasets are better.  I have used ImageNet, good example, very useful. Also Mechanical Turk. 

Competition Makes Big Datasets the Winners   By Chris Edwards

Communications of the ACM, September 2022, Vol. 65 No. 9, Pages 11-13   10.1145/3546955

If there is one dataset that has become practically synonymous with deep learning, it is ImageNet. So much so that dataset creators routinely tout their offerings as "the ImageNet of …" for everything from chunks of software source code, as in IBM's Project CodeNet, to MusicNet, the University of Washington's collection of labelled music recordings.

The main aim of the team at Stanford University that created ImageNet was scale. The researchers recognized the tendency of machine learning models at that time to overfit relatively small training datasets, limiting their ability to handle real-world inputs well. Crowdsourcing the job by recruiting recruiting casual workers from Amazon's Mechanical Turk website delivered a much larger dataset. At its launch at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR), ImageNet contained more than three million categorized and labeled images, which rapidly expanded to almost 15 million.

The huge number of labeled images proved fundamental to the success of the AlexNet model based on deep neural networks (DNNs) developed by a team led by Geoffrey Hinton, professor of computer science at the University of Toronto, that in 2012 won the third annual competition built around a subset of the ImageNet dataset, easily surpassing the results from the traditional artificial intelligence (AI) models. Since then, the development of increasingly accurate DNNs and large-scale datasets have gone hand in hand.

Teams around the world have collected and released to the academic world or the wider public thousands of datasets designed for use in both developing and assessing AI models. The Machine Learning Repository at the University of California at Irvine, for example, hosts more than 600 different datasets that range from abalone descriptions to wine quality. Google's Dataset Search indexes some 25 million open datasets developed for general scientific use, and not just machine learning. However, few of the datasets released to the wild achieve widespread use.

Bernard Koch, a graduate student at the University of California at Los Angeles, teamed with Emily Denton, a senior research scientist at Google, and two other researchers from the University of California; the team found in their work presented at the Conference on Neural Information Processing (NeurIPS) last year a long tail of rarely used sources headed by a very small group of highly popular datasets. To work out how much certain datasets predominated, they analyzed five years of submissions to the Papers With Code website, which collates academic papers on machine learning and their source data and software. Just eight datasets, including ImageNet, each appeared more than 500 times in the collected papers. Most datasets were cited in fewer than 10 papers.

Much of the focus on the most popular datasets revolves around competitions, which have contributed to machine learning's rapid advancement, Koch says. "You make it easy for everybody to understand how far we've advanced on a problem." Koch says.

Groups release datasets in concert with competitions in the hope that the pairing will lead to more attention on their field. An example is the Open Catalyst Project (OCP), a joint endeavor between Carnegie Mellon University and Facebook AI Research that is trying to use machine learning to speed up the process of identifying materials that can work as chemical catalysts. It can take days to simulate their behavior, even using approximations derived from quantum mechanics formulas. AI models have been shown to be much faster, but work is needed to improve their accuracy.

Using simulation results for a variety of elements and alloys, the OCP team built a dataset they used to underpin a competition that debuted at NeurIPS 2021. Microsoft Asia won this round with a model that borrows techniques from the Transformers used in NLP research, rather than the graphical neural networks (GNNs) that had been the favored approach for AI models in this area.

"One of the reasons that I am so excited about this area right now is precisely that machine learning model improvements are necessary," says Zachary Ulissi, a professor of chemical engineering at CMU who sees the competition format as one that can help drive this innovation. "I really hope to see more developments both in new types of models, maybe even outside GNNs and transformers, and incorporating known physics into these models." ... '

No comments: