The Eponymous Pickle: Amazon Releases 51-Language Dataset for Understanding

Wednesday, April 27, 2022

Amazon Releases 51-Language Dataset for Understanding

Impressive release of key knowledge for language understanding. The breadth of languages is good too.

Amazon Releases 51-Language Dataset for Language Understanding by 7wData, April 26, 2022

Imagine that all people around the world could use voice AI systems such as Alexa in their native tongues.

One promising approach to realizing this vision is massively multilingual natural-language understanding (MMNLU), a paradigm in which a single machine learning model can parse and understand inputs from many typologically diverse languages. By learning a shared data representation that spans languages, the model can transfer knowledge from languages with abundant training data to those in which training data is scarce.

Today we are pleased to make three announcements related to MMNLU.

First, we are releasing a new dataset called MASSIVE, which is composed of one million labeled utterances spanning 51 languages, along with open-source code that provides examples of how to perform massively multilingual NLU modeling and allows practitioners to re-create the baseline results presented in our paper..

Second, we are launching a new competition using the MASSIVE dataset called Massively Multilingual NLU 2022 (MMNLU-22).

And third, we will cohost a workshop at EMNLP 2022 in Abu Dhabi and online, also called Massively Multilingual NLU 2022, which will highlight the results from the competition and include presentations from invited speakers and oral and poster sessions from submitted papers on multilingual natural-language processing (NLP).

“We are very excited to share this large multilingual dataset with the worldwide language research community,” says Prem Natarajan, vice president of Alexa AI Natural Understanding. “We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.”

MASSIVE is a parallel dataset, meaning that every utterance is given in all 51 languages. This enables models to learn shared representations of utterances with the same intents, regardless of language, facilitating cross-linguistic training on natural-language-understanding (NLU) tasks. It also allows for adaptation to other NLP tasks such as machine translation, multilingual paraphrasing, new linguistic analyses of imperative morphologies, and more. ... '

The Eponymous Pickle

About Me

RSS

Blog Archive

Wednesday, April 27, 2022

Amazon Releases 51-Language Dataset for Understanding

No comments: