Certainly interesting, but how good, secure is the resulting translation?
IBM's CodeNet dataset can teach AI to translate computer languages By A. Tarantola in Engadget
AI and machine learning systems have become increasingly competent in recent years, capable of not just understanding the written word but writing it as well. But while these artificial intelligences have nearly mastered the English language, they have yet to become fluent in the language of computers — that is, until now. IBM announced during its Think 2021 conference on Monday that its researchers have crafted a Rosetta Stone for programming code.
Over the past decade, advancements in AI have mainly been “driven by deep neural networks, and even that, it was driven by three major factors: data with the availability of large data sets for training, innovations in new algorithms, and the massive acceleration of faster and faster compute hardware driven by GPUs,” Ruchir Puri, IBM Fellow and Chief Scientist at IBM Research, said during his Think 2021 presentation, likening the new data set to the venerated ImageNet, which has spawned the recent computer vision land rush.
“Software is eating the world," Marc Andreessen wrote in 2011. "And if software is eating the world, AI is eating software,” Puri remarked to Engadget. “It is this relationship between the visual tasks and the language tasks, when common algorithms could be used across them, that has led to the revolution in breakthroughs in natural language processing, starting with the advent of Watson Jeopardy, way back in 2012,” he continued.
In effect, we’ve taught computers how to speak human, so why not also teach computers to speak more computer? That’s what IBM’s Project CodeNet seeks to accomplish.”We need our ImageNet, which can snowball the innovation and can unleash this innovation in algorithms,” Puri said. CodeNet is essentially the ImageNet of computers. It’s an expansive dataset designed to teach AI/ML systems how to translate code and consists of some 14 million snippets and 500 million lines spread across more than 55 legacy and active languages — from COBOL and FORTRAN to Java, C++, and Python.
“Since the data set itself contains 50 different languages, it can actually enable algorithms for many pairwise combinations,” Puri explained. “Having said that, there has been work done in human language areas, like neural machine translation which, rather than doing pairwise, actually becomes more language-independent and can derive an intermediate abstraction through which it translates into many different languages.” In short, the dataset is constructed in a manner that enables bidirectional translation. That is, you can take some legacy COBOL code — which, terrifyingly, still constitutes a significant amount of this country’s banking and federal government infrastructure — and translate it into Java as easily as you could take a snippet of Java and regress it back into COBOL.
It is the ImageNet of code. .... "
No comments:
Post a Comment