The Eponymous Pickle: Organizing Unstructured Data with Machine Learning

Friday, June 16, 2023

Organizing Unstructured Data with Machine Learning

Considering all the data and its uses.

By Esther Shein

Commissioned by CACM Staff, June 8, 2023

Vector databases are efficient for conducting similarity searches, and they are scalable and flexible, but high-dimensional vectors can be computationally expensive, according to Apple's Huaping Gu.

Vector database company Pinecone in April secured $100 million in venture capital (VC) funding in a $750-million valuation. Other vector database startups have also recently raised millions from VCs, including Chroma, Weviate, and Qdrant. This begs the question: what exactly are vector databases, and why are they generating buzz now?

Some 80% to 90% of any organization's data is unstructured, according to analysts' estimates, and databases have gone through many iterations, from Structured Query Language/SQL databases(in which data is structured in a collection of tables) and relational databases (which focus on the relationship between stored data elements) to NoSQL databases (in which data is stored and retrieved in different structures without using rows and columns). NoSQL was triggered by the advent of Web 2.0 in the early 2000s.

Those traditional databases were not adequately equipped to analyze unstructured data, especially in real time. Now, with artificial intelligence (AI) gaining momentum, vector databases have emerged for use in machine learning applications. A vector is a high-dimensional array of data in which each dimension is a number.

Explains Charles Xie, CEO and founder of vector database company Zilliz and the Linux Foundation's Milvus Project, "Vectors are important because when you're talking about pictures or images or video, they are the numerical representation of unstructured data that can be easily processed by a machine,''

This is where the use of machine learning models to turn unstructured data into floating point values, or vector embeddings, is key. In contrast, those unstructured images, pictures, and videos are time-consuming and a challenge to classify manually in relational databases. As an example, it took 25,000 people (curators) to label the now-famous ImageNet dataset, Xie says.

Once the data is in a machine-readable format, relational databases store and search across structured table-based data, Xie says. However, unlike structured data, there is no easy way to store and efficiently search large amounts of unstructured data within a relational database.

For example, quickly searching for similar shoes, given a collection of shoe pictures from various angles, would be impossible in a relational database since understanding shoe size, style, heel type, color, etc., purely from the image's raw pixel values is difficult, observes Chris Churilo, vice president of marketing at Zilliz."So we want to turn to a machine to do that for us," using models "that are going to spit out a numerical representation of this content'' that are embeddings or vectors, she says. "The cool thing about having this numerical representation is, now I can ask the machine to find [something] that's similar by basically comparing these numbers against each other."

The machine can do that pretty accurately, Churilo says. ... '

The Eponymous Pickle

About Me

RSS

Blog Archive

Friday, June 16, 2023

Organizing Unstructured Data with Machine Learning

No comments: