The Eponymous Pickle: Improving Machine Learning

Tuesday, November 23, 2021

Improving Machine Learning

Integrating knowledge Graphs with derived knowledge. Useful but rarely done. Our early links with Stanford addressed this.

Improving Machine Learning: How Knowledge Graphs Bring Deeper Meaning to Data

Posted by Kendall Clark on November 22, 2021

Enterprise machine learning deployments are limited by two consequences of outdated data management practices widely used today. The first is the protracted time-to-insight that stems from antiquated data replication approaches. The second is the lack of unified, contextualized data that spans the organization horizontally.

Excessive data replication and the resulting "second-order effects" are creating enormous efficiencies and waste for data scientists in most organizations. According to IDC, over 60 zettabytes of data were produced last year, and this is forecast to increase at a CAGR of 23 percent until 2025. Worse, the ratio of unique to replicated data is 1:10, which implies that most organizations’ data management methods are based on copying data.

When creating machine learning models, firms usually section off relevant data by replicating them from different sources. Models are typically trained on 20 percent of this data, while the other 80 percent remain for testing. The rigors of data cleansing, feature engineering, and model evaluation can take six months or more, making data stale during this process while delaying time-to-insight and compromising findings.

The second repercussion of traditional, outdated data management approaches is the reduced quality of insights. This effect is not only attributed to building models with stale data, but also to the inadequate relationship awareness, disconnected vertical data silos, poor contextualization, and schema limitations of relational data management techniques.

Properly implementing knowledge graphs in a modern data fabric corrects these data management issues while increasing machine learning’s value. Deploying data virtualization within a knowledge graph empowered data fabric enables data scientists to bring machine learning to their data—instead of the opposite, which wastes time and resources.

Moreover, the inherent flexibility of graph models and their ability to leverage inter-connected relationships make preparing data for machine learning much easier as they provide capabilities like improved feature engineering, root cause analysis, and graph analytics. This functionality is also key to helping knowledge graphs transition to be the dominant data management construct for the next 20 years as data management and AI converge. In short, knowledge graphs will help AI as much as AI will help knowledge graphs.

Data Scientists Need Strategic Data Management

The growing volumes and varieties of data organizations are dealing with prolonged machine learning deployments. Varying data formats, schemas, and terminologies across silos or data lakes delay machine learning initiatives requiring this training data. The lack of context and semantic annotations makes it difficult to understand data’s meaning and use for specific models. Even when data is sufficiently contextualized, this information rarely persists, so organizations must start over for subsequent projects. The months of training required when replicating this varied data is made even more difficult by fast-moving data, like information collected by IoT devices, for example. Organizations are forced to deal with this obstacle by replicating fresh data again, restarting this time-consuming process that impairs models’ functionality.

A far better approach is to train models at the data fabric layer instead of replicating data into silos. Organizations can easily create training and testing datasets without moving data. They can even specify, for example, a randomized 20 percent sample of their data with a query that extracts features and delivers a training dataset via this data virtualization approach underpinned by knowledge graphs. This methodology illustrates the connection between data management and machine learning to accelerate time-to-insight with the added benefit of training models on more current data.

Achieving Quality Machine Learning Insights

Knowledge graphs provide a richer, superior foundation for understanding enterprise data compared with relational or other approaches. They offer contextualized understanding and relationship detection between the edges of nodes, which is how graphs store data. This capability is significantly enhanced by semantic graph data models that standardize business-specific terminology as a hierarchical set of vocabularies or taxonomies. Thus, data scientists can innately understand data’s meaning and relation to any use case, such as machine learning. Semantic graph data models also align data at the schema level, provide intelligent inferences about concepts or business categories, and eschew conventional problems with terminology or synonyms while delivering a complete view of enterprise data. .... '

The Eponymous Pickle

About Me

RSS

Blog Archive

Tuesday, November 23, 2021

Improving Machine Learning

No comments: