/* ---- Google Analytics Code Below */

Monday, April 02, 2018

Its all About having the Right Data and Metadata

Spent quite a lot of time figuring this out for analytics in the enterprise.

Data governance and the death of schema on read  in O'Reily

Comcast’s system of storing schemas and metadata enables data scientists to find, understand, and join data of interest.

By Barbara Eckman

In the olden days of data science, one of the rallying cries was the democratization of data. No longer were data owners at the mercy of enterprise data warehouses (EDWs) and extract, transform, load (ETL) jobs, where data had to be transformed into a specific schema (“schema on write”) before it could be stored in the enterprise data warehouse and made available for use in reporting and analytics. This data was often most naturally expressed as nested structures (e.g., a base record with two array-typed attributes), but warehouses were usually based on the relational model. Thus, the data needed to be pulled apart and “normalized" into flat relational tables in first normal form. Once stored in the warehouse, recovering the data’s natural structure required several expensive relational joins. Or, for the most common or business-critical applications, the data was “de-normalized,” in which formerly nested structures were reunited, but in a flat relational form with a lot of redundancy.

This is the context in which big data and the data lake arose. No single schema was imposed. Anyone could store their data in the data lake, in any structure (or no consistent structure). Naturally nested data was no longer stripped apart into artificially flat structures. Data owners no longer had to wait for the IT department to write ETL jobs before they could access and query their data. In place of the tyranny of schema on write, schema on read was born. Users could store their data in any schema, which would be discovered at the time of reading the data. Data storage was no longer the exclusive provenance of the DBAs and the IT departments. Data from multiple previously siloed teams could be stored in the same repository. ..... "         (More)

No comments: