Good talk on the issues, perhaps not enough on protecting the data, defining its ownership responsibly.
Doing our part to share open data responsibly
Daphne Luong Director, Software Engineering
Charina Chou Global Policy Lead, Emerging Technologies
This past weekend marked Open Data Day, an annual celebration of making data freely available to everyone. Communities around the world organized events, and we’re taking a moment here at Google to share our own perspective on the importance of open data. More accessible data can meaningfully help people and organizations, and we’re doing our part by opening datasets, providing access to APIs and aggregated product data, and developing tools to make data more accessible and useful.
Responsibly opening datasets
Sharing datasets is increasingly important as more people adopt machine learning through open frameworks like TensorFlow. We’ve released over 50 open datasets for other developers and researchers to use. These include YouTube 8M, a corpus of annotated videos used externally for video understanding; the HDR+ Burst Photography dataset, which helps others experiment with the technology that powers Pixel features like Portrait Mode; and Open Images, along with the Open Images Extended dataset which increases photo diversity.
Just because data is open doesn’t mean it will be useful, however. First, a dataset needs to be cleaned so that any insights developed from it are based on well-structured and accurate examples. Cleaning a large dataset is no small feat; before opening up our own, we spend hundreds of hours standardizing data and validating quality. Second, a dataset should be shared in a machine-readable format that’s easy for others to use, such as JSON rather than PDF. Finally, consider whether the dataset is representative of the intended content. Even if data is usable and representative of some situations, it may not be appropriate for every application. For instance, if a dataset contains mostly North American animal images, it may help you classify a deer, but not a giraffe. Tools like Facets can help you analyze the makeup of a dataset and evaluate the best ways to put it to use. We’re also working to build more representative datasets through interfaces like the Crowdsource application. To guide others’ use of your own dataset, consider publishing a data card which denotes authorship, composition and suggested use cases (here’s an example from our Open Images Extended release).
Making data findable and useful ... '
Tuesday, March 05, 2019
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment