Long, insightful piece, good insights, somewhat technical, but useful. A favorite topic we experimented with since very early days.
Home/Magazine Archive/March 2022 (Vol. 65, No. 3)/Automating Data Science/Full Text
By Tijl De Bie, Luc De Raedt, José Hernández-Orallo, Holger H. Hoos, Padhraic Smyth, Christopher K. I. Williams
Communications of the ACM, March 2022, Vol. 65 No. 3, Pages 76-87 10.1145/3495256
--> Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process.
Key insights:
• Automation in data science aims to facilitate and transform the work of data scientists, not to replace them.
• Important parts of data science are already being automated, especially in the modeling stages, where techniques such as automated machine learning (AutoML) are gaining traction.
• Other aspects are harder to automate, not only because of technological challenges, but because open ended and context-dependent tasks require human interaction.
Introduction
Data science covers the full spectrum of deriving insight from data, from initial data gathering and interpretation, via processing and engineering of data, and exploration and modeling, to eventually producing novel insights and decision support systems. Data science can be viewed as overlapping or broader in scope than other data-analytic methodological disciplines, such as statistics, machine learning, databases, or visualization
To illustrate the breadth of data science, consider, for example, the problem of recommending items (movies, books or other products) to customers. While the core of these applications can consist of algorithmic techniques such as matrix factorization, a deployed system will involve a much wider range of technological and human considerations. These range from scalable back-end transaction systems that retrieve customer and product data in real time, experimental design for evaluating system changes, causal analysis for understanding the effect of interventions, to the human factors and psychology that underliehow customers react to visual information displays and make decisions.
As another example, in areas such as astronomy, particle physics, and climate science, there is a rich tradition of building computational pipelines to support data-driven discovery and hypothesis testing. For instance, geoscientists use monthly global landcover maps based on satellite imagery at sub-kilometer resolutions to better understand how the earth’s surface is changing over time [50]. These maps are interactive and browsable, and they are the result of a complex data-processing pipeline, in which terabytes to petabytes of raw sensor and image data are transformed into databases of automatically detected and annotated objects and information. This type of pipeline involves many steps, in which human decisions and insight are critical, such as instrument calibration, removal of outliers, and classification of pixels.
The breadth and complexity of these and many other data science scenarios means that the modern data scientist requires broad knowledge and experience across a multitude of topics. Together with an increasing demand for data analysis skills, this has led to a shortage of trained data scientists with appropriate background and experience, and significant market competition for limited expertise. Considering this bottleneck, it is not surprising that there is increasing interest in automating parts, if not all, of the data science process. This desire and potential for automation is the focus of this article.
As illustrated in the examples above, data science is a complex process, driven by the character of the data being analyzed and by the questions being asked, and is often highly exploratory and iterative in nature. Domain context can play a key role in these exploratory steps, even in relatively well-defined processes such as predictive modeling (e.g., as characterized by CRISP-DM [5]) where, for example, human expertise in defining relevant predictor variables can be critical. .... '
No comments:
Post a Comment