Data Science

Summary

Data is no longer scarce. In fact, businesses have an abundance of data and its growing. This has given rise to the term Big Data. Data science enables businesses to discover valuable insights from data and apply that profitably. Data science is therefore complementary to Big Data.

Historically, statisticians had a mathematical focus. They evolved into data analysts who applied their expertise to solving business problems. They did this by visualizing data and searching for patterns. When dealing with vast amounts of data, there was a need to apply Machine Learning algorithms and programming. This is where a data scientist comes in.

A data scientist is really a first-class scientist who's curious, asks questions and makes hypotheses that can be tested with data.

Milestones

1962

John W. Tukey publishes "The Future of Data Analysis". He explains that statistics has mostly been about making inferences but his interest is in data analysis, which has more to do with science than mathematics. The availability of computers makes data analysis possible. His influential paper is sometimes today referred to as FoDA.

1974

The term data science is used for the first time, by Peter Naur in his "Concise Survey of Computer Methods". He defines it as the "science of dealing with data". His definition does not consider data semantics (domain knowledge). Thus, it's different from modern definition of the term. An alternative term datalogy is also used.

1976

John Chambers at Bell Labs create programming language S. This lays the basis for statistical computing and quantitative programming environments (QPE) that use scripts and workflows. In the 1990s, S inspires the creation of an open source language called R, which is today the dominant QPE.

1977

The International Association for Statistical Computing (ISAC) is formed. This underscores the increasing use of computing in statistical work "to convert data into information and knowledge". The same year John Tukey publishes "Exploratory Data Analysis" where he states that we should use data to form hypotheses to test. Exploratory Data Analysis and Confirmatory Data Analysis should both be used.

1989

The first Knowledge Discovery in Databases (KDD) conference is held. By the mid-1990s, this evolves into ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Researchers clarify that "KDD refers to the overall process of discovering useful knowledge from data, and data mining refers to a particular step in this process. Data mining is the application of specific algorithms for extracting patterns from data".

1997

Professor C. F. Jeff Wu calls for statistics to be renamed data science and statisticians to be renamed data scientists. The same year the journal Data Mining and Knowledge Discovery is launched.

2001

William S. Cleveland publishes "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". He proposes the term Data Science in its modern sense. To Cleveland, a data analyst is good at programming but has limited knowledge of statistics. A data scientist on the other hand comes from statistics background but has to work more closely with computer specialists.

2001

Leo Breiman publishes "Statistical Modeling: The Two Cultures". He compares Generative Modeling with Predictive Modeling. The former is dominant among statisticians. Breiman calls them to adopt predictive modeling and algorithms, which have developed in other fields.

2008

Data Science as a term creates interest thanks to the work of D. J. Patil (LinkedIn) and Jeff Hammerbacher (Facebook).

2009

Google's chief economist, Hal Varian, states that data is plenty but there's a scarcity of experts who can extract value from it. The sexy job for the next decade will be statisticians. He states his expectation of a data scientist,

The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades.

2010

By the start of this decade, researchers and writers attempt to explain data science to the public. Data scientist is claimed to be the sexiest job of the 21st century. This decade also sees shifting terminology. Data Mining is now referred to as Machine Learning. The work of a data analyst is called Business Intelligence but when she uses big data it's called Big Data Analytics.

