Data mining is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The ultimate goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interesting metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the "knowledge discovery in databases" process also known as KDD.
The goal of data mining is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence, machine learning, and business intelligence. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.
The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are, or may be, too small for reliable statistical inferences to be made about the validity of any patterns discovered. However, these methods can be used in creating new hypotheses to test against the larger data populations.
Automate your data analysis now!
Data science is an interdisciplinary field about processes and systems to extract knowledge or insights, in a structured or unstructured way, from data in various forms. Data science is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).
"Data Scientist" has become a popular occupation with Harvard Business Review dubbing it "The Sexiest Job of the 21st Century".
Data science employs techniques and theories drawn from many fields: computer science, information science, mathematics and statistics, including:
- Artificial intelligence;
- Computer programming;
- Data compression;
- Data engineering;
- Data mining;
- Data warehousing;
- High performance computing;
- Pattern recognition and learning;
- Machine learning;
- Probability models;
- Pattern recognition and learning;
- Predictive analytics;
- Statistical learning;
- Signal processing;
- Uncertainty modeling;
Methods that scale to big data are of particular interest in data science, although the discipline is not generally considered to be restricted to such big data, and big data solutions are often focused on organizing and preprocessing the data instead of analysis. The development of machine learning has enhanced the growth and importance of data science.
Data science affects academic and applied research in many domains, including:
- Biological sciences;
- Digital economy;
- Health care;
- Machine translation;
- Medical informatics;
- Search engines;
- Social sciences and the humanities;
- Speech recognition.
Data science heavily influences economics, business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis.
Integrate competitive intelligence into your business
Machine learning is an extension of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a:
Field of study that gives computers the ability to learn without being explicitly programmed.
Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.
Machine learning is closely related to and often overlaps with computational statistics; a discipline which also focuses in prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible.
Machine learning applications
- Computer vision;
- Optical character recognition (OCR);
- Search engines;
- Spam filtering.
Machine learning is sometimes integrated with data mining, where the latter sub-field focuses more on exploratory data analysis and is known as unsupervised learning.
In data analytics, machine learning is a method used to conceive complex models and algorithms for prediction purpose. These analytical models allow analysts, data scientists, engineers and researchers to "produce reliable, repeatable decisions and results", thus exposing "hidden insights" through learning from historical relationships and trends in the data.
Ready to precede trends?