Data mining is a technique for finding “knowledge” such as the regularity of combinations of purchased products from large amounts of data . Data mining has been attracting attention in business in recent years, but what is the reason? In this article, we will explain the definition of data mining, the reasons why it is attracting attention in business, the differences from data science, specific methods of data mining, and points to note.
What is data mining?
First, I will explain the definition and basic knowledge of data mining.
Definition of data mining
Data mining means mining data. It is a technology for finding “knowledge” from large amounts of data by making full use of analysis methods such as statistics and artificial intelligence.
It was originally called Knowledge Discovery in Databases (KDD), but in 1996 a definition was proposed that linked the terms knowledge discovery from databases and data mining. It is often used with meaning.
Computers have become faster and storage capacities have increased, and data mining has become able to handle huge amounts of data called big data . It now encompasses processing, discovery steps, post-processing of results, integration into operational systems, and more.
There are various knowledge discovery processes, but what they all have in common is the premise that humans and computers interact and iteratively interact as needed at each stage of the process to discover knowledge from data . The point is that we take the position of researching and developing various technologies.
Why Data Mining Matters
In recent years, with the development of networks, communication functions have been installed in various things such as smartphones and sensors, and their data has come to be acquired via the Internet. Such technology is called IoT (Internet of Things) technology. In addition, advances in data storage technology have enabled companies to collect huge amounts of data and store it as big data.
Companies are using this big data to consider solutions to their marketing challenges. Solving marketing issues means clarifying the following 4Ps in the corporate market.
- ・Product: What kind of service do you provide to the target?
- ・Price: how much the service will be provided to the target
- ・Place: How to provide the service to the target
- ・Promotion: How to convey the features and appeal of the service to the target
Companies are turning to data mining as one of the means to clarify these 4Ps.
★What is data mining?
→Technology for finding “knowledge” from large amounts of data by making full use of analysis methods such as statistics and artificial intelligence
・It is often used in the same way as knowledge discovery from databases.
・Data mining covers data acquisition, cleansing, preprocessing, discovery steps, post-processing of results, and integration into operation systems.
・Companies are turning to data mining as one of the means to clarify marketing issues
Differences from data science
Data science, like data mining, is about deriving knowledge from large amounts of data, but the scope of the process is different.
In data science, we are in charge of grasping and operating the following four things.
- 1. Grasping issues and drafting hypotheses
- 2. Data collection
- 3. Data analysis and visualization
- 4. Utilization of information obtained based on analysis
Data mining mainly deals with “3. Data analysis and visualization” in the data science process. In other words, data mining includes everything from understanding the situation and issues of collected data, data cleansing, preprocessing such as identifying and confirming data, modeling and verification of analysis results, and integrating it into an easy-to-use operation system. range.
data mining methods
The data mining and knowledge discovery process can be done by testing hypotheses if there are already some, but if there are no hypotheses at all, it is necessary to start by creating hypotheses. Therefore, it can be broadly divided into two approaches: hypothesis making and hypothesis testing.
These two approaches are roughly equivalent to machine learning approaches. In machine learning, rules are automatically generated from a large amount of data through statistical analysis, using supervised learning for hypothesis testing and unsupervised learning for hypothesis discovery.
Here, we describe specific methods for supervised learning as hypothesis testing and unsupervised learning as hypothesis discovery.
Hypothesis verification Supervised learning
Input the correct input data and output data into the computer in advance. Then, when given new data, supervised learning is a method of making a decision by comparing it with the correct answer.
Supervised learning consists of “classification” that determines the objective variable how to classify the data, and “regression” that determines the prediction (objective variable) based on the tendency (explanatory variable) given in advance . I have.
Judging an individual from an image is an example of classification. Individuals can be classified based on information such as facial images, fingerprints, and voice. Examples of regression include predicting the probability of passing the school of your choice from test scores, or predicting store sales from regional and climate data.
Here are some well-known supervised learning algorithms.
(1) Linear regression analysis
Linear regression is a type of regression analysis that predicts the value of a target variable based on the values of another explanatory variable . For example, when predicting the sales amount of cakes from the number of posts posted on SNS by a cake shop, the explanatory variable will be the number of posts on SNS, and the objective variable will be the sales amount of cakes. If you have one explanatory variable like this, it is called simple regression analysis, and if you have more than one explanatory variable, such as the number of SNS posts and the number of leaflets distributed, it is called multiple regression analysis.
Other methods include logistic regression , which sets categories and proportions as the objective variable, and discriminant analysis , which sets two populations and infers which population a given sample belongs to .
(2) Support Vector Machine (SVM)
A support vector machine is a technique for constructing how two classes differ , finding a straight line (function) that distinguishes the two classes. For example, a function that identifies glasses that break often and those that don’t. This function is called a discriminant plane, and the closest data (support vector, support vector) from this function is used to determine the discriminant plane. Support vector machines are said to be easy to use, as they can be used even when there are many explanatory variables.
(3) Decision tree
Decision trees are also called decision trees. It is an analysis/prediction method that creates a tree structure and creates rules for classification . For example, among customers who come to shop at supermarkets, prospective customers have point cards, and attributes such as purchase amount of X yen or more can be classified and used for future predictions. The disadvantage is that due to over-learning, which is called ” overfitting ,” the prediction accuracy tends to decrease for unknown data that deviates even slightly from the learned data.
④ Random forest
Combining multiple learning models such as logistic regression, support vector machines, and decision trees is called ensemble learning . A machine learning algorithm that performs classification, regression, and clustering using many decision trees based on the data obtained by randomly sampling a portion of ensemble learning observation data and training multiple learning models in parallel. It’s a random forest. Overfitting of decision trees is overcome by using multiple decision trees, and it is said to be simple, easy to understand, and highly accurate.
⑤ Neural network
Neural means neurons, and a neural network is one of the machine learning methods that artificially reproduces the mechanism of brain nerves with a computer program .
In the brain, neurons receive signals from other neurons, transmit signals to other neurons one after another according to the amount of signals received, and process information. In the neural network, this mechanism is reproduced with functions. It is an image that what has acquired a certain amount of information is output as “information” to the next layer, and when it acquires a certain amount of information in that layer, it is output as “knowledge”.
Deep learning makes it possible to make more accurate decisions by increasing the number of neurons by stacking many intermediate layers (hidden layers).
Hypothesis discovery Unsupervised learning
Unlike supervised learning, unsupervised learning does not have the correct answer in the data because the goal is to discover hypotheses. Therefore, unsupervised machine learning classifies similar data based on some perspective. This classification itself is the main goal of unsupervised learning. Humans have to interpret the meaning of the classified groups.
By classifying a large amount of data and groups, it can be useful for discovering features that humans do not notice and abnormal data.
The main techniques of unsupervised learning are dimensionality reduction and clustering .
Dimensionality reduction is a technique to reduce the number of explanatory variables by extracting features of data . Also called dimensionality reduction. The training data resulting from dimensionality compression has less information, making it easier to understand. Clustering is a technique for classifying data into groups with similar features .
We introduce what are known as algorithms for unsupervised learning.
(1) Principal component analysis (PCA)
Principal component analysis is a representative technique for dimensionality reduction. It is a method of summarizing a large amount of data (explanatory variables) into fewer indicators (composite variables) . Calculating the BMI (Body Mass Index), which indicates the degree of obesity, from height and weight data is an example of principal component analysis that reduces the number of indices while maintaining data information.
②Factor analysis
Factor analysis is also a technique for dimensionality reduction. Factor analysis is a technique that reveals a small number of latent factors that cause the results of observed variables .
Factor analysis was devised as a method to explain the structure of intelligence. When explaining why test scores are good in Japanese but poor in society, it would be easier to explain if we could identify the ” factors ” behind the scores, such as reading comprehension and reasoning ability .
(3) Cluster analysis
Cluster means “flock” or “mass” in English. Cluster analysis is an analysis method that collects and classifies data with similar characteristics to understand characteristics .
A group of data with similar characteristics is called a ” cluster “, and the process of creating this cluster is called ” clustering “.
④ Association analysis
Association analysis is a technique for extracting patterns and relationships from data . A typical example is market basket analysis . It is said that it is useful for finding the relevance and simultaneity between products from among products. This association analysis is used for ” recommendations ” that recommend products related to the products you are looking at on online sales sites .
⑤Self-Organizing Map (SOM)
A self-organizing map is a type of neural network, and is a visualization method that automatically distinguishes trends and correlations from a huge amount of information and makes it visually understandable . The self-organizing map was originally derived from the self-organizing model of functional maps in the cerebral visual cortex, hence the name.
Notes on data mining
Because big data is huge, it cannot be used for machine learning as it is. It is necessary to process the information in advance and understand the trends and characteristics of the data. It is also necessary to evaluate the reliability and bias of the data obtained after machine learning.
Data visualization and feature selection
Since a huge amount of data cannot be used as it is, it is necessary to check the data in advance and decide on preprocessing and data to be read into the computer. Preprocessing is called cleansing, which removes unnecessary information, missing data, and abnormal values. After that, the ” feature amount “, which is the data to be read into the computer, is determined. It is necessary to decide after visualizing the state with graphs, tables, etc.
How to determine valid data
Even if machine learning is performed and analysis results are obtained, the analysis results will be inaccurate if the original data is unreliable or if the data collection target is biased. “Reliability”, “quantity” and “bias” of acquired data, such as whether the data was acquired by a reliable institution, whether the amount of data is sufficient, and whether the method and subject of data acquisition are accurate. Let’s do an evaluation of
Summary: Data mining is a problem-solving method in which humans and computers interact
The definition of data mining and specific methods of data mining were explained along with machine learning techniques. Data mining and knowledge discovery processes are interactive, iterative interactions between humans and computers that efficiently discover useful knowledge.
Digital transformation (DX), which transforms people’s lives for the better through the spread of digital technology, also enriches the environment surrounding business and life through interaction between people and computers.