Process of The Data Mining
Data mining is a promising and relatively new technology. Data mining is defined as a process of discovering hidden valuable knowledge by analyzing large amounts of data, which is stored in databases or data warehouse, using various data mining techniques such as machine learning, artificial intelligence(AI) and statistical.
Many organizations in various industries are taking advantages of data mining including manufacturing, marketing, chemical, aerospace… etc, to increase their business efficiency. Therefore, the needs for a standard data mining process increased dramatically. A data mining process must be reliable and it must be repeatable by business people with little or no knowledge of data mining background. As the result, in 1990, a cross-industry standard process for data mining (CRISP-DM) first published after going through a lot of workshops, and contributions from over 300 organizations.
The data mining process involves much hard work, including perhaps building data warehouse if the enterprise does not have one. A typical data mining process is likely to include the following steps:
Requirements analysis: The enterprise decision makers need to formulate goals that the data mining process is expected to achieve. The business problem must be clearly defined. One cannot use data mining without a good idea of what kind of outcomes the enterprise is looking for, since the technique to be used and the data that is required are likely to be different for different goals. Furthermore, if the objectives have been clearly defined, it is easier to evaluate the results of the project. Once the goals have been agreed upon, the following further steps are needed.
Data selection and collection: This step may include finding the best source databases for the data that is required. If the enterprise has implemented a data warehouse, then most of the data could be available there. If the data is not available in the warehouse or the enterprise does not have a warehouse, the source OLTP (On-line Transaction Processing) systems need to be identified and the required information extracted and stored in some temporary system. In some cases, only a sample of the data available may be required.
Cleaning and preparing data: This may not be an onerous task if a data warehouse containing the required . data exists, since most of this must have already been done when data was loaded in the warehouse. Otherwise this task can be very resource intensive and sometimes more than 50% of effort in a data mining project is spent on this step. Essentially a data store that integrates data from a number of databases may need to be created. When integrating data, one often encounters problems like identifying data, dealing with missing data, data conflicts and ambiguity. An ETL (extraction, transformation and loading) tool may be used to overcome these problems.
Data mining exploration and validation: Once appropriate data has been collected and cleaned, it is possible to start data mining exploration. Assuming that the user has access to one or more data mining tools, a data mining model may be constructed based on the enterprise’s needs. It may be possible to take a sample of data and apply a number of relevant techniques. For each technique the results should be evaluated and their significance interpreted. This is likely to be an iterative process which should lead to selection of one or more techniques that are suitable for further exploration, testing, and validation.
Implementing, evaluating, and monitoring: Once a model has been selected and validated, the model can be implemented for use by the decision makers. This may involve software development for generating reports, or for results visualization and explanation for managers. It may be that more than one technique is available for the given data mining task. It is then important to evaluate the results and choose the best technique. Evaluation may involve checking the accuracy and effectiveness of the technique. Furthermore, there is a need for regular monitoring of the performance of the techniques that have been implemented. It is essential that use of the tools by the managers be monitored and results evaluated regularly. Every enterprise evolves with time and so must the data mining system. Therefore, monitoring is likely to lead from time to time to refinement of tools and techniques that have been implemented.
Results visualization: Explaining the results of data mining to the decision makers is an important step of the data mining process. Most commercial data mining tools include data visualization modules. These tools are often vital in communicating the data mining results to the managers, although a problem dealing with a number of dimensions must be visualized using a two dimensional computer screen or printout. Clever data visualization tools are being developed to display results that deal with more than two dimensions. The visualization tools available should be tried and used if found effective for the given problem.