Data Mining Notes

Overview of Data Mining

  • Advancements in internet connection and data storage have made it possible to store vast amounts of data at low cost.
  • This data contains hidden knowledge that can be vital for company growth and scientific discoveries.
  • Much of this data remains unexamined, leading to a 'data-rich but knowledge-poor' phenomenon.
  • Knowledge discovery is the extraction of previously unknown and potentially useful information from databases.
  • Data Mining is the central element of the knowledge discovery process.

1. What is Data Mining?

  • Data Mining is a process to discover patterns and relationships in data.
  • It involves various techniques from multiple disciplines, including database systems, machine learning, information science, visualization, and statistical methods.
1.1. The Origins of Data Mining
  • There is no universal agreement on the definition of data mining.

  • Data mining problems and solutions have roots in classical data analysis.

  • Two essential disciplines for data mining are statistics and machine learning.

  • Statistics:

    • Origins in mathematics.
    • Focus on mathematical rigor and theoretical grounds.
    • Driven by the notion of a model: a hypothesized conceptual structure that could have led to the data.
  • Machine Learning:

    • Origins in computer practice.
    • Practical orientation to experiment without formal proof.
    • Emphasizes algorithms: a process where the logic flow is coded in an implicit algorithm.
  • Control Theory:

    • Basic modelling principles also have origins in control theory, applied mainly to engineering systems and industrial processes.
    • Determining a mathematical model for an unknown system by observing its input-output data is commonly known as system identification.
    • An essential purpose is to predict a system's behaviour and explain the interrelationships between the attributes or variables.
    • Structure and parameter identification must be performed repeatedly until a satisfactory model is encountered.
    • New techniques were developed for parameter identification, and they are a part of the spectra of Data Mining.
  • Model vs Pattern

    • A model is a large-scale structure representing relationships over many or all cases of the data.
    • A pattern is a local structure of the model, satisfied by a few cases or in a small part of a data space.
1.2. What is NOT Data Mining?
  • Not all data analytical methods are considered data mining processes. Data mining usually applied to large data sets and undergoes a standardized process of data exploration, pre-processing, modelling, and evaluation for knowledge discovery.
  • Computing Algorithm or Program: Writing or using a computing program to derive data patterns or prediction results is not Data Mining.
  • Descriptive Statistics: Computing mean, standard deviation, and other descriptive statistics is vital to understanding any data set but is not considered Data Mining. However, they are performed in the exploration phase of a Data Mining process.
  • Hypothesis testing: While data mining can develop and test many hypotheses, hypothesis testing by itself is not data mining but rather one of the data analysis methods in modelling phase of Data Mining.
  • Queries: Data query techniques are used in the phases of a data mining process, but querying itself is not Data Mining.
  • Exploratory visualization: Describing data using a visualization application is essential in the data understanding and pre-processing phases in Data Mining process but is not Data Mining itself.
  • Machine Learning Technique: Performing data analysis using a supervised or unsupervised machine learning method(s) is not Data Mining.
  • Dimensional slicing: OLAP applications deliver data analysis via dimensional slicing, filtering, and pivoting, which is considered information retrieval but not Data Mining.
1.3. Fallacies of What Data Mining Can Accomplish
  • Fallacy 1: There are data mining tools that can automate the processing of data sets and find answers to problems.
    • Reality: There are no automatic data mining tools that will mechanically solve problems without human intervention. Instead, data mining is a process.
  • Fallacy 2: The data mining process is autonomous, requiring little or no human oversight.
    • Reality: Without skilled human oversight, data mining tools' arbitrary use will likely generate incorrect solutions. Further, the erroneous analysis brings more harm than no analysis since it leads to decision-making for deployments or policy recommendations that will probably be expensive failures. Human analysts must evaluate continuous quality monitoring and other assessment measures.
  • Fallacy 3: Data mining can recognize the causes of business or research problems.
    • Reality: The knowledge discovery process will help businesses or researchers to find patterns of data behaviour. However, it depends on humans to determine the causes based on the patterns identified.
  • Fallacy 4: Data mining will automatically clean up and process a dirty data set.
    • Reality: Data pre-processing and cleaning up will not be performed automatically. Organizations often face the problem of data that is of bad quality for analysis and needs considerable fixing and updating.
  • Fallacy 5: Data mining always provides positive and useful results for application deployments or policy recommendations.
    • Reality: There is no guarantee of positive and useful results when mining data for actionable knowledge. Data mining can certainly provide actionable and fruitful results when used appropriately by whom understand the problem domain and models involved, the data requirements, and the overall project objectives
1.4. The Case for Data Mining
  • Conventional data analysis techniques can only bring users limited knowledge discovery.
  • There is a need for a paradigm to:
    • Handle a massive volume of data and explore the patterns and interrelationships of thousands of magnitudes as attributes or variables.
    • Deploy data modelling techniques to derive useful insights from the data sets.
    • Provide a standardized process, tools, and techniques to assist humans to process data and extract useful information systematically and intelligently.
1.4.1. Data Volume
  • The amount of data captured by organizations is exponentially increasing.
  • A rapid growth in the volume of data exposes the limits of current data analysis methodologies.
  • Data volume plays a significant factor in deciding the time to development and deployment.
1.4.2. Dimensions of Data
  • One of the primary characteristics of the Big Data phenomenon is high variety.
  • Variety of data relates to multiple data types (numerical, categorical), data formats (audio files, video files), and data categories (demographics, location coordinates, graph data).
  • Each attribute or variable is a dimension in the data space.
  • As the dimension of the data increases, there is a need for an adaptable methodology that can work well with multiple data types and multiple attributes.
1.4.3. Complex Questions
  • Conventional analysis like hypothesis testing techniques is not scalable to find the natural groupings in a data set with hundreds of dimensions.
  • A more automated approach, such as machine-learning algorithms, is needed to automate data exploration in the vast search space.
  • Conventional statistical analysis approaches a problem using a theoretical or stochastic model to predict a target variable based on input variables.
  • Linear regression and logistic regression analysis are classic examples of this technique where the model's parameters are estimated from the data.
  • Machine learning approaches the problem of modelling by attempting to find and select a model that can better characterize data or predict the output from input variables.
  • Machine learning techniques are usually recursive and evaluate the output and 'learn' from the modelled errors of previous steps in each cycle.
1.4.4. The Needed Paradigm
  • Deploy an extensive methodology that can recursively utilize statistical and machine learning techniques to generate useful data patterns and relationships from large volumes and dimensions of data sets.
  • Data Mining is one such paradigm that can manage large data sets with many attributes and deploy complex modelling techniques to explore patterns from the data sets for complex questions.

2. The Concept of Data Mining

  • Data Mining is a process to discover patterns and relationships in data that involves various techniques from multiple disciplines.
  • Data Mining is not merely an application of statistical and machine learning methods. Instead, it is a thoughtfully planned and considered process.
  • The general experimental process adapted to data- mining problems involves the following phases:
    1. Problem Understanding
    2. Data Understanding
    3. Data Pre-processing
    4. Data Modelling
    5. Model Evaluation
  • All phases, separately and the entire Data Mining process, are highly iterative.
  • A clear and adequate understanding of the whole process is vital for any successful data-mining application.
  • A data-mining process begins by developing a clear understanding of what problem(s) need to be solved in collaboration with experts in a particular application domain and data-mining (e.g., data analysts).
  • Subsequently, data analysts understand available data and prepare the data (evidence and hypotheses) for analysis.
  • The data understanding phase involves inferring descriptive information in meta-data, basic data statistics (counts, ranges, distributions, visualizations), and data quality.
  • Based on the descriptive information, data are corrected if necessary and transformed for analysis.
  • Then, appropriate data processing methods and applications are applied in the data pre-processing data phase to extract and enhance features for data modelling.
  • In the data modelling phase, appropriate machine learning or (and) statistical modelling technique(s) is selected.
  • The models built are then evaluated and optimized in light of the goals of the problem formulation, and hypotheses and plans are possibly adjusted.
  • Finally, results are interpreted after determining the best model for deployment in an application.
2.1. Problem Understanding
  • Domain-specific knowledge and experience are vital to developing an applicable and meaningful problem statement.
  • In problem understanding phase, a data mining practitioner(s) usually identifies a set of attributes or variables for the unknown dependency and, if possible, a general form of this dependency as an initial inference as speculation to be tested or hypothesis.
  • This first phase of Data Mining requires the incorporated expertise of an application domain and a data-mining practitioner.
2.2. Data Understanding
  • Data understanding, also known as Exploratory Data Analysis (EDA) or data exploration, provides methods to understand the data.
  • Basic understanding approaches involve computing descriptive statistics and visualization of data.
  • Descriptive statistics like mean, median, mode, standard deviation, and range for each attribute or variable summarize the characteristics of the distribution of the data.
  • Visual plots of data points instantly capture all the data points condensed into one chart.
2.3. Data Pre-processing
  • Preparing the data sets to suit a data-mining process is the most time-consuming.
  • Data rarely are available in the form required by the data modelling techniques.
  • Data contain missing values, misclassification, duplicated, or abnormal data distribution.
  • Data analysts also need to pre-process the data into the types that suit the requirements of different data modelling techniques.
2.4. Data Modelling
  • A model is the abstract representation of the data and its relationships in a given data set.
  • It is essential to understand an algorithm before applying it. Specifically, the data mining practitioners must know how it works and decide what parameters need to be configured.
  • Data mining models built in the data modelling phase can be classified into the following categories: Predictive (a.k.a Supervised learning) and Descriptive (a.k.a. Unsupervised learning).
  • Predictive modelling algorithms require a known prior data set to 'learn' the model. Descriptive modelling techniques have no target variable to predict; hence there is no test data set.
  • Usually, the implementation involves repetitive experiments with different parameters to generate several models, and selecting the best one is an additional task.
2.5. Model Evaluation
  • Selected data-mining models should help in decision-making. Hence, such models need to be interpretable for actionable deployment.
  • The model's precision and interpretation goals are somewhat contradictory.
  • Simple models are more interpretable, but they may be less accurate.
  • Identifying accurate and useful models to select the best model for deployment is crucial.
  • Interpreting the models is essential for non-data-mining practitioners because they are unlikely to understand and interpret hundreds of pages of numerical results and use them for successful decision-making.

3. Applications of Data Mining

  • Comprehensible decision structure is a critical element in the successful adoption of the application.
3.1. Web Mining
  • Search engine companies analyse the webpage hyperlinks to develop a measure to distinguish and rank each website and page. For example, Google uses the PageRank metric to measure the characteristics of a web page.
  • Another way search engine tackles the problem of ranking web pages is to use machine learning based on a set of example queries containing the terms in the query and human decisions about how relevant the web pages are to that query.
  • Online product or service merchants mine the purchasing databases to develop recommendations.
  • Social networks and other personal data provide a massive volume of data for data-mining applications.
3.2. Loan Default
  • Loan companies use information to decide whether to give a loan to the applicant.
  • Decisions involve applying learning methods to decide 'reject' and 'accept' cases.
  • A machine learning technique can be used in the data-mining process to generate a set of classification rules that predict the borderline cases if they are likely to default.
3.3. Screening Images
  • Environmental scientists have been attempting to identify oil slicks from satellite images to provide early warning of ecological disasters and prevent illegal dumping.
  • A hazard detection system can be deployed to screen images using a data-mining approach for subsequent manual processing.
3.4. Load Forecasting
  • It is crucial to decide and estimate future power demand as far in advance as possible in the electricity supply industry.
  • The predictive modelling technique used in the data-mining approach is considered a supplementary correction to the static load model.
  • The resulting system adopting a data-mining approach yielded a similar performance as trained human forecasters but was faster by far.
3.5. Sales and Marketing
  • Banking industries were the early adopters of the data mining approach because of their successes in using machine learning techniques for credit assessment.
  • Market basket analysis uses association techniques in a data mining approach to discover items that are likely to occur concurrently in transactions.
  • Direct marketing is another favourite application domain for data mining.
3.6. Other Applications
  • British Petroleum used data mining to create rules for establishing the parameters in advanced manufacturing processes.
  • Westinghouse faced difficulties manufacturing nuclear fuel pellets and used a data- mining approach to develop rules to control the process.
  • Bell Atlantic used a data- mining approach to make technician according to the generated rules, saving more than 1010 million per year.
  • In biomedicine, data mining approaches predict drug activity by analysing drugs' chemical properties.
  • Finally, cybersecurity is the primary concern in today's vulnerable networked computer systems; data- mining approaches allow the detection of intrusion by identifying unusual operation patterns.

4. Standard Process Methodology of Data Mining

  • CRISP-DM (Cross Industry Standard Process for Data Mining)
  • KDD (Knowledge Discovery in Databases)
  • SEMMA (Sample, Explore, Modify, Model, Assess)