Introduction To Data Mining And Machine Learning

Introduction To Data Mining And Machine Learning

ADM 3308: Business Data Mining - Telfer School Of Management, University Of Ottawa


Table of Contents

  • Introduction

  • Machine Learning vs. Statistical Analysis

  • Definition of Data Mining and Machine Learning

  • Data Mining and Ethical Issues

  • Data Mining Models


Introduction

  • Overview of the course focused on Data Mining and Machine Learning concepts applicable in business contexts.


Machine Learning vs. Statistical Analysis

  • Machine Learning and Statistical Analysis are often compared and contrasted.

    • Machine Learning: Focuses on developing algorithms that allow computers to learn from and make predictions based on data.

    • Statistical Analysis: Involves analyzing data to derive conclusions based on a set of assumptions, often employing hypothesis testing.


Definition of Data Mining and Machine Learning

  • Data Mining: The exploration and analysis of a large quantity of data to discover meaningful patterns and rules.

  • Knowledge Discovery in Data (KDD): The non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data.

    • Key Features of KDD:

    • Exploratory data analysis

    • Data-driven discovery

    • Deductive learning


Data Mining and Ethical Issues

  • Discussion on ethical implications associated with data mining practices, including:

    • Privacy concerns surrounding user data.

    • Challenges of anonymizing data effectively.

    • Statistical data showing that 85% of Americans can be identified using just zip code, birth date, and gender.

    • Risk of discriminatory practices in data mining, such as in loan approvals, if using sensitive attributes like gender or race.


Data Mining Models

  • Different data mining models and techniques include:

    • Classification: Grouping data into pre-defined classes (e.g., grades A, B, C, D).

    • Example: Approving loans based on attributes like age, region, and income.

    • Prediction/Estimation: Making predictions such as forecasting stock prices.

    • Clustering: Similar to classification but without pre-defined classes (e.g., clustering customers based on age or income).

    • Association Rules (Affinity Grouping): Identifying patterns like: if people buy product X, they often buy product Y with a given confidence percentage (Z%).


Data Structures

  • Various sources and structures of data explored, including:

    • Business & Commerce: Corporate sales, stock transactions, etc.

    • Humanities and Social Sciences: Scanned books, historical documents, etc.

    • Entertainment: Internet images, streaming media.

    • Scientific Data: Data from astronomy, biology, etc.

    • Internet of Things (IoT): Data generated by machines and sensors.


What To Do With These Data?

  • Addressing the question of deriving actionable insights from vast amounts of data generated across industries.


Data Mining Uncovers Hidden Information

  • The process of data mining involves identifying and extracting hidden patterns and information within a dataset.


Multi-Disciplinary Field of Science and Technology

  • Data Mining is recognized as a multi-disciplinary field, bridging various domains including computer science, statistics, and domain-specific areas.


Machine Learning

  • Involves understanding both available data (training data) and new instances of data (unseen data).

  • Applications of machine learning in business analytics include predictive use cases and improving data models.


Classification Example: Approving Loans

  • Inputs for classification include attributes such as age, region, income, etc., leading to outcomes of either approved or not approved for loans.


Data Mining Models and Techniques

  • Data mining includes various models focusing on:

    • Descriptive Models: Describing data characteristics.

    • Predictive Models: Making predictions based on historical data.


Statistical Concepts

  • Definitions in statistics include:

    • Population: The complete set of items under consideration.

    • Sample: A subset of the population used for analysis.

    • Statistic: A summary measure that describes a characteristic of the sample data.


Null Hypothesis, P-Values, and Q-Values

  • Null Hypothesis: States that no effect or difference exists.

  • P-value: The probability of observing the data, given that the null hypothesis is true.

  • Q-value: The confidence level associated with the p-value, computed as the reverse of the p-value.


Machine Learning vs. Statistics

  • Shared Techniques: Both domains utilize similar algorithms and methodologies.

  • Differences:

    • Machine Learning often handles larger datasets compared to traditional statistical analysis.

    • Machine Learning focuses on data behavior and prediction capabilities, whereas statistical analysis emphasizes hypothesis testing and model validity.

    • The dynamic nature of Machine Learning contrasts with the static assumptions in traditional statistics.


Data Mining and Knowledge Discovery

  • Bridges statistical theory with performance improvement learning models.

  • The full process of data mining includes:

    • Data cleaning

    • Learning

    • Validation

    • Integration

    • Visualization of results


CRISP-DM Data Mining Process Model

  • Outline of the Data Mining process includes:

    • Business Understanding

    • Data Understanding

    • Data Preparation

    • Modeling

    • Evaluation

    • Deployment


Ethical Issues in Data Mining

  • Critical considerations in data mining include questions related to the following:

    • What conclusions can be legitimately drawn from the data?

    • Are resources and results put to good use?

    • Is the data biased?

    • Is the model explainable?

  • Emergence of Responsible AI and Responsible Data Science discussions.


References

  • Linoff, G., & Berry, M. (2011). Data Mining Techniques for Marketing, Sales, and Customer Relationship Management, 3rd Edition. John Wiley.

  • Shmueli, G., Bruce, P.C., Deokar, A.V., & Patel, N.R. (2023). Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner. John Wiley.