Introduction To Data Mining And Machine Learning

Introduction to Data Mining and Machine Learning

  • Course Information

    • ADM 3308: Business Data Mining

    • Telfer School of Management, University of Ottawa

Table of Contents

  • Introduction

  • Machine Learning vs. Statistical Analysis

  • Definition of Data Mining and Machine Learning

  • Data Mining and Ethical Issues

  • Data Mining Models

Introduction

  • Discussion on the increasing relevance and application of data mining and machine learning in various domains such as retail, healthcare, and finance.

Machine Learning vs. Statistical Analysis

  • Overview of the relationship and distinctions between machine learning and traditional statistical analysis.

Definition of Data Mining and Machine Learning

  • Data Mining:

    • Exploration and analysis of large quantities of data to discover meaningful patterns and rules.

    • Involves finding hidden patterns in a database.

  • Machine Learning:

    • A technique that uses algorithms to analyze data and learn from it to make predictions or decisions based on new data instances.

Examples in Data Mining

  • Retail Industry Example: Market Basket Analysis

    • Questions Posed:

    • How do the demographics of the neighborhood affect what customers buy?

    • Are bananas purchased when milk is purchased?

    • What should be in the basket but is not?

    • Is soda typically purchased with bananas?

    • Does the brand name make a difference?

    • Case Study:

    • In a shopping basket, the shopper purchases:

      • Quart of milk

      • Bananas

      • Dish detergent

      • Window cleaner

      • Bottle of soda

    • Outcome:

    • Discover customer groups and utilize them for targeted marketing.

The Data Flood

  • Context:

    • There is an overwhelming amount of data generated across various industries but a thirst for actionable information.

  • Sources of Data:

    • Retail industry

    • Banking and business transactions

    • Telecommunications and mobile services

    • Web and clickstream data

    • Social media interactions

    • E-commerce activities

    • Healthcare records and scientific data (astronomy, biology, etc.)

Data Everywhere

  • Categories of Data Sources:

    • Business & Commerce:

    • Corporate sales

    • Stock market transactions

    • Census data

    • Airline traffic

    • Humanities and Social Sciences:

    • Scanned books

    • Historical documents

    • Social interaction data

    • Entertainment:

    • Internet images

    • Streaming media

    • MP3 files

    • Sensors and IoT:

    • Sensor networks

    • Wearable devices

    • Machine-generated data from various devices

    • Transportation:

    • GPS data and transportation logs

    • Social Media:

    • Data from platforms like Facebook and LinkedIn

    • Medicine:

    • MRI/CT scans

    • Patient records

    • Science:

    • Databases from astronomy, genomics, environmental studies

Data Mining: Uncover Hidden Information

  • Definition:

    • Data mining is the exploration and analysis of a large quantities of data in order to discover meaningful patterns and rules.

  • Knowledge Discovery in Data (KDD):

    • Non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data.

    • Other names include exploratory data analysis, data-driven discovery, deductive learning, and knowledge discovery.

    • Reference: Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, "Advances in Knowledge Discovery and Data Mining", AAAI/MIT Press 1996.

Multi-disciplinary Field of Science and Technology

  • Data mining combines techniques from various fields including statistics, machine learning, database technology, and pattern recognition.

Core Concepts in Machine Learning

  • Training Data vs. New Instances of Data:

    • Training data is used to build models that can predict or classify new data instances.

  • Models:

    • Representation of the relationship between input attributes and target outcomes.

Classification Example: Approving Loan, Credit Assessment

  • **Input Attributes:

    • Age

    • Region

    • Income

    • Other relevant factors

    • Classes: Approved, Not Approved.

Data Mining Models and Techniques

Models:

  • Classification:

    • Grouping data into predefined classes (e.g., grades: A, B, C, D).

  • Prediction/Estimation:

    • Example: predicting stock price trends or economic indicators.

  • Clustering:

    • Similar to classification but without predefined classes; used to group customers based on attributes such as age, income, or preferences.

  • Association Rules (Affinity Grouping):

    • Example: Identifying that customers who buy product X also buy product Y with a certain confidence percentage (e.g., confidence = Z%).

    • Classification is typically predictive while association can be both predictive and descriptive.

Machine Learning vs. Statistical Analysis

Key Differences:

  • Data vs. Samples:

    • Machine learning deals with large data sets while statistics often works with samples of populations.

    • Statistical models may not be feasible for large databases with numerous attributes.

  • Fuzziness vs. Accuracy:

    • Machine learning can accommodate imprecision in data (fuzziness), while traditional statistics focuses on achieving accuracy.

  • Data-Driven vs. Hypothesis Testing:

    • Machine learning emphasizes letting data inform discoveries, unlike statistics which focuses on validating hypotheses.

  • Dynamics vs. Static:

    • Machine learning often uses dynamic datasets that evolve over time, while statistical methods typically engage with static, unchanging samples.

Integration of Statistics, Machine Learning, and Data Mining

  • Statistics:

    • Theory-driven focus intended primarily for hypothesis testing.

  • Machine Learning:

    • Heuristic-driven focus aimed at enhancing the performance of predictive models.

  • Data Mining & Knowledge Discovery:

    • Integrates both theoretical and heuristic approaches to cover the whole process including data cleaning, model learning, validation, integration, and visualization of results.

Data Mining as a Process

CRISP-DM Data Mining Process Model:

  • Steps:

    • Business Understanding

    • Data Understanding

    • Data Preparation

    • Modeling

    • Evaluation

    • Deployment

Data Mining and Ethical Issues

Key Ethical Considerations:

  • Privacy Concerns:

    • Anonymizing data poses significant challenges; for example, 85% of Americans can be identified using just a zip code, birth date, and gender.

  • Problematic Information in Attributes:

    • Certain attributes may correlate with sensitive categories such as race or gender, leading to potential biases and discrimination.

  • Ethical Implications of Data Mining:

    • Application of data mining techniques can lead to unethical discrimination (e.g., in loan applications).

    • Reference: Medium article on algorithmic bias in data.

Important Questions for Ethical Data Mining:

  • What types of conclusions can legitimately be drawn from the data?

  • Are resources effectively utilized?

  • Is the data inherently biased?

  • Is the model developed explainable to stakeholders?

  • Discussion of concepts such as Responsible AI and Responsible Data Science.

References

  • Linoff, G., & Berry, M. (2011). Chapter 1: "Data Mining Techniques for Marketing, Sales, and Customer Relationship Management". John Wiley.

  • Shmueli, G., Bruce, P. C., Deokar, A. V., & Patel, N. R. (2023). Chapter 1: "Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner". John Wiley.