Introduction To Data Mining And Machine Learning

Introduction to Data Mining and Machine Learning

Course Information
- ADM 3308: Business Data Mining
- Telfer School of Management, University of Ottawa

Introduction
Machine Learning vs. Statistical Analysis
Definition of Data Mining and Machine Learning
Data Mining and Ethical Issues
Data Mining Models

Introduction

Discussion on the increasing relevance and application of data mining and machine learning in various domains such as retail, healthcare, and finance.

Machine Learning vs. Statistical Analysis

Overview of the relationship and distinctions between machine learning and traditional statistical analysis.

Definition of Data Mining and Machine Learning

Data Mining:
- Exploration and analysis of large quantities of data to discover meaningful patterns and rules.
- Involves finding hidden patterns in a database.
Machine Learning:
- A technique that uses algorithms to analyze data and learn from it to make predictions or decisions based on new data instances.

Examples in Data Mining

Retail Industry Example: Market Basket Analysis
- Questions Posed:
- How do the demographics of the neighborhood affect what customers buy?
- Are bananas purchased when milk is purchased?
- What should be in the basket but is not?
- Is soda typically purchased with bananas?
- Does the brand name make a difference?
- Case Study:
- In a shopping basket, the shopper purchases:
  - Quart of milk
  - Bananas
  - Dish detergent
  - Window cleaner
  - Bottle of soda
- Outcome:
- Discover customer groups and utilize them for targeted marketing.

The Data Flood

Context:
- There is an overwhelming amount of data generated across various industries but a thirst for actionable information.
Sources of Data:
- Retail industry
- Banking and business transactions
- Telecommunications and mobile services
- Web and clickstream data
- Social media interactions
- E-commerce activities
- Healthcare records and scientific data (astronomy, biology, etc.)

Data Everywhere

Categories of Data Sources:
- Business & Commerce:
- Corporate sales
- Stock market transactions
- Census data
- Airline traffic
- Humanities and Social Sciences:
- Scanned books
- Historical documents
- Social interaction data
- Entertainment:
- Internet images
- Streaming media
- MP3 files
- Sensors and IoT:
- Sensor networks
- Wearable devices
- Machine-generated data from various devices
- Transportation:
- GPS data and transportation logs
- Social Media:
- Data from platforms like Facebook and LinkedIn
- Medicine:
- MRI/CT scans
- Patient records
- Science:
- Databases from astronomy, genomics, environmental studies

Data Mining: Uncover Hidden Information

Definition:
- Data mining is the exploration and analysis of a large quantities of data in order to discover meaningful patterns and rules.
Knowledge Discovery in Data (KDD):
- Non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data.
- Other names include exploratory data analysis, data-driven discovery, deductive learning, and knowledge discovery.
- Reference: Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, "Advances in Knowledge Discovery and Data Mining", AAAI/MIT Press 1996.

Multi-disciplinary Field of Science and Technology

Data mining combines techniques from various fields including statistics, machine learning, database technology, and pattern recognition.

Core Concepts in Machine Learning

Training Data vs. New Instances of Data:
- Training data is used to build models that can predict or classify new data instances.
Models:
- Representation of the relationship between input attributes and target outcomes.

Classification Example: Approving Loan, Credit Assessment

**Input Attributes:
- Age
- Region
- Income
- Other relevant factors
- Classes: Approved, Not Approved.

Data Mining Models and Techniques

Models:

Classification:
- Grouping data into predefined classes (e.g., grades: A, B, C, D).
Prediction/Estimation:
- Example: predicting stock price trends or economic indicators.
Clustering:
- Similar to classification but without predefined classes; used to group customers based on attributes such as age, income, or preferences.
Association Rules (Affinity Grouping):
- Example: Identifying that customers who buy product X also buy product Y with a certain confidence percentage (e.g., confidence = Z%).
- Classification is typically predictive while association can be both predictive and descriptive.

Machine Learning vs. Statistical Analysis

Key Differences:

Data vs. Samples:
- Machine learning deals with large data sets while statistics often works with samples of populations.
- Statistical models may not be feasible for large databases with numerous attributes.
Fuzziness vs. Accuracy:
- Machine learning can accommodate imprecision in data (fuzziness), while traditional statistics focuses on achieving accuracy.
Data-Driven vs. Hypothesis Testing:
- Machine learning emphasizes letting data inform discoveries, unlike statistics which focuses on validating hypotheses.
Dynamics vs. Static:
- Machine learning often uses dynamic datasets that evolve over time, while statistical methods typically engage with static, unchanging samples.

Integration of Statistics, Machine Learning, and Data Mining

Statistics:
- Theory-driven focus intended primarily for hypothesis testing.
Machine Learning:
- Heuristic-driven focus aimed at enhancing the performance of predictive models.
Data Mining & Knowledge Discovery:
- Integrates both theoretical and heuristic approaches to cover the whole process including data cleaning, model learning, validation, integration, and visualization of results.

Data Mining as a Process

CRISP-DM Data Mining Process Model:

Steps:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment

Data Mining and Ethical Issues

Key Ethical Considerations:

Privacy Concerns:
- Anonymizing data poses significant challenges; for example, 85% of Americans can be identified using just a zip code, birth date, and gender.
Problematic Information in Attributes:
- Certain attributes may correlate with sensitive categories such as race or gender, leading to potential biases and discrimination.
Ethical Implications of Data Mining:
- Application of data mining techniques can lead to unethical discrimination (e.g., in loan applications).
- Reference: Medium article on algorithmic bias in data.

Important Questions for Ethical Data Mining:

What types of conclusions can legitimately be drawn from the data?
Are resources effectively utilized?
Is the data inherently biased?
Is the model developed explainable to stakeholders?
Discussion of concepts such as Responsible AI and Responsible Data Science.

References

Linoff, G., & Berry, M. (2011). Chapter 1: "Data Mining Techniques for Marketing, Sales, and Customer Relationship Management". John Wiley.
Shmueli, G., Bruce, P. C., Deokar, A. V., & Patel, N. R. (2023). Chapter 1: "Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner". John Wiley.

Introduction To Data Mining And Machine Learning

Introduction to Data Mining and Machine Learning

Table of Contents

Introduction

Machine Learning vs. Statistical Analysis

Definition of Data Mining and Machine Learning

Examples in Data Mining

The Data Flood

Data Everywhere

Data Mining: Uncover Hidden Information

Multi-disciplinary Field of Science and Technology

Core Concepts in Machine Learning

Classification Example: Approving Loan, Credit Assessment

Data Mining Models and Techniques

Models:

Machine Learning vs. Statistical Analysis

Key Differences:

Integration of Statistics, Machine Learning, and Data Mining

Data Mining as a Process

CRISP-DM Data Mining Process Model:

Data Mining and Ethical Issues

Key Ethical Considerations:

Important Questions for Ethical Data Mining:

References