Introduction To Data Mining And Machine Learning

ADM 3308: Business Data Mining - Telfer School Of Management, University Of Ottawa

Introduction
Machine Learning vs. Statistical Analysis
Definition of Data Mining and Machine Learning
Data Mining and Ethical Issues
Data Mining Models

Introduction

Overview of the course focused on Data Mining and Machine Learning concepts applicable in business contexts.

Machine Learning vs. Statistical Analysis

Machine Learning and Statistical Analysis are often compared and contrasted.
- Machine Learning: Focuses on developing algorithms that allow computers to learn from and make predictions based on data.
- Statistical Analysis: Involves analyzing data to derive conclusions based on a set of assumptions, often employing hypothesis testing.

Definition of Data Mining and Machine Learning

Data Mining: The exploration and analysis of a large quantity of data to discover meaningful patterns and rules.
Knowledge Discovery in Data (KDD): The non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data.
- Key Features of KDD:
- Exploratory data analysis
- Data-driven discovery
- Deductive learning

Data Mining and Ethical Issues

Discussion on ethical implications associated with data mining practices, including:
- Privacy concerns surrounding user data.
- Challenges of anonymizing data effectively.
- Statistical data showing that 85% of Americans can be identified using just zip code, birth date, and gender.
- Risk of discriminatory practices in data mining, such as in loan approvals, if using sensitive attributes like gender or race.

Data Mining Models

Different data mining models and techniques include:
- Classification: Grouping data into pre-defined classes (e.g., grades A, B, C, D).
- Example: Approving loans based on attributes like age, region, and income.
- Prediction/Estimation: Making predictions such as forecasting stock prices.
- Clustering: Similar to classification but without pre-defined classes (e.g., clustering customers based on age or income).
- Association Rules (Affinity Grouping): Identifying patterns like: if people buy product X, they often buy product Y with a given confidence percentage (Z%).

Data Structures

Various sources and structures of data explored, including:
- Business & Commerce: Corporate sales, stock transactions, etc.
- Humanities and Social Sciences: Scanned books, historical documents, etc.
- Entertainment: Internet images, streaming media.
- Scientific Data: Data from astronomy, biology, etc.
- Internet of Things (IoT): Data generated by machines and sensors.

What To Do With These Data?

Addressing the question of deriving actionable insights from vast amounts of data generated across industries.

Data Mining Uncovers Hidden Information

The process of data mining involves identifying and extracting hidden patterns and information within a dataset.

Multi-Disciplinary Field of Science and Technology

Data Mining is recognized as a multi-disciplinary field, bridging various domains including computer science, statistics, and domain-specific areas.

Machine Learning

Involves understanding both available data (training data) and new instances of data (unseen data).
Applications of machine learning in business analytics include predictive use cases and improving data models.

Classification Example: Approving Loans

Inputs for classification include attributes such as age, region, income, etc., leading to outcomes of either approved or not approved for loans.

Data Mining Models and Techniques

Data mining includes various models focusing on:
- Descriptive Models: Describing data characteristics.
- Predictive Models: Making predictions based on historical data.

Statistical Concepts

Definitions in statistics include:
- Population: The complete set of items under consideration.
- Sample: A subset of the population used for analysis.
- Statistic: A summary measure that describes a characteristic of the sample data.

Null Hypothesis, P-Values, and Q-Values

Null Hypothesis: States that no effect or difference exists.
P-value: The probability of observing the data, given that the null hypothesis is true.
Q-value: The confidence level associated with the p-value, computed as the reverse of the p-value.

Machine Learning vs. Statistics

Shared Techniques: Both domains utilize similar algorithms and methodologies.
Differences:
- Machine Learning often handles larger datasets compared to traditional statistical analysis.
- Machine Learning focuses on data behavior and prediction capabilities, whereas statistical analysis emphasizes hypothesis testing and model validity.
- The dynamic nature of Machine Learning contrasts with the static assumptions in traditional statistics.

Data Mining and Knowledge Discovery

Bridges statistical theory with performance improvement learning models.
The full process of data mining includes:
- Data cleaning
- Learning
- Validation
- Integration
- Visualization of results

CRISP-DM Data Mining Process Model

Outline of the Data Mining process includes:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment

Ethical Issues in Data Mining

Critical considerations in data mining include questions related to the following:
- What conclusions can be legitimately drawn from the data?
- Are resources and results put to good use?
- Is the data biased?
- Is the model explainable?
Emergence of Responsible AI and Responsible Data Science discussions.

References

Linoff, G., & Berry, M. (2011). Data Mining Techniques for Marketing, Sales, and Customer Relationship Management, 3rd Edition. John Wiley.
Shmueli, G., Bruce, P.C., Deokar, A.V., & Patel, N.R. (2023). Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner. John Wiley.

Introduction To Data Mining And Machine Learning

Introduction To Data Mining And Machine Learning

Table of Contents

Introduction

Machine Learning vs. Statistical Analysis

Definition of Data Mining and Machine Learning

Data Mining and Ethical Issues

Data Mining Models

Data Structures

What To Do With These Data?

Data Mining Uncovers Hidden Information

Multi-Disciplinary Field of Science and Technology

Machine Learning

Classification Example: Approving Loans

Data Mining Models and Techniques

Statistical Concepts

Null Hypothesis, P-Values, and Q-Values

Machine Learning vs. Statistics

Data Mining and Knowledge Discovery

CRISP-DM Data Mining Process Model

Ethical Issues in Data Mining

References