Introduction To Data Mining And Machine Learning
Introduction to Data Mining and Machine Learning
Course Information
ADM 3308: Business Data Mining
Telfer School of Management, University of Ottawa
Table of Contents
Introduction
Machine Learning vs. Statistical Analysis
Definition of Data Mining and Machine Learning
Data Mining and Ethical Issues
Data Mining Models
Introduction
Discussion on the increasing relevance and application of data mining and machine learning in various domains such as retail, healthcare, and finance.
Machine Learning vs. Statistical Analysis
Overview of the relationship and distinctions between machine learning and traditional statistical analysis.
Definition of Data Mining and Machine Learning
Data Mining:
Exploration and analysis of large quantities of data to discover meaningful patterns and rules.
Involves finding hidden patterns in a database.
Machine Learning:
A technique that uses algorithms to analyze data and learn from it to make predictions or decisions based on new data instances.
Examples in Data Mining
Retail Industry Example: Market Basket Analysis
Questions Posed:
How do the demographics of the neighborhood affect what customers buy?
Are bananas purchased when milk is purchased?
What should be in the basket but is not?
Is soda typically purchased with bananas?
Does the brand name make a difference?
Case Study:
In a shopping basket, the shopper purchases:
Quart of milk
Bananas
Dish detergent
Window cleaner
Bottle of soda
Outcome:
Discover customer groups and utilize them for targeted marketing.
The Data Flood
Context:
There is an overwhelming amount of data generated across various industries but a thirst for actionable information.
Sources of Data:
Retail industry
Banking and business transactions
Telecommunications and mobile services
Web and clickstream data
Social media interactions
E-commerce activities
Healthcare records and scientific data (astronomy, biology, etc.)
Data Everywhere
Categories of Data Sources:
Business & Commerce:
Corporate sales
Stock market transactions
Census data
Airline traffic
Humanities and Social Sciences:
Scanned books
Historical documents
Social interaction data
Entertainment:
Internet images
Streaming media
MP3 files
Sensors and IoT:
Sensor networks
Wearable devices
Machine-generated data from various devices
Transportation:
GPS data and transportation logs
Social Media:
Data from platforms like Facebook and LinkedIn
Medicine:
MRI/CT scans
Patient records
Science:
Databases from astronomy, genomics, environmental studies
Data Mining: Uncover Hidden Information
Definition:
Data mining is the exploration and analysis of a large quantities of data in order to discover meaningful patterns and rules.
Knowledge Discovery in Data (KDD):
Non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data.
Other names include exploratory data analysis, data-driven discovery, deductive learning, and knowledge discovery.
Reference: Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, "Advances in Knowledge Discovery and Data Mining", AAAI/MIT Press 1996.
Multi-disciplinary Field of Science and Technology
Data mining combines techniques from various fields including statistics, machine learning, database technology, and pattern recognition.
Core Concepts in Machine Learning
Training Data vs. New Instances of Data:
Training data is used to build models that can predict or classify new data instances.
Models:
Representation of the relationship between input attributes and target outcomes.
Classification Example: Approving Loan, Credit Assessment
**Input Attributes:
Age
Region
Income
Other relevant factors
Classes: Approved, Not Approved.
Data Mining Models and Techniques
Models:
Classification:
Grouping data into predefined classes (e.g., grades: A, B, C, D).
Prediction/Estimation:
Example: predicting stock price trends or economic indicators.
Clustering:
Similar to classification but without predefined classes; used to group customers based on attributes such as age, income, or preferences.
Association Rules (Affinity Grouping):
Example: Identifying that customers who buy product X also buy product Y with a certain confidence percentage (e.g., confidence = Z%).
Classification is typically predictive while association can be both predictive and descriptive.
Machine Learning vs. Statistical Analysis
Key Differences:
Data vs. Samples:
Machine learning deals with large data sets while statistics often works with samples of populations.
Statistical models may not be feasible for large databases with numerous attributes.
Fuzziness vs. Accuracy:
Machine learning can accommodate imprecision in data (fuzziness), while traditional statistics focuses on achieving accuracy.
Data-Driven vs. Hypothesis Testing:
Machine learning emphasizes letting data inform discoveries, unlike statistics which focuses on validating hypotheses.
Dynamics vs. Static:
Machine learning often uses dynamic datasets that evolve over time, while statistical methods typically engage with static, unchanging samples.
Integration of Statistics, Machine Learning, and Data Mining
Statistics:
Theory-driven focus intended primarily for hypothesis testing.
Machine Learning:
Heuristic-driven focus aimed at enhancing the performance of predictive models.
Data Mining & Knowledge Discovery:
Integrates both theoretical and heuristic approaches to cover the whole process including data cleaning, model learning, validation, integration, and visualization of results.
Data Mining as a Process
CRISP-DM Data Mining Process Model:
Steps:
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Data Mining and Ethical Issues
Key Ethical Considerations:
Privacy Concerns:
Anonymizing data poses significant challenges; for example, 85% of Americans can be identified using just a zip code, birth date, and gender.
Problematic Information in Attributes:
Certain attributes may correlate with sensitive categories such as race or gender, leading to potential biases and discrimination.
Ethical Implications of Data Mining:
Application of data mining techniques can lead to unethical discrimination (e.g., in loan applications).
Reference: Medium article on algorithmic bias in data.
Important Questions for Ethical Data Mining:
What types of conclusions can legitimately be drawn from the data?
Are resources effectively utilized?
Is the data inherently biased?
Is the model developed explainable to stakeholders?
Discussion of concepts such as Responsible AI and Responsible Data Science.
References
Linoff, G., & Berry, M. (2011). Chapter 1: "Data Mining Techniques for Marketing, Sales, and Customer Relationship Management". John Wiley.
Shmueli, G., Bruce, P. C., Deokar, A. V., & Patel, N. R. (2023). Chapter 1: "Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner". John Wiley.