Introduction To Data Mining And Machine Learning
Introduction To Data Mining And Machine Learning
ADM 3308: Business Data Mining - Telfer School Of Management, University Of Ottawa
Table of Contents
Introduction
Machine Learning vs. Statistical Analysis
Definition of Data Mining and Machine Learning
Data Mining and Ethical Issues
Data Mining Models
Introduction
Overview of the course focused on Data Mining and Machine Learning concepts applicable in business contexts.
Machine Learning vs. Statistical Analysis
Machine Learning and Statistical Analysis are often compared and contrasted.
Machine Learning: Focuses on developing algorithms that allow computers to learn from and make predictions based on data.
Statistical Analysis: Involves analyzing data to derive conclusions based on a set of assumptions, often employing hypothesis testing.
Definition of Data Mining and Machine Learning
Data Mining: The exploration and analysis of a large quantity of data to discover meaningful patterns and rules.
Knowledge Discovery in Data (KDD): The non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data.
Key Features of KDD:
Exploratory data analysis
Data-driven discovery
Deductive learning
Data Mining and Ethical Issues
Discussion on ethical implications associated with data mining practices, including:
Privacy concerns surrounding user data.
Challenges of anonymizing data effectively.
Statistical data showing that 85% of Americans can be identified using just zip code, birth date, and gender.
Risk of discriminatory practices in data mining, such as in loan approvals, if using sensitive attributes like gender or race.
Data Mining Models
Different data mining models and techniques include:
Classification: Grouping data into pre-defined classes (e.g., grades A, B, C, D).
Example: Approving loans based on attributes like age, region, and income.
Prediction/Estimation: Making predictions such as forecasting stock prices.
Clustering: Similar to classification but without pre-defined classes (e.g., clustering customers based on age or income).
Association Rules (Affinity Grouping): Identifying patterns like: if people buy product X, they often buy product Y with a given confidence percentage (Z%).
Data Structures
Various sources and structures of data explored, including:
Business & Commerce: Corporate sales, stock transactions, etc.
Humanities and Social Sciences: Scanned books, historical documents, etc.
Entertainment: Internet images, streaming media.
Scientific Data: Data from astronomy, biology, etc.
Internet of Things (IoT): Data generated by machines and sensors.
What To Do With These Data?
Addressing the question of deriving actionable insights from vast amounts of data generated across industries.
Data Mining Uncovers Hidden Information
The process of data mining involves identifying and extracting hidden patterns and information within a dataset.
Multi-Disciplinary Field of Science and Technology
Data Mining is recognized as a multi-disciplinary field, bridging various domains including computer science, statistics, and domain-specific areas.
Machine Learning
Involves understanding both available data (training data) and new instances of data (unseen data).
Applications of machine learning in business analytics include predictive use cases and improving data models.
Classification Example: Approving Loans
Inputs for classification include attributes such as age, region, income, etc., leading to outcomes of either approved or not approved for loans.
Data Mining Models and Techniques
Data mining includes various models focusing on:
Descriptive Models: Describing data characteristics.
Predictive Models: Making predictions based on historical data.
Statistical Concepts
Definitions in statistics include:
Population: The complete set of items under consideration.
Sample: A subset of the population used for analysis.
Statistic: A summary measure that describes a characteristic of the sample data.
Null Hypothesis, P-Values, and Q-Values
Null Hypothesis: States that no effect or difference exists.
P-value: The probability of observing the data, given that the null hypothesis is true.
Q-value: The confidence level associated with the p-value, computed as the reverse of the p-value.
Machine Learning vs. Statistics
Shared Techniques: Both domains utilize similar algorithms and methodologies.
Differences:
Machine Learning often handles larger datasets compared to traditional statistical analysis.
Machine Learning focuses on data behavior and prediction capabilities, whereas statistical analysis emphasizes hypothesis testing and model validity.
The dynamic nature of Machine Learning contrasts with the static assumptions in traditional statistics.
Data Mining and Knowledge Discovery
Bridges statistical theory with performance improvement learning models.
The full process of data mining includes:
Data cleaning
Learning
Validation
Integration
Visualization of results
CRISP-DM Data Mining Process Model
Outline of the Data Mining process includes:
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Ethical Issues in Data Mining
Critical considerations in data mining include questions related to the following:
What conclusions can be legitimately drawn from the data?
Are resources and results put to good use?
Is the data biased?
Is the model explainable?
Emergence of Responsible AI and Responsible Data Science discussions.
References
Linoff, G., & Berry, M. (2011). Data Mining Techniques for Marketing, Sales, and Customer Relationship Management, 3rd Edition. John Wiley.
Shmueli, G., Bruce, P.C., Deokar, A.V., & Patel, N.R. (2023). Machine Learning for Business Analytics: Concepts, Techniques and Applications in RapidMiner. John Wiley.