data
CS 422: Data Mining
Instructor: Vijay K. Gurbani, Ph.D., Illinois Institute of Technology
Lecture Focus: Components of Learning, Decision Trees
Components of Learning
Overview
Most data mining/machine learning algorithms operate on matrices.
Matrix Layout:
Columns: Attributes, features, predictors, dimensions
Rows: Observations, data points
Data Layout Examples
Matrix Data Layout
Example Data:
values: 12.65, 6.25, 16.22, 2.2, 1.1
includes data on distance, load, and thickness.
Document Data Layout
Example Documents:
Document 1: season, time, lost, winning games, score, ball, play, coach, team
Tokenization for analysis.
Transaction Data Layout
Transaction ID (TID) and associated items:
Example Transactions:
TID 1: Bread, Coke, Milk
TID 2: Beer, Bread
TID 3: Beer, Coke, Diaper, Milk
Graph Data Layout
Graphs can be represented as matrices.
Example Matrix:
5 2 1
2 5
Formalism in Learning
Input/Output Structure
Input: A matrix of attributes (n-dimensional, n >= 1)
Output: Response vector
Target Function: Relationship between input data and desired output
Training Data: Pairs (x1, y1), (x2, y2),…,(xn, yn)
Hypothesis: Constructed model based on training examples
Learning Hypothesis Algorithm
Unknown Target Function: f: X-Y (e.g., ideal credit approval function)
Training Examples: Historical credit customer records (e.g., (x1, 1/1))
Learning: Final hypothesis algorithm g produces the final approval formula A.
Hypothesis Set H: Set of candidate formulas.
Terminology in Learning
Learner: Takes input and produces a classifier.
Classifier: Takes input and produces output (predictions).
Model: Artifact created by a learner, used by a classifier for predictions.
Machine Learning Types
Supervised Learning: Algorithms trained with labeled data.
Unsupervised Learning: Algorithms find patterns in unlabeled data.
Reinforcement Learning: Learning based on actions maximizing rewards.
Decision Trees: One method for supervised learning.
Semi-Supervised: Uses small labeled dataset alongside a large unlabeled dataset.
Generalization in Learning
Importance of generalization to unseen cases.
Challenges include "Curse of Dimensionality."
Exploration of data scientist's time allocation and efforts.
Workflow of Data Mining
Key Steps
Preprocessing:
Data transformation, feature selection, normalization, and subsetting
Knowledge Transformation: Extracting patterns from preprocessed data.
Evaluation: Interpretation of outputs and models derived from mining.
Data Types in R
Numeric: Includes integers and float values.
Factor: Enumeration data type with specific possible values (e.g., {"blue", "green"}).
Ordinal: Order matters (e.g., {"small", "medium", "large"}).
Nominal: Order does not matter (e.g., {"blue", "green", "red"}).
Character: String or single character representation.
Algorithm Preferences
Some algorithms necessitate certain data types (e.g., continuous data for classification).
Ensuring data transformation aligns with algorithm requirements is crucial.
Decision Trees
Description: First classification algorithm in data mining.
Classification Purpose: Learn a target function mapping attributes to class labels.
Case Study: Scenario where a bank determines loan applicants based on specific criteria.