data

CS 422: Data Mining

Instructor: Vijay K. Gurbani, Ph.D., Illinois Institute of Technology
Lecture Focus: Components of Learning, Decision Trees

Components of Learning

Overview

Most data mining/machine learning algorithms operate on matrices.
Matrix Layout:
- Columns: Attributes, features, predictors, dimensions
- Rows: Observations, data points

Data Layout Examples

Matrix Data Layout

Example Data:
- values: 12.65, 6.25, 16.22, 2.2, 1.1
- includes data on distance, load, and thickness.

Document Data Layout

Example Documents:
- Document 1: season, time, lost, winning games, score, ball, play, coach, team
- Tokenization for analysis.

Transaction Data Layout

Transaction ID (TID) and associated items:
- Example Transactions:
  - TID 1: Bread, Coke, Milk
  - TID 2: Beer, Bread
  - TID 3: Beer, Coke, Diaper, Milk

Graph Data Layout

Graphs can be represented as matrices.
Example Matrix:
- 5 2 1
- 2 5

Formalism in Learning

Input/Output Structure

Input: A matrix of attributes (n-dimensional, n >= 1)
Output: Response vector
Target Function: Relationship between input data and desired output
Training Data: Pairs (x1, y1), (x2, y2),…,(xn, yn)
Hypothesis: Constructed model based on training examples

Learning Hypothesis Algorithm

Unknown Target Function: f: X-Y (e.g., ideal credit approval function)
Training Examples: Historical credit customer records (e.g., (x1, 1/1))
Learning: Final hypothesis algorithm g produces the final approval formula A.
Hypothesis Set H: Set of candidate formulas.

Terminology in Learning

Learner: Takes input and produces a classifier.
Classifier: Takes input and produces output (predictions).
Model: Artifact created by a learner, used by a classifier for predictions.

Machine Learning Types

Supervised Learning: Algorithms trained with labeled data.
Unsupervised Learning: Algorithms find patterns in unlabeled data.
Reinforcement Learning: Learning based on actions maximizing rewards.
Decision Trees: One method for supervised learning.
Semi-Supervised: Uses small labeled dataset alongside a large unlabeled dataset.

Generalization in Learning

Importance of generalization to unseen cases.
Challenges include "Curse of Dimensionality."
Exploration of data scientist's time allocation and efforts.

Workflow of Data Mining

Key Steps

Preprocessing:

Data transformation, feature selection, normalization, and subsetting

Knowledge Transformation: Extracting patterns from preprocessed data.
Evaluation: Interpretation of outputs and models derived from mining.

Data Types in R

Numeric: Includes integers and float values.
Factor: Enumeration data type with specific possible values (e.g., {"blue", "green"}).
- Ordinal: Order matters (e.g., {"small", "medium", "large"}).
- Nominal: Order does not matter (e.g., {"blue", "green", "red"}).
Character: String or single character representation.

Algorithm Preferences

Some algorithms necessitate certain data types (e.g., continuous data for classification).
Ensuring data transformation aligns with algorithm requirements is crucial.

Decision Trees

Description: First classification algorithm in data mining.
Classification Purpose: Learn a target function mapping attributes to class labels.
Case Study: Scenario where a bank determines loan applicants based on specific criteria.