data

CS 422: Data Mining

  • Instructor: Vijay K. Gurbani, Ph.D., Illinois Institute of Technology

  • Lecture Focus: Components of Learning, Decision Trees

Components of Learning

Overview

  • Most data mining/machine learning algorithms operate on matrices.

  • Matrix Layout:

    • Columns: Attributes, features, predictors, dimensions

    • Rows: Observations, data points

Data Layout Examples

Matrix Data Layout

  • Example Data:

    • values: 12.65, 6.25, 16.22, 2.2, 1.1

    • includes data on distance, load, and thickness.

Document Data Layout

  • Example Documents:

    • Document 1: season, time, lost, winning games, score, ball, play, coach, team

    • Tokenization for analysis.

Transaction Data Layout

  • Transaction ID (TID) and associated items:

    • Example Transactions:

      • TID 1: Bread, Coke, Milk

      • TID 2: Beer, Bread

      • TID 3: Beer, Coke, Diaper, Milk

Graph Data Layout

  • Graphs can be represented as matrices.

  • Example Matrix:

    • 5 2 1

    • 2 5

Formalism in Learning

Input/Output Structure

  • Input: A matrix of attributes (n-dimensional, n >= 1)

  • Output: Response vector

  • Target Function: Relationship between input data and desired output

  • Training Data: Pairs (x1, y1), (x2, y2),…,(xn, yn)

  • Hypothesis: Constructed model based on training examples

Learning Hypothesis Algorithm

  • Unknown Target Function: f: X-Y (e.g., ideal credit approval function)

  • Training Examples: Historical credit customer records (e.g., (x1, 1/1))

  • Learning: Final hypothesis algorithm g produces the final approval formula A.

  • Hypothesis Set H: Set of candidate formulas.

Terminology in Learning

  • Learner: Takes input and produces a classifier.

  • Classifier: Takes input and produces output (predictions).

  • Model: Artifact created by a learner, used by a classifier for predictions.

Machine Learning Types

  • Supervised Learning: Algorithms trained with labeled data.

  • Unsupervised Learning: Algorithms find patterns in unlabeled data.

  • Reinforcement Learning: Learning based on actions maximizing rewards.

  • Decision Trees: One method for supervised learning.

  • Semi-Supervised: Uses small labeled dataset alongside a large unlabeled dataset.

Generalization in Learning

  • Importance of generalization to unseen cases.

  • Challenges include "Curse of Dimensionality."

  • Exploration of data scientist's time allocation and efforts.

Workflow of Data Mining

Key Steps

  1. Preprocessing:

  • Data transformation, feature selection, normalization, and subsetting

  1. Knowledge Transformation: Extracting patterns from preprocessed data.

  2. Evaluation: Interpretation of outputs and models derived from mining.

Data Types in R

  • Numeric: Includes integers and float values.

  • Factor: Enumeration data type with specific possible values (e.g., {"blue", "green"}).

    • Ordinal: Order matters (e.g., {"small", "medium", "large"}).

    • Nominal: Order does not matter (e.g., {"blue", "green", "red"}).

  • Character: String or single character representation.

Algorithm Preferences

  • Some algorithms necessitate certain data types (e.g., continuous data for classification).

  • Ensuring data transformation aligns with algorithm requirements is crucial.

Decision Trees

  • Description: First classification algorithm in data mining.

  • Classification Purpose: Learn a target function mapping attributes to class labels.

  • Case Study: Scenario where a bank determines loan applicants based on specific criteria.