Data Mining and Analytics - Theory
Lecture 1: Introduction
Data mining is only one step in the process of knowledge discovery, but it is an essential one since it uncovers hidden patterns for evaluation.
Architecture of a typical data mining system:
Information Repository: database, data warehouse, World Wide Web
Database or Data Warehouse Server
Data Mining Engine
Pattern Evaluation Model
User Interface
Data warehousing is the entire process of data extraction, transformation, and loading data to the warehouse and the access of the data by end users and applications.
Potential Applications of Data Mining:
Data analysis and decision support.
Market analysis and management target marketing, customer relationship management, market basket analysis.
Risk analysis and management.
Forecasting, customer retention, quality control, competitive analysis.
Fraud and unusual pattern detection.
Text, stream and web mining.
Bioinformatics and bio-data analysis.
Data transfer from the operational database to the warehouse is an ongoing process.
When is data mining useful?
Problem can be clearly defined.
Potentially meaningful data exists.
Data contains hidden knowledge and isn’t only useful for reporting purposes.
Cost of data processing is less than the likely increase in profit.
Lecture 2: Clustering
Dissimilarities and similarities are assessed based on the attribute values describing the objects and often involve distance measures. In clustering, no predictions are made, and it is a form of learning by observation.
Supervised Classification | Unsupervised Clustering |
|---|---|
Known number of classes | Unknown number of classes |
Based on a training set | No prior knowledge |
Used to classify future observations | Used to understand and explore data |
K-Means
Clusters numerical data.
Input: numerical.
Output: centers of each cluster, and cluster assignment.
Distance metric defined over the variable space: Euclidean distance.
K-Modes
Variant of k-means to cluster nominal data, replacing means of clusters with modes.
Uses new dissimilarity measures and frequency-based methods to update modes of clusters.
Can be integrated with k-means to cluster mixed (numeric + nominal) data.
K-Medoids
A general version of k-means that works with any distance measure.
Computationally more expensive, effective for small datasets but doesn’t scale well for large datasets.
More robust than k-medoids when there’s noise and outliers, since they have less influence on a medoid.
A medoid is always a member of the dataset itself.
Lecture 3: Hierarchical Clustering
Its goal is to group objects into a tree structure in which the root is the whole input, the leaves are the individual element, and the internal nodes are the union of their children.
Either agglomerative (bottom-up) or divisive (top-down).
Distance measures between two clusters:
Single-link: smallest distance between any two points.
Complete-link: largest distance between any two points.
Average-link: average distance between all pairs.
Advantages | Disadvantages |
|---|---|
Dendrograms are great for visualization | Not easy to define levels for clusters |
Provides hierarchical relations between clusters | Experiments showed that other clustering techniques outperform it |
Able to capture concentric clusters |
Lecture 4: Fuzzy Clustering
Uses soft clustering, in which clusters may overlap and an element can belong to more than one cluster. As opposed to hard clusters, where each element belongs to exactly one cluster.
Fuzzy logic relies on degrees of membership, first introduced by Dr. Lotfi Zadeh of UC Berkeley in 1965 to model the uncertainty of natural language.
Uses mathematical theory of fuzzy sets.
Simulates the process of normal human reasoning.
Allows the computer to behave less precisely.
Decision making that involves gray areas.
There are different forms of membership function such as triangular, trapezoidal, or Gaussian. The type is context dependent and is chosen arbitrarily.
Advantages of Fuzzy Logic:
More natural to construct.
Easy to understand, as it frees the imagination.
More forgiving and provides flexibility.
Uses less expensive hardware and shortens development time.
Increases system maintainability.
Handles decision-making problems not easily defined by mathematical models.
Fuzzy Inference Systems:
Maps an input space to an output space using fuzzy logic.
Uses rules that are in the form of if-then statements.
Tools generally allow more than conclusion per rule.
Fuzzification
Crisp inputs → Membership Functions → Fuzzy inputs
If multiple parts in antecedent, resolve using operators:
AND: take minimum value
OR: take maximum value
NOT: take 1 - value
Fuzzy Inferencing
Computes truth value for the premise of each rule and applies it to the conclusion.
Results in a fuzzy set for each output variable for each rule
Output Aggregation
Combines outputs of all rules into a single fuzzy set
Input: list of truncated output functions
Output: one fuzzy set for each output variable
Common aggregation operators:
Maximum: point-wise maximum
Sum: point-wise sum
Probabilistic sum
Defuzzification
Fuzzy Output → Crisp Number
Some common methods to find the crisp number:
Centroid: finding the center of gravity of the membership function for the fuzzy value.
Maximum: choosing the variable value at which the fuzzy set has its maximum truth value as the output variable.
→ Fuzzy rules follow common sense behavior of the system and are written in terms of the membership function linguistic labels.
Fuzzy Applications:
Manufacturing and management:
Space shuttle vehicle orbiting
Regulation of water temperature in shower heads
Selection of stocks to purchase
Inspection of beverage cans for printing defects
Matching of golf clubs to customers' swings
Risk assessment and project selection
Consumer products (air conditioners, cameras, dishwashers)
Business:
Strategic planning
Real estate appraisals and valuation
Bond evaluation and portfolio design
Lecture 5: Genetic Algorithms
Search and optimization techniques based on Darwin’s principle of natural selection.
Each candidate solution is a chromosome, which is a string of genes. Each can copy, mate and mutate via genetic operators.
A typical algorithm requires two things to be defined:
Genetic representation of the solution domain: bits, real numbers, rules, program elements, etc.
Fitness function: a way to evaluate the solution domain defined over the genetic representation.
The reason the genetic representations are convenient is that their parts are easily aligned due to their fixed size, facilitating simple crossovers. However, variable length may be used, although this makes crossover more complex.
Tree-like genetic representations are explored in Genetic programming.
The fitness function is always problem dependent.
Genetic algorithms can be developed for both supervised and unsupervised data mining.
Genetic learning operators apply an evolutionary approach to inductive learning.
→ Limitations of Genetic Algorithms:
Doesn’t guarantee an optimal solution and often settles into a local minima.
Not all problems can be put into a genetic algorithm formulation.
Development and interpretation of genetic algorithm solutions requires both programming and statistical skills.
Relies heavily on random number generators.
Locating good variables for a particular problem and obtaining data for these variables is difficult.
Selecting methods to evolve the system requires experimentation and experience.
→ Genetic Algorithms Applications:
Dynamic process control
Optimization of induction rules
Discovery of new connectivity topologies (neural networks)
Simulation of biological models of behavior
Complex design of engineering structures
Pattern Recognition
Scheduling, routing and transportation
Layout and circuit design
Telecommunication and graph-based problems.
Lecture 6: Data Warehouses
Why use data warehouse?
Data is scattered over the network.
Many versions of the same data exist, with subtle differences.
Need expert to access the data.
Available data is poorly documented.
Data needs to be transformed from one form to another.
What is needed?
Data should be integrated.
Summary data has value to the organization.
Historical data holds the key to understanding data over time.
What-if capabilities are required.
Necessary Skills
Successful technician
Flexible
Team player
Good balance of business and technical understanding
Characteristics of Data Warehouse
Subject-oriented:
Organized around subjects, based on how the users refer to them.
Focuses on modeling and analysis for decision makers.
Provides a simple and concise view around particular subject issues by eliminating data that isn’t useful in the decision support process.
Integration:
Integrating multiple heterogeneous sources.
Applies data preprocessing to ensure consistency.
Removes inconsistencies regarding naming convention and value representations.
Time-variant:
Has a longer time horizon (5-10 yeas) than operational systems (current value).
Every key structure contains an element of time either explicitly or implicitly, while operational data key may or may not contain a time element.
Nonvolatile:
Data cannot be updated and is stored in read-only format.
Data accessing requires two operations: initial loading — access of data.
Components of Data Warehouse
Data Store:
Most common component.
Feeds the warehouse with data.
Generally subject-oriented and volatile.
Data Mart
Contains only a subset of corporate-wide data of value to a specific group of users, the scope of which is confined to specific subjects. Its data also tends to be summarized.
Low-cost, and have an implementation cycle of weeks rather than months or years.
Metadata
Information about the warehouse. Legacy systems generally don’t keep a record of the characteristics of the data.
Abstractions — high-level data concisely describing lower-level data.
Has two types:
Technical Metadata: information about the warehouse - useful for admins and designers.
Business Metadata: information about the data - useful for end-users.
Has several components:
Transformation maps
Extraction and relationship history
Algorithms for summarization
Data ownership
Patterns of access
Warehouse Schemas
Star Schema
Single, large and central fact table and one table for each dimension.
Every fact points to one tuple.
Does not capture hierarchies directly.
Benefits: easy to understand and define, reduces number of physical joins
Excellent for ad-hoc queries but bad for OLTP.
Snowflake Schema
Variant of star schema model with a single fact table and one or more tables for each dimension.
Dimension tables are normalized.
Drawbacks: time consuming joins, report generation is slow.
Galaxy Schema
Multiple fact tables share dimension tables.
Viewed as a collection of stars — hence the name fact constellation or galaxy schema.
Lecture 7: Building Data Warehouses
Data in warehouse is stored in form of fact tables and dimension tables.
Building data warehouses goes through several processes:
Data Selection
Data Preprocessing
Data Transformation and Integration
Data Loading
Data Warehouse Architectures
→ All of which include some form of ETL (extraction, transformation, and loading)
Generic Two-Level Architecture
→ Periodic extraction: data is not completely current.
→ One, company-wide warehouse to which data is loaded.
Independent Data Mart Architecture
→ Several data marts: mini warehouses, limited in scope.
→ Separate ETL for each independent data mart.
→ Higher data access complexity.
Dependent Data Mart with Operational Data Store Architecture
→ The operational data store is the data staging area and provides an option for obtaining current data.
→ Single ETL for a large, enterprise data warehouse.
→ Dependent data marts are loaded from the enterprise data warehouse.
Logical Data Mart and Real Time Architecture
→ An operational data store as the staging area which feeds data into a real-time data warehouse.
→ The warehouse includes a transformation layer, and several data marts.
→ The data marts are not separate databases, but rather logical views of the warehouse.
ETL Process
Data extraction: gathering data from multiple sources.
Data cleaning: detecting and rectifying in the data.
Data transformation: converts data from legacy format to warehouse format
Load: sorts, summarizes, checks integrity, etc.
Refresh: propagates updates from data sources to the warehouse.
Data Cube Model
→ Data warehouses are based on a multidimensional data model that views data in form of a cube. The data cube allows data to be modeled and viewed in multiple dimensions.
→ Defined by dimensions and facts:
Dimensions are the perspectives or entities the organization wants to keep records on, and each may have a table associated with it known as the dimension table.
Fact tables contain measures and keys to each of the dimensions.
OLAP
Data warehouse systems serve users in the role of data analysis and decision making. These systems, known as online analytical processing systems (OLAP), organize and present data in various formats to accommodate the needs of different users.
Defined as a set of functionalities that facilitates multidimensional analysis and allows users to analyze data in ways natural to them.
Enable analysts, managers and executives to gain insight into data through fast, consistent and interactive access to a variety of views of information transformed from raw data.
Operations:
Drill/Roll Down: from higher level summary to lower level, providing more detailed data or introducing new dimensions.
Drill/Roll Up: summarizing data by climbing hierarchies or through dimensionality reduction.
→ Data warehousing includes:
Building the warehouse
OLAP
Presentation
OLTP - Online Transaction Processing Systems
Online operational database systems whose task is to perform online transactions and query processing. They cover most day-to-day operations such as purchasing, inventory, manufacturing, banking, payroll, registration and accounting.
Data Warehouse Design and Usage
Business Analysis Framework
There are four views regarding warehouse design to be considered.
→ Top-down: allows selection of relevant data.
→ Data source: presents the information being managed by the operational system.
→ Data warehouse: includes the fact and dimension tables. Represents the information stored inside the warehouse.
→ Business query: data from the viewpoint of the end-user.
Design Process
There are three approaches to consider.
→ Top-down: starts with overall design and planning. Useful when the technology is mature, and the business problem is clear and well-understood.
→ Bottom-up: starts with experiments and prototypes. Useful in early stages of business modeling and technology development. Less expensive, and less commitments.
→ Combined: exploits strategic nature of top-down, and the rapid implementation of bottom-up.
Usage for Information Processing
A data warehouse serves as part of a closed loop “plan-execute-assess” feedback system for enterprise management. Warehouses are widely used in:
Financial services
Banking services
Consumer goods
Retail sector
Controlled manufacturing
Lecture 8: Web Mining
The world wide web is huge, widely distributed and global, and, therefore, constitutes a rich source for data mining. Web data includes:
Web documents
Hyperlinks between documents
Usage logs of websites.
Issues with web mining:
Web data tends to be very large, from tens to hundreds of terabytes.
Needs large farms of servers.
Proper organization of hardware and software is needed.
Difficulty in finding relevant information.
Extracting new knowledge from the web.
Web Content Mining
Related but different from data mining and text mining. Used to extract useful information from web page contents. The data is mainly semi-structure and/or unstructured.
Includes:
Discovery of useful information.
Information retrieval view.
Database view.
Web pages → Preprocessing → Vector Creation → Structured Information
Feature Extraction:
Using bag of words where each word is a feature.
Features can be either Boolean (occurrence of a word or not), or frequency.
Variations include removing case, punctuation, infrequent words, stop words.
Reducing features using:
Stemming: reducing words to their morphological roots.
Information gain, mutual information, cross entropy.
Agent-based Approaches:
Intelligent search agents: search for characteristics to organize and interpret the discovered information.
Information filtering: using information retrieval techniques to filter and categorize, such as HyPursuit and BO.
Developing sophisticated AI systems acting autonomously on behalf of users.
Database Approaches:
Used to transformed unstructured data into a high-level collection such as relational databases and using standard querying and data mining techniques to analyze.
Multi-level databases:
Lowest level: keeps semi-structured information.
High level: generalizations from lower levels are organized into relations and objects.
Web-query systems: query systems and languages such as SQL, and NLP for extracting data.
Web Structure Mining
Web pages are nodes, and hyperlinks are edges.
Performed at the document level (intra-page) or at the hyperlink level (inter-page).
PageRank:
Prioritizes pages returned from search by looking at the web structure. The importance of pages is calculated based on the number of pages which point to it, aka backlinks.
HITS (Hyperlink Induced Topic Search):
Authorities: highly important pages — the best source for the requested information.
Hubs: contain links to authorities.
HITS is an iterative algorithm for mining the web graph to identify the topic hubs and authorities.
→ Hubbiness of a page is the sum of the authorities of all the pages it links to.
→ Authority of a page is the sum of hubbiness of all pages that link to it.
Applications:
Information retrieval in social networks.
Finding the relevance of each web page.
Measuring the completeness of websites.
Search engines.
Web Usage Mining
Also known as web log mining.
Its goal is to analyze the behavioral patterns and profiles of users interacting with a website. Those patterns are represented as collections of frequently accessed resources by groups of users with common interests.
Data:
Web server logs
Site contents
Data about the visitors
Sources:
Server access, referrer, agent logs.
Cookies
User profiles, metadata, page attributes, content attributes, usage data
Phases:
Data Preparation
Data cleaning: checking suffix of URL
User identification: using heuristics to identify users.
Transaction identification: all page references made during a single visit.
Pattern Discovery
Clustering and classification: clustering users to provide personalized web content and clustering pages to aid in search engines.
Sequential patterns: extract frequently occurring intersession patterns to predict future visit patterns to place ads or recommendations.
Association rules: correlations among pages accessed together by a client to help restructure a website or develop marketing strategies.
Pattern Analysis
Validation: removing irrelative rules or patterns.
Interpretation: output is usually in mathematical form and not suitable for direct human interpretations.
Design:
Web log is filtered to generate a relational database.
A data cube is generated from that database.
OLAP is used to drill-down and roll-up in the cube.