Data Science 2

Session 2

Data Terminology

• A dataset (example set) is a collection of data with a defined structure.

• A data point (record, object or example) is a single instance in the dataset. Each row in the table is a data point. Each instance contains the same structure as the dataset.

• An attribute(feature, input, dimension, variable, or predictor) is a single property of the dataset. Each column in the table is an attribute. Attributes can be numeric, categorical, date-time, text, or Boolean data types.

• A label(class label, output, prediction, target, or response) is the special attribute to be predicted based on all the input attributes.

• Identifiers are special attributes that are used for locating or providing context to individual records. They bear no information that is suitable for building data science models and should, thus, be excluded for the actual modeling step.

Types of Variables

Type of Measurements: Examples

• Nominal

• Eye color, zip codes

• Ordinal

• rankings (e.g., taste of beers on a scale from 1-10), grades

• Integer

• # of orders, # of children in a family

• Real

• Bank account balance, income

From Business Problems to Data Science Solutions

• Decompose data analytics problem into pieces such that each piece matches a known task for which tool are available

• There is a large number of data mining algorithms available, but only a limited number of data mining tasks

Data Science Methods and Examples

• Descriptive Methods

• Goal: Find patterns in the data.

• Example: Which products are often boughttogether?

• Predictive Methods

• Goal: Predict unknown values of avariable given observations (e.g., from the past)

• Example: Will a person click a onlineadvertisement, given her browsing history ?

• Machine Learning Terminology

• descriptive = unsupervised 10

• predictive = supervised

Data Science Tasks

1. Classification [Predictive]

2. Regression [Predictive]

3. Clustering [Descriptive]

4. Association Analysis [Descriptive]

Classification: Definition:

− Goal: Previously unseen records should be assigned a class from a given set of classes as accurately as possible.

- Approach: Given a collection of records (training set)

• each record contains a set of attributes

• one of the attributes is the class (label) that should be predicted.

− Find a model for the class attribute as a function of the values of other attributes.

Regression

• Predict a value of a given continuous variable based on the values of other variables, assuming a linear or nonlinear model of dependency.

• Greatly studied in statistics and neural network field.

• Examples:

• Predicting sales amounts of new productbased on advertising expenditure.

• Predicting wind velocities as a function of temperature, humidity, air pressure, etc.

• Predicting the realizable price of a house orcar.

− Difference to classification: The class attribute is continuous, while classification is used for nominal class attributes (e.g., yes/no).

Clustering: Definition

• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that

• data points in one cluster are more similar to one another

• data points in separate clusters are less similar to one another

• Similarity Measures

• Euclidean distance if attributes are continuous

• Other problem-specific similarity measures

• Goals

• Intra-cluster distances are minimized

• Inter-cluster distances are maximized

• Result

• A descriptive grouping of datapoints

Clustering: Application 1

− Application area: Market segmentation

− Goal: Divide a market into distinct subsets of customers

• where any subset may beconceived as a marketing target to be reached with a distinct marketing mix

− Approach:

1. Collect information about customers

2. Find clusters of similar customers

3. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters

Clustering: Application 2

• Application area: Document Clustering

• Goal: Find groups of documents that are similar to each other based on terms appearing in them.

• Approach

• Identify frequently occurring terms in each document.

• Form a similarity measure based on the frequencies of different terms.

• Application Example: Grouping of articles in Google News

Association Analysis : Definition

• Given a set of records each of which contain some number of items from a given collection

• produce dependency rules which will predict occurrence of an item based on occurrences of other items

Association Analysis : Application

• Application area: Supermarket shelf management.

• Goal: To identify items that are bought together by sufficiently many customers.

• Approach: Process the point-of-sale data collected

• with barcode scanners to find dependencies among items.

• A classic rule and its implications:

• If a customer buys diapers and milk, then he is likely to buy beer aswell.

• So, don’t be surprised if you find six-packs stacked next to diapers!

• Promote diapers to boost beer sales.

• If selling diapers is discontinued, this will affect beer sales as well. 23

• Application area: Sales Promotion

Practice: Answering business questions with these techniques

1. To which segment this recent customer will belong?

2. In summer, what are the products customers frequently purchase together with steaks?

3. Who are the clients that we risk to loose during the move of this branch?

4. Will a given new customer be profitable? How much revenue should I expect this customer to generate?

5. How many segments can we use to serve our customers effectively? What do these segments look like?

Data Science is a Process

Business Understanding

• Understand the problem to be solved!

• Designing the solution is an iterative process of discovery

• Analyst’s creativity plays an important role

• Structure the problem such that one or more sub-problems involve building models for classification, regression, ...

Business Understanding - Prior Knowledge

• A clear and well-defined business objective

• Knowledge of the subject matter, the context and the business

process generating the data

• Understanding how the data is collected, stored, transformed and

reported

• Correlation versus Causality

Data Understanding

• Collect data

– List the datasets acquired (locations, methods used to acquire, problems encountered and solutions achieved).

• Describe data

– Check data volume and examine its gross properties.

– Accessibility and availability of attributes. Attribute types, range, correlations, the identities.

– Understand the meaning of each attribute and attribute value in business terms.

– For each attribute, compute basic statistics (e.g., distribution, average, max, min, standard deviation, variance, mode, skewness).

Data Understanding

• Explore data

– Analyze properties of interesting attributes in detail.

• Distribution, relations between pairs or small numbers of attributes, properties of significant sub-populations, simple statistical analyses.

• Verify data quality– Identify special values and catalogue their meaning.

– Does it cover all the cases required? Does it contain errors and how common are they?

– Identify missing attributes and blank fields. Meaning of missing data.

– Do the meanings of attributes and contained values fit together?

– Check spelling of values (e.g., same value but sometime beginning with a lowercase letter, sometimes with an upper case letter).

– Check for plausibility of values, e.g. all fields have the same or nearly the same values.

Data Preparation

• Select data

– Reconsider data selection criteria.

– Decide which dataset will be used.

– Collect appropriate additional data (internal or external).

– Consider use of sampling techniques.

– Explain why certain data was included or excluded.

• Clean data

– Correct, remove or ignore noise.

– Decide how to deal with special values and their meaning (99 for marital status).

– Aggregation level, missing values, etc.

– Outliers?

Data Preparation

• Construct data

– Derived attributes.

– Background knowledge.

– How can missing attributes be constructed or imputed?

• Integrate data

– Integrate sources and store result (new tables and records).

• Format Data

– Rearranging attributes (some tools have requirements on the order of the attributes).

– Reordering records (perhaps the modelling tool requires that the records be sorted according to the value of the outcome attribute).

– Reformatted within-value (remove illegal characters, uppercase lowercase).

Modeling

− Input: Preprocessed Data

− Output: Model / Patterns

1. Apply data science method

2. Evaluate resulting model / patterns

3. Iterate

• experiment with different parameter settings

• experiment with multiple alternative methods

• improve preprocessing

• increase amount or quality of training data

Evaluation

• Assess the Data Mining results rigorously

• Gain confidence that results are valid and reliable

• Ensure that the model satisfies the original business goals (support decision making!)

• Ensure comprehensibility of the model to stakeholders

• Design experiments for tests in live systems

• Example: fraud detection

– A fraud detection model may be extremely accurate

– Evaluation shows that it produces too many false alarms

– What is the cost of dealing with false alarms?

Deployment

• Production readiness determines the critical qualities required for the deployment objectives

– A consumer credit approval process

– A customer segmentation process

• Technical Integration

– Data science automation in R, Python, PMML, …

• Response Time

– Trade-off between production responsiveness and modeling time building

• Model Refresh

– when and how often

• Assimilation of the knowledge in the organisation

Value TypesinRapidMiner

Value types define how data istreated

• Numeric data has an order (2 is closer to 1 than to5)

• Nominal data has no order (red is as different from green as from blue)

Rôles RapidMiner

Roles define how the attribute is treated by theOperators

The Repository

• This is where you store your data and processes

• Stores data and its meta data(!)

• Only if you load data from the repository, RapidMiner can show you which attributes exist

• Add data via the “Add Data” button or the “Store” operator

• Load data via drag ‘n’ drop orthe “Retrieve” operator

RapidMiner Operators:Pre-Processing

• Type and Role Conversions

• “TypeA to Type B”: Change the type

• “Set Role”: Change the role

• Attribute Set Transformation

• “Select Attributes”: Remove attributes

• “Generate Attributes: Create new attributes

• Value Transformation

• “Normalize”: transform all values to a certain range

• Filtering

• “Filter examples”: Remove examples

• Aggregation

• “Aggregate”: count, sum

How to findOperators?

•The Operators Panel lets you browse all available operators

•You can search for operators by typing in the search bar

•You add operators by double clicking or by dragging them on to the process view

Design and Results Views in RapidMiner

• Use the “Design View” to create your Process

• See your current Process – “Process”

• Access your data and processes – “Repository”

• Add operators to the process – “Operators”

• Configure the operators – “Parameters”

• Learn about operators – “Help”

• Use the “Results View” to inspect the output

• The “Data View” shows your exampleset

• The “Statistics View” contains meta data and statistics

• The “Visualizations View” allows you to visualize the data