Looks like no one added any tags here yet for you.
Business Intelligence (BI)
An umbrella term that includes the application, infratstructure, tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance
What is Machine Learning
The process of solving practical problems by 1. gathering a dataset and 2. algorithmically building a statistical model based on the dataset
What is Machine Learning pt 2
The use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw infrences from patterns of data
What is Machine Learning pt 3
A branch of Artificla intellignece and computer sicence which focuses on the use of data and algorithms to imitaate the way humans learn, gradually improving its accuraacy
Why use Machine Learning Now?
Big data- unprecedented amount of data
Competitive Landscape: rapidly changing and competitive market
Storage: AAvailability of processing and storage space
Data Mining
process of discovering interesting and meaningful patterns in data
- data driven process not user driven process
How is Predictive Analytics different
- BI is more user driven, PA is a subset
- Statistics - more model (theory) than data, taught not to data mine
- Data mining: more connotation about privacy
Machine Learning Definition
Emphasis on machine reecognizing patterns
- less exphasis on reporting and visualization
Statisticians:
Rigor
- make sure everything follows every assumption
Machine Learning:
Performance
- Get the best outcome variables regardless of the context
Data Analyst:
Storytelling
- Explain something that makes sense
Two Types of Data Mining
Descriptive
Predictive
* Both use Present data
Descriptive Data Mining
Describes the data
- who were the best customers
- what did customers buy together?
Predictive Data Mining
Makes predictions based on past data
- who will be the best customers
- what will customers buy together
Why do we need Machine Learning
Often state the obvious
- what is the best way to predict
- how well does it predict
Data explosion
- large increase of available data
Big Data- The 3 V's
Volume
Variety
Velocity
Outcome: Value- what can you do with the data, how does it add value
Volume
Can you find the information you are looking for
Variety
Is a picture worth a thousand words in 70 languages?
Is your information balanced
Velocity
Information gains momentum and crises and opportunities evolve in real time. How is the outlook for today?
Who does what
End User: Decision Making
Business Analyst: Data Presentation/ visualization
Data Analyst: Data mining (information discovery), Data Exploration (statistical analysis, querying, reporting)
Database Administrator: Data Warehouses/Data Marts (OLAP),
Data sources (paper, files, data, OLTP)
Needed skills of a Data Analyst
- Statistical awareness
- Expertise with analytical tools
- Domain Knowledge
* know what results are useful
* know how to apply results
The Knowledge Discovery in Databases Process (KDD)
Data --> Selection
Target Data --> Processing
Processed Data --> Transformation
Transformed Data --> Data Mining
Patterns --> Interpretation and Evaluation
Knowledge
Potential Application Areas
Marketing
- Database marketing
- Tager marketing
- Customer relationship Management (CRM)
Finance
- Credit Scoring
- Fraud Detection
Managment
- Health inofrmatics
Rexer Analytics Survey Results
survey of 1200 data scientists in 72 countries
from multiple secotrs
three key words: Data, Scientist, Computer
1/3 of respondents have seen difficulties when do it yourself tools or services are used
What do Data Scientists do?
improving understanding for customers
Retraining customers
Improving customer experiences
selling products/services to existing customers
market research/ survey analysis
acquiring customers
improving direct marketing programs
sales forecasting
fraud deetection or prevention
risk management / credit scoring
How do Data Scientists do it?
Regression, Decision Trees, and Cluster Analysis remain the most comonly used algorithms
Biggest Difficulty for Data Scientists
Deployment
Challenges to Analytics
Deployment
- must be actionable, integrated within the company
Management-Communication
- Have to have trust in the model
Data
- Data must be accessible
- must be accurate
Modeling
- Risk of overfitting the model
What is SAS
A suite of business solutions and technologies to help organizations solve problems
Analytical software system used by many businesses worldwide
- regulation and reesouces
The base system for all other SAS products
- JMP
- Enterprise Miner
Why use SAS
- Access and manage data across multiple sources
- perform analyses and deliver information across your organization
- Access, Manage, Analyze, Present
Why not use Excel or SPSS
- Can use huge datasets
- saves a program of your steps
SAS, R, or Python
34% use SAS overall
Dataa Scientists: Those working primarily with unstructured or streaming data
Predictive Analytics Pros: Primarily work with structured data
SAS Programs
A SAS progrma is a sequence of one or more steps
- Data stpes typically create SAS data sets
- Proc steps typically process SAS data sets to generate reports and graphs and to manage data
SAS Program Steps
A step is a sequence of SAS statements.
Step Boundaries
SAS steps begin with one of the following
- A data statement
- A proc statement
SAS detects the end of a statement when it encounters one of the following:
- A run statement (for most steps)
- a Quit statement (for some procedures)
- The beginning of another step (Data statement or Proc statement)
Data Steps
Used for reading, manipulating, and processing data
- any changes to the data file are made here
SAS datasets have the extension .sas7dat
- with a datastep you can read in other formats e.g. Excel
Proc Steps
Use for analyzing the data
- Any interpretation of the data file is done here
- proc contents (shows contents of SAS file, metadata)
- proc sort (sorts sas datasets)
- proc print (prints sas datasets)
- proc import (imports data from excel)
- proc export (exports data to excel)
SAS interface
Three Primary tabs or windows
Editor: can enter, edit, submit, and save a SAS program
Log: Browse notes, warnings, and errors relating to a submitted SAS program
Results: browse output from reporting procedures
SAS Syntax Rules: Statements
Usually begin with an identifying keyword
always end with a semicolon
Recommended Formatting
Begin each statement on a new line
use white space to separate words and steps
include statements within a step
indent continued lines in multiline statements
Add comments to the code
Can be used for:
documenting
taking notes
trial and error
Syntax Errors
A syntax error is an error in the spelling or grammar of a sas statement. Sas finds syntax errors as it compiles each SAS statement before execution begins
examples of syntax errors:
misspelled keywords
unmatched quotation marks
missing semicolons
invalid options
Syntax Errors pt 2
When SAS encouters a syntax problem, it writes a warning or error message to the log
you should always check the log to make sure that the program ran successfully even if output is generated
Data Mining Methodologies
CRISP-DM
SEMMA
viewed as implemmentations of KDD
CRISP-DM
Croos Industry Standard Practice Model for data mining
- developed in 1966 by DaimlerChrysler, SPSS, NCR (POS services)
- De facto industry standard
- 6 steps
- cyclical process
CRISP-DM 6 steps
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Defining Business Objectives
- What core business objectives should be addressed
- How can they be quantified
- What data is available
- What methods can be used for this
- How can the model be assessed
- How can the models be deployed
Defining the target variable
- The variable to be estimated or predicted based on the business objectives
- Cant include the future when building a model
- must have data for all predictors
* may needd 30-60 days in the past
Defining measures of Success
Classification
- Percent correct classification (PCC)
- how many errors are made
- Confusion Matrix
- how errors are made
Subset of the Population
- Lift
- ROC
- Area under the curve
Estimation
- R^2
- Average Error
- Mean Squared Error
Business Measures
- ROI
- Parsimony
- Explainability
Data Understanding
Examine the data
- initial collection of the data
- describe and explore the data
Identify Problems
- verify data quality
Defining data
Data must be 2D
- each row is a unit of analysis, a record
- Each column is a variable
Data is rectangular
- each record has the same number of columns
Normalize
Defining the Unit of Analysis
Including a column for each visit is complicated
- Not every customer has the same number of visits
should the unit of analysis be
- Each visit to a store
- A customer
- A household
All variables must be on the same level
Data Preparation
Fix problems in the data
- Data Cleaning
- Data Transformation
Create Derived Variables
- Data formatting
Modeling
Build predictive or descriptive models
- Regression
- Logistic Regression
- Decision Trees
- Analytical Neural Networks
- Cluster Analysis
Evaluation
Assess Models
- Do they meet the business needs (not just the stats)
- might result in identification of other needs
Report the expected effects of the models
Deployment
Plan for use of models
- Apply to business operations
- Monitor for changes in the operating conditions
- Documentation
Modeling out of Order
Caution
- Can misguide the Analysis
- Can rule out variables that are useful
Building Models First
- Get a sense for the variables
- Get a baseline for how the model might work
- Determine if the model predicts too well
* confounding variable
Early Deployment
- Usually know main predictors quickly
- Determine obstacles in real world
Modeling Process
1. Business Understanding
2. Data Understanding
3. Data Prepatation
4. Modeling
5. Evaluation
6. Deployment
SAS Modeling Method
Developed by SAS
- Logical organization of the functional toolset of SAS Enterprise Miner
- For carrying out corre tasks of data mining
- Does not emphasize business underrstanding
Sample, Explore, Modify, Model, Assess
Sample
Input data
Partitian Data
Create Multiple Data Sets
- Training: used for model fitting
- Validation: To assess the model
- Test: Determine how well model generalizes
Explore
Gain Understanding of the data
Look for patterns that exist
- Factor analysis
- clustering
Modify
Create, select, and transform variables
- Group customers
- Alter dates
look for outliers
reduce the amount of variables
Model
Use tools to predict target variable
- Neural networks
- Decision Trees
- Regression
Assess
- Determine how useful
- Determine how reeliable
- Use the test data
Modern Workflow
Assess and View
Interact
Analyze and Discover
Share
Promote and Govern
What is a SAS Data Set
A SAS data set is a specifically structured data file that SAS creates and that only SAS can read. A SAS data set is a table that contians observations and variables
File Formats
- .sas7bdat
- SAS can also import other file formats
- .xlsx- .csv
SAS Data Set Terminiology
SAS Data Set --> Table
Observation --> Row
Variable --> Column
Browsing the Descriptor Portion of SAS
Use Proc Contents to display the descriptor portion of a SAS data set
Proc contents data = libname.dataset;
run;
Descriptor Portion
The descriptor portion contains the following metadata
- general properties (such as data set name and number of observations)
- Variable properties (such as name, type, and length)
Proc Contents
- Add a proc contents step to display the metadata
Data Portion
The data portion of a SAS data set contains the data values, which are either character or numeric
Browsing the Data Portion
Using Proc print to display the data portion of a SAS data set
proc print data = libname.dataset;
run;
SAS Variable Names
- Can be 1-32 characters long
- Must start with a letter or underscore. Subsequent characters must be letters or numbers or mixed case
- can be uppercase, lowercase, or mixed case
- are not case sensitive
Invalid Variable Names
5monthsdata
data#5
five months data
Missing Data Values
Missing values are valid values in a SAS data set
- a blank represents a missing character value
- a period represents a missing numeric value
A value must exist for every variable in every observation
Wrtie a Program to Display the current Date
date date;
currentdate = today();
run;
proc print data=worrk.date;
run;
Any Date
data _NULL_;
date = input ('DDMMMYY'd, best12.);
put date;
run;
SAS Libraries
SAS datasets are stored in SAS libraries. A SAS library is a collection of SAS files that are refrenced and stored as a unit. Files can be stored in a temporary or permanent library
How SAS Libraries are Defined
When a Sas session starts, SAS crreates on temporary and one permenanet SAS library. These libraries are open and readdy to be used. You rrefer to a sas library by a logical name called a library refrerence name, or libref
Temprory Library
work is a temporary library where you can store and access SAS data sets for the duration of the SAS session. It is the default library
Sas deletes the work library and its contents when the session terminates
Permanent Libraries
SAShelp is a permanent library that contains sample SAS data sets you can access during your SAS session
Acessing SAS Data Sets
All sas data sets have a two level name that consists of the libref and the data set name, separated by a period
libref.datasetname
When a data set is in the temporary work library, you can use a one level name
User Defined Libraries
A user defined library
- is created by the user
- is permanent. Data sets are stored until the user deletes then
- is not automatically available in a SAS session
- is implemeneted within the opeating environments file system
Libname Statement
The SAS libname statement is a global statement
libname libref "sas-library" <options>;
- it is not required to be in a data step or a proc step
- it does not require a run statement
- it executes immediately
- it remains in effect until changed or canelled, or until the session ends
Browsing a Library Programmatically
using proc contents with the _ALL_ keyword to generate a list of all SAS files in a library
PROC CONTENTS DATA=libref._ALL_ NODS;
RUN;
- _ALL_ reequests all of the files in the library
- The nods option surpresses the individual data set descriptor information
- NODS can be used only with the keyword _ALL_
Importing an Excel Data File
- Multiple excel data file extensions
- specify the DBMS you want to use
* XLS
* XLSX
* EXCEL (reads all types of excel files)
- looks at less data when determining import strategy
proc import datafile =" " dbms = out = ;
run;
Print Procedure
By default, proc print displays all observations, all variables, and an observation column on the left side
statements and options camn be added to the print procedure to modify the deefault behavior
Variable Statements
The VAR statement selects variables to include in the report and specifies their order
proc print data=libname.dataset;
run;
Proc print with obs=statement
- select a limited set of variables to display
- obs is not a variable
proc print data = libname.dataset (obs=#);
run:
Sum Statement
The sum statement calculates and dsplays report totals for the requested numeric values
sum variables;
Where Statement
The where statement selects observations that meet the criteria specified in the where expression
proc print data=libname.dataset;
var ;
where where-expression;
run;
Where Statement pt 2
The where expression deefines the condition (or conditions) for selecting observations
operands:
- character constants
- numeric constants
- date constants
- character variables
- numeric variables
operators:
- symbols that represent a comparision, calculation, or logical operation
- +, (), >, <, -, *
- SAS functions
- special where operators
Surpressing the obs column
Use the NOOBS option in the proc print statement to surpress the observation column
proc print data = libname.dataset NOOBS;
Operands
Constants are fixed
- characters are enclosed in quotation marks and are case sensitive
- numeric values do not use quotation marks or special characters
Variables must exist in the input data set
ex:
where sex = 'M';
where salary > 50000;
SAS Date Constant
A SAS date constant is a date written in the following form: 'ddmmm<yy>yy'd
SAS automatically converts a date constant to a SAS date value
Comparison Operators
Comparison operators compare a variable with a value or with another variable
= Equal to (EQ)
^= ¬= ~= Not Equal to (NE)
> Greater Than (GT)
< Less Than (LT)
>= Greater than or equal to (GE)
<= less than or equal to (LE)
(' '), (' ' , ' ' ) equal to one of a list (IN)
Logical Operators
Logical operators combine or modify where expressions
WHERE WHERE-expression-1 AND | OR WHERE-expression-n;
Logical Operator Priority
The operators can be written as symbols or mnemonics, and parentheses can be added to modify the order of evaluation
^ ¬ ~ (NOT) priority: 1
& (AND) priority: 2
| (OR) priority: 3
The NOT operator modifies a condition by finding the complement of the specified criteria
Special Where Operators
Special WHERE operators are operators that can be used only in WHERE expressions.
Contains: Includes a substring, can be used only with characters
Between-And: An inclusive range, can be used with characters and numbers
Is Null: A missing value, can be used with charaacters and numbers
Like: Matches a pattern, can be used with characters only