Exam PA 2.0

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/45

There's no tags or description

Looks like no tags are added yet.

Last updated 9:13 PM on 4/4/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

46 Terms

New cards

Characteristics of high-quality predictive modeling problems

Clearly identified business issue, questions are well defined, good and useful data is available for answering the questions, predictions will drive actions and increase understanding, predictive analytics can likely improve the approach, and the model can be updated and monitored over time

New cards

Producing meaningful problem definition

get to the root cause of the business issue, develop a hypothesis, use KPis, consider availability of good data and implementation issues, and make problem specific enough to solve

New cards

What should be true for data to be considered relevant?

Data needs to be unbiased and representative of the population and time frame of interest

New cards

Sampling

The process of taking a subset of observations from the population to generate the dataset

New cards

Random sampling

Randomly draw observation from the population without replacement. Each observation is equally likely to be sampled

New cards

Stratified Sampling

divide the population into non-overlapping strata and randomly sample a set number of observations from each stratum

New cards

Systematic sampling

This is a special case of stratified sampling. Draw observations according to a set repeatable pattern.

New cards

What makes data high quality?

Data values should be reasonable in the business context, have consistent units and rules, and sufficient documentation

New cards

Structured data

This is data that can fit into a tabular arrangement

New cards

Structured data pros and cons

Structured data is easier to manipulate but cannot represent data that can’t naturally fit in a tabular arrangement

New cards

Unstructured Data

Data that cannot fit into a tabular arrangement, such as audio, images, etc.

New cards

Unstructured data pros and cons

This data is more flexible but harder to access and include in a predictive model

New cards

Personally Identifiable Information

This is information that can be used to trace to an individual’s identity

New cards

How can personally identifiable information be handled?

de-identify data to remove personally identifiable information, ensure data receives sufficient encryption, and conduct model operations while following terms of use

New cards

Sensitive variables

These are variables that may be significant but using them in a model may lead to legal, fairness, and discrimination concerns

New cards

Target leakage

This occurs when a predictor in a model leaks information about the target variable that is not available at the time of modeling

New cards

How can you detect target leakage?

Predictor variable is observed at the same time or after the target variable, target variable is included in a predictor, target variable is used in principal components, or target variable is used in clusters

New cards

Goals of Exploratory Data Analysis

Validate data and generate insights for modeling,

New cards

What does it mean to validate data

Clean data for inappropriate values, errors, and outliers so that data is ready for analysis

New cards

What does it mean to generate insights for modeling

Understand the characteristics of variables, identify potentially useful predictors, generate new features, and decide which type of model is best

New cards

Pros and cons of summary statistics

Summary statistics are objective and comparable across variables, but can only capture certain aspects of distributions and are easily affected by outliers

New cards

Pros and cons of graphical displays

Graphical displays provide visual impressions of a distributions and can reveal information not easily captured by summary statistics, but are not as precise as summary statistics and less comparable across variables

New cards

Three types of bivariate graphical displays for categorical variables

Stacked bar charts, dodged bar charts, and filled bar charts

New cards

When to use bar charts

To display counts of levels in a categorical variable

New cards

When to use filled bar chart

When you want to display proportions of levels in a categorical variable. This loses information of the count of each levels

New cards

Dodge

Compare counts of levels across variables

New cards

Issue with correlated numeric predictors

Difficult to separate effects of individual predictors on a target variable, coefficient estimates become unstable, and for decision trees collinearity can dilute variable importance scores

New cards

Issue with skewness due to outliers in a variable distribution

Extreme values can overly influence fitted values and distort graphical displays

New cards

Solutions to skew due to outliers

Remove outliers if they are due to data entry error or make up a small proportion of the dataset, or apply a log or square root transformation if data is positive

New cards

When should a numerical variable be converted into a factor?

A numerical variable should be converted into a factor if it has a small number of values, values have no meaningful values and are labels, or the variable has a complex relationship with the target variable.

New cards

For GLM’s why is it better to convert a numerical variable into a factor if it has a complex relationship with the target variable

Factor conversion gives GLMs more flexibility to capture the relationship

New cards

When should numerical variables not be converted into factors

Variable has a large number of distinct values, variable values have a numerical order that could be useful in predicting the target variable, variable has simple relationship with target variable, and future observations will have new variable values

New cards

Why is sparse levels for a categorical predictor bad

Sparse levels reduce robustness of model and cause overfitting

New cards

Sensitivity

This is the proportion of positive observations correctly classified as positive or TP/(TP+FN)

New cards

Specificity

This is the proportion of negative observations correctly classified as negative, or TN/(TN+FP)

New cards

Precision

This is the proportion of positive predictions correctly that are actually positive, or TP/(TP+FP)

New cards

Solutions to unbalanced data

Oversampling and Undersampling

New cards

What is Undersampling and what are its cons?

This is when all observations from the smaller class are kept, and fewer observations from the larger class are selected. This can cause the model to become less robust and the classifier to become more prone to overfitting due to the smaller amount of data

New cards

Oversampling

This is when all observations from the larger class are kept and more observations are sampled from the smaller class. This can only be done after splitting testing and training data. This is usually computationally harder

New cards

Feature Generation

This is the act of creating new variables form original variables. This attempts to transform information from original variables into a more useful form, and make the model easier to interpret.

New cards

Dimensionality vs Granularity

Dimensionality refers to how many model inputs can be created from a model, while granularity refers to how specific information is

New cards

Model validation

This checks that the model has no deficiencies and has upheld its assumptions

New cards

Offset term

An offset term is an included variable whose coefficient is fixed at 1 and is used to adjust for exposure. This allows the expected target variable to grow proportionally with exposure

New cards

Weight term

A weight term is not a predictor, but it tells the model how much influence each observation should have when estimating the coefficient.

New cards

Assumptions of a glm

Response variable follows distribution from exponential family, observations are independent, mean of response is related to predictors through a link function, and the variance is determined by the mean

New cards