BUSN 5000

studied byStudied by 0 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 29

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

30 Terms

1

How do we define the variance of a random variable and why is it interesting?

The Variance of a random variable X measures how much X deviates from the expected value (Mean).

Quantifies the spread of data points around the mean.

High Variance = widely dispersed.

low variance = clustered around the mean.

Helps in risk assessment, model evaluation and statistical inference.

New cards
2

What does it mean to impute a missing value?

Imputation: Replace missing values using observed data.

Methods: Mean imputation (Replace missing values with the mean of observed values)

Regression imputation (Predict missing values using a model)

Multiple imputation(Generate multiple plausible values for each missing case and analyze them separately).

EX) If income is missing for some survey respondents, we can predict it based on education and occupation

New cards
3

What is the effect of sample selection that is based on the values of the outcomes variable?

Endogenous sample selection: happens when inclusion in the sample depends on the outcome variable.

Leads to selection bias, meaning the observed data does not represent the full population.

EX) If a study excludes people earning more than 400,000$ then E[wages|sample] is not the same as E[wages] for the population.

Not the same as censoring (where High wages are Top-Coded but still included)

New cards
4

How might an assumption of MAR be violated?

MAR is violated when missingness depends on unobserved factors

EX) High-income earners systematically do not respond, probability of missingness depends on income even after conditioning education

EX2) In medical trials patients with severe side effects drop out, meaning missing data is correlated with unobserved severity.

When Violated the data is missing not at random (MNAR) leading to biased estimates.

New cards
5

What is the difference between data that are MCAR and those that are just MAR?

MCAR=missing completely at random

The probability of missing data does not depend on observed data

EX) power outage deletes random survey responses

MAR= missing at random

The probability of missingness depends on observed data not understand data

EX) People with higher incomes are less likely to report earnings but this pattern across all education levels

-MCAR: safe to analyze complete cases nonbias

-MAR: adjust for missingness using observed covariates

New cards
6

How does classification error differ from classical measurement error?

Classification error: categorical variable is misclassified

EX) mislabeling a person's education level or job status

-classification error can be systematic and not necessarily mean zero

Implication: can lead to misestimated treatment effects, particularly in randomized controlled trials (RCLs)

New cards
7

What is the effect of classical measurement error on the estimated correlation between two variables?

Classical measurement error = measured variable X is observed with random noise (X = X* + e)

-This biases the correlation between X and Y downward (attenuation bias)

-Variance of X increases due to added noise, covariance between X and Y remains unchanged

-The estimated correlation understates the true relationship, corr(X,Y) < corr(X*,Y)

# X* is true variable, X is with error

New cards
8

What is the simplest way to estimate the CEF and how would you do it in R?

- E[Y|X] = f(x)

Samples estimation method:

1) group observations by X

2)Compute the average Y for each X

New cards
9

What are the advantages of using a model to estimate the CEF?

- Smooths estimates rather than noisy averages

- Prediction for values of X that are not in the dataset

- Insights into relationship, allowing us to separate effects

EX) controlling for education

- Generalization = meaning it captures trends rather than just specific data points

EX) linear models assumes a constant increase in earnings per year

- Quadratic model captures an earnings peak followed by a decline

New cards
10

What is the deference between linear and a quadratic model on the relationships between Y and X?

- Linear model = assumes a constant rate of change (y=B0 + B1x)

EX) Earnings increase by a fixed amount each year

Problem) Ignores plateaus and declines

- Quadratic model allows for a curved relationship (y = B0 + B1x +B2x^2)

- Captures nonlinear trends

EX) Earnings rise early, peak, then decline

New cards
11

What are the two factors required to call a data analysis reproductible?

1) Some inputs = some results: IF you re-run the analysis with the original data, the outcome should be identical.

2) inputs are available and documented: Anyone should be able to access the original data and the exact steps taken.

New cards
12

Why is reproducibility important?

- For your future self and other who will use your work

- For transparency, to guard against error and fraud related to confirmation and publication biases

- It matters as a response to replication crisis in social / behavioral sciences research

New cards
13

What are some reasons estimates might be biased?

1) Censoring: high earns may be top-coded, leading to underestimations of the mean.

2)Sample selection: If the dataset excludes certain groups, it may not reflect the entire population

3) Measurement error: If earnings are reported incorrectly, the estimate won't reflect true earnings

New cards
14

What is a data set and how is it different from a data table?

A data set is a large amount of data that is used for analysis. It may contain one or more data tables. Data tables are well organized data that is structured in rows and columns (Single component.

New cards
15

What are the three criteria for "Tidy" data

1) Each variable corresponds to a column

2) Each observation corresponds to a Row

3) Every entry (cell) Corresponds to a single value

New cards
16

What are the four states of the data value chain?

1)Acquisition (acquire) - collect data (Surveys, databases, sensors, APIE)

2) Transformation (transform) - clean, organize and propose data

3) analysis - apply statistical or machine learning models to extract insights / conditional expectations functions (CEF)

4)Communication - present findings via reports, visualizations or business decisions.

New cards
17

What does the law of large numbers say about sample averages?

The LLN states that as the sample size (N) increases, the sample average (mean) converges to the true population average.

New cards
18

What do we mean when we talk about the excepted value of a random variable?

The excepted value (E[x]) of a random variable is its long run average value over many repetitions of an experiment.

New cards
19

How do we define the covariance between two random variables and what does covariance tell you?

Covariance between 2 random variables x and y measures how they co-move

- cov(x,y) > 0 x and y move together

EX) higher education --> higher earnings

- cov(x,y) < 0 x and y move in opposite directions

EX) higher age --> larger earnings in later years

-cov(x,y) = 0 there is no linear relationship

New cards
20

What is the distinction between the estimand, its estimator and an estimate?

Estimand: The true quantity we want to learn about

EX) true population of high earners in the population

Estimator: A statistical model or formula used to estimate the estimand

EX) calculating sample mean or using a lognormal model

Estimate: the actual numerical value produced by applying the estimator to the data.

New cards
21

How would you use the plug in principle to estimate the CEF?

Plug in principle = we estimate unknown population parameters using sample averages.

1) Start with CEF, E[y|x] = f(x)

2) Estimate the function using sample data

- Replace True expectation with sample mean within each subgroup of x (variables)

- x = education level, calculate the mean earnings for each education group

New cards
22

What does "data provenance" mean?

Provenance = source

Record Data provenance = where the data come from, and any transformations applied

New cards
23

How does R markdown aide in reproducibility?

Helps put your code used to analyze data into good communication platforms like HTML or PDF

- Helps document steps

New cards
24

What are the main elements of a data schema?

DEF = Representation of the data structure and comprises all the attributes of the data and their type

Elements:

- the number of data tables and name of each

- for each table: the number of observation list of variables, format of the information in each variable

- Units of observations (year, person)

- Key variables

New cards
25

How does a frequentist think about learning from data?

A frequentist interprets probability as the long-run frequency of an event occurring in a repeated experiment

- Random samples represent the population

- Probabilities stabilize as the number of trials increases

- Inference is based on observable data, rather than prior beliefs.

New cards
26

What does the law of Interated expectations say?

Law of Interated expectations (LIE) states that the expected value of Y can be written in terms of expectations conditional on X: E[Y] = E[E[y|x]]

New cards
27

What is the correlation coefficient and how does it differ from covariance?

Correlation Coefficient standardizes covariance, making it easier to interpret

How is corr different than cov?

- Corr is standardized between -1, 1

- It means the strength and direction of a linear relationship

- corr is unit free, allowing comparisons across data sets

Interpretations:

- 1--> perfect positive relationship, 0 --> no linear relationship, -1 perfect negative relationship

EX) a corr of .13 between age and earnings suggest a weak positive relationship.

New cards
28

What are the key aspects of data quality?

- Content: The variable measures the correct concept

- Validity: The data truly represents what is claims to measure

- reliability: the data is consistent across repeated measurements

- Comparability: Data collection is uniform across groups / time periods

- Coverage: The dataset includes all relevant subjects

- Selection Bias: The sample is representative of the broader population.

New cards
29

What is the conditional expectations functions (CEF) and why do we call it the "workhouse" of data science?

The CEF represent the expected value of a dependent variable (y) given an independent variable x

- E[y|x] = f(x)

The CEF summarizes relationships in the data by capturing how the average value of y changes with x

Used in prediction models = casual inferences

- Help reduce uncertainty by focusing on expected outcome given conditions

EX) if Y=earnings and X=years of education, E[earnings | education] gives the average earnings from people with a given level of education

New cards
30

Code Terms

- dplyr = Package in R that has many functions used for data analysis

- Filter = selects observations based on conditions

- Select = Chooses specific variables to keep or drop

- Mutate = Adds new variables or modifies existing ones

- Arrange = Sorts rows by values

- Summarize = creates summary statistics

- Group_by = Groups data by a variable

- Datasummary = Puts statistics in a pretty little table

- ggplot = depending on inside of code, yo can create different types of charts demonstrating data better

- geom_smooth = Visualizing a model based on CEF

- Kable = prettying up and output

New cards

Explore top notes

note Note
studied byStudied by 22 people
601 days ago
5.0(1)
note Note
studied byStudied by 45 people
704 days ago
5.0(1)
note Note
studied byStudied by 12 people
885 days ago
5.0(1)
note Note
studied byStudied by 6 people
898 days ago
5.0(1)
note Note
studied byStudied by 17 people
693 days ago
5.0(1)
note Note
studied byStudied by 13 people
760 days ago
5.0(1)
note Note
studied byStudied by 20 people
610 days ago
5.0(1)
note Note
studied byStudied by 340 people
821 days ago
5.0(5)

Explore top flashcards

flashcards Flashcard (66)
studied byStudied by 12 people
722 days ago
5.0(1)
flashcards Flashcard (34)
studied byStudied by 10 people
12 days ago
4.0(1)
flashcards Flashcard (30)
studied byStudied by 2 people
416 days ago
5.0(1)
flashcards Flashcard (86)
studied byStudied by 3 people
861 days ago
5.0(1)
flashcards Flashcard (66)
studied byStudied by 23 people
190 days ago
5.0(1)
flashcards Flashcard (24)
studied byStudied by 7 people
15 days ago
5.0(1)
flashcards Flashcard (44)
studied byStudied by 8 people
678 days ago
4.0(1)
flashcards Flashcard (44)
studied byStudied by 3 people
1 day ago
5.0(1)
robot