Mutate
Tidyverse function for creating new variables.
Filter
Tidyverse function for including data matching a certain condition
Select
Tidyverse function for keeping/dropping variables
Group_By
Tidyverse function for categorizing data
Summarise
Tidyverse function for computing the basic summary statistics
Population Average
The big average we’re trying to figure out
Sample average
The average of the population we have.
records
A data set is made up of _____ that contain information on a specific entity.
fields
Each record is made of _____ that contain measurements of known types.
Panel Data
Data collected over time on multiple entities, such as individuals, firms, or countries.
Acquire, Transform, Analyze, Communicate
The 4 stages of analysis
Cross Section
Many units observed at a particular time
Time Series
A single unit observed over multiple time periods
Data Set
Multiple data tables structured for a particular analysis
Database
A collection of tables where each table has some known and meaningful relationship to the other tables.
Volume
A word to describe the literal size and scale of data
Velocity
A word to describe the speed of generation, collection, and storage of data
Variety
A word to describe the complexity of sources and forms of data
Veracity
A word to describe the degree of consistency and completeness of data.
Content
What a variable measures
Validity
Whether a variable measures what its supposed to measure
Reliability
Whether repeated measurements return the same value
Comparability
Whether a variable is measured the same way across units
Coverage
Whether all units intended for inclusion are included
Selection
Whether selected units are representative of those not covered
Data Schema
A representation of the data structure that comprises all the attributes of the data and their data types
Vector
A sequence of data elements of the same type
Matrix
A two-dimensional array of data elements of the same type
Data Frame
A tabular data structure
List
An ordered collection of objects
Factor
A vector that can contain only predefined values, and is used to store categorical data
Array
A multidimensional collection of same-type data elements.
Frequentism
The approach to thinking that states the probability of some event happening is the number of times it happens over the number of random trials
Law of Large Numbers
The idea that the more trials you use, the closer your data gets to being exactly accurate.
Estimand
The thing we want to estimate
Estimator
The formulas we use to make an estimate
Estimate
Our best guess for something, with bias and sampling error.
Measurement error
When a variable’s empirical measurement does not accurately capture the thing we are interested in
CEF
The workhorse of data science, has the expected value (average) of a variable given another variable.
Law of Iterated Expectations
The law that states the unconditional expectation is equal to the weighted average of conditional expectations. E(Y) = E(Y|X)
Covariance
Indicates the strength of a relationship
Human Capital Theory
Models education as an investment much like you would for any other capital asset, predicts age-earnings profile will be concave.
Consistency
The bias and sampling error approach zero as sample size increases
Central Limit Theory
THe Theory that under random sampling, given enough data, a random variable will approach a normal distribution.
Confidence Interval
how likely an estimate is close to its target in the population
Item Nonresponse
When data is missing because respondents refused to provide it
Unit Nonresponse
When data is missing because of people that the data was not collected from
Missing Completely at Random
Sampling error is completely independent of X and Y.
Missing at Random
Selection into a dataset depends on X, but not other unobserved factors.
Exogenous
Anything that went wrong with sampling is external.
Endogenous
Anything that went wrong with sampling is internal.
Imputation
The process of filling in the missing values based on data you observe
Simpson’s Paradox
The idea that there is a lurking third variable that effects correlations
Bayes’ Rule
A mathematical formula used to update the probability of a hypothesis based on new evidence or information. It calculates the probability of an event occurring given prior knowledge and new data.