Looks like no one added any tags here yet for you.
selection bias
takes place when data is chosen in a way that is not reflective of real world data distribution
survivorship bias
the tendency to draw conclusions based on things that have survived, some selection process, and to ignore things that did not survive. it is a cognitive bias and is a form of selection bias.
mean
average; also sometimes called the "expected value"
median
the "middle" or the "50th percentile' of the data
mode
the most common occurring value
percentile
a measure of relative position where a percentage of observations fall
What is an example of percentile?
the median is the 50th percentile
interquartile range
difference between the 75th percentile and the 25th percentile (the middle 50% of the data)
variance
the average of the squared difference of the mean (denoted as σ^2)
standard deviation
the square root of the variance (denoted as σ)
purpose of sampling
1. dont have the resources to collect data from the entire population
2. dont have access to the entire population
3. when evaluated properly, the sample can be used to make inferences about the entire population
central limit theorem
tells us the means from samples have a normal distribution
law of large numbers
tells us that if we repeat something many times (or take a large sample) then we will get closer to the true average outcome
central limit theorem and the law of large numbers together
allows us to infer that our sample tells us something about the population
structured data
generic term for data that is organized for some purpose. relatively easy to work with.
What are some examples of structure data?
1. calendar
2. grade book
3. diamonds dataset
"tidy" data
structured data that is ready to analyze
columns
variable
what are examples of columns?
net income, assets, revenues
rows
observations or units
what are some examples of rows?
companies
data "tidying" or "wrangling" often takes up ____________ of the time spent on a data analysis project
most
unstructured data
data that is not organized or well curated for analysis
cross-sectional data
data for many subjects at a certain point
what is an example of cross-sectional data?
financial data from all public companies in 2022
time-series data
data for one subject over time
what are some examples of time-series data?
Tesla's financial data from 2010-2020
panel data
data for many subjects over time
what are some examples of panel data?
financial data from all public companies for 2010-2020
factor or categorical variable
take a limited number of values
what are the two types of categorical variables?
1. ordinal
2.nominal
what are some examples of ordinal variables?
education level, grades, bond ratings
what are some examples of nominal variables?
geographic location, industry classification, colors
discrete number variables
take specific values like whole numbers or integers
what are some examples of discrete nunber variables?
inventory counts and population
indicator, Boolean, and logical variables
take binary values
what are some examples of indicator, Boolean and logical variables?
True/False, Win/Lose, Heads/Tails
continuous numeric variables
take (in theory) an infinites number of values (often stored as numeric)
what are som examples of continuous numeric variables?
earnings per share, distances, time
character variables
capture text or "strings"
what are some examples of character variables?
names, phrases, sentences
date variable
capture dates and times
snake case
all characters in lower case and underscores represent spaces
camel case
character are in lower case except the first letter of new word
filter()
extract a "subset" of observations or rows in a dataset
select()
extract variables or columns in a dataset
arrange()
sort data
mutate()
create or replace variables
group_by()
work with data in "groups"
ungroup()
to undo group_by()
summarize()
aggregate and summarize data
distinct()
identify distinct observations and remove deli observations
duplicated()
identify duplicated observations
lubridate
working with dates and times
in lubridate y means
year
in lubridate m means
month
in lubridate d means
date
in lubridate h means
hour
in lubridate m means
minute
in lubridate s means
second
TIDYR
organizing and reshaping data
unite()
combine columns
separate()
separate columns
pivot_longer
reshape datasets to convert rows into columns
pivot_wider
reshape datasets to convert columns into rows
str_remove()
remove certain character in a string
str_replace()
replace certain characters from strings
to_lower()
convert all character to lower case case
to_upper
convert all characters to upper case
left_join()
joins the left (L) and right (R) datasets keeping all rows in the L dataset
right_join()
joins the right (R) and left (L) datasets keeping all rows in the R dataset
inner_join()
joins the left (L) and right (R) datasets keeping only rows that match
full_join()
join the left (L) and right (R) datasets keeping all rows from both datasets
semi_join()
retain rows in the left (L) dataset that match to the right (R) dataset (similar to an inner join but without merging the variable).
anti_join()
retain rows in the left (L) dataset that do not match to the right (R) dataset
what are the 7 comments of the grammar of graphics?
1. Data
2. Aesthetics
3. Geometrics
4. Facets
5. Statistics
6. Coordinates
7. Themes
aesthetics
x- axis, y - axis, color, fill, size, labels, alpha, shape, line width, line type
geometrics
point, line, histogram, bar, box plot
facets
columns, rows
statistics
binning, smoothing, descriptive, inferential
coordinates
cartesian, fixed, polar, limits
themes
non-data ink
what are the 5 common visualizations?
1. Bar/Pie Charts
2. Histograms
3. Line Charts
4. Scatterplots
5. Boxplots
pie chart
allows us to visualize relative values (means, counts, sum) by categories or groups
bar charts
allows us to visualize relative value (means, counts, sums) by categories or groups- generally more useful than pie charts
histograms
allows us to visualize the distribution of a variable using the frequency of observations inside bins
box plots
allow us to visualize the distribution of the data by displaying the median, interquartile, range, and extreme values
line charts
depicts trends in variables. great for time series analysis
scatterplots
visualize relation between variable using data points. observations are plotted Y and X axes, expand beyond two dimensions with shapes, colors, sized and other features
overplotting
use 'jitter', 'transparency' and sampling to alleviate overplaying
rm(list=ls())
clear the environment
install.package()
install packages
library()
open libraries
setwd()
set your working directory
getwd()
check your working directory
str()
check the structure of an object
summary()
summarize data
unique() OR distinct()
list unique variable
table()
count of observations for each unique variable
what are the 4 components for statistics for location?
1. mean
2. median
3. mode
4. percentile