1/43
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
population
the complete set of units (people, firms, etc.) we want
to study
sample
a subset of the population examined to learn about the
population.
representative sample
a sample that mirrors the population on
relevant characteristics
sampling bias
systematic under- or over-representation of some
population members.
statistic
either a function applicable to data or the result of that
function, i.e. a number
parameter
a numerical characteristic of a population that a
statistic aims to estimate
Qualitive (categorical) data
the result of categorising or describing attributes of a population
Quantative (numerical) data
the result of counting or measuring attributes of a population
variable
a characteristic of a unit being observed that may assume more than one set of values for each member of the population
numerical variable
takes on values with equal units such as petals per flower
categorical variable
place a person or thing into a category such as colour of flower
data
the observed value of the variable(s)
quantative discrete variables
take on only certain numerical values, e.g. calls per week
quantative continuous variables
take on all values in a defined range, e.g, length, weight, time
median
middle value seperating the greater and lesser halves of a data set
mode
most frequent value in a data set
function
a rule that assigns to each input exactly one output. it comes with a domain (allowed inputs) and a codomain (possible outputs). The set of outputs actually attained is the range. (image)
Domain and range
a function that maps every element in the domain to exactly 1 element in the range. Although each input can be sent to only one output, 2 different inputs can be sent to same output
statistical functions
when we aggregate data, we take a high dimensional domain and map it to a low dimensional range.
bar graph
the length of the bar for each category is proportional to the number or percent of individuals in each category. Bars may be vertical or horizontal. include the zero in the bar chart
simple random sampling
picking individuals out of proportion with equal chance
the problem, sampling bias
some members of population are not as likely to be chosen as others and we do not account for it
common type of sampling bias
self-selection, exclusion, survivorship
simple random sample
any group of individuals is equally likely to be chosen as any other groups of individuals
proportionate stratisified sample
divide the population into groups called strata and then take a proportionate number from each stratum. Advantage: sample is representative along the characteristic used for stratisfication
disproportionate stratisfied sample
over-sample (pick individuals with a higher chance from) groups with large variance, e.g, smaller groups. Leads to biased results if not adjusted
cluster sample
divide population into clusters (groups) and then randomly select some of the clusters. Include all the members from these clusters
convinience sample
use results that are readily available (already collected). cheaper but might be biased
distribution
a description of how often each outcome occurs
empirical cumulative distribution function(ECDF)
A standard representation for an empirical (observed) distribution
Histograms
divides the span of our data into non-overlapping bins of the same size. Then, for each bin, we count the number of values that fall into that interval. The histogram plots these counts as bars with the base of the bar defined by the intervals, histograms are preffered over EDCFs as they are easier to interpret
Smooth density plots
basically smoothing out the edges of a histogram
Advantage of smooth density plot
prettier and easier to compate several distributions as less messy
Disadvantage of smooth density plot
Interpretation slightly more difficult, form is dependent on underlying smoothing, never a good idea to use methods we dont understand well
what does it mean when data is not pretty?
it means asymetric for instance or carrying outlines
percentiles
the values for which p=0.01, 0.02,…0.99 of the data are less than or equal to that value respectively
median (percentile)
the most often used percentile is the 50% percentile, called the median
quaritiles
these are the percentiles at p= 0.025, 0.5, 0.75
range
the difference between the largest value and smallest value
box plot
provides a 5 number summary for data composed of the range along with quarities
stratification
often when we divide observations into groups based on the values of one or more variables associated with these observations
variance
a measure of variation in the population. It is defined as the sum of squared deviations from the mean divided by the number of units
standard deviation
the square root of the variance
what is the function of standard deviation?
it provides a numerical measure of the overall amount of variation in a data set, always positive or zero, it is small when the data are all concentrated close to the mean, exhibiting little variation or spread, can also be used to determine whether a particular data value is close to or far from the mean