UNIT 1: Describing Data
Statistics - study of methods to describe and measure aspects of nature from samples
Provides us tools to quantify the uncertainty of results
Revolves around estimation and inference
Estimation - process of inferring an unknown quantity of a population using sample data
Assesses differences amongst groups and relationships b/w variables
Ex: the effects of certain medical drugs on the possibility of recovery from illness
Parameters - quantities that describe populations
Ex: averages, proportions, measures of variation
Statistical hypothesis - specific claim regarding a population parameter
Uses data to evaluate evidence for or against statistical hypotheses
Population - all the individuals of interest
Ex: all the genes in a human genome
Sample - subset of individuals taken from a population
Ex: subset of 20 genes from human genome
Sampling Error - chance of difference between an estimate and the population parameter being estimated caused by sampling
Bias - systematic discrepancy between the estimates we obtain if we could sample a population again and again
The goal of sampling is to minimize sampling error and bias\
Random sampling - each member of a population has an equal and independent chance of being selected
Minimizes bias and makes it possible to measure the amount of sampling error
Create a list of every unit in the population of interest; assign each unit a number
Decide on the number of units to be sampled
Use a random number generator to generate n random integers between one and the total population
Sample the units whose numbers match those produced by the random generator
It may be easier to make units groups rather than individuals in certain cases
Ex: assessing microbes
Sample of convenience - collection of individuals that are easily available to the researcher
Ex: if you interview the first 200 people who walk into a building
Violates the assumption of independence
Volunteer bias - bias resulting from a systematic difference between the pool of volunteers and the population they belong
Certain volunteers may not be representative of rest of population
May be:
Healthier
Low-income (if paid for participating)
More proactive
Less prudish
Variable - characteristics that differ among individuals or other sampling units
Data - measurements of one or more variables made on a sample of individuals
Categorical variables - describe qualitative characteristics of individuals that do not correspond to a degree of difference on a numerical scale
Qualitative variables
Ex: survival, sex chromosome
nominal=name; ordinal=ordered
Numerical variable - quantitative measurements that have magnitude on a numerical scale
Ex: core body temp., age of death
Continuous data are rounded to a predetermined number of digits set for convenience
Discrete data comes in indivisible units
Ex: number of amino acids in a protein
Explanatory variable - how well one variable predicts or affects the other
Response variable - affected variable
In an experiment, the manipulated variable is explanatory and the measure of its effects is the response; same as independent and dependent
Frequency - number of observations having a particular value
Frequency distribution - number of times each value of a variable occurs in a sample
Reveals distribution of a variable, trends/patterns
Probability distribution - distribution of a variable in the whole population
Researchers use theoretical probability distribution called normal distribution
Normal distribution - approximated probability distribution
Resembles a bell curve
Experimental study - researcher assigns different treatments randoomly to individuals
Observational study - researcher has no control over which group obtains which treatment
Randomly assigning participants removes confounding variables, allowing researchers to focus on explanatory variables
Experiments reveal causse-and-effect; observational reveal associations
Confounding variable - masks or distorts the causal relationship between measured variables in a study
Random assignment doesn’t necessarily eliminate a confounding variable, simply provides each group and equal opportunity to experiencing it
Describing Data
We need to understand data, how it’s collected, and summarized
Population - entire group under a study (e.g. all patients with diabetes in a country)
Sample - subset of the population actually studied, intended to represent the whole
Impractical to study entire populations
Statistics is scientific; it allows us to take what we learn from the sample and generalize it to an entire population
Qualitative (Categorical)
Represent qualities or categories
Nominal - no natural order to categories (e.g. sex, birthplace)
Ordinal - categories with an order (e.g. social class, grade level)
Qualitative data lacks numbers
Even if numbers are present (e.g. coding 0=female, 1=male)
Binary only has two categories (e.g. alive/dead)
Quantitative (Numerical)
Represents measurable quantities (e.g. age, height)
Discrete - countable values, usually integers
Continuous - any value within a range (e.g. height, time)
PQ:
Variable: received vaccination against measles
Qualitative
Nominal
Dichotomous
PQ 2:
Variable: cancer stage
Qualitative
Ordinal
Not dichotomous
PQ 3:
Variable: amount of visits
Quantitative
discrete
no
PQ 4:
Exam score
Quantitative
Continuous
Not applicable
Qualitative Data Presentation
Tables:
Organizes frequencies and percentages in a clear way
Sum of frequency column is total number of data points
Bar Chart: visualize categories with bar heights proportional to frequencies
Spaces between categories to differentiate
Pie Chart: represent proportions of categories as slices of a circle
Frequency Distribution: tabulates data into intervals
Can be discrete or continuous
Proportion = frequency(of a category)/total
Cumulative frequency keeps a running total of all the frequencies
Apparent Limits, Real Limits
1.76 2.000
- +
.005 .005
1.755 2.005
No gaps in continuous data
Subtract by .005 (lower limit); add by .005 (upper limit)
As one ends, next one starts
Do for all tabulated limits
Quantitative Data Presentation
Histogram - graphical bars showing distribution of quantitative variables
Frequency Curves - line graph connecting frequencies or representing smoothed distributions
Stem-Leaf plot - show actual values in structured rows
Data: 14,19,22,24,28,30,36,44,49,57
Dot plot - scatter of data points
Cumulative Frequency Polygons - show how many observations fall below certain values
Used for measuring baby height/weight
Key words for cumulative frequency “and below”
Data can be symmetric, skewed, or multimodal
Bell-shaped, skewed left/right
Tail tells you where the skew is
Measures of Central Tendency
Mean - arithmetic mean = sum of data points/number of data points
Median - middle value, 50th percentile
Mode - most frequent value
Weighted mean - accounts for unequal importance of values