UNIT 1: Describing Data

Statistics - study of methods to describe and measure aspects of nature from samples

Provides us tools to quantify the uncertainty of results
Revolves around estimation and inference

Estimation - process of inferring an unknown quantity of a population using sample data

Assesses differences amongst groups and relationships b/w variables
- Ex: the effects of certain medical drugs on the possibility of recovery from illness

Parameters - quantities that describe populations

Ex: averages, proportions, measures of variation

Statistical hypothesis - specific claim regarding a population parameter

Uses data to evaluate evidence for or against statistical hypotheses

Population - all the individuals of interest

Ex: all the genes in a human genome

Sample - subset of individuals taken from a population

Ex: subset of 20 genes from human genome

Sampling Error - chance of difference between an estimate and the population parameter being estimated caused by sampling

Bias - systematic discrepancy between the estimates we obtain if we could sample a population again and again

The goal of sampling is to minimize sampling error and bias\

Random sampling - each member of a population has an equal and independent chance of being selected

Minimizes bias and makes it possible to measure the amount of sampling error

Create a list of every unit in the population of interest; assign each unit a number
Decide on the number of units to be sampled
Use a random number generator to generate n random integers between one and the total population
Sample the units whose numbers match those produced by the random generator

It may be easier to make units groups rather than individuals in certain cases
- Ex: assessing microbes

Sample of convenience - collection of individuals that are easily available to the researcher

Ex: if you interview the first 200 people who walk into a building
Violates the assumption of independence

Volunteer bias - bias resulting from a systematic difference between the pool of volunteers and the population they belong

Certain volunteers may not be representative of rest of population
- May be:
  - Healthier
  - Low-income (if paid for participating)
  - More proactive
  - Less prudish

Variable - characteristics that differ among individuals or other sampling units

Data - measurements of one or more variables made on a sample of individuals

Categorical variables - describe qualitative characteristics of individuals that do not correspond to a degree of difference on a numerical scale

Qualitative variables
- Ex: survival, sex chromosome
nominal=name; ordinal=ordered

Numerical variable - quantitative measurements that have magnitude on a numerical scale

Ex: core body temp., age of death
Continuous data are rounded to a predetermined number of digits set for convenience
Discrete data comes in indivisible units
- Ex: number of amino acids in a protein

Explanatory variable - how well one variable predicts or affects the other

Response variable - affected variable

In an experiment, the manipulated variable is explanatory and the measure of its effects is the response; same as independent and dependent

Frequency - number of observations having a particular value

Frequency distribution - number of times each value of a variable occurs in a sample

Reveals distribution of a variable, trends/patterns

Probability distribution - distribution of a variable in the whole population

Researchers use theoretical probability distribution called normal distribution

Normal distribution - approximated probability distribution

Resembles a bell curve

Experimental study - researcher assigns different treatments randoomly to individuals

Observational study - researcher has no control over which group obtains which treatment

Randomly assigning participants removes confounding variables, allowing researchers to focus on explanatory variables
Experiments reveal causse-and-effect; observational reveal associations

Confounding variable - masks or distorts the causal relationship between measured variables in a study

Random assignment doesn’t necessarily eliminate a confounding variable, simply provides each group and equal opportunity to experiencing it

Describing Data

We need to understand data, how it’s collected, and summarized

Population - entire group under a study (e.g. all patients with diabetes in a country)

Sample - subset of the population actually studied, intended to represent the whole

Impractical to study entire populations
Statistics is scientific; it allows us to take what we learn from the sample and generalize it to an entire population

Qualitative (Categorical)

Represent qualities or categories

Nominal - no natural order to categories (e.g. sex, birthplace)

Ordinal - categories with an order (e.g. social class, grade level)

Qualitative data lacks numbers
- Even if numbers are present (e.g. coding 0=female, 1=male)
Binary only has two categories (e.g. alive/dead)

Quantitative (Numerical)

Represents measurable quantities (e.g. age, height)

Discrete - countable values, usually integers

Continuous - any value within a range (e.g. height, time)

PQ:

Variable: received vaccination against measles
Qualitative
Nominal
Dichotomous

PQ 2:

Variable: cancer stage
Qualitative
Ordinal
Not dichotomous

PQ 3:

Variable: amount of visits
Quantitative
discrete
no

PQ 4:

Exam score
Quantitative
Continuous
Not applicable

Qualitative Data Presentation

Tables:

Organizes frequencies and percentages in a clear way

Satiisfaction	Frequency
Very satisfied	59
Satisfied	42
Neutral	12
Dissatisfied	8
Very dissatisfied	5

Sum of frequency column is total number of data points

Bar Chart: visualize categories with bar heights proportional to frequencies

Spaces between categories to differentiate

Pie Chart: represent proportions of categories as slices of a circle

Frequency Distribution: tabulates data into intervals

Can be discrete or continuous
Proportion = frequency(of a category)/total
Cumulative frequency keeps a running total of all the frequencies

Apparent Limits, Real Limits

1.76 2.000

- +

.005 .005

1.755 2.005

No gaps in continuous data
Subtract by .005 (lower limit); add by .005 (upper limit)
- As one ends, next one starts
- Do for all tabulated limits

Quantitative Data Presentation

Histogram - graphical bars showing distribution of quantitative variables

Frequency Curves - line graph connecting frequencies or representing smoothed distributions

Stem-Leaf plot - show actual values in structured rows

Data: 14,19,22,24,28,30,36,44,49,57

STEM	LEAF
1	4 9
2	2 4 8
3	0 6
4	4 9
5	7

Dot plot - scatter of data points

Cumulative Frequency Polygons - show how many observations fall below certain values

Used for measuring baby height/weight
Key words for cumulative frequency “and below”

Data can be symmetric, skewed, or multimodal
- Bell-shaped, skewed left/right
Tail tells you where the skew is

Measures of Central Tendency

Mean - arithmetic mean = sum of data points/number of data points

Median - middle value, 50th percentile

Mode - most frequent value

Weighted mean - accounts for unequal importance of values