AC

UNIT 1: Describing Data

Statistics - study of methods to describe and measure aspects of nature from samples

  • Provides us tools to quantify the uncertainty of results

  • Revolves around estimation and inference

Estimation - process of inferring an unknown quantity of a population using sample data

  • Assesses differences amongst groups and relationships b/w variables

    • Ex: the effects of certain medical drugs on the possibility of recovery from illness

Parameters - quantities that describe populations

  • Ex: averages, proportions, measures of variation

Statistical hypothesis - specific claim regarding a population parameter

  • Uses data to evaluate evidence for or against statistical hypotheses

Population - all the individuals of interest

  • Ex: all the genes in a human genome

Sample - subset of individuals taken from a population

  • Ex: subset of 20 genes from human genome

Sampling Error - chance of difference between an estimate and the population parameter being estimated caused by sampling

Bias - systematic discrepancy between the estimates we obtain if we could sample a population again and again

  • The goal of sampling is to minimize sampling error and bias\

Random sampling - each member of a population has an equal and independent chance of being selected

  • Minimizes bias and makes it possible to measure the amount of sampling error

  1. Create a list of every unit in the population of interest; assign each unit a number

  2. Decide on the number of units to be sampled

  3. Use a random number generator to generate n random integers between one and the total population

  4. Sample the units whose numbers match those produced by the random generator

  • It may be easier to make units groups rather than individuals in certain cases

    • Ex: assessing microbes

Sample of convenience - collection of individuals that are easily available to the researcher

  • Ex: if you interview the first 200 people who walk into a building

  • Violates the assumption of independence

Volunteer bias - bias resulting from a systematic difference between the pool of volunteers and the population they belong

  • Certain volunteers may not be representative of rest of population

    • May be:

      • Healthier

      • Low-income (if paid for participating)

      • More proactive

      • Less prudish

Variable - characteristics that differ among individuals or other sampling units

Data - measurements of one or more variables made on a sample of individuals

Categorical variables - describe qualitative characteristics of individuals that do not correspond to a degree of difference on a numerical scale

  • Qualitative variables

    • Ex: survival, sex chromosome

  • nominal=name; ordinal=ordered

Numerical variable - quantitative measurements that have magnitude on a numerical scale

  • Ex: core body temp., age of death

  • Continuous data are rounded to a predetermined number of digits set for convenience

  • Discrete data comes in indivisible units

    • Ex: number of amino acids in a protein

Explanatory variable - how well one variable predicts or affects the other

Response variable - affected variable

  • In an experiment, the manipulated variable is explanatory and the measure of its effects is the response; same as independent and dependent

Frequency - number of observations having a particular value

Frequency distribution - number of times each value of a variable occurs in a sample

  • Reveals distribution of a variable, trends/patterns

Probability distribution - distribution of a variable in the whole population

  • Researchers use theoretical probability distribution called normal distribution

Normal distribution - approximated probability distribution

  • Resembles a bell curve

Experimental study - researcher assigns different treatments randoomly to individuals

Observational study - researcher has no control over which group obtains which treatment

  • Randomly assigning participants removes confounding variables, allowing researchers to focus on explanatory variables

  • Experiments reveal causse-and-effect; observational reveal associations

Confounding variable - masks or distorts the causal relationship between measured variables in a study

  • Random assignment doesn’t necessarily eliminate a confounding variable, simply provides each group and equal opportunity to experiencing it

Describing Data

  • We need to understand data, how it’s collected, and summarized

Population - entire group under a study (e.g. all patients with diabetes in a country)

Sample - subset of the population actually studied, intended to represent the whole

  • Impractical to study entire populations

  • Statistics is scientific; it allows us to take what we learn from the sample and generalize it to an entire population

Qualitative (Categorical)

  • Represent qualities or categories

Nominal - no natural order to categories (e.g. sex, birthplace)

Ordinal - categories with an order (e.g. social class, grade level)

  • Qualitative data lacks numbers

    • Even if numbers are present (e.g. coding 0=female, 1=male)

  • Binary only has two categories (e.g. alive/dead)

Quantitative (Numerical)

  • Represents measurable quantities (e.g. age, height)

Discrete - countable values, usually integers

Continuous - any value within a range (e.g. height, time)

PQ:

  • Variable: received vaccination against measles

  • Qualitative 

  • Nominal

  • Dichotomous

PQ 2:

  • Variable: cancer stage

  • Qualitative

  • Ordinal

  • Not dichotomous

PQ 3:

  • Variable: amount of visits

  • Quantitative

  • discrete

  • no 

PQ 4:

  • Exam score

  • Quantitative 

  • Continuous

  • Not applicable

Qualitative Data Presentation

Tables:

  • Organizes frequencies and percentages in a clear way

Satiisfaction

Frequency

Very satisfied

59

Satisfied 

42

Neutral 

12

Dissatisfied 

8

Very dissatisfied

5

  • Sum of frequency column is total number of data points

Bar Chart: visualize categories with bar heights proportional to frequencies

  • Spaces between categories to differentiate

Pie Chart: represent proportions of categories as slices of a circle

Frequency Distribution: tabulates data into intervals

  • Can be discrete or continuous

  • Proportion =  frequency(of a category)/total

  • Cumulative frequency keeps a running total of all the frequencies

Apparent Limits, Real Limits

1.76 2.000

- +

.005 .005

1.755 2.005

  • No gaps in continuous data

  • Subtract by .005 (lower limit); add by .005 (upper limit)

    • As one ends, next one starts

    • Do for all tabulated limits

Quantitative Data Presentation

Histogram - graphical bars showing distribution of quantitative variables

Frequency Curves - line graph connecting frequencies or representing smoothed distributions

Stem-Leaf plot - show actual values in structured rows

Data: 14,19,22,24,28,30,36,44,49,57

STEM

LEAF

1

4 9

2

2 4 8

3

0 6

4

4 9

5

7

Dot plot - scatter of data points

Cumulative Frequency Polygons - show how many observations fall below certain values

  • Used for measuring baby height/weight

  • Key words for cumulative frequency “and below”

  • Data can be symmetric, skewed, or multimodal

    • Bell-shaped, skewed left/right

  • Tail tells you where the skew is

Measures of Central Tendency

Mean - arithmetic mean = sum of data points/number of data points

Median - middle value, 50th percentile

Mode - most frequent value

Weighted mean - accounts for unequal importance of values