INTRODUCTORY STATISTICS
COURSE UNIT SUMMARY:
Unit 1- Background: data collection, types of Statistical Studies (Chapter 1)
Unit 2- Display of data (graphs) (Ch.2-3)
Unit 3- Probability: a tool for estimation and prediction; Gaussian distribution (Ch.5, 7)
Unit 4- Gaussian sampling distributions: their use and importance; C.I.s (Ch. 8)
Unit 5- Gaussian Hypothesis Testing: for population mean or proportion (Ch. 9)
Unit 6-Bivariate* Hypothesis Testing and data analysis and : Chi2, Linear Regression (Ch.4, 11)
--------------------------------------------------------------
UNIT 1: Background
1) Individuals
In life, in art and in science we distinguish individuals
Individuals are people, animals, objects, or groups of items of one kind
Individuals may be college students, squirrels, ballpoint pens, or “groups” such as elementary schools or bags of potato chips
Statistics is a science that studies groups instead of individuals.
2) Attributes or characteristics of individuals are called variables: they
“vary” between individuals, e.g. are not the same for all
For example, some individual pencils are yellow and some are red: color is one example of a “variable”
Variables don’t have to be a number (do you have a dog, versus how many dogs do you have).
Variables may be measured on individuals by a measuring device, such as a scale or a fMRI*blood flow In the brain), or assessed by one or more of the senses: touch, sight, hearing, taste and smell
For example, the variable “Weight” is measured by a scale
Some possible values of “Weight” for a coin may be 4.2 g, or 2.7 g
Some possible values of “Weight” for a baby may be 6.2 lbs. or 7.7 lbs.
The appropriate units and possible values of the variable “Weight” , or of any other variable, depend on the type of individuals in the study
You can feel the weight of an apple, but to get a precise measurement, you would use a scale
The variable “Presence of a noise” is usually assessed by hearing
Possible values of the variable are “There is noise”, or “There is no noise”
The variable “Opinion on the proposed healthcare bill” is assessed by a survey
Possible values could be “Favor”, “Oppose”, “Favor with corrections”
A constant is a special case of a variable: the number of eyes in a cat is almost always = 2
The variable could be school, and the variable you have to measure is the amount of people.
Some of the variables, like the weight of an apple, you can measure with your own senses, and with a measuring tool, such as a weight scale.
Values of a variable, or of variables measured or assessed on individuals are collectively called “data” (singular “datum”)
In a class, we collect data that is for each person, the value of the variable, “age” and of the variable “gender”, but you’re really asking for “sex” (male/female/other). Made up data, consistent with actual data over the years.
Recording data as if it was bivariate (a set of 2 variables, in this case age and sex);
{(24,F),(20,M),(17,M),(24,F),(19,M),(17,f),(24,F),(20,M), (27,F), (34,F), (20,F), (16,M), (20,F), (20,M), (37,M), (28,F), (20,F) (47,F), (24,F), (20,F), (37,M), (44,F), (42,M), (57,F) (52,F), (20,M), (37,M), (28,F), (20,F), (27,F), (24,F), (25,F), (32,M), (42,M), (32,M), (27,F), (22,F), (21,M), (23,M), (28,F), (19,F)} - the data set for the population of people in class
Symbols: (24,F) = (X,Y)
Age: A or X Sex: S or Y
A volunteer would pick 5 slips out of a bag (n=5 represents the symbol for sample size). They pick n=5 {24,20,22,42,19} and {F,M,F,M,F} or {24F,20M,22F,42M,19F}
Replace those slips and have a second volunteer pick another n=5 sample data {24F,57F,22F,47F,19F}
Statistics are numbers calculated using sample data in order to summarize the values of the variables in the sample. Here we have two variables (bivariate data): Age, numeric and sex, categorical. Suggest one statistic to summarize the values of variable age. Students decided to calculate the mean age (arithmetic average). Suggest a static to summarize the variable sex in the samples. Usually, % of men or % of F is suggested, but professional statistics prefer to calculate the statistic called “sample proportion”, which in mathematics would be called “ratio of a part to whole”. Its is a %, but in decimal form and should contain at least 2 decimal places.
If you write 0.6, it won’t be respected. Write 0.60 (60%)
Calculating: {24F, 20M, 22F, 42M, 19F} and {24F, 57F, 22F, 47F, 19F}
Mean age: x or a = (x1 =25.4 years) (x2= 33.8 years)
Add up all numbers and divide by amount of numbers
Proportion of F = (# of F)/n (Pf1 = 3/5=0.6 0.60) (Pf2=5/5=1.00)
The purpose for calculating sample statistics is, among other things, to estimate the corresponding population parameters: xbar = sample mean estimates miu=populatoin mean and phat=sample proportion estimates pi=population proportion
By agreement, statistics will be represented by latin (English) letters and parameters will be represented by Greek letters.
Usually, there is no access to the data from an entire population, and no way to check how good estimates are in general if we are honest and our statistical technique is good, the estimates are pretty good. Once in a while though, we are way off.
Calculate the pop parameters estimated by statistics:
Pop mean age: 282+ pop proportion
3) Subjects of statistical studies
Population
The set of all individuals of a given kind about whom information is desired is called a population.
Some examples of a population may be:
All taxpayers in the Twin Cities
All bags of potato chips delivered to Cub stores in California
All private colleges in England
All squirrels in Sharon Woods Park in Westerville, Ohio
Population is individuals, not variables.
Parameters
Parameters are “characteristics of a population”. In general, they are numbers calculated or estimated (usually estimated) to summarize data in the population
If data pertain to the entire population of interest, parameters can be directly calculated - Like we did for the Age and sex data for the class
In general, the data from the entire population would take too long and cost too much to collect
For that reason, parameters are usually not calculated, but estimated or predicted from the statistics calculated based on data from a sample.
Sample: of at least 30 people are usually fairly representative of the population.
A subset of a population that is actually studied = individuals is the sample
Hoped that the sample’s characteristics are similar to those of the population: Besides “hope”, we will learn good sampling technique
For example, if the population of interest is all tea in a cup, and the sample is a teaspoon of tea from that cup
Likely that if the tea in the teaspoon is sweet, so is the tea in the cup
Sampling techniques*one techinque used in different ways* used to “pick” samples cannot insure that a given sample is “representative of the population”. However, good sampling technique (we will study in a little while) makes extremely unrepresentative samples infrequent (5% or less, rare)
The numbers calculated to summarize the data in a sample are called “statistics” (singular is “statistic”). xBar for Age, or Phat for F
So important, that they gave the name to the science of statistics
Used to estimate or to predict the population parameters or to test hypothesis about population parameters.
In general, the science of statistics concerns itself with populations, rather than with individuals, or with samples. Samples are merely a tool to assess the population.
____________________________________________________________________
4) The Science of Statistics
The two branches of (the science of) statistics are
Descriptive statistics: concerns itself with describing the population by calculating parameters based on the population data
For example, using U.S. census data, descriptive statistics can determine the average number of people in a household, or the average size of a dwelling
Problems with conclusions of descriptive statistics include the incompleteness of population data (for example, undocumented Immigrants are not counted) and recording errors
This branch is infrequently used because of the time and expense required, and because it is not always possible to have access to the entire population (for ex: the population of earthworms, or the population of people under a dictatorship)
*Inferential statistics: concerns itself with estimating or with predicting the population parameters by using sample statistics
Most of the statistical analyses are of this kind because the data from samples is easier to obtain in a timely and a relatively less expensive manner
Therefore, inferring = “making educated guesses” based on incomplete information. We usually infer form sample stats to population parameters.
5) Important Parameters and Statistics
Measure center or variability (more in Unit 2)
Sample mean (x)(Xbar) or sample proportion (p)(Phat) or Pi(hat) estimate measures of center for the corresponding population
x estimates μ: The statistic sample mean estimates the parameter pop mean
p estimates π: the statistic sample proportion est. the parameters pop. Proportion
Estimates of variability - in Unit 2
one measure is range= high-low, but range of a sample is not a reliable estimate for the pop. Range is not usually used in inference.
-------------------------------------------------------------------------------------------------------------------------
6) Research Questions in Statistics
A research question applies to a group of subjects “of one kind” (population)
May concern current properties of the pop., or change in properties over time, or change as a result of a “treatment”
May concern possible causes of current conditions, for ex: what are probable causes of high prevalence of mesothelioma among construction workers?
-----------------------------------------------------------------------------------------------------------
7) Types of Statistical Studies
Experiments: Statistical studies in which a treatment is applied to some subjects. Experiments will need a non treated group to see if the treatment actually works.
Cross sectional: Study treatment Right now
Longitudinal (prospective): Study the effects of the treatment from now on
Ex: testing medication: Usually done more or less in an cross sectional manner (for phases 2 and 3 of trials) but in Phase 4 trials the side effects are assed longitudinally, over long periods of time.
Observational: No treatment was applied, (like no medication, but it would be like testing new programs for students
Cross sectional: happening ‘right now’
Longitudinal (prospective): observed over time (months or years)
Case-Control (retrospective): a sample of individuals with a condition of interest is picked, and a sample of individuals without this condition.
Ex: observing the effects of a new reading program:
Meta-analyses
Definition: Meta-analysis is a type of study that combines and analyzes data from multiple individual studies to draw broader conclusions.
Challenges:
Different studies may define variables in different ways.
The populations in each study might not be the same.
The methods used in each study can vary.
8) Variable Roles in Statistical Studies
Variables in experiments are clearly defined (e.g., treatment vs. control).
Variables in observational studies may not be as clear-cut.
Key variable roles:
Explanatory (or independent) variable: The factor that you think might be causing the change in the outcome.
Response (or dependent) variable: The outcome you are trying to explain or predict.
Lurking variable: A variable that is not included in the analysis but affects both the explanatory and response variables.
Confounding variable: A lurking variable that is related to both the explanatory and response variables, making it hard to determine the true relationship between them.
9) Variable Types
Variables are categorized based on the type of data they represent.
Numeric (Quantitative):
Continuous: Can take any value within a range (e.g., height, weight).
Discrete: Takes specific, separate values (e.g., number of children).
Categorical (Qualitative):
Ordinal: Categories that have a meaningful order (e.g., education level: high school, bachelor's, master's).
Nominal: Categories without an inherent order (e.g., gender, eye color).
10) Distribution
Definition: A distribution shows how values of a variable are spread or arranged. It’s essential because it tells you how data behaves and helps in making predictions.
Importance: If you know the distribution, you can make better predictions about future data based on your sample.
11) Concepts Important to Statistical Studies
I) Measurement Quality:
A) Validity: How well a test or measure measures what it is supposed to measure.
Predictive validity: The ability of a measurement to predict future outcomes (e.g., SAT scores predicting college success).
B) Reliability: Consistency of a measure. If you repeat the measurement, will you get the same result?
II) Bias & Confounding:
A) Bias: Systematic error that skews results in a particular direction.
B) Confounding: When a third variable is linked to both the independent and dependent variables, making it hard to separate their effects.
III) Accuracy & Precision:
A) Accuracy: How close a measured value is to the true value.
B) Precision: How consistent results are when measurements are repeated.
12) What Statistics Does
Point estimation: Estimating a parameter (e.g., average) from sample data.
Confidence Intervals (C.I.): Giving a range within which the true parameter likely lies.
Hypothesis testing: Testing a claim or hypothesis about a population parameter.
Models for Estimation & Prediction: Creating models to make predictions and study relationships between variables.
Sampling: Sampling is crucial, and the sample needs to be representative of the population to make valid conclusions.
13) Sampling Methods
Random Sampling: Every individual in the population has an equal chance of being selected.
Examples:
Simple Random Sample (SRS): Every possible sample has the same chance of being selected.
Stratified Sampling: Population is divided into subgroups, and random samples are taken from each subgroup.
Cluster Sampling: Entire groups or clusters are selected at random, and all individuals in the chosen clusters are surveyed.
Systematic Sampling: Every k-th individual is selected from a list of the population.
Non-random Sampling: Not all individuals have an equal chance of being selected.
Convenience Sampling: Samples are chosen based on what’s easiest or most convenient.
Voluntary Response Sampling: Individuals choose to participate, often leading to biased results.
SRS (Simple Random Sample)
Advantages:
Simple to execute.
Best for ensuring that each individual has an equal chance of selection.
Disadvantages:
May not represent smaller subgroups well (especially in a heterogeneous population).
Process:
Identify the population.
List all individuals.
Randomly select from the list.
Example: Drawing names from a hat to select participants for a study.
Stratified Sampling
Advantages:
More precise estimates by ensuring representation from each subgroup.
Can lead to a more accurate overall result.
Disadvantages:
More complex to organize.
Requires knowledge of the strata beforehand.
Process:
Divide the population into subgroups (strata).
Randomly sample from each subgroup.
Example: Dividing a population by age groups (e.g., 18-25, 26-35) and randomly sampling from each.
Cluster Sampling
Advantages:
Cost-effective, especially when the population is spread out geographically.
Disadvantages:
Less precise, as all individuals within a cluster are included.
Process:
Divide the population into clusters.
Randomly select clusters.
Survey everyone in the selected clusters.
Example: U.S. Unemployment Survey where 729 geographic areas (clusters) are surveyed.
Systematic Sampling
Advantages:
Simple and quick to implement.
Useful when a list of individuals is already available.
Disadvantages:
If there's a pattern in the list, it could introduce bias.
Process:
Select a starting point at random.
Choose every k-th individual after that.
Example: Picking every 10th person from a list of names.
Non-random Sampling
Convenience Sampling:
Advantages: Quick and easy.
Disadvantages: Not representative, leading to biased results.
Example: Surveying people in a mall.
Voluntary Response Sampling:
Advantages: People who are interested are more likely to respond.
Disadvantages: Not representative; those who feel strongly are more likely to participate.
Example: Online surveys where anyone can choose to participate.
14) Graphing
Why Graphs Are Important:
Graphs help to visually display data, making trends and patterns easier to spot.
They help decide the best statistical analysis methods.
Types of Graphs:
Categorical Variables: Bar Graphs.
Numeric Variables:
Dotplots
Stem-and-Leaf Plots
Histograms
Boxplots