Charts-
Bar charts- (horizontal) best for category comparison(try to use when less than 7)
vertical bar chart best for RANKINGS like election votes from top to least
Histogram- Distribution of continuous variable (time, age, weight) Great to understand things like household income distribution and population distribution (ex.)
Each bar is called a class and the beginning of class(bar) is called lower limit and the end is called upper limit. If want to transfer to line graph use the middle of class.
Pie chart- Bad w/ showing multiple data points. If you have to use one never use it for more than 5 data points, and rank in order where biggest in top right hand corner w labels.
Scatterplot- Good for correlation or how things relate to each other. shows clustering trends or spot outliers. This one has 2 variables(x,y)
Dot plot- Only one variable and it has dots in between. only (x)
Box plot- Has minimum, Q1,Q2,Q3 and maximum. Q2 is the median and it has no mean in this plot.
Line chart(graph)- Shows how something changes over time. Ex. like stock market price or visitors to your website. Skew right(higher left to lower right AKA positive) Skew left(Higher right lower left AKA negative). Tail on right hand side. Mode is highest point, then median and hen mean from left to right on skew right (Mean bigger median and then mode). Opposite order for a skew left
Stem leaf plot- Dividing nubers into stem and leafs.
Definitions and notes-
Variable- charactericits or condittion that can change or take in different values. Most researches begin w question betw 2 variables
Entire group of individuals AKA population.
Sample - A subset of the population selected for analysis to draw conclusions about the entire group.
Types of variables-
Discrete- (such as class size) consist of indivisible categories
Continous variables- (such as time or weight) are infinetly divisible into whatver units a researcher may choose. For example, time.
Real limits- To define the units for a continuous variable, a researcher must use real limits which are boundaries located exactly half-way between adjacent categories.
Measuring Variables-
The process of measuring a variable requires a set of categories called a scale of measurement and classifies each individual into one category.
4 types of Measurement Scales-
a nominal scale- unordered set of categories identified only by name. Only allow you to determine whether 2 ind. are same or diff
Ordinal scale- ordered set of categories. Tell you the direction of difference between 2 individuals.
Interval scale- ordered series of equal sized categories. identify direction of magnitude of difference. Zero point is located arbitrarily on interval scale
Ratio scale- Interval scale where value of zero indicated none of the variable. Ratio measurement identify the direction and magnitude of differences and allow ratio comparison of measurements.
Correlational sutdies- The goal of a correlational study is to determine wheter there is a relationship betw. 2 variabes and describe relationship.
Sample Space: all the possible outcome of any event.
coin = 2, H, T
Die = 6(1,2,3,4,5,6)
2 dice =36= 6×6
Probability of any even between 0 and 1. Possible —> certainty
Combination VS permutation
combination no order and permutation does matter.
nCr where n is total elements, r is how many elements selected
equation= n!/(n-r)!r!
(combination always smaller thatn pemutation.
Permutation= nPr
Equation = n!/(n-r)!
P(An(multiply)B) = P(A) * P(B) —> IND
=P(A|B) P(B)=P(A) P(B|A)
Conditional= whatever happens with something doesnt affect the others probability.
Notes and summary-
Independent (multiplication rules) and/n (upside down u)
P(AnB) = P(A) P(B) = P(A|B) (P(B) = P(A) * P(B|A)
Conditional Events: P(A|B) = Find the probability of A event given that B event already occured
P(A|B) = P(A) P(B|A) = P(B)
P(A|B) = P(AnB)/P(B)
Complement: P(A) + P(not A) =1
P(A) = 1-P(NOT A)
General formula (Addition Rule): or/U if not independent cant do it
P(AUB)= P(A)+P(B) -P(AnB)
Mutually Exclusive (Disjoint) : No intersection
P(AUB) = P(A) + P(B) where P(AnB) =0
*1.2
Statistics is the study of collecting, analyzing, interpreting, and presenting data.
A dataset is a structured collection of data. Each dataset consists of:
Individuals (Rows): The entities being studied (e.g., students in a class).
Variables (Columns): The characteristics of individuals that can change.
Variables describe different attributes of individuals in a dataset.
A. Categorical Variables (Qualitative)
These variables represent group labels or categories, not numerical values.
Examples:
Hair color (Black, Brown, Blonde)
Blood type (A, B, AB, O)
Location (New York, California, Texas)
Grade level (Freshman, Sophomore, Junior, Senior)
Key Identifiers:
Can’t be mathematically measured.
Used for grouping or classification.
B. Quantitative Variables (Numerical)
These variables represent measurable quantities and have numerical values.
Examples:
Height (e.g., 5’8”, 170 cm)
Test Score (e.g., 85%, 92%)
Age (e.g., 16 years, 21 years)
Weight (e.g., 150 lbs, 68 kg)
Key Identifiers:
Can be mathematically analyzed (mean, median, range).
Represents actual numerical data.
Categorical Variable Example:
Student Name: A label that does not have numerical value.
Class Level: A category (Freshman, Sophomore, etc.).
Quantitative Variable Example:
GPA: A numerical value that can be measured and analyzed.
Dataset: Collection of data organized in rows (individuals) and columns (variables).
Variable: A characteristic that differs between individuals in a dataset.
Categorical Variable: Represents categories or labels (not numerical).
Quantitative Variable: Represents numerical values that can be measured.
*1.3
Categorical Variables represent data in group labels rather than numerical values.
Data for categorical variables is often summarized using tables.
Frequency Table:
Displays the count (frequency) of observations in each category.
Example: A table showing the number of films in different genres.
Relative Frequency Table:
Displays the proportion (percentage) of observations in each category.
Calculated by dividing each category’s frequency by the total number of observations.
Useful for comparing proportions rather than absolute counts.
From Frequency to Relative Frequency:
Divide each category's frequency by the total count.
From Relative Frequency to Frequency:
Multiply each relative frequency by the total count.
Tables help identify trends and make comparisons.
Example: If the relative frequency of premium olive oils is 0.55, it indicates that more than 50% of sales come from premium types.
A distribution lists possible values a variable can take and how often they occur.
✔ Frequency Table – A table showing counts of observations in each category.
✔ Relative Frequency Table – A table showing proportions (percentages) of observations in each category.
✔ Distribution – A list of all possible values of a variable and their occurrences.
Concept | Definition | Example |
---|---|---|
Categorical Variables | Group labels (not numerical). | Film genres, laptop brands. |
Frequency Table | Shows count of observations per category. | Number of action, drama, or comedy films. |
Relative Frequency Table | Shows proportion (%) of observations per category. | Percentage of films in each genre. |
Converting Tables | Can switch between frequency and relative frequency using division/multiplication. | Convert film counts to film percentages. |
Distribution | How data values are spread across different categories. | Olive oil sales distribution among different grades. |
*1.4
Bar charts visually represent categorical data using bars to show frequency (count) or relative frequency (percentage).
They help compare different categories effectively.
Step 1: Choose the axes:
The x-axis represents categories (e.g., days of the week, food items).
The y-axis represents frequency (count) or relative frequency (percentage).
Step 2: Label axes appropriately.
Step 3: Draw bars for each category, ensuring heights correspond to their frequencies.
Step 4: Compare and analyze the data presented.
The tallest bar represents the most frequent category.
If no bar exceeds 50% relative frequency, no single category dominates the dataset.
Example:
In a survey about volunteer day preferences, Friday was most preferred, while Thursday was least preferred.
In a restaurant order analysis, Fajitas were most popular at Location 1, while Tacos were most popular at Location 2.
✔ Bar Chart (Graph) – A graph that uses bar height/length to display the frequency (or relative frequency) of categorical variables.
Concept | Definition | Example |
---|---|---|
Bar Chart | A graphical representation of categorical data using bars. | Graph showing employee preferences for a volunteer day. |
Frequency (Count) | The number of times a category appears. | 50 employees chose Friday, 30 chose Monday. |
Relative Frequency (%) | The proportion of total observations in each category. | 40% of employees chose Friday. |
Axis Representation | X-axis → Categories, Y-axis → Frequency/Percentage. | Days of the week on x-axis, count of employees on y-axis. |
Interpretation | Tallest bar = most frequent category. | Friday is the most popular day for the trip. |
Notes:
Z= x-M/Omega where x represents the value of interest, M is the mean of the dataset, and Omega denotes the standard deviation.
Z= measure the position (Standard Normal), if uses z score, it becomes standard normal bell curve.
Percentile%= Part Prior / Whole
If you are 90th percentile, it means there are 90 percent of people behind you.
If using invNorm, you put the percentile and you weill get the Z score.
Every time convert to Z its standard
Normal- big numbers
bell shaped curve
Area under curve = 1 (100%) and P+Q=1
Symmetrical
P< or equal to always find area to the left
To find the right subtract by 1
Use real numbers
Standard Normal- asks for z score where u have to use equation
Standard uses M=0 and sigma =1
Standard is a position rather than a number
Central Limit Theorem (CLT)
The distribution of sample means will be normal if the sample size is large enough. This is true regardless of the distribution of the original population.
Skinnier graph is more accurate
Z = X-M/sigma
Z=X-M/sigma/squroot of n(sample size)
sampling distribution=CLT
A sampling distribution is normal if the population could be normal or abnormal
Chapter 6
Confidence Interval (CI): to capture the true population parameter particular confidence level. (95%).
CL=1-alpha (alpha is what is outside of the Confidence level.
Ex. (1-.095)/2 (because two sides u divide by 2)
(ONLY USE POSITIVE ON THESE)
3 formulas:
1. mean = x(bar) x(bar)+_ E where E is margin of error use when you dont know population mean
sigma is given
x(bar) +- Zc sigma/squroo of n use Z=critical
when sigma is unknown use t-critical (calculator is dist 4) Degree of freedom is= n-1
x(bar) +- tc Sx/squaeroo of n
Proportion%
p(hat) +-Zc squaroo of p(hat)*q(hat)/n
(calc use dist 3)
the half alpha is also known as reject zone.
Z test statistic- X-M/sigma/squroot of n(sample size)
T test statistic- X(bar)-M/Samples/squareroot(n)