Looks like no one added any tags here yet for you.
Data
Scores on variables or information expressed as numbers (quantitatively)
Variables
traits that can change values from case to case (gender, age, religious affiliation)
Cases
entities from which data are gathered (people, businesses, cities, countries)
Descriptive statistics
describing variables
inferential statistics
making inferences from info about variables. generalize from a sample to a population.
univariate - bivariate - multivariate
summarizing variables (one variable, the relationship between two variables or the relationship between three or more variables)
univariate
summarize or describe the distribution of a single variable. The goal is data reduction - summarize many numbers with one or a few numbers. This can look like percentages, averages, charts and graphs.
Bivariate
describing the strength and direction of the relationship between two variables (measures of association and scatterplots)
multivariate descriptive statistics
relationships between three or more variables —> multiple regression, partial correlation
independent & dependent variables
cause (independent) —> effect (dependent)
independent variable
An independent variable is the variable that is manipulated or changed in an experiment to test its effects on the dependent variable. It is called "independent" because its variation does not depend on other variables in the study.
dependent variable
A dependent variable is the variable that is measured or tested in an experiment. Its value depends on changes in the independent variable
discrete variable
measured in units that cannot be subdivided. these must be whole integers. like people in a household
continuous variable
measured in a unit that can be subdivide infinitely. example —> age, years married etc.
What is an interval-ratio variable?
Interval-ratio variables are numerical variables with meaningful intervals between values and a true zero point. Examples include height, weight, and the number of years a couple has been married.
What is an ordinal variable?
Ordinal variables are categorical variables with a natural order or ranking, but the intervals between categories are not necessarily equal or known. Examples include levels of satisfaction and socioeconomic status. Scores can be ranked high to low or more to less, scores represent only position with respect to other scores, survey items measuring opinions and attitudes are typically ordinal.
What is a nominal variable?
Nominal variables are categorical variables that do not have any intrinsic order or ranking. Examples include gender, eye color, and types of fruit.
categories must be
1. Mutually exclusive = There must be one and only one category for each case. 2. Exhaustive = A category must exist for every possible score that might be found. 3. Homogeneous = Categories should include cases that are comparable.
When are addition and subtraction completely justified for variables?
When variables are interval-ratio.
Interval-ratio variables are numerical with meaningful intervals between values and a true zero point, allowing for arithmetic operations like addition and subtraction to be completely justified.
How is a proportion calculated?
A proportion is calculated by dividing the number of cases in a category by the total number of cases in all categories.
Formula: Proportion = (Number of cases in a category) / (Total number of cases in all categories)
How do you get the numerical value of a probability from a proportion?
Answer: Do nothing to the value.
Explanation: A proportion is already a numerical value between 0 and 1, representing the probability. Therefore, the proportion itself is the probability.
What is probability?
Probability is a measure of the likelihood that a particular event will occur. It is expressed as a number between 0 and 1, where 0 indicates that the event will not occur, and 1 indicates that the event will certainly occur. This is an imagined procedure that corresponds to works like likely/unlikely/risky/confidence/doubt etc.
Formula:
F/N
The probability of an event \( A \) is calculated by dividing the number of favorable outcomes by the total number of possible outcomes.
What is a proportion?
A proportion is a fraction or percentage that represents the part of the total that falls into a specific category. It is a numerical value between 0 and 1.
Formula:
A proportion is calculated by dividing the number of cases in a specific category by the total number of cases in all categories. It represents the fraction of the total that falls into the specific category.
How do you calculate the number of men in a class if the ratio of men to women is 3.3:1 and there are 100 women in the class?
The number of men is calculated by multiplying the ratio (3.3) by the number of women (100).
Formula:
Example Calculation:
What should we find in a table for which percentages and cumulative percentages are appropriate and have been properly calculated?
Answer: The cumulative percentages are always equal to, or greater than, the corresponding percentage.
Explanation: Cumulative percentages are calculated by adding the percentage of each category to the sum of the percentages of all previous categories. Therefore, the cumulative percentage for any category will always be equal to or greater than the corresponding percentage for that category alone.
How does the median define "central tendency"?
The median defines "central tendency" as the value or score of the case having as many cases above it as below it.
The median is the middle value in an ordered dataset, separating it into two equal halves. It represents the value or score of the case that has as many cases above it as below it.
How do you calculate the percent of Blue preferences?
Answer: The percent of Blue preferences is calculated by dividing the number of Blue preferences by the total number of cases and then multiplying by 100.
Formula:
Example Calculation:
How would you calculate the mean preference from a table of political party preferences?
Answer: Calculating the mean preference from a table of political party preferences would be a mistake.
Explanation: The values in the table represent frequencies of political party preferences, not numerical scores that can be averaged. Therefore, calculating the mean preference would be inappropriate and misleading.
How do you calculate the probability of randomly meeting a person who is politically Red?
Answer: The probability is calculated by dividing the number of Red preferences by the total number of cases.
Formula:
Explanation: The probability of randomly meeting a person who is politically Red is 0.19, or 19%.
What is the mean in statistics?
The mean, also known as the arithmetic mean, is the average of a set of values. It is calculated by summing all the values and dividing by the total number of values.
Formula:
The mean is the central value of a dataset. It is sensitive to outliers and skewed data, which can affect its accuracy as a measure of central tendency.
What is the median in statistics?
The median is the middle value of a dataset when it is ordered from smallest to largest. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.
is less sensitive to outliers and skewed data compared to the mean. It represents the value that separates the lowest 50% from the highest 50% of the dataset.
What is the mode in statistics?
Answer: The mode is the value that appears most frequently in a dataset. A dataset can have no mode, one mode (unimodal), two modes (bimodal), or more than two modes (multimodal). It is most often used with nominal-level variables
Explanation: The mode is useful for identifying the most common value in a dataset. It is not affected by outliers and can be used with both numerical and categorical data.
What does it indicate if a distribution has a mean of 70 and a median of 80?
the distribution has a negative skew.
When the mean is less than the median, the distribution is negatively skewed (left-skewed). This means that the tail on the left side of the distribution is longer or fatter than the right side.
What is a negative skew (left skew)?
A negative skew occurs when the tail on the left side of the distribution is longer or fatter than the right side. The mean is less than the median, and the median is less than the mode.
Key Characteristics:
- Tail on the left
- Mean < Median < Mode
- More frequent higher values
What is a positive skew (right skew)?
A positive skew occurs when the tail on the right side of the distribution is longer or fatter than the left side. The mean is greater than the median, and the median is greater than the mode.
Key Characteristics:
- Tail on the right
- Mean > Median > Mode
- More frequent lower values
What happens when the mean is subtracted from each value or score in a distribution of an interval-level variable, and these differences are squared and then summed?
Answer: The sum will depend on the dispersion of the scores.
Explanation: The sum of squared deviations (sum of squares) depends on the dispersion (spread) of the scores in the distribution. The greater the dispersion, the larger the sum of squared deviations.
What does the expression instruct us to do?
The expression instructs us to sum the squared deviations of the scores from the mean and provides a basis out of which we can create a measure of dispersion.
The expression is the sum of squared deviations, which is used to calculate measures of dispersion such as variance and standard deviation.
What does the expression instruct us about?
Answer: The expression can instruct us about one of the properties of the mean and would be zero even if there was some dispersion in the distribution.
Explanation: The expression represents the sum of the deviations of each score from the mean. This sum is always zero because the positive and negative deviations cancel each other out. This property of the mean is one of its defining characteristics, and it holds true regardless of the dispersion in the distribution.
Why do the deviations from the mean always sum to zero?
The deviations from the mean always sum to zero because the mean is the balance point of the dataset. The positive deviations (scores above the mean) and the negative deviations (scores below the mean) cancel each other out.
Example Calculation:
Scores: 2, 4, 6, 8, 10
Mean: 6
Deviations: -4, -2, 0, 2, 4
Sum of Deviations: (-4) + (-2) + 0 + 2 + 4 = 0
What percentage of the area under a normal curve is contained within ±1 standard deviation of the mean?
Answer: 68%
Explanation: In a normal distribution, approximately 68% of the data falls within ±1 standard deviation of the mean. This is part of the 68-95-99.7 rule, which states that about 68% of the data falls within ±1 standard deviation, 95% within ±2 standard deviations, and 99.7% within ±3 standard deviations.
How do we get 68% from the area between the mean and z being 0.3413?
Answer: The area between the mean and +1 standard deviation (z = 1) is 0.3413 (34.13%). Since the normal distribution is symmetric, the area between the mean and -1 standard deviation is also 0.3413. Adding these areas gives us the total area within ±1 standard deviation:
Total Area = 0.3413 + 0.3413 = 0.6826 (68.26%)
What is the proportion of the area under a normal curve that is below +1 standard deviation above the mean?
The same as the proportion of the area above -1 standard deviation below the mean.
Explanation: The normal distribution is symmetric around the mean. Therefore, the proportion of the area below +1 standard deviation above the mean is the same as the proportion of the area above -1 standard deviation below the mean, which is approximately 84%.
What is the Z scale in statistics?
Answer: The Z scale, also known as the standard normal distribution, is a way to measure how far from the mean each of your data values is using a standardized scale. Z-scores convert your raw data to data from a z-distribution, which has a mean of 0 and a standard deviation of 1.
Formula:
Explanation:
- Z-score (Z): The number of standard deviations a data point (X) is from the mean (μ).
- X: The raw score or data point.
- μ (mu): The mean of the population.
- σ (sigma): The standard deviation of the population.
Key Points:
- Positive Z-score: Indicates the data point is above the mean.
- Negative Z-score: Indicates the data point is below the mean.
- Z-score of 0: Indicates the data point is exactly at the mean.
- Standard Normal Distribution: The distribution of Z-scores has a mean of 0 and a standard deviation of 1.
Uses of Z-scores:
- Comparing Scores: Allows comparison of scores from different distributions.
- Identifying Outliers: Helps identify data points that are significantly different from the mean.
- Calculating Probabilities: Used to calculate probabilities and percentiles in a normal distribution.
Example Calculation:
If a test score (X) is 85, the mean (μ) is 70, and the standard deviation (σ) is 10, the Z-score is calculated as:
This means the test score is 1.5 standard deviations above the mean.
Given that male and female university students study an average of 15 hours per week, with standard deviations of 5.4 and 2.3 respectively, what can we infer about the probabilities of studying different hours per week?
- The probability of a male working more than 15 hours per week is the same as the probability of a female working more than 15 hours per week.
- The probability of a male working only a few hours per week is higher than the probability of a female working only a few hours per week.
- The probability of a male working very many hours per week is higher than the probability of a female working very many hours per week.
the larger SD with males shows there is more likey to be males studying a wide variety of hours compared to females.
What is a frequency polygon and how is it constructed?
A frequency polygon is a type of line graph used in statistics to represent the distribution of a dataset. It is similar to a histogram but uses points connected by straight lines instead of bars.
Key Points:
- Definition: A graphical representation of the frequencies of different classes in a dataset.
- Construction:
1. Mark the class intervals on the x-axis.
2. Calculate the midpoints of each class interval.
3. Plot the midpoints on the x-axis.
4. Plot the frequencies on the y-axis.
5. Connect the points with straight lines.
- Formula for Midpoint:
- Comparison with Histogram: Uses points connected by lines instead of bars. Useful for comparing multiple datasets.
Example:
If the class intervals are 10-20, 20-30, and 30-40 with frequencies 5, 10, and 15 respectively, plot the midpoints (15, 25, 35) and connect the points (5, 10, 15) with lines.
Frequency polygons are great for visualizing data and understanding its distribution, especially when comparing multiple datasets.
What does the formula 𝑓𝑖⁄𝑛 tell us if we are given information on the frequency of the values for a variable?
The formula 𝑓𝑖⁄𝑛 tells us how to calculate the proportion.
Explanation:
- 𝑓𝑖: Frequency of a specific value or category.
- 𝑛: Total number of observations or the sum of all frequencies.
- Proportion: The proportion is calculated by dividing the frequency of a specific value by the total number of observations.
Example:
If the frequency of a specific value is 10 and the total number of observations is 50, the proportion is:
Calculating the Median
When n is odd:
- Find the middle case by adding 1 to n and then dividing that sum by 2.
- Example: With an n of 7, the median is the score associated with the , or fourth, case.
- Example: If n had been 21, the median would be the score associated with the , or 11th, case.
When n is even:
- There is no single middle case.
- Example: With an n of 8, the ordered distribution of scores is 10, 10, 8, 7, 5, 4, 2, 1.
- Any value between 7 and 5 would technically satisfy the definition of a median.
- The median is defined as the average of the scores of the two middle cases.
- Example: The median would be defined as , or 6.
Summary:
- To identify the two middle cases when n is an even number, divide n by 2 to find the first middle case and then increase that number by 1 to find the second middle case.
- Example: With eight cases, the first middle case would be the fourth case and the second middle case would be the , or fifth, case.
- Example: If n had been 142, the first middle case would have been the 71st case and the second middle case would have been the 72nd case.
- Remember that the median is defined as the average of the scores associated with the two middle cases.
Normal Distribution
- A normal distribution is a continuous probability distribution that is symmetrical and bell-shaped.
Characteristics:
- Symmetrical: The left and right sides of the distribution are mirror images of each other.
- Mean, Median, and Mode: All three measures of central tendency are equal and located at the center of the distribution.
- Bell-shaped Curve: The highest point on the curve is at the mean, and the curve decreases as you move away from the mean.
- Asymptotic: The tails of the distribution approach, but never touch, the horizontal axis.
- Empirical Rule: Approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
Standard Normal Distribution:
- A special case of the normal distribution with a mean of 0 and a standard deviation of 1.
Standardizing a Distribution
- Standardizing a distribution involves converting the original values of a variable to \( z \)-scores.
- This process allows for comparison across different distributions and simplifies the interpretation of data.
Steps to Standardize a Distribution:
1. Calculate the Mean and Standard Deviation of the Empirical Distribution:
- The mean is the average of all the values.
- The standard deviation measures the spread or dispersion of the values.
2. Convert Each Original Value x to a z -Score:
- The formula for calculating a z -score is:
- z -scores represent the number of standard deviations a value is from the mean.
3. Interpret the z -Scores:
- A z -score of 0 indicates the value is equal to the mean.
- Positive z -scores indicate values above the mean.
- Negative z -scores indicate values below the mean.
Example:
- Suppose we have an empirical distribution with a mean of 50 and a standard deviation of 10.
- To standardize a value of 60:
- The z -score of 1 indicates that 60 is 1 standard deviation above the mean.
Characteristics of the Standardized Distribution:
- The standardized distribution will have a mean of 0 and a standard deviation of 1.
- The shape of the distribution remains the same, but the scale is adjusted.
- Some z -scores will be negative if the original values are below the mean.
Conclusion:
- Standardizing a distribution allows for easier comparison and interpretation of data.
- It is a useful technique in statistical analysis and data normalization.
Normal Distribution and Variance
- Variance is a measure of the spread or dispersion of a set of values.
- For a normal distribution, the variance cannot be zero because that would imply that all the values are identical and there is no spread in the data.
- If the variance were zero, the standard deviation (which is the square root of the variance) would also be zero, and the distribution would collapse to a single point, which is not a normal distribution.
- Therefore, for a variable to be accurately described as having a normal distribution, it must have a non-zero variance, ensuring that there is some spread in the data.
Calculating a Mark Based on Standard Deviations
A friend tells you they wrote a statistics exam for which the mean mark was 76 and the variance was 4. The friend would not tell what their mark was, but they did say it was three standard deviations above the mean. Presuming your friend is telling the truth, what was their mark?
Steps to Solve:
1. Identify the Given Information:
- Mean (\( \mu \)) = 76
- Variance (\( \sigma^2 \)) = 4
- The mark is three standard deviations above the mean.
2. Calculate the Standard Deviation:
- The standard deviation (\( \sigma \)) is the square root of the variance.
3. Determine the Number of Standard Deviations Above the Mean:
- The friend’s mark is three standard deviations above the mean.
4. Calculate the Friend’s Mark:
- The formula to calculate the mark is:
- Here, \( z \) is the number of standard deviations above the mean.
- Plugging in the values:
Answer:
- The friend’s mark is 82.
Index of Qualitative Variation (IQV)
- The only measure of dispersion available for nominal-level variables (can be used for ordinal-level variables as well).
- IQV varies from 0.00 to 1.00. It is zero if all the cases fall in a single category (have the same single value) and 1.00 if the cases are spread evenly throughout all the variable’s categories/values.
- Ratio of the amount of variation actually observed in a distribution of scores to the maximum variation that could exist in that distribution.
IQV Formula:
- \( k \) is the number of categories or values the variable could take on.
- \( \sum \left(\frac{f}{n}\right)^2 \) is the sum of the squared proportion of cases appearing in each category or value of the variable.
The expression (𝑋𝑖 − 𝑋̅) is used in several statistical contexts. What does this expression report?
epresents the deviation of an individual value (X_i) from the mean. It shows how far each data point is from the average value of the dataset.
How does ∑(𝑋𝑖 − 𝑋̅) connect to properties of the mean
The sum of the deviations from the mean, is always zero. This is because the mean is the balance point of the data, and the positive and negative deviations cancel each other out. Mathematically:
∑(Xi−X)=0
How does ∑(𝑋𝑖 − 𝑋̅) 2 connect to the properties of the mean?
The sum of the squared deviations from the mean, (\sum (X_i - \overline{X})^2), is used to calculate the variance and standard deviation. This measure is always positive and provides a way to quantify the spread of the data around the mean. The formula for variance is:
How does ∑(𝑋𝑖 − 𝑋̅) 2 connect to dispersion?
The sum of the squared deviations from the mean, is directly related to the dispersion or spread of the data. Dispersion measures how much the data points vary from the mean. A larger sum of squared deviations indicates greater dispersion, meaning the data points are more spread out. Conversely, a smaller sum indicates less dispersion, meaning the data points are closer to the mean.
Variance is the average of these squared deviations and provides a measure of the overall spread of the data.
Standard Deviation is the square root of the variance and provides a measure of spread in the same units as the original data.
this is a fundamental component in calculating variance and standard deviation, which are key measures of dispersion in a dataset. It quantifies how much the data points deviate from the mean, providing insight into the variability of the data.
How do you turn the “bars” from a histogram into the “broken line” of a frequency polygon, and how do you attach the ends of a broken-line polygon to the horizontal axis?
Identify the midpoints: For each bar in the histogram, identify the midpoint of the interval it represents. The midpoint is calculated as the average of the lower and upper boundaries of the interval.
Plot the midpoints: On the same graph, plot a point at the height of each bar (frequency) at the corresponding midpoint on the horizontal axis.
Connect the points: Draw straight lines to connect the plotted points in sequence. This forms the “broken line” of the frequency polygon.
Attach the ends: Extend the line from the first midpoint to the horizontal axis at the midpoint of the interval before the first interval. Similarly, extend the line from the last midpoint to the horizontal axis at the midpoint of the interval after the last interval
Dispersion
variety, diversity, and the amount of variation between scores
Can variance be negative?
No. It will never be negative. We square the SD to make it positive. We must get the numbers below the mean to be positive so we can work with them.
what is an expected value of a random variable?
this is another name for mean or average
point of inflection
where the curve either begins going up or down
cumulative frequency
An optional column in a frequency distribution that displays the number of cases within an interval and all preceding intervals
Frequency polygon
A graphic display device for interval-ratio variables. Intervals are represented by dots placed over the midpoints, the height of each corresponding to the number (or percentage) of cases in the interval. All dots are connected by straight lines, and the line is dropped to the horizontal axis at the midpoint of the adjacent interval at the ends.
how do you get a rate?
divide the possible frequency by the actual frequency
Finding the Index of Qualitative Variation (IQV
1: Ensure your frequency distribution table includes a valid percentage column. 2: Add a squared percentage column, and then square the valid percentage values and enter them into this column. 3: Sum the squared percentages (Pct2 ). 4: Count the number of valid variable response categories (k). 5: Enter the k and Pct2 values into the IQV formula, and compute the IQV.
Finding the Median
Array the scores in order from high score to low score. 2: Count the number of scores to see if n is odd or even. If n Is ODD 3: The median will be the score of the middle case. 4: To find the middle case, add 1 to n and divide by 2. 5: The value you calculated in step 4 is the number of the middle case. The median is the score of this case. For example, if n = 13, the median will be the score of the (13 + 1)/2, or seventh, case. If n Is EVEN 3: The median is halfway between the scores of the two middle cases. 4: To find the first middle case, divide n by 2. 5: To find the second middle case, increase the value you computed in step 4 by 1. 6: Find the scores of the two middle cases. Add the scores together and divide by 2. The result is the median. For example, if n = 14, the median is the score halfway between the scores of the seventh and eighth cases. 7: If the middle cases have the same score, that score is defined as the median
Finding the Interquartile Range (Q)
Array the scores in order from low to high scores. For example, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24. 2: Find the median of the data (see the One Step at a Time: Finding the Median box). Continuing with the example data in step 1, Md 5 (12 1 14)/2 5 13. 3: Divide the ordered data into two equal parts at the median. Continuing this example, the lower half 5 2, 4, 6, 8, 10, 12 and the upper half 5 14, 16, 18, 20, 22, 24. 4: Find the median of the lower half of the data. (This value is equal to Q1.) Continuing this example, Md of 2, 4, 6, 8, 10, 12 5 (6 1 8)/2 5 7. 5: Find the median of the upper half of the data. (This value is equal to Q3.) Continuing this example, Md of 14, 16, 18, 20, 22, 24 5 (18 1 20)/2 5 19. 6: Subtract Q1 from Q3. (This value is equal to Q.) Finishing the example, Q 5 19 2 7 5 12
Finding the Standard Deviation (s) and the Variance (s2 ) of a Sample
To Begin 1: Construct a computing table like Table 3.8, with columns for the scores (Xi), the deviations (Xi 2X ), and the deviations squared (Xi 2X ) 2 . 2: List the scores (Xi ) in the left-hand column. Add up the scores and divide by n to find the mean. As a rule, state the mean in two places of accuracy or two digits to the right of the decimal point. To Find the Values Needed to Solve Formula 3.7 1: Find the deviations (Xi 2 X ) by subtracting the mean from each score, one at a time. List the deviations in the second column. Generally speaking, you should state the deviations at the same level of accuracy (two places to the right of the decimal point) as the mean. 2: Add up the deviations. The sum must equal zero (within rounding error). If the sum of the deviations does not equal zero, you have made a computational error and need to repeat step 1, perhaps at a higher level of accuracy. 3: Square each deviation and list the result in the third column. 4: Add up the squared deviations listed in the third column. To Solve Formula 3.7 1: Transfer the sum of the squared deviations column to the numerator in Formula 3.7. 2: Divide the sum of the squared deviations (the numerator of the formula) by n. 3: Take the square root of the quantity you computed in the previous step. This is the standard deviation. To Find the Variance (s2 ) 1: Square the value of the standard deviation (s)
Index of qualitative variation
The ratio of the amount of variation actually observed in a distribution of nominal- or ordinal-level variable scores to the maximum variation that could exist in that distribution
Finding Z Scores
: Subtract the value of the mean (X) from the value of the score (Xi ). 2: Divide the quantity found in step 1 by the value of the standard deviation (s). The result is the Z-score equivalent for this raw score
Finding Areas Between Z Scores
Compute the Z scores for both raw scores. Note whether the scores are positive or negative. 2: Find the areas between each score and the mean in column “b.” Finding Areas Between Z Scores If the Scores Are on the Same Side of the Mean If the Scores Are on Opposite Sides of the Mean 2 2 Subtract the smaller area from the larger area. Multiply this value by 100 to express it as a percentage. Add the two areas together to get the total area between the scores. Multiply this value by 100 to express it as a percentage
Finding Probabilities
Compute the Z score (or scores). Note whether the score is positive or negative. 2: Find the Z score (or scores) in column “a” of the standard normal curve table (Appendix A). 3: Find the area above or below the score (or between the scores) as you would normally (see the three previous One Step at a Time boxes in this chapter) and express the result as a proportion. Typically, probabilities are expressed as a value between 0.00 and 1.00 rounded to two digits beyond the decimal point.