Basic Statistics in Psychology - Comprehensive Notes
Basic Statistics in Psychology
Unit I: Relevance of Statistics
Lesson 1: Relevance of Statistics
1.1 Learning Objectives:
Understand the role of research in psychology
Identify different types of research
Examine different scales of measurement
Understand how data can be represented
1.2 Introduction:
1.2.1 Psychological Research:
Systematic investigation by psychologists to understand factors influencing individual or group behavior.
Aims to provide deeper understanding, but results aren't always definitive; methodology is crucial.
Methodology is broader, encompassing the entire research process (assumptions, ethics, sociocultural context).
Methods are specific techniques for data collection, analysis, and reporting.
Qualities of researchers:
Persistence.
Tolerance for ambiguity.
Ethical conduct.
Logical and rational thinking.
Openness to changing one's mind.
Planning and organization skills.
Effective communication skills.
1.2.2 Why do Psychologists Carry Out Research?
Exploration: Initial investigation when little is known; addresses "what" questions.
Description: Detailed examination of a phenomenon within its context; addresses "who" and "how" questions.
Explanation: Finding the reasons behind a phenomenon; focuses on the "why" question.
Prediction: Determining when a phenomenon is likely to occur; focuses on the "when" question.
Control: Influencing behavior to improve quality of life; involves constructive changes.
1.2.3 Different Types of Research:
Basic vs. Applied Research
Basic (pure) research: Advances understanding of psychological phenomena and their long term impact, acting as a benchmark for future studies (e.g., Francis Galton’s research on intelligence).
Applied research: Seeks solutions to practical problems with short term effect and utilizing established theories (e.g., manager using tests for hiring).
Laboratory vs. Field Research
Laboratory research: Conducted in a controlled environment (e.g., Bandura’s Bobo doll experiment).
Field research: Conducted in natural settings, using observations and surveys; results are more generalizable.
Quantitative vs. Qualitative Research
Quantitative research: Uses numbers for data collection and analysis (e.g., questionnaires, statistical measures).
Qualitative research: Uses words and images (e.g., case studies, interviews) and is less generalizable due to smaller sample sizes.
Cross-sectional vs. Longitudinal Research
Cross-sectional research: Data collected at one point in time (economical).
Longitudinal research: Data collected over an extended period to track change (costly and time-consuming).
1.3 Relevance of Statistics in Psychological Research:
Statistics involves data collection, presentation, and analysis.
Helps identify trends and patterns, predicting likelihood of events.
Aids in diagnosing patients and improving organizational efficiency.
Two types: Descriptive and Inferential Statistics.
1.4 Descriptive and Inferential Statistics:
Descriptive Statistics
Describes and summarizes data.
Examples: mean, median, mode, standard deviation, range, correlation coefficients.
Inferential Statistics
Draws conclusions about a population from a sample.
Involves hypothesis testing using techniques like t-tests, z-tests, ANOVA (Analysis of Variance), chi-square test.
1.5 Levels of Measurement:
Variables include gender, height, IQ, motivation, etc.
S.S. Stevens (1946) identified four scales:
Nominal Scale: Qualitative categories that are are mutually exclusive and exhaustive (e.g., gender, pass/fail).
Ordinal Scale: Categories with ranks that are mutually exclusive and exhaustive (e.g., ratings like 1, 2, 3).
Interval Scale: Equal intervals between points, but zero is arbitrary (e.g., Celsius temperature scale).
Ratio Scale: Includes all interval scale characteristics plus a true zero point (e.g., Kelvin scale for temperature).
1.6 Grouped Frequency Distribution: (Excluded from current syllabus, for personal understanding only)
1.6.1 Frequency Distribution: Organizes data, showing number of observations for each category (e.g., majors selected by students).
1.6.2 Grouped Frequency Distribution: Combines scores into class intervals for data visualization.
Guidelines for creating intervals:
Mutually exclusive intervals.
Continuous intervals (even if frequency is 0).
Highest score at the top.
Equal interval widths.
Convenient interval widths (e.g., 2, 5, 10).
Appropriate number of intervals.
Lower score as a multiple of interval width.
1.6.3 Steps involved in creating a grouped frequency distribution.
Find the highest and lowest scores.
Find the range (highest score - the lowest score).
Divide the range by 10 and 20 to find out the largest and smallest interval width, then select a convenient width between these values.
Find the score at which the interval width should begin highest or lowest, it should be in the multiple of the interval width.
List the class intervals with highest value at the top and making continuous intervals of equal width.
Use tally system to count the number of scores within an interval, then convert the tally into frequency.
1.6.4 Real Limits vs. Apparent Limits:
Real limits: Extend half a unit below and above the apparent limits (e.g., apparent limit of 70-71 becomes real limits of 69.5-71.5).
Apparent limits: Smallest to largest unit of measurement in the interval.
1.6.5 Relative Frequency Distribution: Shows proportion or percentage of scores within each interval. Useful for comparing groups of unequal sizes.
1.6.6 The Cumulative Frequency: Answers questions about how many scores fall below the the upper real limit of each interval.
1.7 Graphical Representation of Data:
Graphs make data interpretation easier and identify patterns.
Includes a horizontal (x-axis/abscissa) and vertical axis (y-axis/ordinate).
Generally, the score or categories are represented on the x-axis while frequency is represented on the y-axis along with their respective labels.
Y-axis is typically ¾ the length of the x-axis.
Types of graphs:
1.7.1 Histogram: Rectangles with vertical sides on real limits; height represents frequency.
1.7.2 Frequency Polygon: Connects dots (midpoints of class intervals); x-axis represents scores, y-axis represents frequency.
1.7.3 Cumulative Percentage Curve: Based on cumulative percentage distribution; represents the percentage of scores below the upper real limit.
Usually S-shaped (ogive).
Helps determine percentile points or ranks.
1.8 Solved Illustrations: Calculation of P50.
Step 1: Find the class interval within which the P50 falls.
Step 2: Determine the number of cases from to where P50 will be.
Step 3: Assume that the class interval is equally distributed.
Step 4: Finally add these points to the lower real limit to get to the percentile.
1.9 Summary:
Main points of the lesson.
1.10 Answers to In-Text Questions:
Answers to questions given in the lesson.
1.11 Glossary:
Definitions of key terms.
1.12 Self-Assessment Questions:
Questions for practice.
1.13 References:
List of references.
1.14 Suggested Readings:
List of suggested readings.
Lesson 2: Central Tendency
2.1 Learning Objectives:
Familiarize students with statistical methods in psychological research
Understand descriptive statistics for quantitative research
Teach application in the field of psychology
Understand properties and computation of measures of central tendency and variability
2.2 Introduction:
Statistics is "the science of learning from data, and of measuring, controlling, and communicating uncertainty" (American Statistical Association).
Plays a crucial role in analysing data, interpreting, and drawing valid conclusions and predictions from data collected for psychological research purposes..
Descriptive Statistics involves summarizing and describing the features of a dataset, calculating measures of central tendency (mean, median, mode) and variability (range, variance, standard deviation).
Inferential Statistics: involves using sample data to make inferences about a population and test hypotheses. Involves studying the statistics theory in depth and critical analyses of research question before addressing it.
2.3 Measures of Central Tendency: Definition, Properties, and Comparison
Central tendency (central location measures): statistical values that indicate the central location of a dataset in a distribution.
2.3.1 Mean, Median, Mode:
Mean (Average): Sum of all values divided by the number of values. Represented by \mu . Most common measure to calculate both continuous and discrete variables, more often for continuous data. \mu = (\Sigma x) / N
Properties: Sensitive to outliers, good for normal distributions, useful with equal large/small values, rigidly defined and easy to calculate.
Median: Middle value when data is arranged in ascending order; positional average. If there are odd number of values, then median is the middle number. If there are even number of values, then it is the average of the two middle values. \textrm{Median} = \frac{N}{2}
Properties: Not sensitive to outliers, clear indicator of central tendency with many middle values, easy to calculate.
Mode: Most frequently occurring value, can have one, multiple, or no mode.. A set with multiple modes called ‘multimodal’, a data with two modes is said to be bimodal. It is useful in the case of a dataset that has a large set of repetitions. It provides clear indication of the most common value.
Properties: Not affected by outliers, useful for datasets with repetition.
2.3.2 Comparison of Mean, Median, and Mode:
Mean is best for normal distributions (sensitive to outliers).
Median is best for skewed distributions (accounts for outliers).
Mode is best for datasets with many repetitions.
Choosing the correct central tendency tool depends on the characteristics of the dataset.
2.4 Calculation of Mode, Median, and Mean from Raw Scores:
Mean (Average): \mu = (\Sigma x) / N
Median: Arrange in ascending order, find the middle value (or average of two middle values if even number of values).
Mode: Most frequent value.
2.5 Effects of Linear Score Transformations on Measures of Central Tendency:
* Linear transformations refers to the mathematical operations that are performed on a set of scores or data values in order to change their distribution this linear transformation is done using the formula: Y = aX + B
* Mean: Affected by the scaling factor a and shift factor b. Formula: Y = aX + B
* Median: Affected in the same way as raw scores.
* Mode: May or may not be affected.Linear transformation: can involve +/-, the mode data can or cannot change because it is an individual score.
2.6 Measures of Variability: Range; Semi-Interquartile Range; Variance; Standard Deviation (Properties and Comparison)
Variability: the spread, or dispersion, of a group of scores.
Range: Difference between largest and smallest values. \textrm{Range} = HS - LS
Semi-Interquartile Range (IQR): Half the difference between the 75th (upper quartile) and 25th percentile (lower quartile). Half of the difference between the 75th percentile (upper quartile) and the 25th percentile (lower quartile) of the data.
Variance ($\sigma^2$): Average of the squared deviations from the mean. It is the mean of the squared deviation of scores. V = SD^2
Calculated on all observations (accurate).
Algebraic calculations can be done on variance
Standard Deviation (SD): Square root of the variance. Karl Pearson in 1894, indicates average distance of scores around the mean smaller SD shows greater data homogeneity.
Best measure of variation and is widely used.
Accurate estimate of population parameter
Quartile Deviation: The quartiles divide the data set into four equal parts
Q = \frac{Q3-Q1}{2}
2.7 Calculation of Variance and Standard Deviation:
Variance: \textrm{Variance} = \frac{\Sigma (x - \bar{x})^2}{n}
Standard Deviation: \textrm{Standard Deviation} = \sqrt{\textrm{Variance}}
2.8 Effects of Linear Score Transformations on Measures of Variability:
Linear score transformations are mathematical operations that are performed on a set of scores or data values in order to change their distribution.
Most common linear score transformation is the standardization which involves subtracting the mean of the scores from the raw scores and dividing the result by the standard deviation.
Range: range of transformed scores = original range * scaling factor a. Y = aX + B
•Semi-Interquartile Range: semi-interquartile range of transformed scores: equal to the original semi-interquartile range * scaling factor a. Y = aX + B
Variance =variance of transformed scores is equal the original variance * square of scaling factor a. Y = aX + B
Standard deviation: Standard deviation of transformed scores: equal to the original standard deviation * scaling factor a.
2.9 Summary:
Descriptive Statistics summarizes and describes data features.
Inferential Statistics uses sample data to infer about a population.
Central tendency measures indicate the central location of data.
Mean is the average.
Median is the middle value.
Mode is the most frequent value.
Linear score transformations manipulate data.
The median is not affected by the linear transformations
2.10 Answers to In-Text Questions:
Answers to questions added in the context.
2.11 Glossary:
Glossary of the terms used in the context.
2.12 Self-Assessment Questions
Self evaluation questions for the context.
2.13 References.
All the resources used in making the context.
2.14 Suggested Readings
Suggested readings are resources from which further information can be extracted.
Unit II: Standard Scores
Lesson 3: Standard Scores
3.1 Learning Objectives:
Understand standard scores
Understand the various types of standard scores
Learn about the applications of standard scores
Learn about the nature and applications of the normal probability distribution
3.2 Introduction to Standard (z) Scores:
Standard score (z-score): deviation of a score from the mean, telling how far it is from the mean.
Positive z-score, value is greater than the mean; negative, value is smaller.
Allows score comparisons between different datasets.
Standard score mean is always zero, is standard deviation equals one.
3.3 Properties of z-scores:
Mean is always 0
Standard deviation is always 1.
Positive z-score (above 0 shows the value of the score is larger than the mean).
Negative z-score (below 0 means the value of the score is smaller than the mean).
Shape of z-score distribution graph is the same as original distribution.
Total of squared, of z-score = total number of z-score values.
*Advantages:
Raw scores into z-scores does not impact the characteristics of the distribution.
Help analyse and compare scores from two different distributions
Disadvantages:
uses use and minus signs which can be confusing and misleading.
Decimals can create confusion
3.4 Transforming Raw Scores into z-scores:
Convert raw score into z-score using the formula: Z = \frac{X- \mu}{\sigma}
Where:
X= raw Score,
\mu= mean of the scores
\sigma = standard deviation of the scores
3.5 Determining Raw Scores from z-scores:
Use the formula: X = \mu + (z \times \sigma)
Where: \mu = the mean, Z = the z-score, and \sigma =
3.6 Some Common Standard Scores:
T-score
Stanine-score
STEN-score
Used to overcome limits of the z-score
3.6.1 T-score:
Overcomes Z –Score decimals and confusion and provide reliable and useful score named T score to analyse data
This T- Score was introduced by William A McCall and given in honour of Thorndike and TermannMean of 50, standard deviation of 10. Formula as follow – T=10z+50
Mean value and standard deviation value
3.6.2 Stanine-score:
Stanine (standard nine) score is scale consisting of 9 categories and have a mean (5) standard deviation (2)
Can be used to transform a convert any score into nine point scale. This scale assign a number to a test score relative to all the tests
3.6.3 STEN-score:
If we take a scale and divided it into 10 parts or units can call it a STEN- score however all the units will be even which is wrong
Formula:
STEN= z(SD) + M
Where: z- the z score
M: Mean 5.5
SD: standard deviation: 2
Easy to understand and no negative values, and easy to comprehend but not in equal distribution
3.7 Computations of Percentiles and Percentile Ranks from Grouped Data:
Percentile (means hundreds) used to describe the position of candidate cummulative frequency distribution percentage
2 concept: percentile point and percentile ranks
Percentile point commonly referred to as Percentile -Represents a point below which specific number of cases fall
Percentile rank on the other hand percentile of cases that fall below point on the measurement scale. The subscript here indicates the rank for
PR = \frac{B + ((X - L) / I) \times n}{N} \times 100
Where:
PR: Percentile Rank
B: Number of raw scores falling below the class falling below the class interval containing X -Raw scores for which percentile rank is to be computed.
X: Raw score for which PR is be computed
L: lower LImit of the interval X lies
n =Number of cases within the interval that contains the raw score X
I: Interval size : number of whole score units forming width of class interval
N: Number of the whole scores in the distribution.
Calculation of Percentile point with group data
3.8 Comparison of z-scores and Percentile Ranks:
Percentile - helps location of test scores with number of scores surpassing it
3.9 The Normal Probability Distribution: Nature, Properties and Applications:
The normal probability distribution, also known as the Gaussian distribution, is a continuous probability distribution that is widely used in statistics, engineering, and the natural sciences
\mu Measures
Standard Deviation
3.10 Normal Curve and Standard Scores:
Finding Areas when the Score in known:
*Identify area to Z store- Using a standard normal distribution table, find the row that corresponds to the first digit of the z-score and the
column that corresponds to the second digit of the z-score. Then, look at the corresponding cell in the table, which will give you the area to the left of the z-score
Finding Scores when the Area is known
Using the same standard normal distribution table, find the area in the table and look for the corresponding z-score. The table will typically have values for the area to the left of the z-score, so if you need to find the z-score for an area to the right of the mean, you’ll need to subtract the area from 1 before looking up the corresponding z-score.3.11 Summary :
Z score to T-Score, the steps are different depending on what are the parameters known and what needs computed3.12 Answers to questions.
3.13 Glossary: Definitions of Key Terms
3.14 Self-Assessment Questions:
3.15 References.
3.16 Suggested Readings:
Definitions of key terms.
Unit III: Analysis of Relationship
Lesson 4: Analysis of Relationships
4.1 Learning Objectives:
Explain the meaning of Correlation.
Construct Scatter diagram for relationship between pairs of variable.
Understand corelation does not mean causation.
Understand how to calculate Pearsons's Product moment.\
4.2 Introduction:
Charles Darwin was renowned scientist worked on concept of evolution of species through natural selection. His work found that variations among species adapt them to the environment thus ensuring their survival
Francis Galton (Darwin’s cousin) work on the concept of individual differences and identified that there are variations among species that helps them in adapting to their environment and thus ensuring their survival which inspired Francis Galton to carry out researches on individual differences which lead to the formulation of Bivariate Distribution
A bivariate distribution is a distribution that shows the relationship between two variables. Which was done by Karl Pearson in 1890.
Correlation: statistical technique used to understand level of association between variables, which can help make conclusions to screen student in admission process. The prerequisite for predictions to be effective is to have high level of correlations between the variables.
4.3 Understanding Correlation:
4.3.1 Scatter Diagram:
Scatter plot is a way of representing information on regarding relation between variables. Can use the dot intersection of IQ (X) and CGPA (Y) scores with intersection showing figure 1.
Can draw a straight like and the line indicate nature between two variables which is linear in in nature.
Can use curved lines to connect the dots and name relationship as a curvilinear, with the use of following steps:
1 - Assign labels X and Y to the variables
2 – Plot values by assigning axis to both. With smaller values on the x Axis and larger values on the Y.
3 after ploting values find the inter section for each set of number
Name each axis and add the title of the graph.
4.3.2 Components of Correlation: Direction and Magnitude:
Correlation between variables if calculated with Pearson’s correlation coefficient symbolized by r formulated by Pearson in 1896.
\gamma XY
correlation by efficient can take values between -1 to +1
-1 implies perfect negative correlation and +1 is perfect positive and 0 is no correlation if you go from left to right higher variable increases
4.3.3 The meaning of Correlation :
* One of the most important aspects of correlation is that if there is correlation between two or variables then it only means that the variables are associated with each other and does not mean one is causing the other i.e., correlations don’t imply causation.
* Comparison is done with percentage of cases with variables and looking that median. With table can say people with above median score of I Q , with 79.3 per with median on CGPA and with a 20.7% of all cases with median in CGPA.4.4 Calculating Pearson’s Correlation:
* One of widely accepted and used correlation coefficient is pearson’s product moment correlation coefficient.
* Calculated using two methods: standard score or Z score and deviation score formula
* 1)Standard score of Z-score- Convert the raw store of both the variables into Z stores and then calculate some of the product of each pair of stores also known as the cross products which is divided by the total number of pairs of scores. r = \frac{\Sigma Zx Zy}{n}
* 2)Deviation Score Formula- Can direct calculate persons correlation and efficient losing raw scores by first calculating the sum of the product of deviation scores of each pair of scores in then divide the result for the product of the number of pierce of scores is standard deviation of each variable
Deviation score formula can direct Pearson corelation.4.5 Correlation and causation:
* This also acts a stating point for further studies that the variables do not increase directly but this can be used as an intital start-up to find information.
* That corelation is involved in assessing between the variables and also does establish prediction . In is the cause and effect relation which can vary on the variables and be effective .4.6 Effects of Linear Score Transformations:
* Linear transtrommation involves changes in each raw score by adding a constant +/-, multiplying or dividing . this does not inflience the vslue if the corelation coefficients. If you look to the mean value by CGPA is changed by 10 and corelation coefficient the mean is same.4.7 Factors Influencing Correlation:
* Sample Size- Small number has unstable corelation- larger is more reliable
* Native of samples between varibales- corelation variables is not fixed for corelation of data
* Linear relationship between corelation- use dots to from line are connecting more clear.
* Varibility of score- High and low varibility will reduce and increase
* Discontuiny In score missing will give false positives4.8 Spearman Rank Correlation Method : Statistical measure is used for strength and direction between to variables. Unlike Pearson , Spearman is based on rank of data , which is useful on ordinary data . It can deal wiht ordinal or non ordinally disbursted data. The following formula id for Spearman Rank corelation \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}
Where: di = deference between each value pair of values n = paired observations This coefficient ranges from –1 to 1, where : ++++ ρ = 1 indicates a perfect positive monotonic relationship (i.e., as one variable increases, the other variable also increases). ++++ ρ = –1 indicates a perfect negative monotonic relationship (i.e., as one variable increases, the other variable decreases). ++++ ρ = 0 indicates no monotonic relationship between the variables
The following tests will be tested with Spearman Rank Test :
Q1. Ranking the Data : First, the data for both variables are ranked independently from lowest to highest. If there are ties (i.e., identical values), the ranks are averaged.Let X and Y be two variables at ordinal data level. Let rank X represent the order in which the values of X occur, and likewise rank Y represent the corresponding order in which values of Y occur. Each value of X is associated with a value of Y – they form pairs of values.
Q2.Calculating the Differences in Ranks: Next, the difference between the ranks of each paired observation is calculated. These differences represent the deviations from the perfect correlation.
Q3. Squaring the Differences: The squared differences are calculated to eliminate any negative signs and to give more weight to larger differences.
Q4.Summing the Squared Differences: The squared differences are then summed across all pairs of observations. The following formula will be used
Q5.Calculating the Spearman Rank Correlation Coefficient: Finally, the Spearman correlation coefficient (often denoted by the symbol ρ) is calculated using a formula that incorporates the sum of squared differences and the sample size
4.9 Linear Regression Analysis/Simple RegressionLinear regression is linear between and models. Dependent used and linear regression in terms of what their relation looks like . Y = \beta 0 + \beta 1 X + \epsilon
Where: $Y is the dependent variable.
X the is independent variable. $ \beta 0 is the intercept
$$ \beta 1 is +\epsilon4.9 regressionLinear and more models dependent variables in models to create regression.4.10 Summary.
Corelation measure used by chart to measure relationship with coorelation and formula with high effect.4.11 Answers to questions.
4.12 Glossary: Definitions of Key Terms
4.13 Self-Assessment Questions:
4.14 References.
4.15 Suggested Readings:
Definitions of key terms.