GCSE Statistics vocabulary
Page 1: Vocabulary You Must Know
Data and Sampling
Hypothesis
Definition: A statement that may or may not be true.
Distinction: Not to be confused with a question.
Purpose: Used in statistical investigations to gather evidence supporting the hypothesis.
Population
Definition: All items or people being studied.
Examples: All students in Year 10, all fireworks produced by a factory.
Sample Frame
Definition: A list of all members of the population (e.g., a register or database).
Random Sample
Definition: Every item in the population has an equal chance of selection.
Stratified (Random) Sampling
Process: Population divided into strata (e.g., gender, school year).
Matching: Proportions in the sample reflect those in the population.
Selection: Members within each strata are chosen at random.
Judgement Sampling
Definition: Non-random sampling based on specific criteria (e.g., the first 20 items/people).
Cluster Sampling
Definition: Non-random sampling where all members of randomly selected clusters are included (e.g., all students in randomly chosen tutor groups).
Quota Sampling
Definition: Non-random sampling where a predetermined number of individuals from different categories (e.g., age groups or genders) are selected by the interviewer.
Systematic Sampling
Definition: Non-random sampling starting from a random point and selecting at fixed intervals.
Cleaning Data
Importance: Ensures data reliability and usability in statistical analysis.
Methods: Dealing with outliers, missing data, standardizing formats, removing unnecessary symbols.
Anomaly
Definition: A value that does not fit with the rest of the data (e.g., significant deviation from the line of best fit in scatter diagrams).
Outlier
Definition: A suspiciously extreme value.
Boundaries: Calculated using mean ± 3 × standard deviation or 1.5 × interquartile range (IQR) above the upper quartile/below the lower quartile.
Variables
Variables
Definition: Values that vary among members of the population (e.g., discrete, continuous, qualitative).
Multivariate Problems: Involve more than one linked variable (e.g., analyzing how driving test performance varies by gender and time of day).
Categorical Data: Fits into distinct categories (e.g., gender, voting intention).
Ordinal Data: Indicates rank order (e.g., race positions).
Distribution
Definition: The set of a variable's values along with their frequencies or probabilities.
Extraneous Variables
Definition: Variables not of interest that may impact the results.
Example: Time of day may affect reaction time comparisons.
Control Groups & Matched Pairs
Control Group
Definition: Used alongside a test group for comparison.
Matched Pairs
Definition: Two similar individuals (one in each group) to control for variables (e.g., test group receiving a drug, control group receiving a placebo).
Questions
Closed/Open Questions
Closed Questions: Require a choice from provided answers, easy analysis.
Open Questions: No answer restrictions, harder to analyze (best avoided).
Pilot Survey / Pre-test
Purpose: Trial a questionnaire on a small scale to identify necessary changes prior to wider use.
Checks include: understanding of questions, sufficiency of return rates, adequacy of response options.
Page 2: Data Display and Reliability
Random Response
Definition: Estimates responses to sensitive questions, ensuring more reliable data collection through randomness.
Example: Using a die or coin flip to determine if a subject should answer a question.
Reliability & Validity
Reliability
Definition: The degree to which repeating a study yields similar results.
Impact: Small sample sizes may lead to unreliable outcomes.
Validity
Definition: Measures whether the study assesses what it intends to measure.
Example: Year 7 opinions on food may lack validity for whole school views.
Displaying & Comparing Data
Choropleth Map
A shaded map representing varying values, darker shades indicate higher numbers.
Frequency Density
Used for histogram bar heights where bar area represents frequency.
Central Tendency
Definition: The average, including mean, median, and mode.
Dispersion
Definition: The spread of data, including range, IQR, standard deviation, and variance.
Variance
Definition: The square of standard deviation.
Interpercentile Range (IPR), Interdecile Range
Definition: Ranges capturing central data segments; e.g., IPR: 10th to 90th percentile, IQR similar but discarding less data.
Standardised Score
Definition: Indicates how many standard deviations a value is from the mean; necessary to memorize the formula.
Use: Comparison across different distributions; positive scores are above the mean, negative below.
Scatter Diagrams
Bivariate Data
Definition: Paired data points (e.g., scores in two subjects).
Association and Correlation
Association: Relationship between two variables (e.g., height and gender).
Correlation: Relationship between two numerical variables (e.g., height and hand span).
Explanatory and Response Variables
Explanatory: Independent variable on x-axis, causes change in y.
Response: Dependent variable on y-axis, reacts to changes in x.
Regression
Regression Line/Equation
A calculated line of best fit, interpreted to understand rates of change.
Causation vs. Correlation
Causation: Change in y due to change in x.
Spurious Correlation: Apparent correlation without a causal relationship; often due to a third variable.
Correlation Coefficients
Spearman’s Rank Correlation Coefficient (SRCC)
Formula provided; indicates the likelihood of a relationship, useful even with non-linear relations.
Scale: -1 (perfect negative) to +1 (perfect positive), 0 indicates no correlation.
Pearson’s Product Moment Correlation Coefficient (PMCC)
Not expected to calculate; indicates the probability of a linear relationship.
Scale is the same as SRCC; non-linear correlations may show a stronger SRCC.
Page 3: Time Series and Data Distribution
Time Series
Trend
Defined as long-term changes over time (rising, falling, or level).
Seasonal Variation
Recurring patterns at regular intervals (e.g. high sales every quarter).
Mean Seasonal Effect
Average numerical differences from trend line over a specific period.
Seasonal effect = observed value - trend line value.
Index Numbers
Base Year
Reference year for percentage comparisons; base year index is always set to 100.
Weighted Index
Average of index numbers weighted for different items, reflecting specific contributions to a total.
RPI (Retail Price Index)
A weighted average of everyday items, assessing inflation.
CPI (Consumer Price Index)
Similar to RPI but excludes mortgage payments; also measures inflation.
GDP (Gross Domestic Product)
Total value of all produced goods/services, reflecting economic growth or recession status.
Chain Base Index Number
Uses the previous year as the base for index calculations.
Geometric Mean
nth root of the product of numbers, useful for finding average percentage changes in index numbers.
Distribution of Sample Means
Definition: Different samples yield varied mean estimates but less spread out than original values.
Quality Assurance
Quality Assurance Process
Ensures item production is regulated and under control.
Control Chart
Plots sample results over time, including action and warning lines.
Action Lines
Often defined as mean ± 3 standard deviations; a value beyond this indicates a necessary process reset.
Warning Lines
Typically at mean ± 2 standard deviations; if exceeded, immediate further sampling is required.
Crude Rates and Standardization
Crude Rate
Expresses measurement per thousand (e.g., crude death rate).
Standardised Rates
Adjusts for age distribution to compare rates accurately across different populations; based on a standard population reference.
Page 4: Probability Concepts
Sample Space Diagram
Definition: Illustrates all potential outcomes (e.g., summing two dice).
Mutually Exclusive Events
Definition: Events that cannot occur simultaneously; probabilities sum to one.
Exhaustive Events
Definition: All possible outcomes included; total probability equals one.
Relative Frequency
Definition: Reflects experimental probability; key for estimating probabilities.
Conditional Probability
Represents the likelihood of event A occurring given that B has occurred; denoted as P(A|B).
Independence
Definition: If two events are independent, P(A|B) = P(A), thus P(A) × P(B) = P(A and B).
Risk Concepts
Absolute Risk
The direct chance of an event occurring.
Relative Risk
Comparatively assesses the likelihood of an event concerning another; not confined to a 0-1 scale (e.g., a relative risk of 2 implies double likelihood).
Binomial Distribution (B(n, p))
Definition: Models two possible outcomes with defined parameters (e.g., heads or tails).
Key Characteristics:
Fixed number of trials.
Independent outcomes.
Constant success probability.
Normal Distribution (N(μ, σ²))
Defines a bell-shaped curve, where most values fall within 3 standard deviations of the mean.
Key Properties:
Continuous variables.
Symmetry around the mean, where mean = median = mode.
Approximately 99.8% of values fall within 3 standard deviations of the mean; 95% within 2 standard deviations.