GCSE Statistics vocabulary

GCSE Statistics Vocabulary

Data and Sampling

Hypothesis
- A statement which may or may not be true.
- A statistical investigation is conducted to verify its validity.
Population
- All items/people under investigation (e.g., all students in Year 10).
Sample Frame
- A comprehensive list of all members of the population (could also be a database or register).
Random Sample
- Every item in the population has an equal chance of being included in the sample.
Stratified (Random) Sampling
- Population divided into strata (e.g., gender, school year).
- Proportions in the sample match those in the population.
- Members in each stratum chosen randomly.
Judgement Sampling
- Non-random sampling based on specific criteria (e.g., selecting the first 20 items/people).
Cluster Sampling
- Non-random sampling using all members from randomly selected clusters (e.g., all pupils in 3 random tutor groups).
Quota Sampling
- Non-random sampling where an interviewer selects a predetermined number of people across different categories (age, gender).
Systematic Sampling
- Non-random sampling starting from a random point and selecting at fixed intervals.

Cleaning Data

Cleaning may be required to enhance reliability and usability.
Tasks may include:
- Addressing outliers or missing data.
- Standardizing formats/units.
- Removing unnecessary symbols.
Anomaly
- A value that does not fit with the rest of the data (e.g., far from line of best fit).
Outlier
- A value that is suspiciously high or low; boundaries identified using
  - Mean ± 3 × s.d.
  - 1.5 × IQR above the upper quartile or below the lower quartile.

Variables

Variables
- Values being investigated, which can differ across the population.
- Types include: discrete, continuous, qualitative.
Multivariate Problems
- Issues where more than one linked variable is analyzed (e.g., driving test performance by gender and time).
Categorical Data
- Data that fits into accessible categories (e.g., gender).
Ordinal Data
- Data reflecting a rank order (e.g., race positions).
Distribution
- A set of values of a variable alongside their frequencies or probabilities.

Extraneous Variables

Variables not under investigation that may influence results.
Efforts are made to limit their effect (e.g. time of day while comparing reaction times).

Control Groups & Matched Pairs

Control Group: used alongside a test group for comparison.
Matched Pairs: ensure similarity between two groups to reduce effects of extraneous variables.
- Example: test group receives a new drug, control group receives a placebo.

Question Types

Closed Questions
- Require a choice from stated answers, facilitating easier analysis.
Open Questions
- No restrictions on answers; harder to analyze (generally best to avoid).

Pilot Survey / Pre-test

Trial questionnaire on a small scale to assess:
- Clarity of questions.
- Sufficiency of data collected.
- Response rates.
- Coverage of response options.

Random Response

Method used to estimate answers to sensitive questions.
Involves an element of randomness to elicit more reliable responses (e.g., using dice to decide if a person responds).

Reliability & Validity

Reliability:
- Consistency of results upon repeat testing (e.g., small sample may yield unreliable results).
Validity:
- The degree to which a process measures what it intends to measure (e.g., surveying Year 7 about Year 10's food opinions could have poor validity).

Displaying & Comparing Data

Choropleth Map:
- Uses shading with darker areas indicating higher values.
Frequency Density:
- Represents bar heights on a histogram; area of bar = frequency.

Central Tendency

Refers to averages (mean, median, mode).

Dispersion

Indicates spread of data (range, IQR, standard deviation, variance).
- Variance: square of standard deviation.

Interpercentile Range (IPR) & Interdecile Range

Range of central distribution parts (e.g., middle 80% or middle 60%).

Standardised Score

Measures how much a value deviates from the mean in standard deviations; used to compare across distributions.

Scatter Diagrams

Bivariate Data: paired data (e.g., scores in Math and English).

Association

Relationship between two variables.

Correlation

Relationship between two numerical variables.

Explanatory & Response Variables

Explanatory (Independent) goes on the x-axis.
Response (Dependent) goes on the y-axis.

Regression Line / Regression Equation

Best fit line calculated using statistical software; gradient indicates rate of change.

Causation

Indicates that one variable change causes the change in another.

Spurious Correlation

Indicates correlation without causation (e.g., both may increase due to a third factor).

Spearman’s Rank Correlation Coefficient (SRCC)

Calculation indicates relationship likelihood; scale from -1 (perfect negative) to +1 (perfect positive).

Pearson’s Product Moment Correlation Coefficient (PMCC)

Calculated relationship indicator, not required to calculate during exam.

Time Series & Changes Over Time

Trend
- Long-term changes described as rising, falling, or level; not just fluctuations.

Seasonal Variation

Patterns that repeat regularly (e.g., sales peaks).
Mean Seasonal Effect: Average of differences from trend line for specific times.

Index Number & Base Year

Comparisons as percentages to a base year.
Base year index defined as 100.

Weighted Index

Averages index numbers of different items; reflects economic impact more accurately.

RPI & CPI

RPI: Weighted index for common living costs, measure of inflation.
CPI: Similar inflation measure excluding mortgage payments.

GDP

Total goods/services produced in a year, indicating economic growth or recession.

Chain Base Index Number

Compares using the previous year as base; geometric mean applied for annual percentage change.

Distribution of Sample Means

Different samples yield varying estimates for mean; means are less spread than original values.

Quality Assurance & Control Charts

Ensures production quality monitoring; included action and warning lines for discrepancies.

Crude & Standardised Rates

Crude Rate: Number per thousand for births/deaths/unemployment.
Standardised Rate: Adjusts figures for population age distribution differences.

Probability

Sample Space Diagram: Represents all possible outcomes typically in a table format.

Mutually Exclusive Events

Events that cannot occur simultaneously; P(A) + P(B) = P(A or B).

Exhaustive Outcomes

All possible outcomes are included, summing probabilities to 1.

Relative Frequency & Conditional Probability

Relative frequency to provide experimental probability; conditional for probabilities given prior outcomes.

Independence

If events are independent, occurrence of one does not influence the occurrence of the other (P(A) × P(B) = P(A and B)).

Absolute vs. Relative Risk

Absolute Risk: Likelihood of an event happening independently (e.g., being late for work).
Relative Risk: Compares event likelihood as a proportion of another event.

Binomial Distribution B(n, p)

Binomial distribution properties include:
- Two possible outcomes (success/failure).
- Fixed trials and independent trials.

Normal Distribution N(μ, σ²)

Typically represented with mean μ and standard deviation σ; follows a bell-curve shape.