GCSE Statistics vocabulary

Page 1: Vocabulary You Must Know

Data and Sampling

Hypothesis
- Definition: A statement that may or may not be true.
- Distinction: Not to be confused with a question.
- Purpose: Used in statistical investigations to gather evidence supporting the hypothesis.
Population
- Definition: All items or people being studied.
- Examples: All students in Year 10, all fireworks produced by a factory.
Sample Frame
- Definition: A list of all members of the population (e.g., a register or database).
Random Sample
- Definition: Every item in the population has an equal chance of selection.
Stratified (Random) Sampling
- Process: Population divided into strata (e.g., gender, school year).
- Matching: Proportions in the sample reflect those in the population.
- Selection: Members within each strata are chosen at random.
Judgement Sampling
- Definition: Non-random sampling based on specific criteria (e.g., the first 20 items/people).
Cluster Sampling
- Definition: Non-random sampling where all members of randomly selected clusters are included (e.g., all students in randomly chosen tutor groups).
Quota Sampling
- Definition: Non-random sampling where a predetermined number of individuals from different categories (e.g., age groups or genders) are selected by the interviewer.
Systematic Sampling
- Definition: Non-random sampling starting from a random point and selecting at fixed intervals.
Cleaning Data
- Importance: Ensures data reliability and usability in statistical analysis.
- Methods: Dealing with outliers, missing data, standardizing formats, removing unnecessary symbols.
Anomaly
- Definition: A value that does not fit with the rest of the data (e.g., significant deviation from the line of best fit in scatter diagrams).
Outlier
- Definition: A suspiciously extreme value.
- Boundaries: Calculated using mean ± 3 × standard deviation or 1.5 × interquartile range (IQR) above the upper quartile/below the lower quartile.

Variables

Variables
- Definition: Values that vary among members of the population (e.g., discrete, continuous, qualitative).
- Multivariate Problems: Involve more than one linked variable (e.g., analyzing how driving test performance varies by gender and time of day).
- Categorical Data: Fits into distinct categories (e.g., gender, voting intention).
- Ordinal Data: Indicates rank order (e.g., race positions).
Distribution
- Definition: The set of a variable's values along with their frequencies or probabilities.
Extraneous Variables
- Definition: Variables not of interest that may impact the results.
- Example: Time of day may affect reaction time comparisons.

Control Groups & Matched Pairs

Control Group
- Definition: Used alongside a test group for comparison.
Matched Pairs
- Definition: Two similar individuals (one in each group) to control for variables (e.g., test group receiving a drug, control group receiving a placebo).

Questions

Closed/Open Questions
- Closed Questions: Require a choice from provided answers, easy analysis.
- Open Questions: No answer restrictions, harder to analyze (best avoided).
Pilot Survey / Pre-test
- Purpose: Trial a questionnaire on a small scale to identify necessary changes prior to wider use.
- Checks include: understanding of questions, sufficiency of return rates, adequacy of response options.

Page 2: Data Display and Reliability

Random Response

Definition: Estimates responses to sensitive questions, ensuring more reliable data collection through randomness.
Example: Using a die or coin flip to determine if a subject should answer a question.

Reliability & Validity

Reliability
- Definition: The degree to which repeating a study yields similar results.
- Impact: Small sample sizes may lead to unreliable outcomes.
Validity
- Definition: Measures whether the study assesses what it intends to measure.
- Example: Year 7 opinions on food may lack validity for whole school views.

Displaying & Comparing Data

Choropleth Map
- A shaded map representing varying values, darker shades indicate higher numbers.
Frequency Density
- Used for histogram bar heights where bar area represents frequency.
Central Tendency
- Definition: The average, including mean, median, and mode.
Dispersion
- Definition: The spread of data, including range, IQR, standard deviation, and variance.
Variance
- Definition: The square of standard deviation.

Interpercentile Range (IPR), Interdecile Range

Definition: Ranges capturing central data segments; e.g., IPR: 10th to 90th percentile, IQR similar but discarding less data.

Standardised Score

Definition: Indicates how many standard deviations a value is from the mean; necessary to memorize the formula.
Use: Comparison across different distributions; positive scores are above the mean, negative below.

Scatter Diagrams

Bivariate Data
- Definition: Paired data points (e.g., scores in two subjects).
Association and Correlation
- Association: Relationship between two variables (e.g., height and gender).
- Correlation: Relationship between two numerical variables (e.g., height and hand span).
Explanatory and Response Variables
- Explanatory: Independent variable on x-axis, causes change in y.
- Response: Dependent variable on y-axis, reacts to changes in x.

Regression

Regression Line/Equation
- A calculated line of best fit, interpreted to understand rates of change.
Causation vs. Correlation
- Causation: Change in y due to change in x.
- Spurious Correlation: Apparent correlation without a causal relationship; often due to a third variable.

Correlation Coefficients

Spearman’s Rank Correlation Coefficient (SRCC)
- Formula provided; indicates the likelihood of a relationship, useful even with non-linear relations.
- Scale: -1 (perfect negative) to +1 (perfect positive), 0 indicates no correlation.
Pearson’s Product Moment Correlation Coefficient (PMCC)
- Not expected to calculate; indicates the probability of a linear relationship.
- Scale is the same as SRCC; non-linear correlations may show a stronger SRCC.

Page 3: Time Series and Data Distribution

Time Series

Trend
- Defined as long-term changes over time (rising, falling, or level).
Seasonal Variation
- Recurring patterns at regular intervals (e.g. high sales every quarter).
Mean Seasonal Effect
- Average numerical differences from trend line over a specific period.
- Seasonal effect = observed value - trend line value.

Index Numbers

Base Year
- Reference year for percentage comparisons; base year index is always set to 100.
Weighted Index
- Average of index numbers weighted for different items, reflecting specific contributions to a total.
RPI (Retail Price Index)
- A weighted average of everyday items, assessing inflation.
CPI (Consumer Price Index)
- Similar to RPI but excludes mortgage payments; also measures inflation.
GDP (Gross Domestic Product)
- Total value of all produced goods/services, reflecting economic growth or recession status.
Chain Base Index Number
- Uses the previous year as the base for index calculations.
Geometric Mean
- nth root of the product of numbers, useful for finding average percentage changes in index numbers.

Distribution of Sample Means

Definition: Different samples yield varied mean estimates but less spread out than original values.

Quality Assurance

Quality Assurance Process
- Ensures item production is regulated and under control.
Control Chart
- Plots sample results over time, including action and warning lines.
Action Lines
- Often defined as mean ± 3 standard deviations; a value beyond this indicates a necessary process reset.
Warning Lines
- Typically at mean ± 2 standard deviations; if exceeded, immediate further sampling is required.

Crude Rates and Standardization

Crude Rate
- Expresses measurement per thousand (e.g., crude death rate).
Standardised Rates
- Adjusts for age distribution to compare rates accurately across different populations; based on a standard population reference.

Page 4: Probability Concepts

Sample Space Diagram

Definition: Illustrates all potential outcomes (e.g., summing two dice).

Mutually Exclusive Events

Definition: Events that cannot occur simultaneously; probabilities sum to one.

Exhaustive Events

Definition: All possible outcomes included; total probability equals one.

Relative Frequency

Definition: Reflects experimental probability; key for estimating probabilities.
Conditional Probability
- Represents the likelihood of event A occurring given that B has occurred; denoted as P(A|B).

Independence

Definition: If two events are independent, P(A|B) = P(A), thus P(A) × P(B) = P(A and B).

Risk Concepts

Absolute Risk
- The direct chance of an event occurring.
Relative Risk
- Comparatively assesses the likelihood of an event concerning another; not confined to a 0-1 scale (e.g., a relative risk of 2 implies double likelihood).

Binomial Distribution (B(n, p))

Definition: Models two possible outcomes with defined parameters (e.g., heads or tails).
Key Characteristics:
- Fixed number of trials.
- Independent outcomes.
- Constant success probability.

Normal Distribution (N(μ, σ²))

Defines a bell-shaped curve, where most values fall within 3 standard deviations of the mean.
Key Properties:
- Continuous variables.
- Symmetry around the mean, where mean = median = mode.
- Approximately 99.8% of values fall within 3 standard deviations of the mean; 95% within 2 standard deviations.