GCSE Statistics vocabulary

Page 1: Vocabulary You Must Know

Data and Sampling

  • Hypothesis

    • Definition: A statement that may or may not be true.

    • Distinction: Not to be confused with a question.

    • Purpose: Used in statistical investigations to gather evidence supporting the hypothesis.

  • Population

    • Definition: All items or people being studied.

    • Examples: All students in Year 10, all fireworks produced by a factory.

  • Sample Frame

    • Definition: A list of all members of the population (e.g., a register or database).

  • Random Sample

    • Definition: Every item in the population has an equal chance of selection.

  • Stratified (Random) Sampling

    • Process: Population divided into strata (e.g., gender, school year).

    • Matching: Proportions in the sample reflect those in the population.

    • Selection: Members within each strata are chosen at random.

  • Judgement Sampling

    • Definition: Non-random sampling based on specific criteria (e.g., the first 20 items/people).

  • Cluster Sampling

    • Definition: Non-random sampling where all members of randomly selected clusters are included (e.g., all students in randomly chosen tutor groups).

  • Quota Sampling

    • Definition: Non-random sampling where a predetermined number of individuals from different categories (e.g., age groups or genders) are selected by the interviewer.

  • Systematic Sampling

    • Definition: Non-random sampling starting from a random point and selecting at fixed intervals.

  • Cleaning Data

    • Importance: Ensures data reliability and usability in statistical analysis.

    • Methods: Dealing with outliers, missing data, standardizing formats, removing unnecessary symbols.

  • Anomaly

    • Definition: A value that does not fit with the rest of the data (e.g., significant deviation from the line of best fit in scatter diagrams).

  • Outlier

    • Definition: A suspiciously extreme value.

    • Boundaries: Calculated using mean ± 3 × standard deviation or 1.5 × interquartile range (IQR) above the upper quartile/below the lower quartile.

Variables

  • Variables

    • Definition: Values that vary among members of the population (e.g., discrete, continuous, qualitative).

    • Multivariate Problems: Involve more than one linked variable (e.g., analyzing how driving test performance varies by gender and time of day).

    • Categorical Data: Fits into distinct categories (e.g., gender, voting intention).

    • Ordinal Data: Indicates rank order (e.g., race positions).

  • Distribution

    • Definition: The set of a variable's values along with their frequencies or probabilities.

  • Extraneous Variables

    • Definition: Variables not of interest that may impact the results.

    • Example: Time of day may affect reaction time comparisons.

Control Groups & Matched Pairs

  • Control Group

    • Definition: Used alongside a test group for comparison.

  • Matched Pairs

    • Definition: Two similar individuals (one in each group) to control for variables (e.g., test group receiving a drug, control group receiving a placebo).

Questions

  • Closed/Open Questions

    • Closed Questions: Require a choice from provided answers, easy analysis.

    • Open Questions: No answer restrictions, harder to analyze (best avoided).

  • Pilot Survey / Pre-test

    • Purpose: Trial a questionnaire on a small scale to identify necessary changes prior to wider use.

    • Checks include: understanding of questions, sufficiency of return rates, adequacy of response options.

Page 2: Data Display and Reliability

Random Response

  • Definition: Estimates responses to sensitive questions, ensuring more reliable data collection through randomness.

  • Example: Using a die or coin flip to determine if a subject should answer a question.

Reliability & Validity

  • Reliability

    • Definition: The degree to which repeating a study yields similar results.

    • Impact: Small sample sizes may lead to unreliable outcomes.

  • Validity

    • Definition: Measures whether the study assesses what it intends to measure.

    • Example: Year 7 opinions on food may lack validity for whole school views.

Displaying & Comparing Data

  • Choropleth Map

    • A shaded map representing varying values, darker shades indicate higher numbers.

  • Frequency Density

    • Used for histogram bar heights where bar area represents frequency.

  • Central Tendency

    • Definition: The average, including mean, median, and mode.

  • Dispersion

    • Definition: The spread of data, including range, IQR, standard deviation, and variance.

  • Variance

    • Definition: The square of standard deviation.

Interpercentile Range (IPR), Interdecile Range

  • Definition: Ranges capturing central data segments; e.g., IPR: 10th to 90th percentile, IQR similar but discarding less data.

Standardised Score

  • Definition: Indicates how many standard deviations a value is from the mean; necessary to memorize the formula.

  • Use: Comparison across different distributions; positive scores are above the mean, negative below.

Scatter Diagrams

  • Bivariate Data

    • Definition: Paired data points (e.g., scores in two subjects).

  • Association and Correlation

    • Association: Relationship between two variables (e.g., height and gender).

    • Correlation: Relationship between two numerical variables (e.g., height and hand span).

  • Explanatory and Response Variables

    • Explanatory: Independent variable on x-axis, causes change in y.

    • Response: Dependent variable on y-axis, reacts to changes in x.

Regression

  • Regression Line/Equation

    • A calculated line of best fit, interpreted to understand rates of change.

  • Causation vs. Correlation

    • Causation: Change in y due to change in x.

    • Spurious Correlation: Apparent correlation without a causal relationship; often due to a third variable.

Correlation Coefficients

  • Spearman’s Rank Correlation Coefficient (SRCC)

    • Formula provided; indicates the likelihood of a relationship, useful even with non-linear relations.

    • Scale: -1 (perfect negative) to +1 (perfect positive), 0 indicates no correlation.

  • Pearson’s Product Moment Correlation Coefficient (PMCC)

    • Not expected to calculate; indicates the probability of a linear relationship.

    • Scale is the same as SRCC; non-linear correlations may show a stronger SRCC.

Page 3: Time Series and Data Distribution

Time Series

  • Trend

    • Defined as long-term changes over time (rising, falling, or level).

  • Seasonal Variation

    • Recurring patterns at regular intervals (e.g. high sales every quarter).

  • Mean Seasonal Effect

    • Average numerical differences from trend line over a specific period.

    • Seasonal effect = observed value - trend line value.

Index Numbers

  • Base Year

    • Reference year for percentage comparisons; base year index is always set to 100.

  • Weighted Index

    • Average of index numbers weighted for different items, reflecting specific contributions to a total.

  • RPI (Retail Price Index)

    • A weighted average of everyday items, assessing inflation.

  • CPI (Consumer Price Index)

    • Similar to RPI but excludes mortgage payments; also measures inflation.

  • GDP (Gross Domestic Product)

    • Total value of all produced goods/services, reflecting economic growth or recession status.

  • Chain Base Index Number

    • Uses the previous year as the base for index calculations.

  • Geometric Mean

    • nth root of the product of numbers, useful for finding average percentage changes in index numbers.

Distribution of Sample Means

  • Definition: Different samples yield varied mean estimates but less spread out than original values.

Quality Assurance

  • Quality Assurance Process

    • Ensures item production is regulated and under control.

  • Control Chart

    • Plots sample results over time, including action and warning lines.

  • Action Lines

    • Often defined as mean ± 3 standard deviations; a value beyond this indicates a necessary process reset.

  • Warning Lines

    • Typically at mean ± 2 standard deviations; if exceeded, immediate further sampling is required.

Crude Rates and Standardization

  • Crude Rate

    • Expresses measurement per thousand (e.g., crude death rate).

  • Standardised Rates

    • Adjusts for age distribution to compare rates accurately across different populations; based on a standard population reference.

Page 4: Probability Concepts

Sample Space Diagram

  • Definition: Illustrates all potential outcomes (e.g., summing two dice).

Mutually Exclusive Events

  • Definition: Events that cannot occur simultaneously; probabilities sum to one.

Exhaustive Events

  • Definition: All possible outcomes included; total probability equals one.

Relative Frequency

  • Definition: Reflects experimental probability; key for estimating probabilities.

  • Conditional Probability

    • Represents the likelihood of event A occurring given that B has occurred; denoted as P(A|B).

Independence

  • Definition: If two events are independent, P(A|B) = P(A), thus P(A) × P(B) = P(A and B).

Risk Concepts

  • Absolute Risk

    • The direct chance of an event occurring.

  • Relative Risk

    • Comparatively assesses the likelihood of an event concerning another; not confined to a 0-1 scale (e.g., a relative risk of 2 implies double likelihood).

Binomial Distribution (B(n, p))

  • Definition: Models two possible outcomes with defined parameters (e.g., heads or tails).

  • Key Characteristics:

    • Fixed number of trials.

    • Independent outcomes.

    • Constant success probability.

Normal Distribution (N(μ, σ²))

  • Defines a bell-shaped curve, where most values fall within 3 standard deviations of the mean.

  • Key Properties:

    • Continuous variables.

    • Symmetry around the mean, where mean = median = mode.

    • Approximately 99.8% of values fall within 3 standard deviations of the mean; 95% within 2 standard deviations.