1/73
A set of vocabulary flashcards covering data collection, sampling methods, data types, data representation, measures of location and spread, box plots, cumulative frequency, histograms, correlation, regression, probability, and hypothesis testing.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Population
The entire set of items or individuals of interest in a study.
Census
Observing or measuring every member of the population.
Sample
A subset of the population used to estimate information about the population.
Sampling frame
A list of sampling units from which a sample is drawn.
Sampling units
Individual items in the population that can be sampled.
Simple random sample
Every possible sample of size n has an equal chance of being chosen; requires a sampling frame.
Lottery sampling
A method of simple random sampling where sample units are drawn like tickets from a hat.
Systematic sampling
Selects elements at regular intervals from an ordered list (e.g., every k-th item).
Stratified sampling
Population is divided into strata; random samples are taken from each, proportionate to stratum size.
Quota sampling
Non-random sampling where quotas reflect population characteristics; quotas filled during interviewing.
Opportunity sampling
Sample chosen from people available at the time and who fit criteria (convenience sampling).
Non-random sampling
Sampling methods that do not use random selection (e.g., quota, opportunity).
Qualitative data
Non-numeric data such as hair colour or types of occupation.
Quantitative data
Numeric data that can be measured or counted.
Discrete data
Quantitative data that take only specific values (e.g., number of students).
Continuous data
Quantitative data that can take any value within a range (e.g., height, time).
Grouped data
Data organized into classes or intervals with frequencies.
Class boundaries
Lower and upper limits that define a class in a grouped frequency table.
Midpoint
The average of the class boundaries; used as a representative value for a class.
Class width
Difference between the upper and lower class boundaries.
Frequency table
Table listing class intervals and their frequencies.
Raw data
Original data before any summarisation or grouping.
Large data set
A big dataset (e.g., weather data) used to practice sampling and statistics; includes multiple variables.
Sampling units
Individual items that are sampled from the population (often numbered or named).
Data type: qualitative
Non-numeric data (e.g., hair colour, species).
Data type: quantitative
Numeric data that can be measured or counted.
Data type: discrete
Data that takes only whole-number values (counts).
Data type: continuous
Data that can take any value within a range (measurements).
Class boundaries (grouped data)
The actual lower and upper limits of a class interval in a grouped distribution.
Frequency density
Height of a bar in a histogram; used when class widths vary; area corresponds to frequency.
Histogram
A graph of grouped continuous data where area of bars is proportional to frequencies.
Frequency polygon
A line graph joining the midpoints of the tops of the histogram bars.
Box plot
A graphical representation showing min, Q1, median (Q2), Q3 and max, with possible outliers.
Outlier
An observation far from the pattern of the rest of the data (often defined using IQR).
Interquartile range (IQR)
Difference between Q3 and Q1; a measure of spread.
Quartiles
Q1 (lower quartile), Q2 (median), Q3 (upper quartile).
P10, P90
10th and 90th percentiles; points that divide data into tenths and ninth-tenths.
Percentile
A value below which a given percentage of data falls.
Cumulative frequency diagram
Plot of cumulative frequencies to read medians/percentiles from the graph.
Measures of central tendency
Statistics that describe the centre of a data set (mean, median, mode).
Mean (x-bar)
Average value: x̄ = (sum of data values)/n.
Median
Middle value when data are arranged in order (or the average of the two middle values for even n).
Mode
Most frequent value in the data (or modal class in grouped data).
Variance
The average of squared deviations from the mean; σ² = Σ(x−x̄)² / n (population).
Standard deviation
The square root of the variance; a measure of spread in the same units as the data.
Coding a data set
Transforming data using y = a + bx to simplify calculations; relationships of mean and spread follow specific rules.
Box plot features
Whiskers show min and max (excluding outliers); box spans Q1 to Q3; line at median; outliers plotted separately.
Skewness
Asymmetry of the data distribution; reflected in the position of the median within the box plot.
Correlation
A measure of the linear relationship between two variables; r indicates strength and direction.
Scatter diagram
Plot of paired data points (x, y) to assess relationships between two variables.
Regression line (least squares)
Line that minimises the sum of squared distances from data points; y = a + bx.
Independent variable (explanatory)
The variable that is purposely changed or used to explain changes in the other variable.
Dependent variable (response)
The outcome measured in a study, believed to depend on the independent variable.
Prediction (interpolation)
Estimating a value within the range of observed data using the regression line.
Extrapolation
Predicting a value outside the range of observed data; usually less reliable.
Binomial distribution
X ~ B(n, p): a fixed number of independent trials with two outcomes (success, failure).
Probability mass function (pmf)
P(X = x) giving the probability that X takes the value x.
B(n, p) mean
Mean of a binomial distribution: np.
B(n, p) variance
Variance of a binomial distribution: np(1−p).
Cumulative probability (binomial CD)
Probabilities P(X < x) calculated from binomial distribution tables or calculator.
Null hypothesis (H0)
Presumed statement about a population parameter to be tested, e.g., p = p0.
Alternative hypothesis (H1)
Statement opposite to H0, describing the parameter value being tested for.
Test statistic
A quantity calculated from sample data used to decide whether to reject H0.
Significance level
Probability threshold (e.g., 0.05) used to decide whether to reject H0.
Critical region
Set of values of the test statistic that lead to rejection of H0.
Acceptance region
Values of the test statistic that fail to lead to rejection of H0.
One-tailed test
Hypothesis test where the alternative specifies a direction (>, <).
Two-tailed test
Hypothesis test where the alternative does not specify a direction (≠).
P-value
The probability, under H0, of obtaining a test statistic as extreme or more extreme than observed.
Interpolation
Estimating a value within the range of observed data inside a class interval.
Extrapolation (regression)
Estimating a value outside the observed data range; less reliable.
Beaufort scale
Descriptive scale for wind speed (e.g., calm to gale) used with large data sets.
Raw data vs summary statistics
Raw data are the original measurements; summary statistics (mean, median, etc.) describe the data.
Redundancy caution (outliers)
Outliers may be genuine or errors; must justify removing anomalies with reason.