Edexcel GCSE Statistics (9-1) Revision Notes

Chapter 1: Collection of Data

Raw Data:
- Unprocessed data that has just been collected and requires ordering, grouping, rounding, and cleaning.
Types of Data:
- Qualitative: Non-numerical, descriptive data (e.g., eye/hair color, gender). Often subjective and more difficult to analyze.
- Quantitative: Numerical data that can be measured with numbers. Easier to analyze than qualitative data (e.g., height, weight, exam marks).
- Discrete: Data that only takes particular values (not necessarily whole numbers), such as shoe size or number of people.
- Continuous: Data that can take any value (e.g., height, weight).
- Categorical: Data that can be sorted into non-overlapping categories, such as gender. Used for qualitative data to facilitate processing.
- Ordinal (Rank): Quantitative data that can be given an order or ranked on a rating scale (e.g., marks in an exam).
- Bivariate: Data involving the measurement of two variables. Can be qualitative or quantitative, grouped or ungrouped. Commonly used with scatter diagrams.
  - One variable is the explanatory variable, and the other is the response variable.
- Multivariate: Data made up of more than two variables (e.g., comparing height, weight, age, and shoe size).
Grouping Data:
- Grouping data using tables makes it easier to spot patterns and understand data distribution.
- Discrete Data: Can be grouped into non-overlapping classes (e.g., 0-10, 11-15), which do not necessarily have equal class widths.
  - Smaller intervals are used when there is a lot of data close together, and wider classes are used for more spread-out data.
- Continuous Data: Can be grouped using inequalities, ensuring that class intervals do not have gaps or overlap, using both < and ≤ symbols.
- Pros of Grouping Data:
  - Makes the data easy to read and understand.
  - Easy to spot patterns and compare data.
- Cons of Grouping Data:
  - Loses accuracy of data, as exact data values are unknown.
  - Calculations made from grouped data will only be estimates (e.g., mean).
Data Sources:
- Primary: Data collected directly by you or someone on your behalf (e.g., questionnaires, interviews, experiments, observations).
- Secondary: Data that has already been collected (e.g., databases, newspapers/magazines/websites, historical records, Office for National Statistics).

Populations and Sampling

Population: Everyone or everything that could be involved in the investigation (e.g., all students in a school when investigating student opinions).
Census: A survey of the entire population.
Sample: A smaller number from the population that you actually survey.
- Data obtained from the sample is used to make conclusions about the whole population.
- It is important that the sample represents the population fairly.
Sampling Frame: A list of all members of the population from which the sample is chosen (e.g., electoral roll, school register).
Sampling Unit: The people to be sampled (e.g., students in a school).
Biased Sample: A sample that does not fairly represent the population (e.g., a survey of students at a mixed school that only includes girls).
- Avoid bias by using random sampling methods.
Sampling Methods:
- Random Sample: Every item/person in the population has an equal chance of being selected.
  - Method:
    - Assign a number to every member of the population.
    - Mention the random sampling technique (e.g., random number table, random number generator).
    - Select the numbers chosen from your population.
    - Ignore any repeats and choose another number.
  - Techniques:
    - Pick numbers/names out of a hat (small samples only).
    - Use a random number table.
    - Use the random number generator function on a calculator or computer.
  - Advantages:
    - Representative sample as every member has an equal chance of being selected.
    - Unbiased.
  - Disadvantages:
    - Need a full list of population (not always easily obtainable).
    - Not always convenient, can be expensive and time-consuming.
    - Needs a large sample size.
- Stratified Sample: The size of each strata (group) in the sample is proportional to the sizes of strata in the population.
  - Method:
    - Split the population into groups.
    - Use the formula $stratified sample = \frac{strata \ total}{population \ total} \times sample \ size$ to calculate the sample size for each group.
    - Use random sampling to select members from each strata/group.
  - Advantages:
    - Sample is in proportion to population, so sample represents the population fairly.
    - Best used for populations with groups of unequal sizes.
  - Disadvantages:
    - Time-consuming.
- Systematic Sampling: Choosing items in the population at regular intervals.
  - Method:
    - Divide population size by sample size to calculate the intervals (e.g., 400/40 = 10, so choose every 10th item).
    - Use random sampling to generate a number between 1 and the interval to choose a starting point (e.g., 7).
    - Select every 10th item after the 7th (7th, 17th, 27th, etc.) until you obtain your sample size.
  - Advantages:
    - Population is evenly sampled.
    - Can be carried out by a machine.
    - Sample is easy to select.
  - Disadvantages:
    - Not strictly a random sample as some members of the population cannot be chosen.
- Cluster Sampling: The population is divided into natural groups (clusters), groups are chosen at random, and every member of the groups are sampled.
  - Advantages:
    - Economically efficient – less resources required.
    - Can be representative if lots of small clusters are sampled.
  - Disadvantages:
    - Clusters may not be representative of the population and may lead to a biased sample.
    - High sampling error.
- Quota Sampling: Population is grouped by characteristics, and a fixed amount is sampled from every group.
  - Method:
    - Group population by characteristics (e.g., gender and age).
    - Select quota (amount) for each group (e.g., 30 men under 25, 40 women over 30, etc.).
    - Obtain sample by finding members of each group until the quota is reached.
  - Advantages:
    - Quick to use.
    - Cheap.
    - Do not need a sample frame or full list of the population.
  - Disadvantages:
    - NOT RANDOM – biased as the interviewer chooses who will be in the sample.
- Opportunity Sampling: Using the people/items that are available at the time (e.g., interviewing the first 10 people you see on a Monday morning).
  - Advantages:
    - Quick.
    - Cheap.
    - Easy.
  - Disadvantages:
    - NOT RANDOM. The sample has not been collected fairly, so it may not represent the population.
- Judgement Sampling: When the researcher uses their own judgment to select a sample they think will represent the population (e.g., a teacher choosing students to interview).
  - Advantages:
    - Easy.
    - Quick.
  - Disadvantages:
    - NOT RANDOM.
    - Quality of sample depends on the person selecting the sample.
- Petersen Capture-Recapture: Used to estimate the size of large or moving populations where it would be impossible to count the entire population. *Method: 1. Take a sample of the population 2. Mark each item 3. Put the items back into the population and ensure they are thoroughly mixed 4. Take a second sample and count how many of your sample are marked
  - Formula: $First Capture Total (N) = \frac{Tagged}{Second Capture}$
  - Assumptions:
    - Population has not changed – no births/deaths
    - Probability of being caught is equally likely for all individuals.
    - Marks/tags not lost
    - Sample size is large enough and is representative of the population.

Experiments

used when a researcher investigates how changes in one variable affect another.
- Variables:
  - Explanatory (Independent) Variable: The variable that is changed.
  - Response (Dependent) Variable: The variable that is measured.
  - Extraneous Variables: Variables you are not interested in but that could affect the result of your experiment.
- Laboratory Experiments: Researcher has full control over variables. Conducted in a lab or similar environment.
  - Advantages:
    - Easy to replicate – makes results more reliable.
    - Extraneous variables can be controlled so results are more likely to be valid.
  - Disadvantages:
    - People may behave differently under test conditions than they would under real-life conditions.
- Field Experiments: Carried out in the everyday environment. Researcher has some control over the variables. They set up the situation and controls the explanatory variable but has less control over extraneous variables.
  - Advantages:
    - More accurate – reflects real life behavior.
  - Disadvantages:
    - Cannot control extraneous variables.
    - Not as easy to replicate – less reliable than lab experiments.
- Natural Experiments: Carried out in the everyday environment. Researcher has no/very little control over the variables. Explanatory variables are not changed but instead researchers look at something that already exists in the world and how it affects other things.
  - Advantages :
    - Reflects real life behaviour
  - Disadvantages:
    - Low validity – extraneous variables are not controlled which may affect results instead of explanatory variable.
    - Difficult to replicate.
    - Cannot control extraneous variables.
- Simulation: A way to model random events using random numbers and previously collected data.
  - Steps:
    1. Choose a suitable method for getting random numbers – dice, calculator, random number tables.
    2. Assign numbers to the data.
    3. Generate the random numbers.
    4. Match the random numbers to your outcomes.

Questionnaires/Interviews

A source of primary data
Questionnaire: A set of questions used to obtain data from the population/sample. Can be carried out via post, email, phone or face to face. The person completing the questionnaire is called the respondent.
- Open questions: Allows any answer. However, the wide range of different answers makes it difficult to analyse the data.
- Closed questions: Has a fixed number of non-overlapping option boxes that only allow for specific answers or opinion scales. This makes data easier to analyse.
- Features of a good questionnaire:
  - Easy to understand
  - Uses simple language
  - Avoid leading questions such as “do you agree…?” – makes the respondent want to agree.
  - Questions are relevant to the investigation
  - Includes a time frame/unit in the question.
  - Includes non-overlapping, exhaustive option boxes.
  - Questions should not be offensive/personal/embarrassing
  - Questions which are easy to analyse the results.
- Problems with Questionnaires:
  - Non – response: when people in the sample do not respond to the questionnaire. Could be due to people not wanting to answer the questionnaire or not understanding the questions.
    - Follow up on people who have not responded.
    - Collect each questionnaire yourself.
    - Offer an incentive to complete the questionnaire such as the opportunity to win a prize.
    - Use a pilot survey to test response rate or understandability of questions.
  - Sensitive questions: Includes questions about people’s health, age, weight, salary etc. May make people uncomfortable so they may not answer truthfully which could distort the results.
    - You can make respondents more comfortable by making the questionnaire anonymous and allowing them to answer the questionnaire in private or by using the random response method.
- Random Response Method: Uses a random event to decide how to answer a question which ensures that people who answer the question remain anonymous. You can use the survey results to calculate an estimate for the proportion of people who answered yes to the sensitive question.
  1. Find total who answered questions.
  2. Find prob. (heads) if it is a coin.
  3. Estimate no. of heads – Prob x total
  4. Estimate number of “yes” answers that were truthful; Yes answer – estimated no of heads
  5. Estimate proportion of people who did the crime = $\frac{D}{C}$
Pilot Study: A small-scale replica of the study to be carried out.
* Advantages:
* Helps you spot any questions that are unclear or ambiguous.
* Gives you an idea of the response rate
* Allows you to check the time and costs of the study.
* You can check that closed questions include all the possible answers.
* Can use pilot study to check that the questionnaire collects all the information needed.
Interviews: where you question each person individually.
Involves lots of specific questions or a list of topics. Can be carried out face to face or over the phone or internet.

Problems with Collected Data

Outliers:
- values that do not fit in with the pattern or trend of the data.
- Can be extreme values or incorrectly recorded, If incorrectly recorded, these can be ignored, If extreme values, you need to decide whether or not to include them in the data as they may distort/skew your results.
Cleaning Data:
- fixing problems with the data. This could be done by:
  - Identifying and correcting/removing incorrect data values or outliers.
  - Removing units or symbols from the data,
  - Putting all the data in the same format e.g. m/cm, capital/lowercase, words/letters.
  - Deciding what to do about missing data.

Controlling Extraneous Variables

Control Groups:
- The control group (sometimes called a comparison group) is used in an experiment as a way to ensure that your experiment actually works. It’s a way to make sure that the treatment you are giving is causing the experimental results, and not something outside the experiment.
  - Use random selection to select 2 groups of people, control and experimental groups.
  - Give the test group the treatment, control group no treatment
  - Compare results from 2 groups to see how effective treatment is
    Conditions must be exactly the same for both groups, only treatment must be different.
Matched pairs:
- 2 groups of equally matched (age/gender etc.) people used to test effect of a particular factor. Everything in common except factor being studied.
The “pairs” don’t have to be different people — they could be the same individuals at different time.
- For example:
  - The same study participants are measured before and after an intervention.
  - The same study participants are measured twice for two different interventions.
The purpose of matched samples is to get better statistics by controlling for the effects of other “unwanted” variables.
For example, if you are investigating the health effects of alcohol, you can control for age-related health effects by matching age-similar participants.

Hypotheses and Investigations

Hypothesis: A statement (not a question) that can be tested by collecting and analysing data.
Stages of an Investigation:
- Planning – choose hypothesis, what data to collect (variables), how you will record data (data collection tables)
- Collecting Data – choosing data sources (primary/secondary), collection methods (questionnaire/interviews), control factors.
- Processing and Representing data – choosing diagrams and calculations.
- Interpreting Results – drawing conclusions from the results of the diagrams and conclusions
- Evaluating methods – looking at the strengths and weaknesses of your data collection methods, planning and diagrams and how well they helped to test the hypothesis.

Chapter 2 – Processing and Representing Data

Tables: Tables with a collection of data. They are a form of secondary data if the data is available online and, in most cases, easily accessible.
- You need to be able to use these tables to identify values, calculate totals/differences/percentages, describe trends and explain inconsistencies.
- One of the main inconsistencies will be that the percentages do not add up to 100% and this is due to rounding errors because individual percentages for columns/rows in the tables have been rounded.
Two-Way Tables:
- Has information in two categories and has two variables so the data is called bivariate data.
- To find missing values, start with the row or column that has only one value missing. Make sure the grand totals for the rows and columns add up to the same number.
- When comparing data from two-way tables, write about comparisons between rows/columns but also individual cells.
Pictograms:
- Uses pictures or symbols to represent a particular amount of data. Always has a key to show the amount each symbol represents.
  - Each symbol is the same size
  - The symbols represent numbers that can be easily divided to show different frequencies, e.g. for a symbol that represents 4, you can draw a quarter of the symbol to show a frequency of 1.
  - Spacings are the same in each row.
  - There is a key to show the frequency that each symbol represents.
Bar Charts:
- Simple Bar Charts:
  - Bars are equal width
  - Equal gaps between bars
  - Frequency on y-axis
- Vertical Line Graph:
  - Similar to simple bar chart but with lines instead of bars.
- Multiple Bar Charts:
  - Can be used to compare two or more sets of data. Has more than one bar for each class represented by different colours which is shown in the key.
- Composite Bar Charts:
  - Has single bars split into different sections for each different category. Usually used to compare different times/days/years. The frequency of each component should be calculated by subtracting the upper frequency of that component with the lower frequency. Do not just read off the y-axis (unless looking at total frequencies or the bottom component).
Stem and Leaf Diagrams:
- A good way of organising data without losing any of the detail – All the original data is in the diagram but looks simple. It also shows the shape of the distribution – whether most of the data lies at the beginning, the end or is distributed in the middle.
- Each value is split into a ‘stem’ and ‘leaf’ – Stems can be more than one digit, leaves are single digits only.
- How to draw one:
  - 1) Put the first digits of each piece of data in numerical order down the left hand side.
  - 2) Go through each piece of data in turn and put the remaining digits in the correct row.
  - 3) Re-draw the diagram, putting the pieces of data in numerical order.
  - 4) Add a key.
- Back-to-back Stem and Leaf Diagrams
  - Shows two sets of data sharing the same stem so that you can easily compare them, Numbers closest to the stem are smallest, Use two different keys for each set of data.
Pie Charts:
- A way of displaying data to show how something is shared or divided into categories, Each sector shows what proportion that category represents of the total data, Area of Pie Chart = Total Frequency
- Angles add up to 360⁰.
- Interpreting Pie Charts – Remember pie charts show proportion and not numbers.
- Comparative Pie Charts:
  - Can be used to compare two sets of data of different sizes. The areas of the two circles should be in the same ratio as the two frequencies.
  - Why?: Drawing two pie charts the same size can be misleading. Area of Pie Chart = Total Frequency, So, the larger the pie chart, the greater the frequency. To compare the total frequencies, compare the areas.
  - Working out radius of second pie chart:
    1. Divide both areas (this gives you the area scale factor)
    2. Square root answer (this gives you the scale factor for radius)
    3. Multiply by radius of first pie chart.
    - If pie chart B is larger than pie chart A then pie chart B has a greater frequency.
    - If both pie charts then have the same angle for a sector that means that sector has a greater frequency in pie chart B even though the proportions are the same because it has a larger area.
Population Pyramids:
- Shows distribution of ages in a population, in numbers or proportion/percentages, They are used to compare two sets of data, usually genders or two geographical areas.
  Interpreting: look at the shape of the distribution
Cumulative Frequency Diagrams:
- Cumulative frequency is a running total of the frequencies.To work out CF for a class interval, add all the frequency for that class interval and the CF of the previous class interval, Use upper bounds for x-axis when plotting points CF Step Polygons – Use for discrete data, Plot the points using upper bound of class interval and join points using straight lines by going across then up. CF Curves – Use for grouped continuous data, Plot points using upper bound of class interval and connect with a smooth curve Estimating values from CF diagrams:
  - Work out median value by dividing total frequency by 2, Find on Y-axis, Draw horizontal line from that value to curve/line, Read off value from x-axis
  - Interquartile Range (IQR)
    - Work out 25% and 75% values, Find on y-axis, Draw horizontal line from that value to curve/line, Read off values from x-axis, Subtract them (Big one – small one)
  - Estimating more than/greater than values
    Draw a vertical line from the value in the question on the axis to the curve
    Read off corresponding y-axis value, Subtract from total frequency.
Histograms:
- Represents continuous data from grouped frequency tables No gaps between bars. Equal Class Widths x-axis = data y-axis = frequency Looks like bar charts without gaps. Unequal Class Widths Area of bar = frequency y-axis = Frequency Density (not frequency). The idea is that the frequency density reflects the ‘concentration’ of things within each range of values. Formula: $Frequency Density = \frac{Frequency}{Class Width}$ , $Frequency Density × Class Width = Frequency$
- Drawing Histograms:
  - 1. Calculate class widths for each class interval
  - 2. Calculate frequency density for each class interval using $FD = \frac{F}{CW}$ formula.
  - 3. Draw a suitable scale on y-axis labelled frequency density.
  - 4. Draw bars using frequency density data. (Remember the bars have no gaps in between) Estimating frequencies from histograms:
  - With these questions you are using the class widths and frequency density from the histogram to work out frequencies. Be careful when calculating class width as some intervals may not include the entire bar Find the bars that cover the range you need from the question. Work out the frequency for each bar using the $FD \times CW = F$ formula. Add the frequencies To compare histograms, they need to have the same class intervals and frequency density scales. When comparing histograms, describe the shape of the distribution and what this shows. The Shape of a Distribution This is the shape formed by the diagram. It can be positive, negative or symmetrical.
Frequency Polygons:
- Similar to histograms with equal class widths but without the bars. Uses mid-points of class intervals and points are plotted and then joined together with straight lines.

Chapter 3 – Summarising Data

Averages: A measure of central tendency (represents the ‘centre’ of a set of data). Includes mode, median and mean.
- Mode: The one that appears the most (remember the Mo in mode and Mo in most) – the most common value Modal Class – the class with the highest frequency (the frequency value is not the mode but the column/row next to it).
- Median: The middle value. Discrete Data
  - Put the numbers in order from smallest to largest. The median is the 1 2 (𝑛 + 1)!