Chapter 1 Notes: Getting Data (1.1–1.5)

Chapter 1 Notes: Getting Data (Sections 1.1–1.5)

Chapter 1. Getting Data

Section 1.1 Introduction

  • Data exist in our everyday life and are used to answer questions about the world. Research is increasingly data-driven, and quantitative reasoning is a core 21st-century skill for undergraduates across disciplines. An online article from 2021 illustrates how data are involved in current events (e.g., fall in marriages/divorces during COVID-19 restrictions in Singapore) and prompts questions about what data were collected, how conclusions were arrived at, and whether the conclusions are justified.

  • Definition 1.1.2 A population of interest is the entire group (of individuals or objects) about which we wish to know something.

  • Definition 1.1.3 A research question is usually one that seeks to investigate some characteristic of a population.

  • Example 1.1.4 Some research questions:

    • 1) What is the average number of hours that students study each week?
    • 2) Does the majority of students qualify for student loans?
    • 3) Are student athletes more likely than non-athletes to do final year projects?
  • We broadly classify research questions into three categories:
    1) To make an estimate about the population.
    2) To test a claim about the population.
    3) To compare two sub-populations / investigate a relationship between two variables in the population.

  • Example 1.1.5 Having a well-designed research question is a critical starting point for any data-driven research problem. The following table (summarized in notes) contrasts neutral vs. better questions along dimensions such as narrowness, focus, and complexity. Conceptual points include:

    • Narrow vs. Less Narrow: A question should avoid being answered by a single statistic; aim to explore broader context.
    • Unfocussed vs. Focused: Focused questions specify data to be collected and analyzed.
    • Simple vs. Complex: More complex questions require investigation and evaluation, potentially forming an argument.
  • We now describe Exploratory Data Analysis (EDA):

    • Definition 1.1.6 Exploratory Data Analysis (EDA) is a systematic process where we explore a data set and its variables, generating summary statistics and plots. EDA is iterative and continues until useful information emerges to answer the data questions.
    • General steps in EDA (as given):
    1. Generate questions about the data.
    2. Search for answers to the questions using data visualisation tools. In exploration, data modelling (e.g., regression) may also be performed.
    3. Ask: “To what extent does the data we have answer the questions we are interested in?”
    4. Refine existing questions or generate new questions about the data before returning to the data for further exploration.

Section 1.2 Sampling

  • A population is the entire group (of individuals or objects) that we wish to know something about. A census is a data collection process in which information is sought from every unit of the population. The numerical facts about the population are called population parameters.

  • Example 1.2.1 Examples of population and population parameters:
    1) The average height (population parameter) of all Primary Six students in a particular primary school (population).
    2) The median number of courses taken (population parameter) by all first-year undergraduates in a University (population).
    3) The standard deviation of the number of hours spent on mobile games (population parameter) by preschoolers aged 4–6 in Singapore (population).

  • Conducting a census is often not feasible due to cost or time constraints and may not yield 100% response. Instead, researchers study a subset of the population. We focus on probability sampling (not censuses or non-probability methods) for the course.

  • Definition 1.2.2 1) It is usually not feasible to gather information from every member of the population, so we look at a sample, which is a portion of the population selected in the study.
    2) Without information from every member, we cannot know exactly what is the population parameter. The hope is that the sample provides a reasonably good estimate about the population parameter.
    3) A sampling frame is the list from which the sample was obtained. Ideally, the sampling frame should be identical to the target population; in practice, this is not always the case. The term "population" is sometimes used to refer to the sampling frame, so attention is needed when reading studies.

  • Remark 1.2.3 Clarifies sampling frame design:

    • Does the sampling frame include all available sampling units from the population?
    • Does it include irrelevant/extraneous units from another population?
    • Does it contain duplicates?
    • Does it contain units in clusters?
    • Generalisability requires the sampling frame to be equal to or larger than the population of interest. However, even with a complete frame, findings may not generalise automatically.
  • Definition 1.2.4 When sampling from a population, we must avoid bias in the sample. A biased sample challenges generalisation to the population. Two major kinds of sampling bias:
    1) Selection bias – bias from the researcher’s choice of units (e.g., imperfect sampling frame, non-probability sampling).
    2) Non-response bias – bias from participants’ non-disclosure or non-participation (e.g., inconvenience, sensitivity concerns).
    Note: Non-response bias can occur in both probabilistic and non-probabilistic sampling.

  • Example 1.2.5 Two illustrative examples:
    1) A university study of the number of courses taken by first-year undergraduates. Sampling frame consists of undergraduates in two Engineering foundation courses. If students not in those two courses are omitted, selection bias results.
    2) A survey of financial aid among boarding school students using a door-drop questionnaire with requests to drop in if they received aid. Non-response bias can lead to underestimation of the proportion who had received aid (as those who received aid may be reluctant to disclose).

  • Definition 1.2.6 Probability sampling is a sampling scheme where the selection process uses a known randomised mechanism. Each unit in the sampling frame has a known non-zero probability of being selected, not necessarily equal across units. Randomisation helps reduce bias.

  • There are four main types of probability sampling methods:
    1) Simple random sampling (SRS)
    2) Systematic sampling
    3) Stratified sampling
    4) Cluster sampling

  • 1) Simple Random Sampling (SRS)

    • Units are selected randomly from the sampling frame. A simple random sample of size n means every set of n units has an equal chance to be selected, sampling without replacement (a unit selected is removed and cannot be chosen again).
    • Use a random number generator to implement SRS. Different samples from the same frame are different due to chance.
    • Example 1.2.7 Classic lucky-draw scenario: every attendee has a ticket; draw tickets one by one without replacement; each ticket has probability 1/n of being drawn at any step if properly mixed.
    • Example 1.2.8 Sample 500 households in Singapore using random phone numbers: choose 500 random numbers from a numbered list 1 to n; contact those households. Non-response is a common shortcoming of SRS.
  • 2) Systematic sampling

    • Select units using a selection interval k and a random starting point from the first interval. Steps: (a) determine p (population size) and n (sample size), set k = p/n; (b) pick a random starting point r from 1 to k; sample units r, r+k, r+2k, …, r+(n−1)k. If p is unknown, still possible to proceed by choosing a random start from the first k and then taking every kth unit.
    • Example 1.2.9 If p = 110 and n = 10, k = 11. Start at a random number from 1 to 11; sample would be {5,16,27,38,49,60,71,82,93,104} if start = 5, or {9,20,31,42,53,64,75,86,97,108} if start = 9.
    • Caution: If the list has inherent patterns, systematic sampling may yield biased samples due to non-random ordering.
  • 3) Stratified sampling

    • The sampling frame is divided into strata (groups) that are similar within strata but may differ between strata. Apply simple random sampling within each stratum to form the overall sample. Proportions in the population strata need not match their sample proportions. Stratified sampling improves representation by ensuring each stratum is represented.
    • Example 1.2.10 Elections: sample voters at each polling station (stratum), say 30 voters per location; compute a weighted average of results based on stratum sizes to predict overall votes.
    • Note: The number of voters per station in the sample is the same, but station sizes differ; you weight by stratum size when combining results.
  • 4) Cluster sampling

    • The sampling frame is divided into clusters. A fixed number of clusters are selected via simple random sampling, and all units within the selected clusters are included in the sample.
    • Advantages: usually simpler, cheaper, less resource-intensive; clusters are often naturally defined.
    • Disadvantages: high variability if selected clusters differ substantially from each other; if too few clusters are selected, representativeness may be poor.
    • Example 1.2.11 Mental wellness survey of Primary school students in Singapore: treat each primary school as a cluster; randomly select several schools and survey all students in those schools. Alternatively, sample all students from all schools (i.e., no clustering).
  • Remark 1.2.12 Distinguishes stratified vs. cluster sampling:

    • Stratified: randomly sample units from each stratum to ensure representation; aim for a sample indicative of stratum composition.
    • Cluster: select whole clusters; convenient when clusters are homogeneous within but heterogeneous across clusters; often used for practicality rather than precise representation.
  • Summary table: advantages and disadvantages of the four probability sampling methods (conceptual recap)

    • Simple Random Sampling: Advantage — good representation; Disadvantage — time-consuming; requires access to sampling frame.
    • Systematic Sampling: Advantage — simpler than SRS; Disadvantage — potential under-representation if list is not random.
    • Stratified Sampling: Advantage — good representation by stratum; Disadvantage — requires sampling frame and criteria for strata; potential ambiguity in unit classification.
    • Cluster Sampling: Advantage — less time-consuming and costly; Disadvantage — requires clusters to be reasonably similar; risk of high variance if clusters are not similar.
  • Remember: There is no single universally best probability sampling method; each method has its own advantages and disadvantages. All probability sampling methods can produce representative samples if applied correctly.

  • Definition 1.2.13 A non-probability sampling method is one where the selection of units is not random. There is no element of chance in the selection process; human discretion typically drives the sampling.

  • Example 1.2.14 Convenience sampling (non-probability): select subjects who are most easily available, e.g., shoppers at a mall. This introduces selection bias (e.g., affluent demographics may be overrepresented) and potential non-response bias.

  • Example 1.2.14 (continued) Volunteer sampling (self-selected): subjects volunteer; often includes those with strong opinions, not representative of the population.

    • Implication: Non-probability samples tend to be biased and less generalisable.
  • Summary of the sampling process (dominant approach when census is not possible):
    1) Design a sampling frame that covers the population of interest.
    2) Decide on an appropriate sampling method (probability methods preferred over non-probability when generalisability is important).
    3) Remove units that are not part of the population from the generated sample.

  • Remark 1.2.15 Generalisability criteria to improve confidence in applying conclusions from sample to population:
    1) Have a good sampling frame equal to or larger than the population.
    2) Use a probability-based sampling method to minimise selection bias.
    3) Have a large sample size to reduce random variability.
    4) Minimise non-response rate.

  • Section 1.3 Variables and Summary Statistics

  • Definition 1.3.1 A variable is an attribute that can be measured or labelled. A data set is a collection of individuals and variables pertaining to the individuals. In studies of relationships between variables, there is typically a distinction between independent and dependent variables.

  • Definition 1.3.2 Independent variables are those that may be manipulated (deliberately or spontaneously) in a study. Dependent variables are those hypothesised to change in response to manipulation of the independent variable. Note: The dependent variable is hypothesised to change when the independent variable is manipulated, but it may not actually change in every case.

  • Example 1.3.3 Illustrations of independent and dependent variables:
    1) Time spent on computer gaming (independent) vs. examination scores (dependent).
    2) Brand of tissue paper (independent) vs. amount of water absorbed (dependent).
    3) Drinking at least 2 glasses of orange juice per day (Yes/No, independent) and whether cholesterol level next year is lower than this year (Yes/No, dependent).

  • Definition 1.3.4 Categorical vs. Numerical variables:
    1) Categorical variables take categories/labels. They are mutually exclusive (an observation cannot belong to two categories at once).
    2) Numerical variables take numerical values and allow arithmetic operations.
    3) Sub-types of categorical variables:

    • Ordinal: natural ordering (e.g., happiness index with ordered levels).
    • Nominal: no intrinsic ordering (e.g., YES/NO, or brands).
      4) Sub-types of numerical variables:
    • Discrete: gaps in the set of possible values (e.g., number of children).
    • Continuous: can take all values in a range (e.g., height, weight).
  • Note: The notion of association between variables will be discussed in Chapter 2.

  • Example 1.3.5 Illustrates variable types:
    1) Happiness index for Secondary school students is ordinal (1 to 5 scale).
    2) Drinking at least 2 glasses of orange juice (YES/NO) is nominal.
    3) Number of students with A in Mathematics (PSLE) is discrete.
    4) Height/weight are continuous.

  • A common data presentation is a table with rows (individuals) and columns (variables). Data sets may include both categorical and numerical variables.

  • Data visualization helps reveal trends and patterns; summary statistics enable numerical comparisons between groups.

  • Summary statistics for numerical variables are divided into two broad categories:

    • Central tendency measures: mean, median, and mode.
    • Dispersion measures: standard deviation and interquartile range.

Section 1.4 Summary Statistics - Mean

  • Definition 1.4.1 The mean is the average value of a numerical variable x. The mean of x is denoted by
    \bar{x} = \frac{1}{n} \sum{i=1}^n xi.
  • Example 1.4.2 If the bill lengths (mm) of 7 penguins are 46.9, 36.5, 36.4, 34.5, 33.1, 38.6, 43.2, then
    \bar{x} = \frac{46.9 + 36.5 + 36.4 + 34.5 + 33.1 + 38.6 + 43.2}{7} \approx 38.46. (rounded to 2 decimals)
  • Remark 1.4.3 Properties of the mean:
    1) (x1 + x2 + \cdots + x_n = n\bar{x}). This allows computing the sum if you know the mean and n.
    Example: Mean of 1, 6, 8 is (\bar{x} = (1+6+8)/3 = 5). If you add 3 to each value, the new mean is 5+3 = 8.
    2) Adding a constant c to all data points changes the mean by c: if the original mean is \bar{x}, then the new mean is \bar{x} + c.
    Example: Mean of 1, 6, 8 is 5; after adding 3 to each value, new mean is 8.
    3) Multiplying all data points by a constant c scales the mean by c: if the original mean is \bar{x}, then the new mean is c\bar{x}.
    Example: Mean of 2, 7, 12 is 7; after multiplying by 2, new mean is 14.
  • Example 1.4.4 A data set with daily weather data is used to illustrate questions such as: 1) Which month in 2020 had the most rainfall? 2) If the mean monthly rainfall in 2020 was 157.22 mm, what is the total rainfall for 2020? 3) Are wind speed and temperature related? What about rainfall and wind speed? 4) Does 2020 weather pattern help predict 2021?
    • To answer Q2, total rainfall is computed as
      12 \times 157.22 = 1866.64 \text{ mm}.
    • It is not true that rainfall is constant every month even if the mean is 157.22 mm; the mean alone does not describe distribution.
    • The mean is useful but insufficient to understand distribution; spread matters (to be addressed later).
  • Example 1.4.5 Two schools A and B: average marks are 32.21 (A, 349 students) and 30.72 (B, 46 students). To know the overall mean for all 395 students, you cannot simply average the two means; you need the group sizes:
    • Weighted mean:
      \frac{349}{395} \times 32.21 + \frac{46}{395} \times 30.72 = 32.04.
    • The overall mean lies between the two subgroup means (closer to the larger subgroup's mean).
  • Example 1.4.6 Proportions as a special case of a mean:
    • Drug effectiveness comparison: new drug vs existing drug with counts of patients and asthma attacks.
    • New drug: 500 patients, 200 attacks; Proportion with attack = \frac{200}{500} = 0.4.
    • Existing drug: 1000 patients, 300 attacks; Proportion = \frac{300}{1000} = 0.3.
    • The comparison is about proportions, not raw counts, because group sizes differ.
    • Conceptual link: If we encode outcomes as 1 (attack) and 0 (no attack), the mean of the 1/0 values equals the proportion of attacks, tying proportions to means.
  • Section 1.4 notes emphasize that the mean alone cannot describe distribution and serve as a sole descriptor of the data.

Section 1.5 Summary Statistics - Variance and Standard Deviation

  • Recall from earlier: knowing the mean does not inform about distribution spread. Standard deviation is a measure of spread about the mean.

  • Definition 1.5.1 The Sample Variance and Standard Deviation:

    • Sample variance:
      \mathrm{Var} = \frac{(x1 - \bar{x})^2 + (x2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2}{n-1}.
    • Standard deviation:
      s_x = \sqrt{\mathrm{Var}}.
    • Here, (x1, x2, \ldots, x_n) are observations, and (\bar{x}) is the mean.
  • Why squared deviations? If we sum raw deviations, they cancel to zero (since (\sum (x_i - \bar{x}) = 0)). Squaring avoids cancellation and provides a measure of dispersion.

  • Remark 1.5.2 There is justification for dividing by (n-1) (not covered in this course) when computing the sample variance; it makes the estimate unbiased for the population variance.

  • Example 1.5.3 Data on the highest temperature on the first day of each month (Jan–Dec): 30.1, 31.1, 31.8, 32.1, 31.9, 32.6, 33.0, 32.4, 32.0, 32.5, 31.3, 29.6.

    • Mean ≈ 31.7.
    • Sample Variance ≈ 1.038; Standard Deviation ≈ 1.019.
  • Remark 1.5.4 Properties of the standard deviation:
    1) The standard deviation is always non-negative and is zero only when all data points are identical.
    2) The standard deviation has the same unit as the data variable (e.g., kilograms if the variable is weight).

  • End of Section 1.5: The materials highlight that standard deviation provides information about dispersion and units, complementing the mean for a fuller summary of a data set.

  • Note throughout these notes:

    • The notion of association between variables is introduced as a topic for Chapter 2.
    • The content above reflects the core ideas in Sections 1.1 through 1.5 of the provided transcript, including definitions, examples, and key formulas in LaTeX syntax.