W1 Lecture 2 Notes

File Extensions

  • File extensions indicate the format of a file (e.g., .PNG, .TXT).

  • Common dataset formats for rectangular table data:

    • TXT (text file)

    • CSV (comma-separated values): Columns are separated by commas.

Datasets

  • Datasets used in Wiki Mobius lessons are typically saved as CSVs.

  • A survey dataset from the 2024 T1 cohort is available on Moodle.

Key Concepts

  • Population, cases, variables are essential.

  • Differentiating between categorical and quantitative variables is crucial for analysis.

Obtaining Data

  • Emphasis on data collection is important; it shouldn't be overshadowed by data analysis.

  • Bad data leads to bad results, and the flaws may not be apparent.

  • Seek statistical consultation before data collection, as it's the most expensive part.

Sequence of Events in Statistical Analysis

  1. Identify research questions.

  2. Define the population of interest.

  3. Determine the variables to measure.

  4. Collect data on those variables.

Types of Data

  • Anecdotal Data: Information collected casually or informally.

    • Personal accounts, subjective experiences.

    • Example: A student claiming coffee improves concentration based on one instance.

  • Importance of Anecdotal Data: Can be the catalyst for intuition or further study.

  • Available Data: Data produced for a purpose other than the current study.

    • Preexisting data.

    • Example: University enrollment records.

    • Australian Bureau of Statistics (ABS) data.

  • Collected Data: Data gathered or simulated specifically for a study.

Examples

  • Anecdotal Data: "Doctor X is hard to understand, so Doctor X is not a good lecturer."

  • Available Data: Enrollment data, weather data from the Bureau of Meteorology, smartwatch data.

  • Collected Data: Surveys, counts of specific items.

Population vs. Sample

  • Population: The entire group of items or people of interest.

  • Sample: A subset of the population examined.

  • Example: If studying the average height of UNSW students, the population is all UNSW students, while the sample is a selection of them.

  • Census: Systematically acquiring information about everyone in the population.

Types of Samples

  • Sample Survey: Collecting data by asking questions.

  • Voluntary Samples: Participants volunteer themselves.

    • Disadvantage: Potential for bias.

  • Convenience Sampling: Using easily accessible subjects.

    • Problem: May not represent the population well.

Reasons to Sample Instead of Conducting a Census

  • Time and money constraints.

  • Difficulty accessing the entire population.

  • Ethical considerations.

  • Samples can provide powerful insights without surveying everyone.

Example: Time to Get Ready

  • Research question: How long does it take UNSW students to get ready in the morning on average?

  • Population: UNSW students.

  • Variable: Time taken to get ready in the morning.

  • Data Collection: Record the time taken by a sample of students.

  • Census: As a condition of enrolling, students provide UNSW with their getting-ready time.

Data Sets

  • Data sets include data, observations, and variables.

  • Each observation has a value for each variable (unless missing).

  • IDs are used to distinguish observations.

Categorical vs. Quantitative Variables

  • Categorical Variable: Places individuals into categories.

  • Quantitative Variable: Takes numerical values where arithmetic makes sense.

    • Example: Temperature (quantitative), Hemisphere (categorical).

  • Units are essential for quantitative variables.

Scenarios

  • Larger hospitals have longer patient stays.

  • Married men have higher incomes.

Types of Studies

  • Observational Study: Observe variables without intervention.

  • Experiment: Impose a treatment or intervention and observe the response.

Observational Studies

  • Used to find associations between variables.

  • Association does not imply causation.

Examples of Association vs. Causation

  • Spending on science and technology vs. suicides.

  • Age of Miss America vs. murders by steam/hot vapors.

  • Deaths tangled in bedsheets vs. cheese consumption per capita.

Explanatory vs. Response Variables

  • Explanatory Variable: Used to predict the response variable.

  • Response Variable: The outcome of the change

Lurking Variable

  • Unobserved variable that influences the interpretation of relationships.

  • Example: Fitness level increases with exercise, with genetics or age as a lurking variable.

Confounding Variables

  • Variables whose effects on the response cannot be distinguished without further investigation. This may be an explanatory variable.

  • Confounding Variable: Unobserved variable that influences the response, making it hard to differentiate from the explanatory variable.

    • A confounding variable could be an explanatory.

Examples: Types of Relationships

*Ice cream sales and heat strokes:

  • Association is due to a confounding variable (heat).
    *Height of tides and position of the moon:

  • Causation due to the gravitational force of the moon.
    *Body mass index (BMI) of children and parents:

  • Association due to shared habits and diets.

Experiments vs. Observational Study for Causation

*Experiments provide good evidence for causation because of the controlled intervention.

Smoking and Cancer

  • Association between lung cancer and smoking.

  • Possible confounding variable: genetics and stress.

  • Causation from smoking and lung cancer development over time.

Ethical way to Study Causation

Causal inference statistics can be used to determine causations with experimentation.