W1 Lecture 2 Notes
File Extensions
File extensions indicate the format of a file (e.g., .PNG, .TXT).
Common dataset formats for rectangular table data:
TXT (text file)
CSV (comma-separated values): Columns are separated by commas.
Datasets
Datasets used in Wiki Mobius lessons are typically saved as CSVs.
A survey dataset from the 2024 T1 cohort is available on Moodle.
Key Concepts
Population, cases, variables are essential.
Differentiating between categorical and quantitative variables is crucial for analysis.
Obtaining Data
Emphasis on data collection is important; it shouldn't be overshadowed by data analysis.
Bad data leads to bad results, and the flaws may not be apparent.
Seek statistical consultation before data collection, as it's the most expensive part.
Sequence of Events in Statistical Analysis
Identify research questions.
Define the population of interest.
Determine the variables to measure.
Collect data on those variables.
Types of Data
Anecdotal Data: Information collected casually or informally.
Personal accounts, subjective experiences.
Example: A student claiming coffee improves concentration based on one instance.
Importance of Anecdotal Data: Can be the catalyst for intuition or further study.
Available Data: Data produced for a purpose other than the current study.
Preexisting data.
Example: University enrollment records.
Australian Bureau of Statistics (ABS) data.
Collected Data: Data gathered or simulated specifically for a study.
Examples
Anecdotal Data: "Doctor X is hard to understand, so Doctor X is not a good lecturer."
Available Data: Enrollment data, weather data from the Bureau of Meteorology, smartwatch data.
Collected Data: Surveys, counts of specific items.
Population vs. Sample
Population: The entire group of items or people of interest.
Sample: A subset of the population examined.
Example: If studying the average height of UNSW students, the population is all UNSW students, while the sample is a selection of them.
Census: Systematically acquiring information about everyone in the population.
Types of Samples
Sample Survey: Collecting data by asking questions.
Voluntary Samples: Participants volunteer themselves.
Disadvantage: Potential for bias.
Convenience Sampling: Using easily accessible subjects.
Problem: May not represent the population well.
Reasons to Sample Instead of Conducting a Census
Time and money constraints.
Difficulty accessing the entire population.
Ethical considerations.
Samples can provide powerful insights without surveying everyone.
Example: Time to Get Ready
Research question: How long does it take UNSW students to get ready in the morning on average?
Population: UNSW students.
Variable: Time taken to get ready in the morning.
Data Collection: Record the time taken by a sample of students.
Census: As a condition of enrolling, students provide UNSW with their getting-ready time.
Data Sets
Data sets include data, observations, and variables.
Each observation has a value for each variable (unless missing).
IDs are used to distinguish observations.
Categorical vs. Quantitative Variables
Categorical Variable: Places individuals into categories.
Quantitative Variable: Takes numerical values where arithmetic makes sense.
Example: Temperature (quantitative), Hemisphere (categorical).
Units are essential for quantitative variables.
Scenarios
Larger hospitals have longer patient stays.
Married men have higher incomes.
Types of Studies
Observational Study: Observe variables without intervention.
Experiment: Impose a treatment or intervention and observe the response.
Observational Studies
Used to find associations between variables.
Association does not imply causation.
Examples of Association vs. Causation
Spending on science and technology vs. suicides.
Age of Miss America vs. murders by steam/hot vapors.
Deaths tangled in bedsheets vs. cheese consumption per capita.
Explanatory vs. Response Variables
Explanatory Variable: Used to predict the response variable.
Response Variable: The outcome of the change
Lurking Variable
Unobserved variable that influences the interpretation of relationships.
Example: Fitness level increases with exercise, with genetics or age as a lurking variable.
Confounding Variables
Variables whose effects on the response cannot be distinguished without further investigation. This may be an explanatory variable.
Confounding Variable: Unobserved variable that influences the response, making it hard to differentiate from the explanatory variable.
A confounding variable could be an explanatory.
Examples: Types of Relationships
*Ice cream sales and heat strokes:
Association is due to a confounding variable (heat).
*Height of tides and position of the moon:Causation due to the gravitational force of the moon.
*Body mass index (BMI) of children and parents:Association due to shared habits and diets.
Experiments vs. Observational Study for Causation
*Experiments provide good evidence for causation because of the controlled intervention.
Smoking and Cancer
Association between lung cancer and smoking.
Possible confounding variable: genetics and stress.
Causation from smoking and lung cancer development over time.
Ethical way to Study Causation
Causal inference statistics can be used to determine causations with experimentation.