Data Concepts: Observations, Variables, Tidy Data, and Data Types
Data Sources and Descriptive Concepts
Data are observations collected from sources such as field surveys, experiments, and other data collection methods, and they form the backbone of statistical investigation.
Summary statistics are a single number that summarizes data from the sample; an example is the average (mean).
The mean (average) is a key summary statistic used to describe the central tendency of the data.
Observations (Cases) and Variables
In data tables, every row is called an observation or a case.
A variable is represented by a column and corresponds to a characteristic measured across observations.
Graphs and tables display columns (variables) to convey the data in a structured way.
What is tidy data?
Tidy data is a data frame format where:
Each row is a unique case (observation).
Each column is a variable.
Each cell contains a single value.
Tidiness makes data organized and easy to analyze, plot, and manipulate consistently.
Example (conceptual):
Rows: different individuals or units (e.g., Person 1, Person 2, …).
Columns: variables like age, income, education level, etc.
Each cell holds one measured value for that person and variable.
Continuous vs Discrete Data
Continuous data:
Numbers that can take any value within a range (including fractions).
They are inherently suitable for arithmetic operations such as addition, subtraction, and averaging.
Example: unemployment rate, which can be added, subtracted, and averaged across time periods.
Note: Continuous data can change continuously and are treated as real-valued measurements.
Discrete data:
Values that are countable and often take on whole numbers or symbolic categories (e.g., area codes).
Not all arithmetic operations are meaningful (e.g., you generally cannot meaningfully add or average area codes).
Discrete data can be counts (e.g., number of visits) or categorical labels.
Summary Statistics (from the Transcript)
Definition: a single number that summarizes data from the sample.
Common example: the average (mean).
Formula for the sample mean:
Example application (conceptual):
Suppose you have unemployment rates across five time points: 4.0, 4.2, 3.9, 4.5, 4.1.
The sample mean would be:
This 4.14% represents the central tendency of the observed unemployment rates in the sample.
Important distinction: these statistics summarize data from the sample, not the entire population.
Connections to Practice and Relevance
Data sources and their structure (observations and variables) underpin all statistical analyses.
Keeping data tidy (one row per case, one column per variable, one value per cell) simplifies downstream tasks like plotting, modeling, and reporting.
Understanding the nature of data (continuous vs discrete) guides appropriate methods and interpretations.
Real-world relevance: proper data organization and appropriate use of summary statistics lead to clearer insights and better decision-making.
Practical Implications and Ethics
Tidiness supports reproducibility: clear structure makes it easier for others to replicate analyses.
Misclassification or mixing data types (e.g., treating codes as numbers) can lead to invalid analyses.
When reporting statistics, specify whether the summary refers to a sample statistic or a population parameter.
Foundational References from the Transcript
Data sources mentioned: field surveys, experiments, and related observational methods.
Page references indicate where these definitions appear in accompanying materials (e.g., guided practice on page 11; observations and variables on page 12; Chapter 1, Lecture 2.1).
The concepts of observations, variables, and tidy data align with foundational data science practices for organizing and analyzing data.