Data and Sampling
Statistical Thinking and Data Sampling
H.G. Wells stated that statistical thinking will be as crucial for citizenship as reading and writing.
Learning Objectives
Understand the concept of good data.
Differentiate between sample and population.
Distinguish between estimate/statistic and parameter.
Understand estimation deviance.
Identify the properties of a good sample.
Importance of Good Data
Example: Family Meal Spaghetti Recipe
Ingredients include:
10 Garlic cloves
Basil steeped in oil
San Marzano Tomatoes: 2 2ɛoz cans (smaller cans are preferred for better taste).
Asking Our Data to Do
Data can be used to accomplish one of three tasks:
Representative: Ensure that sampled data reflects the population accurately.
Comparison: Compare different datasets or groups effectively.
Just Because: Explore data for discoveries without a particular hypothesis.
Ensuring Representativeness
The first step is to ensure that sampled data is representative of the population being studied.
It is crucial to define the population from which the sample is taken.
Sampling from a Population
Discusses techniques and methods for sampling from a defined population.
Statistical Measures: Mean and Standard Deviation
Mean Calculation Formula: ar{x} = rac{1}{n} imes ext{Sum of data points}
For a dataset:
ar{x} = rac{x1 + x2 + ext{…} + x_n}{n}
Steps for Calculating Mean and Standard Deviation
Calculate the Mean: For the dataset [3.5, 3, 2, 1.75, 2, 0.6]:
Step 1: Mean = (3.5 + 3 + 2 + 1.75 + 2 + 0.6) / 6
Mean = 13.85 / 6 ≈ 2.3083 (rounded to four decimal places).
Find Squared Differences from Mean for Each Data Point:
(3.5 - 2.3083)² ≈ 1.4225
(3 - 2.3083)² ≈ 0.4809
(2 - 2.3083)² ≈ 0.0947
(1.75 - 2.3083)² ≈ 0.3119
(2 - 2.3083)² ≈ 0.0947
(0.6 - 2.3083)² ≈ 6.0625
Calculate the Mean of Squared Differences (Variance):
ext{Variance} = rac{(1.4225 + 0.4809 + 0.0947 + 0.3119 + 0.0947 + 6.0625)}{6}
Variance ≈ 1.4091 (rounded to four decimal places).
Calculate Standard Deviation:
ext{Standard Deviation} = ext{sqrt}(1.4091) ≈ 1.1875 (rounded to four decimal places).
Applications of Mean and Standard Deviation
Common applications include:
Quality Control
Polls
Clinical Studies
Experimental and Observational Studies (Lab or Field Setting)
Understanding Sampling Error and Bias
Important question: How do you know your spoon is representative of your soup?
Common Biases in Sampling
Sampling Error: The error that arises from not sampling the entire population.
Measurement Error: Errors caused by inaccuracies during data collection.
Selection Bias: Bias that occurs when the sample is not representative of the population from which it is drawn.
Addressing Bias with Larger Samples
Larger samples can significantly help reduce biases mentioned above.
Data as a Source of Comparison
Journals and articles often use comparative data to draw conclusions.
Example: Maternal sucrose consumption can alter behavior and steroids in adult rat offspring (Journal of Endocrinology).
Concept of "Just Because" in Science
The phrase “the most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny…'” by Isaac Asimov exemplifies the spirit of exploration in data.
Types of Data Visualization
Exploratory Data Visualization: Used primarily to analyze and explore data patterns.
Explanatory Data Visualization: Used to communicate findings.
Importance of Displaying Data
Proper data display enhances:
Understanding of data
Communication of results
Aesthetic appeal of data representation.
Best Practices in Data Visualization
Key learning objectives in data visualization include:
Differentiate between explanatory and exploratory data displays.
Identify and create good graphs.
Understand how data types influence figure design.
Implement best practices in designing figures.
Critique poorly designed graphs.
Continuous vs. Discrete Data
Distinction between types of data influencing analysis and representation.
Examples of Data Visualization Techniques
Histograms: Useful for displaying the shape of data distribution.
Types of distributions:
Bell shaped
Bimodal
Skewed
Uniform
Density Plots: Useful for understanding the distribution of continuous variables.
Line Graphs: Effective to show trends over time.
Color Coded Maps: Used to visualize spatial information.
Common Mistakes in Data Visualization
Mistakes include failing to show data clearly, obscuring patterns, and drawing graphs unclearly.
Example of a misleading graph that fails to convey the accurate data story effectively.
Relationships in Data
Explore different variable types and how they can be represented graphically:
One categorical variable vs. one numerical variable, etc.
Importance of choosing the right type of plot to represent the relationship between multiple variables.
Conclusion
The comprehensive understanding and application of statistical methods are essential for making informed decisions based on data.