MK

Data and Sampling

Statistical Thinking and Data Sampling

  • H.G. Wells stated that statistical thinking will be as crucial for citizenship as reading and writing.

Learning Objectives

  1. Understand the concept of good data.

  2. Differentiate between sample and population.

  3. Distinguish between estimate/statistic and parameter.

  4. Understand estimation deviance.

  5. Identify the properties of a good sample.

Importance of Good Data

  • Example: Family Meal Spaghetti Recipe

    • Ingredients include:

    • 10 Garlic cloves

    • Basil steeped in oil

    • San Marzano Tomatoes: 2 2ɛoz cans (smaller cans are preferred for better taste).

Asking Our Data to Do

  • Data can be used to accomplish one of three tasks:

    1. Representative: Ensure that sampled data reflects the population accurately.

    2. Comparison: Compare different datasets or groups effectively.

    3. Just Because: Explore data for discoveries without a particular hypothesis.

Ensuring Representativeness

  • The first step is to ensure that sampled data is representative of the population being studied.

  • It is crucial to define the population from which the sample is taken.

Sampling from a Population

  • Discusses techniques and methods for sampling from a defined population.

Statistical Measures: Mean and Standard Deviation

  • Mean Calculation Formula: ar{x} = rac{1}{n} imes ext{Sum of data points}

    • For a dataset:
      ar{x} = rac{x1 + x2 + ext{…} + x_n}{n}

Steps for Calculating Mean and Standard Deviation

  1. Calculate the Mean: For the dataset [3.5, 3, 2, 1.75, 2, 0.6]:

    • Step 1: Mean = (3.5 + 3 + 2 + 1.75 + 2 + 0.6) / 6

    • Mean = 13.85 / 6 ≈ 2.3083 (rounded to four decimal places).

  2. Find Squared Differences from Mean for Each Data Point:

    • (3.5 - 2.3083)² ≈ 1.4225

    • (3 - 2.3083)² ≈ 0.4809

    • (2 - 2.3083)² ≈ 0.0947

    • (1.75 - 2.3083)² ≈ 0.3119

    • (2 - 2.3083)² ≈ 0.0947

    • (0.6 - 2.3083)² ≈ 6.0625

  3. Calculate the Mean of Squared Differences (Variance):

    • ext{Variance} = rac{(1.4225 + 0.4809 + 0.0947 + 0.3119 + 0.0947 + 6.0625)}{6}

    • Variance ≈ 1.4091 (rounded to four decimal places).

  4. Calculate Standard Deviation:

    • ext{Standard Deviation} = ext{sqrt}(1.4091) ≈ 1.1875 (rounded to four decimal places).

Applications of Mean and Standard Deviation

  • Common applications include:

    • Quality Control

    • Polls

    • Clinical Studies

    • Experimental and Observational Studies (Lab or Field Setting)

Understanding Sampling Error and Bias

  • Important question: How do you know your spoon is representative of your soup?

Common Biases in Sampling

  • Sampling Error: The error that arises from not sampling the entire population.

  • Measurement Error: Errors caused by inaccuracies during data collection.

  • Selection Bias: Bias that occurs when the sample is not representative of the population from which it is drawn.

Addressing Bias with Larger Samples

  • Larger samples can significantly help reduce biases mentioned above.

Data as a Source of Comparison

  • Journals and articles often use comparative data to draw conclusions.

  • Example: Maternal sucrose consumption can alter behavior and steroids in adult rat offspring (Journal of Endocrinology).

Concept of "Just Because" in Science

  • The phrase “the most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny…'” by Isaac Asimov exemplifies the spirit of exploration in data.

Types of Data Visualization

  • Exploratory Data Visualization: Used primarily to analyze and explore data patterns.

  • Explanatory Data Visualization: Used to communicate findings.

Importance of Displaying Data

  • Proper data display enhances:

    • Understanding of data

    • Communication of results

    • Aesthetic appeal of data representation.

Best Practices in Data Visualization

  • Key learning objectives in data visualization include:

    1. Differentiate between explanatory and exploratory data displays.

    2. Identify and create good graphs.

    3. Understand how data types influence figure design.

    4. Implement best practices in designing figures.

    5. Critique poorly designed graphs.

Continuous vs. Discrete Data

  • Distinction between types of data influencing analysis and representation.

Examples of Data Visualization Techniques

  1. Histograms: Useful for displaying the shape of data distribution.

    • Types of distributions:

    • Bell shaped

    • Bimodal

    • Skewed

    • Uniform

  2. Density Plots: Useful for understanding the distribution of continuous variables.

  3. Line Graphs: Effective to show trends over time.

  4. Color Coded Maps: Used to visualize spatial information.

Common Mistakes in Data Visualization

  • Mistakes include failing to show data clearly, obscuring patterns, and drawing graphs unclearly.

  • Example of a misleading graph that fails to convey the accurate data story effectively.

Relationships in Data

  • Explore different variable types and how they can be represented graphically:

    • One categorical variable vs. one numerical variable, etc.

    • Importance of choosing the right type of plot to represent the relationship between multiple variables.

Conclusion

  • The comprehensive understanding and application of statistical methods are essential for making informed decisions based on data.