Statistics - Data Understanding and Sampling Techniques

Understanding Data and Statistics

Goal of Statistics

  • The primary goal is to gain a deeper understanding of the data to make informed decisions and predictions.

Key Definitions

  • Statistics is used to make decisions and predictions.

  • Population: A collection of people or objects being studied.

    • Example: Studying all cars of a particular make.

  • Sample: A subset of the population.

  • Parameter: Qualification related to a population study.

  • Statistics: Qualification related to a sample study.

  • Variable: A characteristic of interest that represents a number.

    • Example: Gathering ages to find an average age.

  • Qualitative Variable: Names and labels representing categories.

    • Example: Colors

  • Quantitative Variable: Numerical representation with measurable quantities.

Homework and Resources

  • Open math lab: A platform where professors contribute questions to homework assignments.

  • Tutors are available for assistance.

Mathematics vs. Statistics

  • Mathematics is an attempt to represent our lives and solve problems using numbers.

  • Statistics is a field that involves collecting and interpreting data, utilizing math as a tool.

  • Mathematics is considered the only pure, exact science because numbers are modeled.

Classification of Numbers/Variables

Discrete vs. Continuous Variables
  • Discrete: Countable and distinct numbers (e.g., 1, 2, 3).

    • They can be finite or infinite but still countable.

    • Examples include integers (positive and negative whole numbers) but dealt with positive numbers in statistics.

  • Continuous: Infinite and not countable; can take any value within a range.

Number System
  • Natural Numbers: Positive integers.

  • Integers: Positive and negative whole numbers, including zero.

  • Fractions: Represent part of a whole.

  • Decimals: Can represent fractions, obtained by dividing the numerator by the denominator.

  • Real Numbers: The set comprising discrete and continuous numbers.

    • There are infinite numbers between zero and one.

  • Age: Time is continuous; even when stating an age like 20, it is not an exact point.

Definitions
  • Discrete: A finite number or countable.

  • Continuous: Infinite and not countable. (Measurements are continuous.)

Examples of Discrete and Continuous Variables:
  • Number of shoes owned: Discrete.

  • Type of car driven: Discrete.

  • Distance from home to the grocery store: Continuous.

  • Number of classes taken per school year: Discrete.

  • Type of calculator used: Discrete.

  • Age: Typically considered continuous but can be discrete when referring to completed years.

Sampling Techniques

  • Different ways to choose a sample from a population.

Methods of Sampling
  • Systematic Sampling: Selecting every nth member of the population.

    • Example: Checking every 100th item on an assembly line.

  • Stratified Sampling: Dividing the population into categories (strata) and then randomly selecting a percentage from each category.

    • Random sample:

      • mathematically, everyone has the same chance of being taken.

  • Random Sampling: Everyone has an equal chance of being selected.

    • Assign numbers and generate them for people.

  • Simple Random Sample: Each piece of people are randomly sampled where a sample of size n is obtained.

  • Convenience Sampling: Selecting readily available individuals who are convenient to choose.

    • Example: Asking friends about their shoe preferences.

  • Voluntary Response Sampling: Individuals volunteer to participate.

    • Example: Online reviews.

  • Cluster Sampling: Dividing the population into clusters (groups) and randomly selecting entire clusters.

    • Example: Surveying all students in two randomly selected high schools.

Sampling Bias and Critical Evaluation

  • Sampling bias can affect the fairness and representation of a sample.

Types of Bias:
  • Sampling Bias: Occurs when the sample is not representative of the population.

  • Voluntary Response Bias: Only those with strong opinions (positive or negative) are likely to respond.

  • Response Bias: When responders give inaccurate answers.

  • Wording Bias: When the wording of questions influences the response.

    • Example: The wording of Proposition 8 in California confused voters.

  • Perceived Lack of Anonymity: Fear of giving an honest response, especially in sensitive surveys.

  • Negative Interest: Questions order influencing answers.

Understanding Data and Statistics

Goal of Statistics
  • The primary goal is to gain a deeper understanding of the data to make informed decisions, predictions, and draw meaningful conclusions. Statistics enables us to transform raw data into actionable insights by identifying patterns, relationships, and trends that would not be apparent through simple observation.

Key Definitions
  • Statistics: A branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. It provides tools and techniques to make informed decisions and predictions based on available data.

  • Population: The entire group of individuals, objects, or events that are of interest in a study. It is the complete set from which a sample may be drawn. Example: Studying all cars of a particular make involves examining every single car produced under that make.

  • Sample: A subset of the population that is selected for study. The sample should be representative of the population to allow for accurate inferences about the entire group.

  • Parameter: A numerical measure that describes a characteristic of the population. It is a fixed value that is often unknown and estimated from sample data.

  • Statistics: A numerical measure that describes a characteristic of the sample. It is used to estimate the corresponding population parameter.

  • Variable: A characteristic of interest that can take on different values. It represents a measurable or observable attribute of individuals or objects in a study. Example: Gathering ages to find an average age involves collecting a variable (age) from each individual.

  • Qualitative Variable: A variable that represents categories or attributes that are non-numerical. These variables describe qualities or characteristics that cannot be measured numerically. Example: Colors such as red, blue, and green.

  • Quantitative Variable: A variable that represents numerical values that can be measured or counted. These variables can be further classified as discrete or continuous.

Homework and Resources
  • Open math lab: A platform where professors contribute questions to homework assignments, providing a diverse range of practice problems and learning materials for students.

  • Tutors are available for assistance: Tutors offer personalized support and guidance to students, helping them understand concepts, solve problems, and improve their overall performance in the course.

Mathematics vs. Statistics
  • Mathematics is an abstract and theoretical discipline that deals with numbers, quantities, and shapes. It provides a framework for representing and solving problems using logical reasoning and symbolic notation.

  • Statistics is an applied field that involves collecting, analyzing, and interpreting data to make inferences and decisions. It utilizes mathematical tools and techniques to extract meaningful information from data and draw conclusions about populations.

  • Mathematics is considered the only pure, exact science because numbers are modeled and follow strict rules and axioms. Statistical analysis involves uncertainty and variability, as it deals with real-world data that may be subject to errors and biases.

Classification of Numbers/Variables
Discrete vs. Continuous Variables
  • Discrete: Variables that can only take on a finite number of values or a countable number of values. These values are typically integers and represent distinct, separate units. Examples: 1, 2, 3. They can be finite or infinite but still countable.

    • Examples include integers (positive and negative whole numbers) but dealt with positive numbers in statistics. For instance, the number of students in a class or the number of cars in a parking lot are discrete variables.

  • Continuous: Variables that can take on any value within a given range. These values are not restricted to integers and can include fractions or decimals. Continuous variables represent measurements that can be infinitely precise.

Number System
  • Natural Numbers: The set of positive integers, starting from 1. These numbers are used for counting and ordering. Example: 1, 2, 3, …

  • Integers: The set of whole numbers, including both positive and negative numbers, as well as zero. Example: … -3, -2, -1, 0, 1, 2, 3, …

  • Fractions: Numbers that represent part of a whole, expressed as a ratio of two integers (numerator and denominator). Example: 1/2, 3/4, 2/5

  • Decimals: Numbers that represent fractions in base-10 notation. They are obtained by dividing the numerator by the denominator. Example: 0.5, 0.75, 0.4

  • Real Numbers: The set of all numbers that can be represented on a number line, including both rational and irrational numbers. This set comprises both discrete and continuous numbers. There are infinite numbers between zero and one.

  • Age: Age is often treated as a continuous variable because time is continuous. Even when stating an age like 20, it is not an exact point but rather a range (e.g., 20 years and some months, days, hours, etc.).

Definitions
  • Discrete: A finite number or countable, meaning that the values can be listed or counted. These values are typically integers.

  • Continuous: Infinite and not countable, meaning that the values cannot be listed or counted. Measurements are continuous because they can be measured to arbitrary precision.

Examples of Discrete and Continuous Variables:
  • Number of shoes owned: Discrete (you can only own a whole number of shoes).

  • Type of car driven: Discrete (categorical variable).

  • Distance from home to the grocery store: Continuous (can be measured to arbitrary precision).

  • Number of classes taken per school year: Discrete (you can only take a whole number of classes).

  • Type of calculator used: Discrete (categorical variable).

  • Age: Typically considered continuous but can be discrete when referring to completed years (e.g., "I am 25 years old").

Sampling Techniques
  • Different ways to choose a sample from a population, each with its own advantages and disadvantages. The goal is to select a sample that is representative of the population to allow for accurate inferences.

Methods of Sampling
  • Systematic Sampling: Selecting every nth member of the population after a random start. This method is simple to implement but may be biased if there is a pattern in the population.

    • Example: Checking every 100th item on an assembly line to ensure quality control. This method is efficient for large populations but may miss defects that occur in a systematic pattern.

  • Stratified Sampling: Dividing the population into subgroups (strata) based on shared characteristics, then randomly selecting a percentage from each subgroup. This method ensures that each subgroup is adequately represented in the sample.

    • Random sample: mathematically, everyone has the same chance of being taken, ensuring fairness and reducing bias.

  • Random Sampling: Selecting individuals from the population in such a way that everyone has an equal chance of being selected. This method minimizes bias but may not always be feasible.

    • Assign numbers to individuals and generate random numbers to select the sample. This method ensures that the sample is representative of the population.

  • Simple Random Sample: A type of random sample in which each individual or set of individuals has an equal chance of being selected. This method is the most basic form of random sampling and serves as a foundation for more complex methods.

  • Convenience Sampling: Selecting individuals who are readily available and easy to reach. This method is quick and inexpensive but is likely to be biased.

    • Example: Asking friends about their shoe preferences is a convenience sample because it is limited to individuals who are easily accessible.

  • Voluntary Response Sampling: Allowing individuals to self-select into the sample. This method is convenient but is highly susceptible to bias because those who volunteer are likely to have strong opinions.

    • Example: Online reviews are a form of voluntary response sampling because only those who are motivated to leave a review will participate.

  • Cluster Sampling: Dividing the population into clusters (groups) and randomly selecting entire clusters to include in the sample. This method is useful when the population is geographically dispersed or when it is difficult to obtain a complete list of individuals.

    • Example: Surveying all students in two randomly selected high schools is a cluster sample because entire schools are selected rather than individual students.

Sampling Bias and Critical Evaluation
  • Sampling bias can significantly affect the fairness and representation of a sample, leading to inaccurate conclusions about the population. It is essential to critically evaluate sampling methods to identify and minimize potential sources of bias.

Types of Bias:
  • Sampling Bias: Occurs when the sample is not representative of the population, leading to systematic errors in the results. This can happen if certain groups are over- or under-represented in the sample.

  • Voluntary Response Bias: Occurs when individuals volunteer to participate in a survey or study, resulting in a sample that is not representative of the population. Only those with strong opinions (positive or negative) are likely to respond, leading to skewed results.

  • Response Bias: Occurs when responders give inaccurate answers due to factors such as social desirability, misunderstanding the question, or memory errors. This can lead to systematic errors in the data and affect the validity of the conclusions.

  • Wording Bias: Occurs when the wording of questions influences the response, leading to biased results. The way a question is phrased can affect how people interpret it and, therefore, how they answer.

    • Example: The wording of Proposition 8 in California confused voters, leading to unintended consequences.

  • Perceived Lack of Anonymity: Occurs when respondents fear that their answers will not be kept confidential, leading to less honest responses, especially in sensitive surveys. People may be hesitant to disclose personal information or express unpopular opinions if they believe their identity will be revealed.

  • Negative Interest: Occurs when the order of questions influences answers. Earlier questions can prime respondents to answer subsequent questions in a certain way, leading to biased results.