BJTHK1203 Statistics for Technology: Data Collections and Sampling

Strategic Data Collection and Professional Objectives

  • The Progression of Data-Driven Decision Making: Real-world industrial technology depends on a flow from understanding research goals to final execution. This progression ensures accuracy and validity in findings.     * Understand the Objective: Identify the primary purpose of the research and determine the methodology required to accomplish it.     * Choose the Proper Methodology: Select the most appropriate way of collecting data to meet the research goals.     * Draw Correct Conclusions: Interpret finding accurately to avoid misrepresentation.     * Make Intelligent Decisions: Use evidence-based results to drive organizational strategy.

Statistics as a Transformation Engine

  • Definitions and Sequence:     * Data: Defined as facts, specifically numerical facts, that are collected together for reference or information purposes.     * Statistics: A specialized tool used for creating new understanding and insights from a specific set of numbers.     * Information: Refers to knowledge communicated concerning some particular fact.     * The Transformation Process: DATA $\rightarrow$ STATISTICS $\rightarrow$ INFORMATION.

The Two Branches of Statistics

  • Descriptive Statistics:     * Function: Describes the specific data set currently being analyzed.     * Limitation: It does not allow for drawing any conclusions or making any inferences beyond the specific data at hand.
  • Inferential Statistics:     * Function: This branch is used to draw conclusions or make inferences about the characteristics of an entire population.     * Mechanism: It operates based on data drawn from a smaller sample representing the population.

Populations and Samples

  • Population: The group comprising all items of interest to a statistics practitioner. The scale is frequently very large and can sometimes be infinite.
  • Sample: A set of data drawn directly from the population. The scale is potentially very large but is always less than the total population.
  • Exhibit 12.5 - Election Day Case Study:     * The Population: All Florida voters, estimated at 5,000,0005,000,000.     * The Sample: Voters who were exit-polled on election day, totaling 765765.

The Rationale for Sampling

  • The Barrier (Why we cannot study the whole population): Investigating every single member of a large population is heavily constrained by practical realities. It is considered highly impractical and prohibitively expensive.
  • The Solution: Extracting a sample allows practitioners to generate reliable estimates about the whole. Sampling is easier to manage, execute, and is significantly cheaper.
  • The Reality Check: Conclusions and estimates derived from samples are not always correct.
  • Statistical Safeguards: To manage inherent risks, statistics builds measures of reliability directly into the inference through:     * Confidence Level.     * Significance Level.

Sources of Data and Statistical Validity

  • The Foundation of Analysis: The reliability and accuracy of data directly affect the validity of statistical results. Both factors depend entirely on the method of collection.
  • The Data Ecosystem: Statistical data is categorized into three primary sources:     1. Published Data     2. Observational Studies     3. Experimental Studies

Detailed Analysis of Published Data

  • Preference: Published data is often the preferred source due to its low cost and convenience.
  • Media Types: Data can be found in printed material, tapes, disks, and on the Internet.
  • Tracking Provenance (Primary vs. Secondary):     * Primary Data: This is data published by the exact organization that collected it. An example is a Census published by the Statistic Department.     * Secondary Data: This is data published by an organization different from the one that collected it. An example is "The Statistical Abstracts of Malaysia," which compiles data from primary census sources.

Active Collection: Experiments, Observations, and Surveys

  • Requirement for Active Collection: When published data is unavailable, a practitioner must conduct a study to generate new data.
  • The Control Matrix:     * Observational Studies: Measurements of a variable are observed and recorded without controlling any factors that might influence their values.     * Experimental Studies: Measurements of a variable are observed and recorded while specifically controlling factors that might influence their values.
  • Surveys (The Human Element): Surveys solicit information directly from people through three distinct channels:     1. Personal Interview     2. Telephone Interview     3. Self-Administered Questionnaire

Design Principles for a High-Quality Questionnaire

  • Structural Guidelines:     * Demographics: Start with demographic questions to help respondents get started comfortably.     * Length: Keep the survey as short as possible.     * Clarity: Ask short, simple, and clearly worded questions.     * Question Types: Prioritize dichotomous (yes/no) and multiple-choice questions.     * Neutrality: Avoid leading questions.     * Caution: Use open-ended questions sparingly.     * Preparation: Pretest on a small group and plan for final data usage before deployment.     * Example Analysis (College Student Survey): High-quality surveys insure confidentiality upfront to build trust and utilize clear structures for ease of processing.

Data Sourcing for Industrial Technology

  • Experimental Work: The common way for technologists to obtain data. It involves running experiments in the field or lab to describe the behavior of a variable using a sample from a population of interest.     * Case Example 1: Testing selected fruit samples for antioxidants.     * Case Example 2: Analyzing heavy metal contents in cockles to study sea pollution status.
  • Historical Records: Data found within the departments of an organization or professional associations, extracted directly from archives.     * Examples: Production rates, sales figures, and marketing data.
  • Surveys: Utilizing intelligent questionnaires to extract specialized insights.
  • Instruments: Capturing continuous variables directly from mechanical or digital tools.     * Application: Essential for capturing values from infinite interval possibilities.     * Examples: Exact height metrics, precise weight calculations, and direct instrument readings (e.g., calipers/measuring tools).

Sampling Plans: Procedures and Motivations

  • Definition: A sampling plan is a method or procedure for specifying how a sample will be taken from a population.
  • The Golden Rule: The sampled population and the target population should be similar to one another.
  • Motivations for Sampling:     * Costs: Addressing financial constraints.     * Population Size: Addressing the logistical impossibility of massive groups.     * Destructive Nature: In some tests, the sampling process itself may destroy the subject being tested.
  • The Research Framework (Writers' Room Analogy):     * Issue: Why is the research happening? (The inciting incident).     * What: The subject matter.     * Population/Sample: Who is being studied? (The casting call).     * Collect Data & Analyze: How will it be done? (The production).

Probability Sampling Techniques

  • Stratified Random Sampling:     * Process: Each element belongs to one and only one stratum.     * Objective: Best results occur when elements within a stratum are as alike as possible (a homogeneous group).     * Advantages: It is precise even with a smaller total sample size.     * Factors: Department, location, age, industry type.
  • Cluster Sampling:     * Process: Each cluster is a representative small-scale version of the population (a heterogeneous group). A simple random sample of the clusters is taken, and ALL elements within the chosen clusters are sampled.     * Disadvantage: Generally requires a larger total sample size than simple or stratified random sampling.     * Example: Area sampling where clusters are city blocks.
  • Systematic Sampling:     * The Formula: Randomly select one of the first n/Nn/N elements, then select every n/Nn/N-th element that follows (where nn is the desired sample size and NN is the population size).     * Advantage: The sample is usually easier to identify.     * Example: Selecting every 100100-th listing in a telephone book after a random start.

Nonprobability Sampling Techniques

  • Convenience Sampling:     * Process: A technique where the sample is identified primarily by convenience.     * Advantage: Sample selection and data collection are relatively easy.     * Disadvantage: It is impossible to determine how representative the sample is of the population.     * Example: A professor using student volunteers.
  • Judgment Sampling:     * Process: Relies entirely on subjective human assessment.     * Advantage: A relatively easy way of selecting a sample.     * Disadvantage: Quality depends heavily on the judgment of the person selecting the sample.     * Example: A reporter judging specific senators to reflect the general opinion of the entire senate.

Understanding Sampling and Non-Sampling Error

  • The Rule of Scale: The larger the sample size, the more accurate the sample estimates will be. Scale is the primary driver of baseline accuracy.
  • Sampling Error:     * Origin: Occurs due to the specific observations that happened to be selected.     * Nature: Natural variation between the sample and the population.     * Mitigation: Increasing the sample size WILL reduce this error. The sample mean deviates from the population mean purely by chance.
  • Non-Sampling Error:     * Origin: Arises from mistakes in data acquisition or improper selection.     * Nature: Considered the more serious threat to data validity. Increasing sample size WILL NOT reduce this error.     * The Compound Effect: When an observation is wrongly recorded, it compounds with sampling error, pushing the sample further from the truth.

Categories of Non-Sampling Threats

  • Errors in Data Acquisition:     * Hardware: Faulty equipment leading to incorrect measurements.     * Process: Mistakes made during transcription from primary sources.     * Comprehension: Inaccurate recording due to misinterpretation of terms.     * Friction: Inaccurate responses to questions concerning sensitive issues.
  • Non-Response Error:     * Definition: Error introduced when responses are not obtained from some members of the sample.     * Proportion: The Response Rate (proportion of selected people who complete the survey) is the critical parameter for diagnosing this error.
  • Selection Bias:     * Definition: Occurs when the sampling plan is designed so that specific members of the target population cannot possibly be selected for inclusion, creating an invisible boundary.

Questions & Discussion

  • Question 1: List 5 important points to consider when designing a questionnaire.     * Response:         1. Start with demographic questions.         2. Keep the survey as short as possible.         3. Use short, simple, clearly worded questions.         4. Focus on dichotomous and multiple-choice questions.         5. Avoid leading questions.
  • Question 2: Describe the difference between observational and experimental data.     * Response:         * Observational data is recorded without controlling variables or factors that might influence values.         * Experimental data is collected while controlling factors that influence the variables under study to observe their behavior in specific settings.