Quantitative Methods in Health Sciences: Descriptive Statistics and Continuous Variables

Data Classification and Statistics

  • Definitions and Scope:     * Data: Consists of information derived from observations, counts, measurements, or responses.     * Statistics: The science focused on the collection, organization, analysis, and interpretation of data for the purpose of informed decision-making.

  • Types of Data:     * Qualitative Data: Consists of attributes, labels, or non-numerical entries. Categorized as:         * Nominal: Labels/names without mathematical order.         * Ordinal: Can be arranged in a specific order or rank.     * Quantitative Data: Consists of numerical measurements or counts. Categorized as:         * Interval: Differences between data are meaningful, but there is no true zero point.         * Ratio: Differences are meaningful and there is an inherent zero point.

  • Variable Classifications:     * Discrete Variable: A quantitative variable that results from countable numbers (whole values).     * Continuous Variable: A quantitative variable that is measurable and can take on decimal values.     * Qualitative (Categorical) Variable: Represents categories such as gender, ethnicity, or group membership.

  • Example: GPA Data Identification:     * Data: Sally (3.22), Bob (3.98), Cindy (2.75), Mark (2.24), Kathy (3.84).     * The names (Sally, Bob, etc.) represent Qualitative data.     * The Grade Point Average (GPA) values represent Quantitative data.

Branches of Statistics

  • Descriptive Statistics:     * Involves the organization, summarization, and visual display of data.     * The primary goal is to turn raw data into accessible information.

  • Inferential Statistics:     * Involves using a sample to draw conclusions about a larger population.

  • Practical Example: Sleep Study:     * Study Detail: Volunteers with less than 6hours6\,hours of sleep were four times more likely to answer incorrectly on a science test compared to participants with at least 8hours8\,hours of sleep.     * Descriptive Part: The statement "four times more likely to answer incorrectly" describes the sample data directly.     * Inferential Conclusion: Drawing the inference that all individuals sleeping less than 6hours6\,hours are more likely to answer science questions incorrectly than those sleeping at least 8hours8\,hours.

  • The Role of Statistics in Experimentation (Three-Step Process):     * Step 1: Experimentation: Comparing two teaching methods (Method A and Method B) applied to a population of first-grade children. Results in Test Scores for students in two samples.         * Sample A Results: 7373, 7575, 7272, 7979, 7676, 7777, 7575, 7777, 7272, 7575, 7676, 7878, 8080, 7474, 7676, 7878, 7373, 7777, 7474, 8181, 7676.         * Sample B Results: 6868, 7070, 7373, 7171, 6767, 7272, 7070, 7171, 7575, 6868, 7070, 7171, 7272, 7474, 6969, 7272, 7373, 7070, 7070, 7777, 7777, 6969.     * Step 2: Descriptive Statistics: Organizing and simplifying the data from Sample A and Sample B.         * Sample A Average Score = 7676.         * Sample B Average Score = 7171.     * Step 3: Inferential Statistics: Interpreting results. The sample data show a 5-point difference. Researchers must decide between two interpretations:         1. There is actually no difference, and the result is due to chance (sampling error).         2. There is a real difference between the methods, accurately reflected by the data.

Measures of Central Tendency (Measures of Location)

  • Overview of Central Tendency:     * Represents a typical or central entry in a data set.     * If a distribution is perfectly "Normal" (bell curve), the Mean, Median, and Mode are identical.

  • The Mean (Arithmetic Average):     * Calculated by the sum of entries divided by the number of entries (nn or NN).     * Population Mean (mu):     * μ=xN\mu = \frac{\sum x}{N}     * Sample Mean (x-bar):     * xˉ=xn\bar{x} = \frac{\sum x}{n}     * Characteristic: It is the most common measure but is highly sensitive to outliers (extreme values).     * Example (Effect of Outliers):         * Set 1 (1,2,3,4,51, 2, 3, 4, 5): Mean=3\text{Mean} = 3         * Set 2 (1,2,3,4,101, 2, 3, 4, 10): Mean=205=4\text{Mean} = \frac{20}{5} = 4

  • The Median:     * The numerical value in the exact middle of an ordered data set (50%50\% above, 50%50\% below).     * Characteristic: It is not affected by outliers.     * Determining Position:     * Position=n+12\text{Position} = \frac{n + 1}{2}     * If nn is odd: The median is the single middle number.     * If nn is even: The median is the average of the two middle numbers.     * Example (Odd Set): Data: 32,39,44,53,57,57,6132, 39, 44, 53, 57, 57, 61. Median = 5353.     * Example (Even Set): Data: 1.39,1.76,1.90,2.12,2.53,2.71,3.00,3.33,3.71,4.001.39, 1.76, 1.90, 2.12, 2.53, 2.71, 3.00, 3.33, 3.71, 4.00. Median = 2.53+2.712=2.62\frac{2.53 + 2.71}{2} = 2.62.

  • The Mode:     * The data entry occurring with the greatest frequency.     * If no entry repeats, there is no mode. If multiple entries repeat equally, it can be bimodal or multimodal.     * Example: Ages 53,32,61,57,39,44,5753, 32, 61, 57, 39, 44, 57. Mode = 5757.

  • Which measure is "Best"?     * Mean: General standard, unless outliers exist.     * Median: Best when extreme values are present (e.g., house prices in Ottawa).

Shapes of Distributions

  • Symmetric Distribution:     * A vertical line drawn through the middle creates mirror-image halves.     * Mean=Median=Mode\text{Mean} = \text{Median} = \text{Mode}

  • Uniform (Rectangular) Distribution:     * All entries/classes have equal frequencies. This is also a type of symmetric distribution.

  • Skewed Left (Negatively Skewed):     * The "tail" extends to the left.     * \text{Mean} < \text{Median}     * Example: Mode/Median = 25,00025,000, Mean = 23,50023,500.

  • Skewed Right (Positively Skewed):     * The "tail" extends to the right.     * \text{Mean} > \text{Median}     * Example: Mode/Median = 25,00025,000, Mean = 121,500121,500 (driven up by a 1,000,0001,000,000 outlier).

Measures of Variation (Measures of Dispersion)

  • Range:     * Range=Maximum entryMinimum entry\text{Range} = \text{Maximum entry} - \text{Minimum entry}     * Disadvantages: Ignores data distribution; highly sensitive to outliers.     * Example: Stock prices 5656 to 6767. Range = 6756=1167 - 56 = 11.

  • Deviation:     * The difference between an entry xx and the mean μ\mu.     * Deviation=xμ\text{Deviation} = x - \mu     * The sum of deviations (xμ)\sum(x - \mu) is always equal to 00.

  • Variance and Standard Deviation:     * Population Variance (sigma squared):     * σ2=(xμ)2N\sigma^2 = \frac{\sum (x - \mu)^2}{N}     * Population Standard Deviation (sigma):     * σ=(xμ)2N\sigma = \sqrt{\frac{\sum (x - \mu)^2}{N}}     * Sample Variance (s2s^2):     * s2=(xxˉ)2n1s^2 = \frac{\sum (x - \bar{x})^2}{n - 1}     * Sample Standard Deviation (ss):     * s=(xxˉ)2n1s = \sqrt{\frac{\sum (x - \bar{x})^2}{n - 1}}

  • Degrees of Freedom (n1n-1):     * The number of values free to vary after using data to estimate a parameter (like the mean).     * Example: If Mean = 1010 for 3values3\,values, and A=8A=8, B=12B=12, then CC must be 1010 (it is not free to vary).

  • Coefficient of Variation (CV):     * Measures relative variation as a percentage.     * CV=sxˉ×100%CV = \frac{s}{\bar{x}} \times 100\%     * Utility: Allows comparison of variation between datasets with different units or different means.     * Comparison Example:         * Stock A: Average = $50\$50, s=$5s = \$5. CV=550×100%=10%CV = \frac{5}{50} \times 100\% = 10\%         * Stock B: Average = $100\$100, s=$5s = \$5. CV=5100×100%=5%CV = \frac{5}{100} \times 100\% = 5\%         * Result: Stock B is less variable relative to its price.

Measures of Position

  • Quartiles:     * Divide an ordered data set into four equal parts.     * Q1Q_1 (First Quartile): Median of the lower half (lower 25%25\%).     * Q2Q_2 (Second Quartile): The median of the whole data set (middle 50%50\%).     * Q3Q_3 (Third Quartile): Median of the upper half (upper 75%75\%).

  • Interquartile Range (IQR):     * IQR=Q3Q1IQR = Q_3 - Q_1     * Represents the range of the middle 50%50\% of the data set.

  • Box-and-Whisker Plot:     * Tool for highlighting data features using the Five-Number Summary:         1. Minimum entry         2. Q1Q_1         3. Q2Q_2 (Median)         4. Q3Q_3         5. Maximum entry

  • Outlier Detection (Rule of Thumb):     * An entry is a potential outlier if it falls outside the following bounds:     * Lower Bound: Q11.5(Q3Q1)Q_1 - 1.5(Q_3 - Q_1)     * Upper Bound: Q3+1.5(Q3Q1)Q_3 + 1.5(Q_3 - Q_1)     * Example Data (Outlier Check):         * Set 2: 104,104,105,106,107,108,112,114,122,124,125104, 104, 105, 106, 107, 108, 112, 114, 122, 124, 125         * Q1=107Q_1 = 107, Median=108Median = 108, Q3=112Q_3 = 112.         * IQR=112107=5IQR = 112 - 107 = 5         * 1.5×IQR=7.51.5 \times IQR = 7.5         * Lower Limit: 1077.5=99.5107 - 7.5 = 99.5         * Upper Limit: 112+7.5=119.5112 + 7.5 = 119.5         * Entries 122,124,125122, 124, 125 are potential outliers.