Quantitative Methods in Health Sciences: Descriptive Statistics and Continuous Variables

Data Classification and Statistics

Definitions and Scope: * Data: Consists of information derived from observations, counts, measurements, or responses. * Statistics: The science focused on the collection, organization, analysis, and interpretation of data for the purpose of informed decision-making.
Types of Data: * Qualitative Data: Consists of attributes, labels, or non-numerical entries. Categorized as: * Nominal: Labels/names without mathematical order. * Ordinal: Can be arranged in a specific order or rank. * Quantitative Data: Consists of numerical measurements or counts. Categorized as: * Interval: Differences between data are meaningful, but there is no true zero point. * Ratio: Differences are meaningful and there is an inherent zero point.
Variable Classifications: * Discrete Variable: A quantitative variable that results from countable numbers (whole values). * Continuous Variable: A quantitative variable that is measurable and can take on decimal values. * Qualitative (Categorical) Variable: Represents categories such as gender, ethnicity, or group membership.
Example: GPA Data Identification: * Data: Sally (3.22), Bob (3.98), Cindy (2.75), Mark (2.24), Kathy (3.84). * The names (Sally, Bob, etc.) represent Qualitative data. * The Grade Point Average (GPA) values represent Quantitative data.

Branches of Statistics

Descriptive Statistics: * Involves the organization, summarization, and visual display of data. * The primary goal is to turn raw data into accessible information.
Inferential Statistics: * Involves using a sample to draw conclusions about a larger population.
Practical Example: Sleep Study: * Study Detail: Volunteers with less than $6\,hours$ of sleep were four times more likely to answer incorrectly on a science test compared to participants with at least $8\,hours$ of sleep. * Descriptive Part: The statement "four times more likely to answer incorrectly" describes the sample data directly. * Inferential Conclusion: Drawing the inference that all individuals sleeping less than $6\,hours$ are more likely to answer science questions incorrectly than those sleeping at least $8\,hours$ .
The Role of Statistics in Experimentation (Three-Step Process): * Step 1: Experimentation: Comparing two teaching methods (Method A and Method B) applied to a population of first-grade children. Results in Test Scores for students in two samples. * Sample A Results: $73$ , $75$ , $72$ , $79$ , $76$ , $77$ , $75$ , $77$ , $72$ , $75$ , $76$ , $78$ , $80$ , $74$ , $76$ , $78$ , $73$ , $77$ , $74$ , $81$ , $76$ . * Sample B Results: $68$ , $70$ , $73$ , $71$ , $67$ , $72$ , $70$ , $71$ , $75$ , $68$ , $70$ , $71$ , $72$ , $74$ , $69$ , $72$ , $73$ , $70$ , $70$ , $77$ , $77$ , $69$ . * Step 2: Descriptive Statistics: Organizing and simplifying the data from Sample A and Sample B. * Sample A Average Score = $76$ . * Sample B Average Score = $71$ . * Step 3: Inferential Statistics: Interpreting results. The sample data show a 5-point difference. Researchers must decide between two interpretations: 1. There is actually no difference, and the result is due to chance (sampling error). 2. There is a real difference between the methods, accurately reflected by the data.

Measures of Central Tendency (Measures of Location)

Overview of Central Tendency: * Represents a typical or central entry in a data set. * If a distribution is perfectly "Normal" (bell curve), the Mean, Median, and Mode are identical.
The Mean (Arithmetic Average): * Calculated by the sum of entries divided by the number of entries ( $n$ or $N$ ). * Population Mean (mu): * $\mu = \frac{\sum x}{N}$ * Sample Mean (x-bar): * $\bar{x} = \frac{\sum x}{n}$ * Characteristic: It is the most common measure but is highly sensitive to outliers (extreme values). * Example (Effect of Outliers): * Set 1 ( $1, 2, 3, 4, 5$ ): $\text{Mean} = 3$ * Set 2 ( $1, 2, 3, 4, 10$ ): $\text{Mean} = \frac{20}{5} = 4$
The Median: * The numerical value in the exact middle of an ordered data set ( $50\%$ above, $50\%$ below). * Characteristic: It is not affected by outliers. * Determining Position: * $\text{Position} = \frac{n + 1}{2}$ * If $n$ is odd: The median is the single middle number. * If $n$ is even: The median is the average of the two middle numbers. * Example (Odd Set): Data: $32, 39, 44, 53, 57, 57, 61$ . Median = $53$ . * Example (Even Set): Data: $1.39, 1.76, 1.90, 2.12, 2.53, 2.71, 3.00, 3.33, 3.71, 4.00$ . Median = $\frac{2.53 + 2.71}{2} = 2.62$ .
The Mode: * The data entry occurring with the greatest frequency. * If no entry repeats, there is no mode. If multiple entries repeat equally, it can be bimodal or multimodal. * Example: Ages $53, 32, 61, 57, 39, 44, 57$ . Mode = $57$ .
Which measure is "Best"? * Mean: General standard, unless outliers exist. * Median: Best when extreme values are present (e.g., house prices in Ottawa).

Shapes of Distributions

Symmetric Distribution: * A vertical line drawn through the middle creates mirror-image halves. * $\text{Mean} = \text{Median} = \text{Mode}$
Uniform (Rectangular) Distribution: * All entries/classes have equal frequencies. This is also a type of symmetric distribution.
Skewed Left (Negatively Skewed): * The "tail" extends to the left. * \text{Mean} < \text{Median} * Example: Mode/Median = $25,000$ , Mean = $23,500$ .
Skewed Right (Positively Skewed): * The "tail" extends to the right. * \text{Mean} > \text{Median} * Example: Mode/Median = $25,000$ , Mean = $121,500$ (driven up by a $1,000,000$ outlier).

Measures of Variation (Measures of Dispersion)

Range: * $\text{Range} = \text{Maximum entry} - \text{Minimum entry}$ * Disadvantages: Ignores data distribution; highly sensitive to outliers. * Example: Stock prices $56$ to $67$ . Range = $67 - 56 = 11$ .
Deviation: * The difference between an entry $x$ and the mean $\mu$ . * $\text{Deviation} = x - \mu$ * The sum of deviations $\sum(x - \mu)$ is always equal to $0$ .
Variance and Standard Deviation: * Population Variance (sigma squared): * $\sigma^2 = \frac{\sum (x - \mu)^2}{N}$ * Population Standard Deviation (sigma): * $\sigma = \sqrt{\frac{\sum (x - \mu)^2}{N}}$ * Sample Variance ( $s^2$ ): * $s^2 = \frac{\sum (x - \bar{x})^2}{n - 1}$ * Sample Standard Deviation ( $s$ ): * $s = \sqrt{\frac{\sum (x - \bar{x})^2}{n - 1}}$
Degrees of Freedom ( $n-1$ ): * The number of values free to vary after using data to estimate a parameter (like the mean). * Example: If Mean = $10$ for $3\,values$ , and $A=8$ , $B=12$ , then $C$ must be $10$ (it is not free to vary).
Coefficient of Variation (CV): * Measures relative variation as a percentage. * $CV = \frac{s}{\bar{x}} \times 100\%$ * Utility: Allows comparison of variation between datasets with different units or different means. * Comparison Example: * Stock A: Average = $\$50$ , $s = \$5$ . $CV = \frac{5}{50} \times 100\% = 10\%$ * Stock B: Average = $\$100$ , $s = \$5$ . $CV = \frac{5}{100} \times 100\% = 5\%$ * Result: Stock B is less variable relative to its price.

Measures of Position

Quartiles: * Divide an ordered data set into four equal parts. * $Q_1$ (First Quartile): Median of the lower half (lower $25\%$ ). * $Q_2$ (Second Quartile): The median of the whole data set (middle $50\%$ ). * $Q_3$ (Third Quartile): Median of the upper half (upper $75\%$ ).
Interquartile Range (IQR): * $IQR = Q_3 - Q_1$ * Represents the range of the middle $50\%$ of the data set.
Box-and-Whisker Plot: * Tool for highlighting data features using the Five-Number Summary: 1. Minimum entry 2. $Q_1$ 3. $Q_2$ (Median) 4. $Q_3$ 5. Maximum entry
Outlier Detection (Rule of Thumb): * An entry is a potential outlier if it falls outside the following bounds: * Lower Bound: $Q_1 - 1.5(Q_3 - Q_1)$ * Upper Bound: $Q_3 + 1.5(Q_3 - Q_1)$ * Example Data (Outlier Check): * Set 2: $104, 104, 105, 106, 107, 108, 112, 114, 122, 124, 125$ * $Q_1 = 107$ , $Median = 108$ , $Q_3 = 112$ . * $IQR = 112 - 107 = 5$ * $1.5 \times IQR = 7.5$ * Lower Limit: $107 - 7.5 = 99.5$ * Upper Limit: $112 + 7.5 = 119.5$ * Entries $122, 124, 125$ are potential outliers.