Data Management and Chi-Square Tests

Data Management

What is Data?
- Data are raw information or facts.
- Data becomes useful information when organized in a meaningful way.
- Data can be qualitative or quantitative.
What is Data Management?
- Data Management is concerned with “looking after” and processing data.
- It involves:
  - Looking after field data sheets
  - Checking and correcting the raw data
  - Preparing data for analysis
  - Documenting and archiving the data and meta-data
Importance of Data Management
- Ensures high-quality data for analysis, leading to correct conclusions.
- Allows further use of the data in the future and enables efficient integration of results with other studies.
- Leads to improved processing efficiency, improved data quality, and improved meaningfulness of the data.

Planning and Conducting an Experiment or Study

A. Methods of Data Collection

Census
- Systematically acquiring and recording information about all members of a given population.
- Researchers rarely survey the entire population due to:
  - High cost
  - Dynamic population (individuals change over time)
Sample Survey
- Selection of a subset within a population to yield knowledge about the population of concern.
- Advantages of sampling:
  - Lower cost
  - Faster data collection
  - Improved accuracy and quality due to smaller dataset
Experiment
- Performed when there are some controlled variables (like certain treatment).
- Intention is to study their effect on other observed variables (like health of patients).
- Main requirement is the possibility of replication.
Observation Study
- Appropriate when there are no controlled variables and replication is impossible.
- Typically uses a survey.
- Example: Exploring the correlation between smoking and lung cancer by collecting observations of smokers and non-smokers.

B. Planning and Conducting Surveys

Characteristics of a Well-Designed and Well-Conducted Survey
- Must be representative of the population.
- Incorporates a chance (e.g., random number generator) to use probabilistic results.
- Wording of questions must be neutral.
- Possible sources of errors and biases should be controlled.
- Sampling frame: A subset of the population that is possible to measure.
- Survey plan should specify a sampling method, determine the sample size, and outline steps for implementing the sampling plan and data collection.
Sampling Methods
- a. Nonprobability Sampling
  - Any sampling method where some elements of the population have no chance of selection, or the probability of selection can’t be accurately determined.
  - Selection is based on criteria other than randomness.
  - Gives rise to exclusion bias.
  - Does not allow the estimation of sampling errors.
  - Limited information about the relationship between sample and population, making it difficult to extrapolate from the sample to the population.
  - Example: interviewing the first person to answer the door in every household on a given street.
  - Examples of nonprobability sampling:
    - Convenience sampling (customers in a supermarket are asked questions).
    - Quota sampling (judgment is used to select the subjects based on specified proportions).
  - Nonresponse effects may turn any probability design into a nonprobability design if the characteristics of nonresponse are not well understood.
- b. Probability Sampling
  - It is possible to both determine which sampling units belong to which sample and the probability that each sample will be selected.
  - Examples of probability sampling methods:
    - i. Simple Random Sampling (SRS)
      - All samples of a given size have an equal probability of being selected, and selections are independent.
      - The frame is not subdivided or partitioned.
      - The sample variance is a good indicator of the population variance, which makes it relatively easy to estimate the accuracy of results.
      - Vulnerable to sampling error because the randomness of the selection may result in a sample that doesn’t reflect the makeup of the population.
      - SRS cannot accommodate the needs of researchers interested in research questions specific to subgroups of the population.
    - ii. Systematic Sampling
      - Relies on dividing the target population into strata (subpopulations) of equal size and then selecting randomly one element from the first stratum and corresponding elements from all other strata.
      - A simple example would be to select every 10th name from the telephone directory, with the first selection being random.
      - Helps to spread the sample over the list.
      - Every 10th sampling is especially useful for efficient sampling from databases.
      - Vulnerable to periodicities in the list.
      - Theoretical properties make it difficult to quantify its accuracy.
      - Systematic sampling is not SRS because different samples of the same size have different selection probabilities.
    - iii. Stratified Sampling
      - When the population embraces a number of distinct categories, the frame can be organized by these categories into separate “strata”.
      - Each stratum is then sampled as an independent sub-population.
      - Dividing the population into strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample.
      - Since each stratum is treated as an independent population, different sampling approaches can be applied to different strata.
      - Implementing such an approach can increase the cost and complexity of sample selection.
      - A stratified sampling approach is most effective when three conditions are met:
        Variability within strata are minimized
        Variability between strata are maximized
        The variables upon which the population is stratified are strongly correlated with the desired dependent variable.
    - iv. Cluster Sampling
      - It is cheaper to ‘cluster’ the sample in some way (e.g. by selecting respondents from certain areas only, or certain time-periods only).
      - Cluster sampling is an example of two-stage random sampling:
        in the first stage a random sample of areas is chosen
        in the second stage a random sample of respondents within those areas is selected
      - This works best when each cluster is a small copy of the population.
      - Can reduce travel and other administrative costs.
      - Generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between themselves, as compared with the within-cluster variation.
      - If clusters chosen are biased in a certain way, inferences drawn about population parameters will be inaccurate.
    - v. Matched Random Sampling
      - There are two (2) samples in which the members are clearly paired, or are matched explicitly by the researcher (for example, IQ measurements or pairs of identical twins).
      - Alternatively, the same attribute, or variable, may be measured twice on each subject, under different circumstances (e.g. the milk yields of cows before and after being fed a particular diet).

C. Planning and Conducting Experiments

Characteristics of a Well-Designed and Well-Conducted Experiment
- Stating the purpose of research, including estimates regarding the size of treatment effects, alternative hypotheses, and the estimated experimental variability.
- Experiments must compare the new treatment with at least one (1) standard treatment, to allow an unbiased estimates of the difference in treatment effects.
- Design of experiments, using blocking (to reduce the influence of confounding variables) and randomized assignment of treatments to subjects
- Examining the data set in secondary analyses, to suggest new hypotheses for future study
- Documenting and presenting the results of the study
- Example: The Hawthorne study examined changes to the working environment but was criticized for the lack of a control group and blindness.
Treatment, Control Groups, Experimental Units, Random Assignments and Replication
- a. Control groups and experimental units
  - To be able to compare effects and make inference about associations or predictions, one typically has to subject different groups to different conditions.
  - Usually, an experimental unit is subjected to treatment and a control group is not.
- b. Random Assignments
  - The second fundamental design principle is randomization of allocation of (controlled variables) treatments to units.
  - The treatment effects, if present, will be similar within each group.
- c. Replication
  - All measurements, observations or data collected are subject to variation, as there are no completely deterministic processes.
  - To reduce variability, in the experiment the measurements must be repeated.
  - The experiment itself should allow for replication itself should allow for replication, to be checked by other researchers.
Sources of Bias and Confounding, Including Placebo Effect and Blinding
- Sources of bias specific to medicine are confounding variables and placebo effects, among others.
- a. Confounding
  - A confounding variable is an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable.
  - The methodologies of scientific studies therefore need to control for these factors to avoid a false positive (Type I) error.
  - Example: The statistical relationship between ice cream sales and drowning deaths.
- b. Placebo and blinding
  - A placebo is an imitation pill identical to the actual treatment pill, but without the treatment ingredients.
  - A placebo effect is a sham (or simulated) effect when medical intervention has no direct health impact but results in actual improvement of a medical condition because the patients knew they were treated.
  - Blinding is a technique used to make the subjects “blind” to which treatment is being given.
- c. Blocking
  - Is the arranging of experimental units in groups (blocks) that are similar to one another.
  - Typically, a blocking factor is a source of variability that is not of primary interest to the experimenter.
  - An example of a blocking factor might be the sex of a patient.
Completely Randomized Design, Randomized Block Design and Matched Pairs
- a. Completely Randomized Designs
  - Are for studying the effects of one primary factor without the need to take other nuisance variables into account.
  - The experiment compares the values of a response variable (like health improvement) based on the different levels of that primary factor (e.g., different amounts of medication).
  - For completely randomized designs, the levels of the primary factor are randomly assigned to the experimental units (for example, using a random number generator).
- b. Randomized Block Design
  - Is a collection of completely randomized experiments, each run within one of the blocks of the total experiment.
  - A matched pairs of design is its special case when the blocks consist of just two (2) elements (measurements on the same patient before and after the treatment or measurements on two (2) different but in some way similar patients).

Chi-Square

The chi-square test is used to determine whether there is significant difference between the expected value frequencies and the observed frequencies in one or more categories. There are two (2) types of chi-square tests:

A chi-square goodness of fit test determines if a sample data matches a population.
A chi-square test for independence compares two (2) variables in a contingency table to see if they are related. It tests to see whether the distributions of categorical variables differ from each other.
- A very small chi-square test statistic means that your observed data fits your expected data well. In other words, there is a close relationship.
- A very large chi-square test statistic means that the data does not fit very well. In other words, there is no relationship.

Assumptions of the Chi-Square Test

Random sample
Independent observations for the sample (one observation per subject)
No expected counts less than five (5)

To calculate the chi-square statistic, $\chi^2$ , use the following formula:
$\chi^2 = \sum \frac{(O - E)^2}{E}$
where:
$\chi^2$ is the chi-square test statistic.
$O$ is the observed frequency value for each event.
$E$ is the expected frequency value for each event.

Goodness of Fit Test

A chi-square goodness-of-fit test is used to test whether a frequency distribution obtained experimentally fits an “expected” frequency distribution that is based on the theoretical or previously known probability of each outcome.

An experiment is conducted in which a simple random sample is taken from a population, and each member of the population is grouped into exactly one of $k$ categories.

Step 1: The observed frequencies are calculated for the sample.
Step 2: The expected frequencies are obtained from previous knowledge (or belief) or probability theory. In order to proceed to the next step, it is necessary that each expected frequency is at least 5.
Step 3: A hypothesis test is performed:
- a. The null hypothesis $H_0$ : the population frequencies are equal to the expected frequencies.
- b. The alternative hypothesis $H_a$ : the null hypothesis is false.
- c. $\alpha$ is the level of the significance.
- d. The degrees of freedom: $k - 1$
- e. A test statistic is calculated: $\chi^2 = \sum \frac{(observed - expected)^2}{expected} = \sum \frac{(O - E)^2}{E}$
- f. From $\alpha$ and $k - 1$ , a critical values is determined from the chi-square table.
- g. Reject $H_0$ if $\chi^2$ is larger than the critical value (right tailed test)

Example:

Researchers have conducted a survey of 1600 coffee drinkers asking how much coffee they drink in order to confirm previous studies. Previous studies have indicated that 72% of Americans drink coffee. At $\alpha = 0.05$ , is there enough evidence to conclude that the distributions are the same?

a. The null hypothesis $H_0$ : the population frequencies are equal to the expected frequencies
b. The alternative hypothesis $H_a$ : the null hypothesis is false.
c. $\alpha = 0.05$
d. The degrees of freedom: $k - 1 = 4 - 1 = 3$
e. The test statistic can be calculated:

$\chi^2 = \sum \frac{(O - E)^2}{E} = 8.483$

f. From $\alpha = 0.05$ and $k - 1 = 3$ , the critical values is 7.815.
g. Is there enough evidence to reject $H_0$ ? Since \chi^2 ≈ 8.483 > 7.815, there is enough statistical evidence to reject the null hypothesis and to believe that the old percentages no longer hold.

Test of Independence

The chi-square test of independence is used to assess if two (2) factors are related. Formally, the hypothesis statements for the chi-square test-of independence are:

$H_0$ : There is no association between the two (2) categorical variables

$H_1$ : There is an association (the two (2) variables are not independent)

The procedure for the hypothesis test is essentially the same. The differences are that:

a. $H_0$ is that the two (2) variables are independent.
b. $H_a$ is that the two (2) variables are not independent (they are dependent).
c. The expected frequency $E_{r,c}$ for the entry in row $r$ , column $c$ is calculated using:

$E_{r,c} = \frac{(sum \space of \space row \space r) \times (sum \space of \space column \space c)}{total \space sample \space size}$

d. The degrees of freedom: $(number \space of \space rows - 1) \times (number \space of \space columns - 1)$

Example:

The results of a random sample of children with pain from musculoskeletal injuries treated with acetaminophen, ibuprofen, or codeine are shown in the table. At $\alpha = 0.10$ , is there enough evidence to conclude that the treatment and result are independent?

a. The null hypothesis $H_0$ : the treatment and response are independent.
b. The alternative hypothesis $H_a$ : the treatment and response are dependent.
c. $\alpha = 0.10$ .
d. The degrees of freedom: $(number \space of \space rows - 1) \times (number \space of \space columns - 1) = (2 - 1) \times (3 - 1) = 1 \times 2 = 2$
e. The test statistic can be calculated using the table below:

$\chi^2 = \sum \frac{(O - E)^2}{E} = 14.07$

f. From $\alpha = 0.10$ and $df = 2$ , the critical value is 4.605.
g. Is there enough evidence to reject $H_0$ ? Since \chi^2 ≈ 14.07 > 4.605, there is enough statistical evidence to reject the null hypothesis and to believe that there is a relationship between the treatment and response.

Example:

A doctor believes that the proportions of births in this country on each day of the week are equal. A simple random of 700 births from a recent year is selected. At a significance level of 0.01, is there enough evidence to support the doctor’s claim?

a. The null hypothesis $H_0$ : the population frequencies are equal to the expected frequencies
b. The alternative hypothesis $H_a$ : the null hypothesis is false.
c. $\alpha = 0.01$
d. The degrees of freedom: $k - 1 = 7 - 1 = 6$
e. The test statistic can be calculated using a table:

$\chi^2 = \sum \frac{(O - E)^2}{E} = 26.8$

f. From $\alpha = 0.01$ and $k - 1 = 6$ , the critical value is 16.812.
g. Is there enough evidence to reject $H_0$ ? Since \chi^2 ≈ 26.8 > 16.812, there is enough statistical evidence to reject the null hypothesis and to believe that the proportion of births is not the same for each day of the week.