Chi-Square Test Notes

Chi-Square test is a statistical test used in lab reports.
The lecture aims to explain the meaning of the numbers generated by the Chi-Square test, how to obtain them using JASP, and how to report them in a lab report.

Continuous data: Numbers, including ratio and interval data.
- Ratio data: Numbers with equal spacing, where zero indicates the absence of the measured quantity. Examples include counting.
- Interval data: Numbers with equal spacing, where zero is just another number on the scale.
  *Example provided between negative 10 and 30 degrees Celsius to distinguish between ratio and interval data.
  *Time since lecture started as an example of ratio data (meaningful zero).
Categorical data: Labels or categories.
- Binary data: Two options or labels.
- Nominal data: Three or more categories or labels with differences only in name.
- Ordinal data: Categories with an order but no set distance between them.

Distinction: categorical data involves labels, with the dataset sometimes using numbers as labels.
Continuous data includes ratio and interval data, as defined earlier.

A cross tabulation or contingency table is used to look at observed frequencies or actual amounts in the data.
This table displays the distribution of planarians.
It is used to determine expected counts.
The Chi-Square statistic indicates the amount of variation in the table.
Degrees of freedom are used to measure the size of the table.
Degrees of freedom and the Chi-Square statistic are utilized to calculate the p-value.
Process: Start with raw data, organize it into a dataset, summarize the data using descriptive statistics, create a cross-tabulation, calculate expected counts, calculate the Chi-Square statistic, determine degrees of freedom, find the p-value, and then write it up.
A data set is a way to organize all the information so each row is a different person each column is a different variable.
Descriptive statistics are used to summarize key variables.
Cross tabulation is the same table of our descriptive statistics but the variables cross over each other.
Cells in the middle are contingency cells because they are contingent on the row and the column.

Hypothesis: More people with moratorium have conflicting status than expected by chance, based on Jones and L's paper.
Data from 150 students completing a survey on identity status and friendship quality (supportive, conflicting, mixed).
The aim is to find patterns between the variables.

A dataset organizes information with each row representing a person and each column representing a variable.
In JASP, rows are different people, columns are different variables, mostly categorical.
Consent is not a variable because everyone in the sample has said yes, so there is no variation.

Descriptive statistics summarize the data, such as means and standard deviations.
*For this analysis, we can count how many people fall into each category to see if the variables are evenly or unevenly balanced.
*Unevenly balanced variables will affect our expected counts.
*Friendship quality will be much more balanced.

It's crucial to remember that even though we're using numbers, it's a summary of categorical data, not continuous data.
Descriptive statistics are important for working out expected counts and form the basis for everything moving forward.

The fun thing to do is calculate the expected counts for each cell.
$E{ij} = (Ni * N_j) / N$ where;
- $E$ = Expected count.
- $i$ is the row.
- $j$ is the column.
- $N_i$ is the number of people in row i.
- $N_j$ is the number of people in column j.
- $N$ is the total sample size.
This formula calculates the expected counts if the variables are unrelated, based solely on the size of the categories.
Comparing observed counts to expected counts helps determine if there's a relationship between variables.

Chi-Square tests how much of the variation in the table is due to the relationship between the variables and how much is due to chance and the size of the categories.
Standardized residuals look at the difference between the observed and expected part in each cell.
They indicate where the patterns are to explain the relationship between variables.
The formula for Chi-Square is: $\chi^2 = \sum{i} \sum{j} \frac{(O{ij} - E{ij})^2}{E_{ij}}$ ; where
- $\chi^2$ is the Chi-Square statistic.
- $O_{ij}$ is the observed count in cell ij.
- $E_{ij}$ is the expected count in cell ij.
The residual is equal to Observed - Expected.

Degrees of freedom (df) represent the number of cells that are free to vary, assuming the totals are known.
Demonstration: In an equation where four numbers add up to 10, three numbers can be freely chosen, but the fourth is determined once the first three are known.
In a contingency table, degrees of freedom are calculated as (number of rows - 1) * (number of columns - 1).
- $df = (rows - 1) * (columns - 1)$
With four rows and three columns, the degrees of freedom are (4 - 1) * (3 - 1) = 6.

Using a Chi-Square p-value table (found in statistics books or online), you can determine the range of the p-value based on the Chi-Square statistic and degrees of freedom.
With statistical programs like JASP, the exact p-value can be obtained.
Example: If the Chi-Square statistic is smaller than 22.46 and higher than 16.81 with 6 degrees of freedom, the p-value falls between 0.001 and 0.01.

To determine if observed counts are significantly larger or smaller than expected, standardized residuals are examined.
Formula: $Standardized Residual = \frac{Observed - Expected}{\sqrt{Expected}}$
Using this formula, for a one-tailed test, an absolute value greater than 1.64 is considered significant. For a two-tailed test, an absolute value greater than 1.96 is considered significant.

JASP (Just Another Statistics Program) is a free, open-source statistics program.
Filtering data: Double click on a variable and select the categories to exclude.
To run a Chi-Square test:
- Go to Frequencies.
- Select Contingency Table.
- Drag variables to columns and rows.
- In Cells, request expected counts and Pearson residuals (standardized residuals).

Descriptive statistics should accurately represent all information.
The distribution of variables should be described, noting if they are even or uneven.
*Example: \"The distribution of one variable is pretty even. The distribution of the other variable [is] unbalanced.\"
Inferential statistics should include the test name, Chi-Square value, degrees of freedom, and p-value.
After reporting the significance, describe what that means using standardized residuals.
*When writing up the Chi-Square, you must acknowledge the fact that the prediction that you were expecting may not be driving the relationship.