Ch 13 Inferential Statistics

CHAPTER 13 INFERENTIAL STATISTICS

13.57 Understanding Null Hypothesis Testing

The Purpose of Null Hypothesis Testing

ㅁ statistics → descriptive data that involves measuring 1(+) variables in a sample & computing descriptive summary data (eg means, correlation coefficients) for those variables

ㅁ Researchers goal is to draw conclusions about pop that sample was selected from (not the sample). Thus, researchers use sample stats to draw conclusions abt the corresponding values in the pop (parameters)

ㅁ parameters → corresponding values in the pop

ㅁ sample stats aren’t perfect estimates of their corresponding pop parameters bc there’s a certain amount of random variability in any stat from sample to sample.

ㅁ sampling error → the random variability in a stat from sample to sample. (term error refers to random variability, not anyone making a mistake)

ㅁ any stat relationship in a sample can be interpreted in 2 ways

→ there’s a relationship in the pop, & relationship in the sample reflects this

→ there’s no relationship in the pop, & relationship in sample reflects only sampling error

ㅁ purpose of null hypo testing is to help researchers decide btwn these 2 interpretations ^^

The Logic of Null Hypothesis Testing

ㅁ Null hypothesis testing → (often called null hypothesis significance testing or NHST) is a formal approach to deciding btwn 2 interpretations of a stat relationship in a sample

ㅁ Null hypothesis → one interpretation from ^^. The idea that there’s no relationship in the pop & the relationship in the sample reflects only sampling error (symbolized H0, “H-zero”)

ㅁ Alternative hypothesis → another interpretation from ^^. This hypothesis proposes that there’s a relationship in the pop & that relationship in the sample reflects this relationship in the pop. (symbolized as H1)

ㅁ Every statistical relationship in a sample can be itnerpreted in either of these 2 ways: → it might have occurred by chance

→ might reflect a relationship in the pop

Although there r many specific null hypothesis testing techns, all based on same general logic. Steps are:

→ Assume for the moment that the null hypo is true. There’s no relationship btwn the variables in the pop

→ Determine how likely the sample relationship would be if the null hypothesis were true

→ If the sample relationship would be extremely unlikely, then reject the null hypothesis in favor of the alternative hypothesis. If it would not be extremely unlikely, then retain the null hypothesis.

ㅁ Reject the null hypothesis → a decision made by researchers using null hypothesis testing which occurs when the sample relationship would be extremely unlikely

ㅁ Retain the null hypothesis → a decision made by researchers in null hypothesis testing which occurs when the sample relationship would not be extremely unlikely

ㅁ p value → crucial step in null hypo. The probability of obtaining sample result or more extreme result if null hypo were true. Not probability that any particular hypo is true or false. Instead, probability of obtaining the sample result if null hypo were true.

→ low p value means sample/more extreme result would be unlikely if null hypo were true & leads to rejection of null hypo. Not-low P value means sample/more extreme result would be likely if null hypo were true & leads to retention of null hypo.

ㅁ a (alpha) → the criterion that shows how low a p-value should be before the sample result is considered unlikely enough to reject the null hypothesis (usually set to 0.05).

ㅁ if there’s a 5% (or less) chance of a result @least as extreme as the sample result if null hypothesis were true, then null hypo is rejected. When this happens, result said to be statistically significant → an effect that’s unlikely due to random chance & therefore likely represents a real effect in the pop

ㅁ if there’s >5% chance of result as extreme as the sample result when the null hypo is true, then the null hypo is retained. (This doesn’t necce mean researcher accepts null hypo as true, just that isn’t enough evidence to reject. Use “fail to reject the null hypo” versus “retain the null hypo.” Never use “accept the null hypo”)

The Misunderstood p Value

ㅁ the p value = one of most misunderstood quantities in psycho research. Most common misinterpretation that p value is probability that null hypo is true–that sample result occurred by chance. INCORRECT. P value is really the probability of a result @least as extreme as the sample result IF the null hypo WERE true.

ㅁ null hypo test involves answering q: “If null hypo were true, what’s the probability of a sample result as extreme as this one?” In other words, “What is the p value?”

ㅁ answer to question depends on 2 considerations: → strength of relationship & size of sample. The stronger the sample relationship & the larger the sample, the less likely the result would be if the null hypo were true. That is, the lower the p value.

ㅁ Sometimes result weak & sample large, or result strong & sample small. In these cases, the two considerations trade off against each other, so a weak result can be stat sig if sample is large enough, & a strong relationship can be stat sig even if sample is small.

ㅁ Table 13.1 shows roughly how relationship strength & sample size combine to determine whether a sample result is statistically significant. ㅁ Columnsof table represent 3 levels of relationship strength: weak, medium, and strong. ㅁ Rowsrepresent 4 sample sizes that can be considered: small, medium, large, & extra large in context of psycho research. ㅁ Each cellrepresents a combo of relationship strength & sample size. \n ㅁ If a cell contains wordYes, then combo would be stat sig for both Cohen’s d & Pearson’s r. If contains word No, wouldn’t be stat sig for either. ㅁ There’s 1 cell where decision for d & rwould be dif & another where it might be dif depending on some additional considerations (discussed in Section 13.2 “Some Basic Null Hypothesis Tests”) \n Although Table 13.1 provides only a rough guideline, it shows v clearly that weak relationships based on medium/small samples r never stat sig & that strong relationships based on medium/larger samples r always stat sig.

ㅁ If keep in mind, you'll often know whether result is stat sig based on descriptive stats alone. Useful to develop intuitive judgment.

→ One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses.

→ A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

ㅁ A stat sig result is not necce a strong one. Even a very weak result can be stat sig if it’s based on a large enough sample.

ㅁ word significant can cause peop to interpret these difs as strong/important. However, these sta sig difs r actually quite weak. (this is why it is important to distinguish between the statistical significance of a result and the practical significance of that result)

ㅁ practical significance → refers to importance/usefulness of result in some real-world context

→ in clinical practice, same concept often referred to as “clinical significance”

ex) study on a new treatment for social phobia might show that it produces a stat sig + effect. Yet this effect still might not be strong enough to justify time, effort, & other costs of putting it into practice (esp if easier/cheaper/work-just-as-well treatments alr exist) Although stat sig, result said to lack practical or clinical sig.

13.58 Some Basic Null Hypothesis Tests

ㅁ t-Test → a test that involves looking @ the dif btwn 2 means. 3 types used for slightly dif research designs: one-sample t-test, the dependent samples t-test, & the independent-samples t-test

ㅁ One-sample t-test → used to compare sample mean (M) w hypo pop mean (μ0) that providesinteresting standard of comparison

ㅁ The null hypo is that the mean for the population (µ) is equal to the hypothetical population mean: μ = μ0.

ㅁ The alternative hypothesis is that the mean for the pop is dif from the hypo pop mean: μ ≠ μ0.

ㅁ To decide btwn these 2 hypos, need to find probability of obtaining the sample mean (or one more extreme) if the null hypo were true. But finding this p value requires first computing a test statistic called t. The formula for t is as follows:

ㅁ Test Statistic → a statistic (eg F, t, etc) that’s computed to compare against what’s expected in the ull hypo, & thus helps find the p value. Useful bc we know how it’s distributed when null hypo is true.

ㅁ Its precise shape depends on a stat concept called the degrees of freedom, which for a one-sample t-test is N − 1. (There are 24 degrees of freedom for the distribution shown in Figure 13.1.) ㅁ this distribution makes it possible to find the p value for any tscore. \n ㅁ we dont have to deal directly w the distribution oft scores. If we were to enter our sample data & hypo mean of interest into 1 of online statistical tools in Ch12 or a program like SPSS (Excel doesn’t have a one-sample *t-*test function), the output would include both the t score & the p value. ㅁ If p is = or

Figure 13.1, distribution is unimodal and symmetrical, & has a mean of 0.

ㅁ Critical Values→ the absolute value that a test statistic (eg F, t, etc) must exceed to be considered stat sig \n ㅁTwo-tailed test → where we reject the null hypo if the test statistic for the sample is extreme in either direction (+/-). This test makes sense when we believe the sample mean might differ from the hypo pop mean but we don’t have good reason to expect dif to go in a particular direction.ㅁ One-tailed test→ where we reject the null hypo only if the t score for the sample is extreme in 1 direction that we specify before collecting the data. This test makes sense when we have good reason to expect the sample mean will differ from the hypo pop mean in a particular direction. \n Example One-Sample t-Testㅁ Imagine a health psycho interested in accuracy of uni stud estimates of # of calories in a choc chip cookie. Shows cookie to 10 students & asks each one to estimate # calories in it. Bc actual # of calories in cookie is 250, this is the hypo pop mean of interest (µ0). The null hypo is that the mean estimate for the pop (μ) is 250. Bc of no real sense of students estimates, decides to do a two-tailed test. Participants’ actual estimates are as follows:250, 280, 200, 150, 175, 200, 200, 220, 180, 250.Mean estimate for sample (M) is 212.00 calories & SD is 39.17. Can now compute the tscore for his sample: \n

ㅁ If he enters the data into online analysis tools or uses SPSS, it would also tell him the two-tailed p value for this t score (w 10 − 1 = 9 degrees of freedom) is .013. Bc <.05, health psycho would reject the null hypo & conclude that uni students tend to underestimate # of calories in a choc chip cookie. If he computes t score by hand, he could look at Table 13.2 & see that the critical value of t for a two-tailed test w 9 degrees of freedom is ±2.262. The fact that his t score was more extreme than this critical value would tell that p value is <.05 & he should reject the null hypo. Using APA style, results reported as follows: t(9) = -3.07, p = .01. ㅁ t and p italicized, degrees of freedom appear in brackets w no decimal remainder, & t/p values rounded to 2 decimal places.

ㅁ If we were to compute the t score by hand, we could use a table like Table 13.2 to make the decision. This table does not provide actual p values. Instead, it provides the critical values of t for different degrees of freedom (df) when α is .05.

ㅁ Dependent-samples t-test → (sometimes called the paired-samples t-test) used to compare 2 means for the same sample tested @ 2 dif times or under 2 dif conditions. This comparison is appropriate for pretest-posttest designs or within-subject experiments. This test can also be one-tailed if the researcher has good reason to expect the dif goes in a particular direction

ㅁ first step is to reduce the 2 scores for each participant to a single dif score by taking the dif btwn them. At this point, the dependent-samples *t-*test becomes a one-sample *t-*test on the difference scores. The hypo pop mean (µ0) of interest is 0 bc this is what the mean dif score would be if there were no dif on average btwn the 2 times or 2 conditions. We can now think of the null hypo as being that the mean difference score in the population is 0 (µ0 = 0) and the alternative hypothesis as being that the mean dif score in the pop is not 0 (µ0 ≠ 0).

ㅁ difference score → a method to reduce pairs of scores (eg pre- & post- test) to a single score by calculating the dif btwn them

Example Dependent-Samples t-Test

ㅁ EX) Imagine that the health psych knows peop tend to underestimate # of calories in junk food, & has developed a short training program to improve estimates. To test effectiveness of program, he conducts a pretest-posttest study in which 10 participants estimate the # of calories in a choc chip cookie before training program & after. Bc he expects program to increase participants’ estimates, he decides to do a one-tailed test.

Pretest estimates are: 230, 250, 280, 175, 150, 200, 180, 210, 220, 190

Posttest estimates (for same peop in same order): 250, 260, 250, 200, 160, 200, 200, 180, 230, 240

The dif scores, are as follows: 20, 10, -30, 25, 10, 0, 20, -30, 10, 50

ㅁ it doesn’t matter whether 1st set of scores is subtracted from 2nd or vice versa, as long as done same way for all participants.

ㅁ The Mean of the dif scores is 8.50 w a SD of 27.27. Thea health pscyh can now compute the t score for his sample as follows:

ㅁ If he enters data into 1 of online analysis tools or uses Excel or SPSS, it would tell him that the one-tailled p value for this t score is 0.148. Bc this is >0.05, he would retain the null hypo & conclude that the training program doesn’t sig increase peop’s calorie estimates. If he were to compute t score by hand, he could look @ table 13.2 & see that the critical value of t for a one-tailed test w 9 degrees of freedom is 1.833. (It is + this time bc he was expecting a + mean dif score.) The fact that his t score was less extreme than this critical value would tell him that his p value is >0.05 & he should fail to reject the null hypo.

ㅁ Independent-samples t-test → used to compare the means of 2 separate samples (M1 & M2).

→ the 2 samples might have been tested under dif conditions in a btwn-subjects experiment, or they could be pre-existing groups in a cross-sectional design (eg. women & men, extraverts & introverts).

→ The null hypo is that the means of the 2 pos r the same: µ1 = µ2

→ The alternative hypo is that they aren’t the same: µ1 ≠ µ2

→ the t stat here is a bit more complicated bc it must take into account 2 sample means, 2 SDs, & 2 sample sizes. Formula as follows:

ㅁ Formula includes squared standard deviations (the variances) that appear inside the square root symbol. Lowercase n1 and n2 refer to the sample sizes in the 2 groups or condition (capital N generally refers to the total sample size). There are N − 2 degrees of freedom for the independent-samples t- test.

Example Independent-Samples t-Test

ㅁ EX) now health psych wants to compare calorie estimates of peop who regularly eat junk food w the estimates of peop who rarely eat junk food. He believes the dif could come out in either direction so he decides to conduct a two-tailed test. He collects data from a sample of 8 participants who eat junk food regularly & 7 participants who rarely eat junk food. Data as follows:

ㅁ Junk food eaters: 180, 220, 150, 85, 200, 170, 150, 190

ㅁ Non-junk food eaters: 200, 240, 190, 175, 200, 300, 240

ㅁ Mean for non-junkfood eaters= 220.71 w SD=41.23. Mean for junk food eaters= 168.12 w SD=42.66. Compute t score as follows:

ㅁ if he enters data into 1 of online analysis tool or Excel/SPSS, it would say that the two-tailed p value for this t score (w 15-2 = 13 degrees of freedom) is 0.015. Bc p value <0.05, reject null hypo & conclude that peop who eat junk food regularly make lower calorie estimates than peop who eat it rarely. If he were to compute the t score by hand, he could look at Table 13.2 & see that the critical value of t for a two-tailed test w 13 degrees of freedom is ±2.160. His t score was more extreme than this critical value so his p value is <.05, should reject the null hypo.

The Analysis of Variance

ㅁ T-tests are used to compare 2 means (a sample mean w a pop mean, the means of two conditions or 2 groups). When there r more than 2 groups or condition means to be compared, most common null hypo test is analysis of variance (ANOVA)

ㅁ Analysis of Variance (ANOVA) → a statistical test used when there r more than 2 groups or condition means to be compared.

ㅁ One-way ANOVA → used for btwn-subjects designs w a single independent variable. Used to compare means of more than 2 samples (M1, M2…MG) in a btwn-subjects design. The null hypo is that all the means are equal in the population: µ1= µ2 =…= µG. The alternative hypo is that not all means in pop r =.

ㅁ The test stat for the ANOVE is called F. It’s a ration of 2 estimates of the pop variance based on the sample data.

ㅁ 2 estimates of pop variance:

→ Mean Squares btwn groups (MSB) → estimate of the pop variance & based on difs among the sample means

→ Mean Squares w/i groups (MSW) → estimate of the pop variance & based on difs among the scores w/i each group

The F statisticis the ratio of the MSB to the MSW & is expressed as follows: F = MSB / MSW Useful bc we know how it’s distributed when null hypo is true, & this allows us to find p value. \n ㅁ In Figure 13.2, this distribution is unimodal & + skewed w values that cluster around 1. The precise shape of the distribution depends on both # of groups & sample size, & there r degrees of freedom values assoc w each. \n ㅁ The between-groups degrees of freedom is the # of groups minus one:dfB = (G − 1). ㅁ The within-groups degrees of freedom is the total sample size minus the # of groups: dfW = N − G.

ㅁ The online tools in Ch12 & statistical software (Excel and SPSS) will compute F & find the p value. If p is = or

Example One-Way ANOVA

ㅁ Imagine that the health psychologist wants to compare the calorie estimates of psych majors, nutrition majors, & professional dieticians. He dollects the following data:

ㅁ psych majors: 200, 180, 220, 160, 150, 200, 190, 200

ㅁ Nutrition majors: 190, 220, 200, 230, 160, 150, 200, 210, 195

ㅁ Dieticians: 220, 250, 240, 275, 250, 230, 200, 240

ㅁ The means are 187.50 (SD = 23.14), 195.00 (SD = 27.77), and 238.13 (SD = 22.35), respectively. So it appears that dieticians made substantially more accurate estimates on average. The researcher would almost certainly enter these data into a program such as Excel or SPSS, which would compute F for him or her and find the p value.

ㅁ Table 13.4 shows the output of the one-way ANOVA function in Excel for these data. It shows that MSB is 5,971.88, MSW is 602.23, & their ratio, F, is 9.92. The p value is .0009. Bc this value is <.05, researcher would reject null hypo & conclude that the mean calorie estimates for the 3 groups arent the same in the pop. ㅁ If the researcher were to compute the F ratio by hand, he could look at Table 13.3 and see that the critical value of F with 2 and 21 degrees of freedom is 3.467 (the same value in Table 13.4 under Fcrit). The fact that his F score was more extreme than this critical value would tell him that his p value is less than .05 and that he should reject the null hypo.

An “ANOVA table” → also includes “sum of squares” (SS) for btwn & within groups. Values r computed on way to finding MSB and MSW but aren’t typically reported by researcher.

ㅁ When we reject the null hypo in a one-way ANOVA, we conclude that the group means aren’t all the same in the pop… But this can indicate dif things. W 3 groups, it can indicate that all 3 means are sig dif from eachother, or that 1 of means is sig dif from other 2 but other 2 aren’t sig dif from eachother, or that mean of dieticians is sig dif from the means for psych & nutrition majors, but means for psych & nutrition majors aren’t sig dif from eachother.

ㅁ post hoc comparisons → an unplanned (not hypothesized) test of which pairs of group mean scores are dif from which others

ㅁ One approach to post hoc comparisons to conduct a series of independent-samples *t-*tests comparing each group mean to each of other group means. Prob → If we conduct a t-test when null hypo is true, have 5% chance of mistakenly rejecting null hypo. If conduct several *t-*tests when null hypo is true, chance of mistakenly rejecting at least 1 null hypo increases w each test

Repeated-Measures ANOVA

ㅁ one-way ANOVA is appropriate for between-subjects designs in which the means being compared come from separate groups of participants. It is not appropriate for within-subjects designs in which the means being compared come from the same participants tested under different conditions or at different times. This requires a slightly different approach, called repeated-measures ANOVA

ㅁ repeated-measures ANOVA → compares the means form the same participants tested under dif conditions or at dif times in which the dependent variable is measured multiple times for each participant

ㅁ The basics of the repeated-measures ANOVA are the same as for the one-way ANOVA. The main difference is that measuring the dependent variable multiple times for each participant allows for a more refined measure of MSW.

Factorial ANOVA

ㅁ When more than one independent variable is included in a factorial design, the appropriate approach is the factorial ANOVA

ㅁ Factorial ANOVA → a statistical method to detect differences in the means between conditions where there are 2(+) independent variables in a factorial design. It allows the detection of main effects & interaction effects.

ㅁ the basics of the factorial ANOVA are the same as for the one-way and repeated-measures ANOVAs. The main dif is that it produces an F ratio and p value for each main effect and for each interaction.

Testing Correlation Coefficients

ㅁ For relationships btwn quantitative variables, where Pearson’s r (the correlation coefficient) is used to describe strength of those relationships, the appropriate null hypo test is a test of the correlation coefficient.

ㅁ Null hypo is that there’s no relationship in the pop. Use Greek lowercase rho (ρ) to represent the relevant parameter: ρ = 0. Alternative hypo is that there’s a relationship in the pop: ρ ≠ 0. As w t- test, this test can be two-tailed if researcher has no expectation abt direction of relationship or one-tailed if researcher expects relationship to go in particular direction.

Example Test of a Correlation Coefficient

Imagine that the health psycho is interested in correlation btwn people’s calorie estimates & their weight. She has no expectation abt direction of relationship, so two-tailed test. She computes correlation coefficient for a sample of 22 uni students & finds that Pearson’s r is −.21. The statistical software used tells her that p value =.348. It’s >.05, so she retains null hypo & concludes that there’s no relationship btwn people’s calorie estimates & their weight. If she were to compute correlation coefficient by hand, look at Table 13.5 & see the critical value for 22 − 2 = 20 degrees of freedom is .444. The fact that the correlation coefficient for her sample is less extreme than this critical value tells her that the p value is >.05 & that she should retain the null hypo.

13. 59 Additional Considerations

Errors in Null Hypothesis Testingㅁ In null hypo testing, researcher tries to draw a reasonable conclusion abt the pop based on sample, but not guaranteed correct. ㅁ Rows represent 2 possible decisions researchers can make in null hypo testing: reject or retain the null hypo. ㅁ Columns represent 2 possible states of wrld: null hypo’s false or tru. ㅁ 4 cells of table represent 4 distinct outcomes of a null hypo test. Two of outcomes (reject null hypo when false & retain when true) are correct decisions. The other two (reject null hypo when true & retainit when it’s false) are errors.

ㅁ Type I Error → a false + in which the researcher concludes that their results are statistically significant when in reality there is no real effect in the population & the results are due to chance. In other words, rejecting the null hypo when it’s true.

ㅁ Type II Error → a missed opportunity in which the researcher concludes that their results are not statistically significant when in reality there’s a real effect in the pop & they just missed detecting it. In other words, retaining the null hypo when it’s false.

ㅁ In principle, it’s possible to reduce the chance of a Type I error by setting α to something <.05. Ex) Setting it to .01 would mean if null hypo is true, then there is only a 1% chance of mistakenly rejecting it. But making it harder to reject true null hypotheses also makes it harder to reject false ones and therefore increases the chance of a Type II error.

ㅁ Similarly, it’s possible to reduce the chance of a Type II error by setting α to something greater than .05 (e.g., .10). But making it easier to reject false null hypotheses also makes it easier to reject true ones and therefore increases the chance of a Type I error.

→ Provides insight into y convention is to set α to .05.

ㅁ Possibility of committing Type I & Type II errors has several important implications for interpreting the results of our own & others’ research:

→ we should be cautious abt interpreting the results of any indv study bc there’s a chance it reflects a Type I or II error. This possibility is why researchers consider it important to replicate their studies. Each time researchers replicate a study and find a similar result, they rightly become more confident that the result represents a real phenomenon and not just a Type I or Type II error.

ㅁ File Drawer Problem → issue related to Type I errors. When researchers obtain non-sig results, they tend not to submit them for publication, or if they do submit them, journal editors/reviewers tend not to accept them. As a consequence, the published literature fails to contain a full representation of the + & - findings abt a research q. Researchers end up putting these non-significant results away in a file drawer (or nowadays, in a folder on their hard drive). Difficult bc result of trad conduct & publish scientific research.

ㅁ One effect of this tendency is that published lit prob contains a higher proportion of Type I errors than we might expect on basis of stat considerations alone. Even when there’s a relationship btnw 2 variables in pop, published research lit likely to overstate strength of that relationship.

ㅁ One solution is registered reports, whereby journal editor/reviewers evaluate research submitted for publication w/o knowing results. If the research q judged to be interesting & sound method, then a non-sig result should be just as important & worthy of publication as a significant one.

ㅁ p-hacking → when researchers make various decisions in the research process to increase their chance of a statistically sig result (& type I error) by arbitrarily removing outliers, selectively choosing to report dependent variables, only presenting sig results, etc. until their results yield a desirable p value.

ㅁ statistical power → in research design, it means the probability of rejecting the null hypo given the sample size & expected relationship strength.

ㅁ Common guideline= power of .80 is adequate. Means 80% chance of rejecting null hypo for expected relationship strength.

2 steps to increase stat power (given that it depends prim on relationship & sample size): ㅁ increase the strength of the relationship → can sometimes be accomplished by using a stronger manipulation or by more carefully controlling extraneous variables to reduce the amount of noise in the data (e.g., by using a within-subjects design rather than a between-subjects design). ㅁ increase the sample size → usual strategy. For any expected relationship strength, there will always be some sample large enough to achieve adequate power.

Shows sample size needed to achieve power of .80 for weak, medium, & strong relations for a 2-tailed independent-samples *t-*test & 2-tailed test of Pearson’s r.

Criticisms of Null Hypothesis Testing

ㅁ criticisms of null hypo testing focus on researchers’ misunderstanding of it.

ㅁ Another set focuses on logic. To many, strict convention of rejecting null hypo when p is

ㅁ Another focuses on idea that null hypo testing—even when understood & carried out correctly—is not v informative. The null hypo is that there is no relationship between variables in the population (e.g., Cohen’s d or Pearson’s r is precisely 0). So to reject the null hypothesis is simply to say that there’s some non-zero relationship in the pop.

The End of p-Values?

ㅁ In 2015, the editors of Basic and Applied Social Psychology banned use of null hypo testing & related statistical procedures. Can submit papers w p-values, but editors will remove them before publication. Editors didn’t provide better solution, but emphasized importance of descriptive stats & effect sizes.

What should be done abt probs w null hypo? Some suggestions in APA Publication Manual:

ㅁ Each null hypo test should be accompanied by an effect size measure like Cohen’s d or Pearson’s r. This ensures an estimate of how strong relationship in pop is, not just if there is/isn’t one

ㅁ Use confidence intervals vs null hypo tests.

ㅁ confidence intervals → a range of values that’s computed in such a way that some % of the time (usually 95%) the population parameter will lie within that range

ㅁ there’re more radical solutions to probs of null hypo testing that involve using dif approaches to inferential statistics:

ㅁ Bayesian Statistics → an approach in which the researcher specifies the probability that the null hypo & any important alternative hypos are true before conducting the study, conducts the study, & then updates the probabilities based on the data

13.60 From the “Replicability Crisis” to Open Science Practices

ㅁ replicability crisis → a phrase that refers to the inability of researchers to replicate earlier research findings

ㅁ The low replicability of many studies is evidence of widespread use of questionable research practices by psycho researchers. These may include:

ㅁ (1) → The selective deletion of outliers in order to influence (usually by artificially inflating) statistical relationships among the measured variables.

ㅁ (2) → The selective reporting of results, cherry-picking only those findings that support one’s hypotheses.

ㅁ (3) → Mining the data without an a priori hypothesis, only to claim that a statistically significant result had been originally predicted, a practice referred to as “HARKing” or hypothesizing after the results are known

ㅁ (4) → A practice colloquially known as “p-hacking” (briefly discussed in the previous section), in which a researcher might perform inferential statistical calculations to see if a result was significant before deciding whether to recruit additional participants and collect more data (Head, Holman, Lanfear, Kahn, & Jennions, 2015)[9]. As you have learned, the probability of finding a statistically significant result is influenced by the number of participants in the study.

ㅁ (5) → Outright fabrication of data (as in the case of Diederik Stapel, described at the start of Chapter 3), although this would be a case of fraud rather than a “research practice.”

ㅁ HARKing → Hypothesizing After the Results are Known: A practice where researchers analyze data w/o a priori hypo, claiming afterward that a stat sig result had been orig predicted

ㅁ this “crisis” has also highlighted the importance of enhancing scientific rigor by:

ㅁ (1) → Designing & conducting studies that have sufficient statistical power, in order to increase the reliability of findings.

ㅁ (2) → Publishing both null & sig findings (thereby counteracting the publication bias & reducing file drawer problem).

ㅁ (3) → Describing one’s research designs in sufficient detail to enable other researchers to replicate your study using an identical or at least very similar procedure.

ㅁ (4) → Conducting high-quality replications and publishing these results

ㅁ One particularly promising response to the replicability crisis has been the emergence of open science practices that increase the transparency & openness of scientific enterprise.

ㅁ open science practices → a practice in which researchers openly share their research materials w other researchers in hopes of Increasing the transparency & openness of the scientific enterprise

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

CHAPTER 9 FACTORIAL DESIGNS \n \n

9.41 Setting Up a Factorial Experiment

ㅁ Common to include multiple independent variables

ㅁ Just as including multiple levels of a single independent variable allows us to answer more sophisticated research questions, so too does including multiple independent variables in the same experiment.

ㅁ ex) instead of conducting 1 study on the effect of disgust on moral judgement & another on the effect of private body consciousness on moral judgement, Schnall & colleagues were able to conduct 1 study that addressed both questions.

ㅁ Including multiple independent variables also allows researcher to answer qs abt whether the effect of 1 independent variable depends on the level of another. Referred to as → interaction between the independent variables

Factorial Designs

ㅁ Factorial Design → most common approach to including multiple IVs. Experiments that include 1(+) independent variable in which each level of 1 independent variable is combined w each level of the others to produce all possible combos.

ㅁ Factorial Design Table → shows how each level of one independent variable is combined w each level of the others to produce all possible combos in a factorial design

ㅁ 2 x 2 Factorial Design → combines 2 variables that have 2 levels. 4 distinct conditions. (eg using cell phone during day, not using cell phone during day, using cell phone at night, not using cell phone at night)

ㅁ 3 x 2 Factorial Design → if 1 of independent variables had a 3rd level. 6 distinct conditions. (eg, using a handheld cell phone, using a hands-free cell phone, & not using a cell phone)

ㅁ 4 x 5 Factorial Design → 20 conditions.

ㅁ Notice that each # in the notation represents 1 factor, 1 independent variable. By looking @ how many #s are in notation, can determine how many IVs are in the experiment. (ex, 2x2 3x3, & 2x3 designs all have 2 #s in notation & therefore all have 2 IVs. The numerical value of each of #s represents the # of levels of each IV. (a 2 means IV has 2 levels, 3 means IV has 3 levels, 4 means has 4 levels etc).

In principle, factorial designs can include any # of IVs w any # of levels: \n ㅁ ex) an experiment could include the type of psychotherapy (cognitive vs. behavioral), the length of the psychotherapy (2 weeks vs. 2 months), and the sex of the psychotherapist (female vs. male). This would be a 2 × 2 × 2 factorial design and would have eight conditions. Figure 9.2 shows one way to represent this design.

Assigning Participants to Conditions

ㅁ simple between-subjects design, each part tested in only 1 condition

ㅁ simple within-subjects design, each part is tested in all conditions

ㅁ in factorial experiment, decision to take the between-subjects or within-subjects approach must be made separately for each IV

ㅁ between-subjects factorial design → all of IVs are manipulated between subjects

ㅁ Ex) all parti could be tested either while using cell phone OR not using a cell phone, & either during day OR night.

→ means each parti tested in 1 & only 1 condition.

ㅁ Mixed Factorial Design → a design which manipulates 1 IV between subjects & another w/i subjects. Possible bc factorial designs have >1 IV

ㅁ Ex) researcher might choose to treat cellphone use as within-subjects factor by testing the same participants both while using & not using a cell phone (while counterbalancing order of these 2 conditions). But might choose to treat time of day as between-subjects factor by testing each parti during either day or night. Thus, each parti in mixed design is tested in 2/4 conditions.

ㅁ Non-manipulated Independent Variable → an IV that’s measure but non-manipulated. This is IV in many factorial designs.

ㅁ Ex) study by Schnall & colleagues. 1 IV was disgust, which researchers manipulated by testing participants in a clean or messy room. Other was private body consciousness, a parti variable which researchers simply measured.

ㅁ Such studies are extremely common & several points worth making abt them:

ㅁ 1) non-manipulated IVs are usually parti variables (private body consciousness, hypochondriasis, self-esteem, gender etc), & therefore by definition are between-subjects factors

ㅁ 2) Such studies are generally considered to be experiments as long as at least 1 IV is manipulated, regardless of how many non-manipualted IVs are included

ㅁ 3) Important to remember that causal conclusions can only be drawn abt the manipulated IC.

Non-Experimental Studies W Factorial Designs

ㅁ factorial designs can also include ONLY non-manipulated variables. (No longer experiments but instead non-experimental in nature)

ㅁ Ex) Consider a hypoth study where researcher measures both moods & self-esteems of several participants – categorizing them as having either a + or - mood & either ↑ or ↓ self esteem– along w their willingness to have unprotected intercourse. This can be conceptualized as a 2 × 2 factorial design with mood (positive vs. negative) and self-esteem (high vs. low) as non-manipulated between-subjects factors. Willingness to have unprotected sex is the dependent variable.

ㅁ bc neither IV in this ex was manipulated, it’s a non-experimental study rather than an experiment.

ㅁ important bc must b cautious abt inferring causality from non-exp studies bc of directionality & third-variable problems.

9.42. Interpreting the Results of a Factorial Experiment

Graphing the Results of Factorial Experiments

ㅁ results of factorial experi w 2 IVs can graph by representing 1 IV on x-axis & other w dif colored bars/lines. (y-axis always for DV).

ㅁ Figure 9.3 shows results for two hypothetical factorial experiments. \n ㅁ The top panel shows the results of a 2 × 2 design. Time of day (day vs. night) is represented by different locations on thex-axis, and cell phone use (no vs. yes) is represented by different-colored bars. (It would also be possible to represent cell phone use on the x-axis and time of day as different-colored bars. The choice comes down to which way seems to communicate the results most clearly.) \n ㅁ The bottom panel of Figure 9.3 shows the results of a 4 × 2 design in which one of the variables is quantitative. This variable, psychotherapy length, is represented along thex-axis, and the other variable (psychotherapy type) is represented by differently formatted lines. This is a line graph rather than a bar graph because the variable on the x-axis is quantitative with a small number of distinct levels. Line graphs are also appropriate when representing measurements made over a time interval (also referred to as time series information) on the x-axis.

Main Effects

ㅁ in factorial designs, there are 3 kinds of results that are of interest: 1) main effects, 2) interaction effects, 3) simple effects

ㅁ Main Effect → the effect of 1 IV on the DV– averaging across the levels of any other IV(s)

ㅁ main effects are independent of eachother in sense that whether or not there’s a main effect of 1 IV says nothing abt whether or not there’s a main effect of the other.

ㅁ ex) Figure 9.3: shows a main effect of cell phone use because driving performance was better, on average, when participants were not using cell phones than when they were. The blue bars are, on average, higher than the red bars. It also shows a main effect of time of day because driving performance was better during the day than during the night—both when participants were using cell phones and when they were not.

ㅁ Interaction Effect → (or just interaction), when the effect of 1 IV depends on the level of another.

ㅁ ex) assume friend asks u to go to moves w another friend. Ur response is “depends on what movie & who else is going.” U rlly want to see big blockbuster summer hit, not cheesy romantic comedy. There’s a main effect of type of movie on ur decision. If ur decision to go see either of these movies further depends on who she’s bringing w her, then there’s an interaction. For instance, if u’ll go see cheesy romantic comedy if she brings her hot friend u want to know better, but not if she brings anyone else, there’s an interaction. ㅁ In many studies, the prim research q is abt an interaction.

Types of Interactions

ㅁ Spreading Interactions → means there’s an effect of 1 IV at 1 level of the other IV & there’s either a week effect or no effect of that IV at the other level of the other IV.

ㅁ for spreading interactions there is an effect of one independent variable at one level of the other independent variable and there is either a weak effect or no effect of that independent variable at the other level of the other independent variable.

ㅁ cross-over interaction → means the IV has an effect on both levels but the effects are in opposite directions

ㅁ Figure 9.4 Bar Graphs Showing Three Types of Interactions. In the top panel, one independent variable has an effect at one level of the second independent variable but not at the other. In the middle panel, one independent variable has a stronger effect at one level of the second independent variable than at the other. In the bottom panel, one independent variable has the opposite effect at one level of the second independent variable than at the other.

ㅁ Figure 9.5 shows examples of these same kinds of interactions when one of the independent variables is quantitative and the results are plotted in a line graph. \n ㅁ Figure 9.5 Line Graphs Showing Different Types of Interactions. In the top panel, one independent variable has an effect at one level of the second independent variable but not at the other. In the middle panel, one independent variable has a stronger effect at one level of the second independent variable than at the other. In the bottom panel, one independent variable has the opposite effect at one level of the second independent variable than at the other

ㅁ The presence of the interaction indicates that the story is more complicated

ㅁ Simple effects → a way of breaking down the interaction to figure out precisely what’s going on.

ㅁ An interaction simply informs us that the effects of at least one independent variable depend on the level of another independent variable. Whenever an interaction is detected, researchers need to conduct additional analyses to determine where that interaction is coming from.

ㅁ a simple effects analysis allows researchers to determine the effects of each independent variable at each level of the other independent variable. So while the researchers would average across the two levels of the personality variable to examine the effects of caffeine on verbal test performance in a main effects analysis, for a simple effects analysis the researchers would examine the effects of caffeine in introverts and then examine the effects of caffeine in extraverts.

ㅁ Schnall and colleagues found a main effect of disgust on moral judgments (those in a messy room made harsher moral judgments). They also discovered an interaction btwn private body consciousness & disgust. (The effect of disgust depended on private body consciousness).

ㅁ Presence of this interaction suggests the main effect may be a bit misleading. NOT true that those in a messy room made harsher moral judgments, bc only true for half of partis.

ㅁ Using simple effects analyses, they were able to further demonstrate that for people high in private body consciousness, there was an effect of disgust on moral judgments. Further, they found that for those low in private body consciousness there was no effect of disgust on moral judgments. By examining the effect of disgust at each level of body consciousness using simple effects analyses, Schnall and colleagues were able to better understand the nature of the interaction.

ㅁ examining simple effects provides a means of breaking down the interaction. Only necessary to conduct when an interaction is present. When no interaction, main effects tell complete & accurate story.

ㅁ Rather than averaging across the levels of the other IV, as done in main effects analysis, simple effects analyses used to examine the effects of each independent variable at each level of the other independent variable(s).

\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n