The 3 aspects in designing a good test
standardization
reliability
validity
Standardization
one aspect of designing a good test
= comparing a data point against a data set
ex. grades from a harsh teacher vs. an easy teacher, don’t mean anything without _______
so that’s why there are ______ tests (MCAS, SAT, AP)
comparison of a large enough data set of ANY data will usually give a bell curve
Reliability
one aspect of designing a good test
= score consistency
IMPORTANT: Reliability is necessary, but NOT SUFFICIENT (you also need validity)
ex. consider a broken scale, each person that steps on will weigh 100 lbs (very reliable/consistent, but accurate? No.)
Split-in-Half Technique
One way to establish reliability
= is assessed by splitting the measures/items from the measurement procedure in half, and then calculating the scores for each half separately
ex. 100 Q test, see how well students did in first 50 and the last 50 (or split by evens and odds)
if both halves are good = reliable
if one good, one bad = unreliable (b/c one part is *statistically* significantly more difficult)
is useful in determining if a test is too long
ex. if someone got a question right that’s early on in the exam, but got a similar Q wrong later, can maybe account it to fatigue)
Test-Retest Reliability
One way to establish reliability
= administering the same test (but shuffled a bit to avoid test-retest bias AKA memorization) twice over a period of time
scores from first and second time can then be correlated to evaluate reliability
The higher the correlation, the higher the reliability!
Alternate Forms Reliability
One way to establish reliability
= Does the score you receive correlate with the score on another test covering the same material?
ex. taking the October SAT vs. the December SAT: scores shouldn’t change drastically b/c the SAT is reliable
Inter-Rater Reliability
One way to establish reliability
= does the score one grader assigns to your assessment correlate with the score another grader gives for the same test?
ex. more than one “blind” grader gives scores for FRQs to ensure _____ reliability where both scores should be the same
Intra-Rater Reliability
One way to establish reliability
= Does an individual rater agree with themselves when measuring the same item multiple times?
= the consistency of the data recorded by ONE rater over several trials
Validity
one aspect of designing a good test
= the extent to which a test actually assesses what it claims to asses
Content Validity
One way to establish validity
= Does the assessment have content relevant to the construct?
AKA how representative the results are of the content being tested
exams must include Qs relevant to the topic (represents the material learned in the couse)
Face Validity
One way to establish validity
= at first glance, does the test seem to evaluate what it claims to?
ex. a test on musical ability, but the first page is just pictures of food → appropriate here to question the _____ validity
ex. AP Psych exam includes a graph of normal distribution, and you (a psych student who didn’t study at all) thinks it looks like a math test, but the graph is actually related to intelligence testing and statistics
thus, the exam LACKS face validity, but it DOES have content validity
Construct Validity
One way to establish validity
= whether or not an assessment measures an idea (or “construct”) that it is designed to measure
construct = something intangible, so test makers have to come up with a tool (or “operational definition”)
so the main question is: Does the operational definition really measure what it’s supposed to?
Criterion Validity
One way to establish validity
= measures how well the test correlates with the outcome
ex. student gets “genius” on an inteligens test, BUT always misspells the word inteligens → low criterion validity
ex. you score really high on an inteligens test, BUT, you have trouble multitasking → bad criterion validity
has 2 types:
predictive validity
concurrent validity
Predictive Validity
a type of criterion validity
= does the test accurately predict the level of some future performance?
ex. Does the performance on the SAT correlate with later college performance?
Concurrent Validity
a type of criterion validity
= do the results from the test correlate with results from OTHER measures designed to assess similar topics/concepts?
ex. if results from my test that I created were similar to the WAIS, then my test has criterion validity (b/c the scores had a positive correlation to another valid measure of intelligence)
Verbal Tests
Tests that use word problems to assess abilities
Abstract tests
Tests that use non-verbal measures to assess abilities
Speed of Processing
= the time it takes a person to do a mental task
Binet Test
an intelligence test that compares a child against what most children their age can do
ex. an average 7 year old can tie their shoes, ride a bike, do basic math, etc.
then you compare a 7 year old against the average
is ratio-based:
Mental age X 100 = IQ
Chronological age
7 X 100 = 100 IQ (the average)
7
ex.
a mental age of 8 X 100 = 80 IQ (below average)
a biological 10 yr old
Stanford-Binet Test
= an intelligence test that compares an individual against a large bank of acquired scores on a bell curve
is based on the Binet Test and addresses the Binet Test’s problem where it gets less accurate as people get older (because age difference isn’t as discretely predictive of age-appropriate skills)
“what can a 34 yr old do that a 33 yr old couldn’t do?”
Wechsler Adult Intelligence Scale (WAIS)
= an IQ test designed to measure intelligence and cognitive ability
gives a general intelligence score + 15 subtests
assesses a range of intellectual abilities
Francis Galton
= this person correlated reaction time to intelligence
HOWEVER, this person also used intelligence tests to support eugenics
Raven’s Matrices
= a non-verbal IQ test that measures intellectual development + logical thinking
tables given to test pattern-recognition with increasing difficulty
Flynn Effect
= an increase in population Intelligence Quotient (IQ) throughout the 20th century
characterized by rapid changes (+3 IQ points per decade)
due to a combination of environmental factors
Growth Mindset
= when you DO believe you can improve intellectually
self-efficacy
don’t allow challenges to define you
more likely to embrace challenges
appreciates feedback
Fixed Mindset
= when you DON’T believe you can improve intellectually
ex. *does bad on math test* → “I’m just not a math person”
learned helplessness
idk, u think
Are tests predictive?
idk, u think
Are tests biased?
55% of intelligence is heritable (biological/nature)
However, intelligence can be modified by changing the environment
ex. improving nutrition, removing toxins, better schools, the ratio of encouraging comments to reprimands, amount of attention from adults, etc.
What portion of intelligence is due to nature? due to nurture?
Between-Group Differences
= the average of group 1 compared to the average of group 2
ex. average height of women = 5’4” vs. men = 5’9”
Within-Group Differences
= the range of differences within 1 group (individual 1 compared to individual 2)
ex. Ms. Georges height vs. Mrs. Silipo’s height (within the female SHS teachers population)
Question Familiarity
one criticism of standardized tests
= the Qs are more familiar to middle/upper-middle class than others
or
= the Qs might reflect common knowledge of the majority group / the interests of one specific group
ex. only 1 student knowing about Ash Wednesday (majority group = christians)
ex. a Q about using instrumental aggression in football (biased towards men)
ex. MCAS question about snow days given to the Midwest region (probably created by people in northern regions)
Motivation
recall Maslow’s hierarchy:
kids who are worries about meeting basic needs are not going to put energy into esteem needs → won’t test well
also:
in many schools, there’s an anti-intellectual culture → doing well on tests = “Nerd!!”
Self-fulfilling Prophecy
people often conform to what’s expected of them
Stereotype Threat
tendency for members of the same group for which a negative stereotype exists to perform poorly on an instrument designed to asses an ability related to the negative stereotype
AKA members who are thought to be ____ will conform to that expectation when tested about _______
this will subconsciously effect a child growing up under this stereotype (will fall prey to the self-fulfilling prophecy)
you can reduce stereotype threat with deception in experiments!
Peer Pressure and Group Norms
= different groups have different beliefs about school success
based on research from the 200s:
Black and White kids think school success is due to INNATE INTELLIGENCE
Asian and Latinx kids think it’s due to HARD WORK
Biological Reactions to Stress
= physical stress reactions (changes in cortisol. blood pressure, etc.) hurt memory, attention, and executive functioning---all necessary components for academic success
Achievement Tests
= any norm-referenced standardized test intended to measure skill/knowledge in a certain subject
Aptitude Tests
= attempts to determine a person’s ability to acquire (through future training) specific skills
AKA how much potential one has (?)
ex. career tests for high school students
Personality Tests
= designed to systematically elicit information about a person's motivations, preferences, interests, emotional make-up, and style of interacting with people and situations
ex. MMPI-2, MBTI, etc.
Objective Tests
= tests that are easily scored, can be given in groups, are forced choice (test taker has to choose from multiple choice or true/false)
answers are then tabulated and the scores determine personality traits
ex. Achievement tests (intelligence), Aptitude tests (intelligence), MMPI (personality), MBTI (personality)
Projective Tests
= tests that are unstructured where the subject is shown ambiguous stimuli
ex. Thematic Apperception Test (TAT)
ex. Rorschach inkblot
Thematic Apperception Test (TAT)
= a type of projective test that involves describing ambiguous scenes to learn more about a person's emotions, motivations, and personality
Rorschach Inkblot
= a projective psychological test in which subjects' perceptions of inkblots are recorded and then analyzed using psychological interpretation, complex algorithms, or both
Intellectual Disability
= limited cognitive and adaptive functioning
= IQ less than 70, needs support to live
have trouble with executive functioning
Adaptive Functioning
= ability to care for self and meet general social expectations
Executive Functioning
= planning strategies, multi-step problems, metacognition
Intermittent
level 1 of support needed for those with intellectual disabilities
= as needed basis (usually only needed when person starts something new in life)
Limited
level 2 of support needed for those with intellectual disabilities
= for limited time/activities (ex. job coach)
Extensive
level 3 of support needed for those with intellectual disabilities
= long-term involvement (help w/daily living)
Pervasive
level 4 of support needed for those with intellectual disabilities
= intense, long-term care (possibly needed to keep a person alive) for all parts of their life
Savants
= people (usually with intellectual disabilities) who have a genius-like ability in a very narrow area
Prodigy
= a child with amazing ability
IQ > 135
Genius
scoring 2 standard deviations above the man (top 1%)