PSYC 385 Preliminary Exam 3

PSYC 385 Preliminary Exam 3 Study Guide

Chapter 7 continued

Understand the definition of costs and benefits and how they impact utility. Make sure you can provide examples of:
- Economic costs and noneconomic costs
  - Economic Costs: Purchasing the test, buying equipment
  - Noneconomic Costs: Social consequences
- Economic benefits and noneconomic benefits
  - Economic Benefits: Higher profits, reduced training costs
  - Noneconomic Benefits: Improved work environment, Mental Health

What is the purpose of a utility analysis? What does a utility analysis result in?
- The utility analysis is a cost-benefit analysis designed to determine the usefulness and/or practical value of an assessment.
- It results in an educated decision as to which of several alternative courses of action is most optimal.
Know the different methods of utility analysis.
- Expectancy data: the likelihood that a test-taker will score above or below an established threshold on a criterion measure
  - Example: Taylor-Russell tables
- Brogden-Cronbach-Gleser formula is used to calculate the $$ amount of a utility gain resulting from the use of a particular selection instrument under specified conditions.
- Utility gain refers to an estimate of the benefit (monetary or otherwise) of using a particular test or selection method.
What can influence the size of the job applicant pool?
- Expertise, economic climate, complexity of the job
- What is the significance of the selection ratio?
  - Number of people hired for a jobNumber of people applied for a job
    - How competitive a job application is
Are false positives seen more often at higher or lower cut scores?
- False positives are seen more often at lower cut scores.
  - Lower Cut Scores: A lower cut score allows more candidates to pass the initial screening and move forward in the hiring process. This increases the likelihood that some candidates who may not truly meet the standards are included, leading to a higher rate of false positives. Essentially, lowering the cut score increases the risk of accepting candidates with lower predictive success, who may then struggle in the role.
  - Higher Cut Scores: In contrast, higher cut scores are more restrictive, allowing only the top-performing candidates to pass. This higher standard typically reduces the number of false positives, as only those who strongly demonstrate the desired qualifications move forward. However, high cut scores can also increase the risk of false negatives—rejecting candidates who might have been successful if given the opportunity.

Know the difference between relative and fixed cut scores.
- Relative cut scores: These are determined about normative data (e.g., selecting people in the top 10% of test scores)
- Fixed-cut scores: made based on having achieved a minimum level of proficiency on a test (e.g., a driving license exam).
Multiple cut scores vs. multiple hurdles, definition, and their difference.
- Multiple Cut Scores: This approach involves setting several cut-off points on a single assessment to categorize performance levels (e.g., grades A, B, C, D, F).
  - The difference is that multiple cut scores refer to where people fall on several available cuts, whereas multiple hurdles refer to ______
- Multiple Hurdles: In this approach, candidates must meet specific cut-off scores at each stage of a sequential assessment process to continue. Each "hurdle" represents a minimum performance requirement, and failure to meet the score results in elimination from the selection process. Common in hiring, multiple hurdles ensure that only those who meet progressive criteria continue to the next stage, helping organizations filter candidates more effectively.
- Key Difference: multiple cut scores allow for a range of performance levels on a single predictor and do not stop progression based on a single score, while multiple hurdles require candidates to meet specific benchmarks at each stage of assessment, creating a sequential elimination process.
What are the different methods for setting cut scores?
- The Angoff method is averaged to yield cut scores for the criterion.
  - This can be used for personnel selection, traits, attributes, and abilities.
  - Problems arise if there is a low agreement between experts
- The Known Groups Method: entails the collection of data on the predictor of interest from groups known to possess, and not to possess, a trait, attribute, or ability of interest
  - This can be used to establish different cut scores
  - After analysis of the data, a cut score is chosen that best discriminates the groups
- IRT-Based Methods: In an IRT framework, each item is associated with a particular level of difficulty or severity
  - To “pass” the test, the test-taker must answer items that are deemed to be above some minimum level of difficulty, which is determined by experts and serves as the cut score
What is discriminant analysis?
- Discriminant analysis is a family of statistical techniques used to shed light on the relationship between identified variables (such as scores on a battery of tests) and two (and in some cases more) naturally occurring groups (such as persons judged to be successful at a job and persons judged unsuccessful at a job).
IRT-based methods of cut scores are associated with a particular level of severity.

Chapter 8

What are the five stages of test development?
- Test conceptualization
- Test construction
- Test tryout
- Analysis
- Revision
  - Then back to test tryout
What are the relations between the five stages of making a test?
- Conceptualization occurs first and allows for the development of the construct, items, and scoring methods, among others. Then, that stage needs to be tested tried out in samples, analyzed, and revised.
  - TEST DESIGN IS CYCLICAL
What types of questions need to be considered during the ‘conceptualization’ phase?
- Some preliminary questions: What is the test designed to measure?
  - What is the objective of the test? How will we measure this?
  - Is there a need for this test?
  - Who will use this test?
Be able to give examples of pilot work and scaling as they pertain to test construction.
- Pilot work: create a Prototype and receive feedback; focus groups; expert panels
  - Ex: having focus groups test something first?
- Scaling: Quantifying or calibrating the measure

Type of scales – unidimensional, multidimensional, categorical, ordinal, etc

When would you want to use selected- vs. constructed-response formats?
- Selected-response format – items require test-takers to Select a Response from a set of alternative responses.
  - The test maker gives them options, and they pick the answers
  - This type is used on our tests for this class with multiple-choice Q
    - MOST tests are like this, its efficient and we created the responses
    - Likert scale is an example of this, they do not get to create their scale they must do it –quantitative—quantity

constructed-response format – Items require test-takers to Simplify or to create the correct answer, not merely to select it.
- These ASK the tester to supply the response, which happens a lot with projective testing, and qualitative types of measures, we want them to produce something, not have us feed them something.
  - Enables the test-taker to have a personal touch/feeling and is more qualitative –more personal answers

What are some benefits of using computerized adaptive testing?
- Computerized adaptive testing (CAT) - an interactive, computer-administered test-taking process wherein items presented to the test-taker are based in part on the Test-takers Performance on previous items
  - If you get it correct, I increase the difficulty, and if I get it wrong, I decrease the difficulty, this is how the computer balances the levels.

CAT can provide economy in testing time and the number of items presented
- It uses IRT, not CTT, so we must make sure we have items all over and spread out.

CAT tends to reduce floor effects and ceiling effects.

How might your motivations for test conceptualization impact the type of scoring you use?
Floor vs. ceiling effects and their impact on interpretation.
Know the different types of scoring and scaling approaches
- Ipsative scoring (Ideographic)
  - Comparing a test-takers score on one scale within a test to another scale within that same test
    - Example: Comparing participant's scores in an academic assessment. For instance, comparing their math with verbal language scores.
- Guttman scale
  - Items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. All respondents who agree with the stronger statements of the attitude will also agree with milder statements.
    - Example: Recycling attitudes from least to most aggressive
- Categorical scaling
  - Stimuli (index cards) are placed into one of two or more alternative categories.
    - Example: Give people materials to sort into groups (e.g. somewhat like me, not like me, me).
- Comparative scaling
  - Entails judgment of a stimulus in comparison with another stimulus
- Cumulative scoring
  - Assumption that the higher the score on the test, the higher the test-taker is on the ability, trait, or other characteristic that the test purports to measure
    - Example: A scale that adds all the items up, giving a total score. The score tells us something about the person on the scale.
- Class scoring
  - Responses earn credit toward placement in a particular class or category with other test-takers whose pattern of responses is presumably similar in some way (e.g., diagnostic testing).
    - Example: A PTSD measure that has several classes to identify someone's level of PTSD.
What qualities separate good items from bad items?
- A good item is reliable and valid
  - It correlates with how we want to, is it valid?
  - We have different methods of going over how we know something is valid
- A good item discriminates test-takers – high scorers on the test overall answer the item correctly.
What is the purpose of the different types of item analysis indices - e.g. difficulty, reliability, validity, discrimination
- Item-Difficulty Index – The proportion of respondents answering an item correctly
- Item-Endorsement index – The percentage of AGREEMENT opposed to the percentage correct.

In a true/false example, how many PPL are saying true vs false, when it comes to symptoms, how many ppl say “Yes I have this to some degree”

Item Reliability Index – Indication of the internal consistency of the scale
Factor analysis can also indicate whether items that are supposed to be measuring the same thing load on a common factor.
- A way for us to estimate the internal consistency of this item and play nicer with the other items, other ones will not
  - To see if people can guess vs know the answer → it assesses this it assesses this by examining patterns of responses, which can help identify whether an item measures genuine knowledge or is subject to random guessing.

The Item-Validity Index – Allows test developers to evaluate the validity of items about a criterion measure

People do not tend to do this one as much, likely because it is harder to do,

The Item-Discrimination Index – Indicates how adequately an item separates or discriminates between high scorers and low scorers
- d-value: the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly
  - Indicates: how likely this item separated people based on high or low levels of what people are experiencing
    - The D value → some indication of item discrimination (this is in classical test theory)

Be able to identify and interpret the α and b parameters of an item characteristic curve. How do these parameters inform your interpretation?
- α parameter indicates the relatedness (or the slope) of the item to the latent construct (i.e., discrimination). The steeper the slope the more discrimination in the item. 
  - Example: An item highly associated with marital distress is measured by the α parameter.
- The b parameter indicates the point on the latent construct where the probability of endorsing the item equals 0.50 while controlling for mean differences along the continuum of marital distress. It means the probability someone is going to say yes to an item or endorse it and where on the continuum.
- The b-parameter is used to characterize the difficulty of an item and compare the difficulty of different items. The severity is another way to say difficulty for diagnostic tools. The line that is closest to the y-axis is the least difficult.
  - Discrimination goes from 0 to 5
  - The difficulty is a z-score, so you want higher here, too (from 0 to 3 or 3 to 0)
How do qualitative methods differ from more quantitative methods?
- Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical procedures
  - Think-aloud test administration – respondents are asked to verbalize their thoughts as they occur during testing
  - Expert panels– Experts may be employed to conduct a qualitative item analysis
  - Sensitivity review – items are examined about fairness to all prospective test-takers. Check for offensive language, stereotypes, etc.
When and why would we want to revise a test?
How can cross-validation and co-validation be used to revise a test?
- Cross-validation refers to the revalidation of a test on a sample of test-takers other than those on whom test performance was originally found to be a valid predictor of some criterion.

As in how widely can this be used —the farther away you go, the less valid you get → This is called validity shrinkage.

Be aware of validity shrinkage
- SHOWS IT WORKS WITH ANOTHER SAMPLE

Co-validation - a test validation process conducted on two or more tests using the Same Sample of test-takers.
- Can we take a different test with the SAME people at the same time,

Co-validation is economical for test developers

Minimizes sampling error

What are the three applications of IRT in test building/revising?
- Evaluating existing tests for mapping test revisions,
- Determining measurement equivalence across test-taker populations
- Developing item banks

Chapter 9/10

Describe interactionism

Interactionism: heredity and environment interact to influence one’s intellect
Intelligence is the aggregate or global capacity of the individual to act purposefully, reason, and deal effectively with his environment (Wechsler, 1944, p. 3).”

What is Wechsler’s definition of intelligence?
- Intelligence is not merely the sum.
Know the main contributions to intelligence testing by Galton, Binet, Terman, Wechsler, and Piaget
- Francis Galton (1822-1911) – emphasis on the heredity of intelligence
- Alfred Binet (1857-1911) – test scores are a measure of performance not strictly TRUE intelligence. Intelligence is a relative contribution of abilities (first intelligence test)
  - 1905 – Binet-Simon test of intelligence

Lewis Terman (1877-1956) – revised the Binet-Simon scale to what is now known as the Stanford-Binet.

Ratio IQ = mental age/chronological age x 100

David Wechsler (1896-1981) – intelligence is not the mere sum of abilities, believed it was important to measure several aspects
- 1939 developed an intelligence test that included non-verbal tasks
Jean Piaget (1896-1980) – stages of cognitive development
- Did not make a test, just influenced how people MADE their tests

What is the Binet-Simon intelligence test? What are its main features?
- The Binet-Simon Scale (CAME FIRST) assesses a person's cognitive abilities and mental capacity, including memory, reasoning, problem-solving, and perception. It is based on the concept of mental age, which refers to the level of intellectual development an individual has achieved relative to others of the same age.
What is factor analysis?
- Factor analysis: statistical techniques designed to determine the existence of underlying relationships between sets of variables or items
  - Factor analysis is also A way to statistically organize variables (ie intelligence)
    - Identified meaningful underlying structure among a set of variables
      - Ex: words say something about memory, vs something else, it is illustrating that these tasks will organize into groups, to say stuff about abilities, to say stuff about one’s general theory of intelligence
Be familiar with the basic factor-analytic theories of intelligence
- Charles Spearman’s theory (g and s): Spearman proposed the existence of a general intellectual ability factor ( G ) and specific factors of intelligence ( S ).
  - G was assumed to afford the best prediction of overall intelligence
  - Group factors: an intermediate class of factors common to a group of activities but not all, neither as general as G nor as specific as s
- Guilford and Thurstone: deemphasizing or eliminating any reference to g → why are we looking for g when we should look for multiple factor models instead of just one?
- Gardner: Developed a theory of seven intelligences → Why are we looking for g when we should look for multiple factor models instead of just one?
  - Let's elaborate on different aspects of intelligence (ex. Emotional intelligence)
- Horn and Cattell: Had a theory about 2 major types of intelligence, crystallized and fluid
  - Crystallized intelligence (Gc): includes Acquired Skills and Knowledge that are dependent on exposure to a particular culture as well as on formal and informal education
  - HAPPENS TO BE VERY VERBALLY DRIVEN
  - If it is highly culturally loaded, this becomes problematic
  - As in related and influenced by education, books, and factors which tell us nothing lol
  - Fluid intelligence (Gf): nonverbal, relatively culture-free, and independent of specific instruction – Ability to adapt to novel situations
- Carroll: Carroll made what is called the Three-stratum theory of cognitive abilities
  - At our most internal core, we have the test and what we are tryna get them to do, what measures they should be provided on, etc, don't memorize groupings, just crystallized and fluid intelligence as a whole,
What are the differences between crystallized and fluid intelligence? What is the effect of aging or injury on each?
- Crystallized intelligence (Gc): includes Acquired Skills and Knowledge that are dependent on exposure to a particular culture as well as on formal and informal education
- HAPPENS TO BE VERY VERBALLY DRIVEN
- If it is highly culturally loaded, this becomes problematic
  - As in related and influenced by education, books, and factors that tell us nothing
- Fluid intelligence (Gf): nonverbal, relatively culture-free, and independent of specific instruction – Ability to adapt to novel situations
  - Fluid Intelligence: decreases as we age, it is more flexible and how we adapt to environments, same with injuries
  - Crystallized Intelligence: increases as we age, because it is stuff we learn from the world and culture

What is the CHC model?
- Kevin McGrew (1997) proposed the CHC Model, integrating the Cattel-Horn Theory and the Carroll theories (CH + C)
  - Features ten "broad-stratum" abilities and over seventy "narrow-stratum" abilities
  - Does not include the general intellectual ability factor
  - Three clusters of abilities: social intelligence, concrete intelligence, and abstract intelligence
What are Thorndike’s three clusters of ability?
- Social intelligence
- Concrete intelligence
- Abstract Intelligence
How are information-processing theories different from factor-analytic theories of intelligence?
- Information-processing theories: focus on identifying the specific mental Processes that constitute intelligence. How information is processed as opposed to what is processed
- The factor analytic theories of intelligence are the ones like s and g, this is looking at how info is processed “aka how intelligent you are” and WHAT Can be processed.
- Define simultaneous and successive processing.
  - Simultaneous (parallel) processing: The integration of information occurs all at once
  - Successive (sequential) processing: The processing of information in a logical sequence
- What is the PASS model?
  - PASS model – planning, attention, simultaneous, successive
    - Or the strategy, receptivity, & type of information processing

Articles

Ariel et al. (2015) The effect of police body-worn cameras?
- Purpose of the Study: This randomized controlled trial (RCT) aimed to assess the effect of body-worn cameras (BWCs) on police officers' use of force and citizen complaints.
- Methodology: The researchers conducted a field experiment, randomly assigning some officers to wear BWCs while others did not. This approach provided a rigorous comparison to observe behavior changes.
- Key Findings:
  - Reduction in Use of Force: Officers wearing BWCs were less likely to use force than those without cameras. This finding suggests that BWCs may encourage officers to act more cautiously or according to protocol.
  - Decrease in Citizen Complaints: Complaints against officers with BWCs dropped significantly. The study suggests that both police and citizen behavior might improve when interactions are recorded.
- Implications: The findings support the idea that BWCs can increase transparency and accountability, potentially improving public trust in police and reducing incidents of excessive force.
- Limitations: Although promising, the study acknowledges that BWCs alone may not resolve complex issues in policing and should be part of broader reforms.
Balderrama-Durbin et al. (2018) The Deployment Communication Inventory
- Purpose of the Study:
  - This study aimed to develop and validate the Deployment Communication Inventory (DCI) as a tool to assess the quality and frequency of communication between military service members and their romantic partners during deployment, with a focus on its impact on relationship satisfaction and resilience.
- Methodology:
  - The researchers created the DCI based on existing literature and theoretical models, such as attachment theory and stress process frameworks. Participants included military couples who completed surveys about their communication during deployment. The inventory measured both the frequency (how often communication occurred) and the quality (emotional tone and content) of these interactions.
- Key Findings:
  - Positive Communication Quality: High-quality communication (e.g., supportive and understanding exchanges) was associated with greater relationship satisfaction and emotional intimacy.
  - Frequency vs. Quality: While frequent communication was generally beneficial, it was less important than the perceived quality of those interactions.
  - Conflict Impact: Negative or conflictual communication during deployment significantly predicted lower relationship satisfaction and increased stress for both partners.
- Implications:
  - The DCI provides a reliable framework for evaluating how deployment communication affects relationship dynamics. It highlights the importance of fostering positive communication strategies among military couples, which could inform counseling and support programs for families experiencing deployment.
- Limitations:
  - The study focused on self-reported data, which may introduce bias. Additionally, the findings may not generalize to all military couples, particularly those with limited access to communication technologies. Future research could explore how different types of communication (e.g., video calls vs. text) uniquely impact relationships.