Midtermy midterm

Ch. 1

Difference between a psychological concept and a psychological construct

Concept: An abstraction formed by generalization from particulars
- Abstracts are hard to define, e.g. intelligence
Construct: - A concept with scientific purpose (i.e. operationalized), a functional concept
- Can be measured and studied, e.g. IQ

Why do we use psychological testing

Many psychological traits are not directly observable or self- reportable and thus require indirect methods for assessment.
Psychological tests enable researchers and clinicians to uncover patterns, correlations, and predictions that are difficult to identify through straightforward verbal or written responses.

Primary objective of psychological testing

The objective of testing is typically to obtain some gauge, usually numerical in nature, with regard to an ability or attribute.

Differentiate between psychological assessment and psychological testing

Psychological testing: process of measuring psychology-related variables by means of devices or procedures designed to obtain a sample of behavior.
- The objective of testing is typically to obtain some gauge, usually numerical in nature, with regard to an ability or attribute.
Psychological assessment: gathering and integration of psychology-related data for the purpose of making a psychological evaluation through tools such as tests, interviews, case studies, behavioral observation, and specially designed apparatuses and measurement procedures.
- The objective of assessment is typically to answer a referral question, solve a problem, or arrive at a decision through the tools of evaluation.

Where can you find information regarding how a particular test was developed;

Test Manuals, catalogs, professional books, reference volumes, journal articles and online databases are all great places to start.
psychometric soundness - The technical quality of a test or other tool of assessment
Mental Measurements Yearbook - A catalog of tests that updates every three years. It provides detailed information about each test listed, including the publisher, test author, purpose, intended population and test administration time.

Tools of Psychological Assessment:

tests,
- A psychological test is a device or procedure designed to measure variables related to psychology (for example, intelligence, attitudes, personality, and interests) Psychological tests and other tools of assessment vary by content, format, technical quality, and administration, scoring, and interpretation procedures.
interviews and proper way to conduct them;
The interview is a method of gathering information through direct communication involving reciprocal exchange. Interviews vary based on their purpose, length, and nature. The quality of information obtained in an interview often depends on the skills of the interviewer: their pacing, rapport with the interviewee, and their ability to convey genuineness, empathy, and humor.

portfolios,
- A work sample; referred to as portfolio assessment when used as a tool in an evaluative or diagnostic process, 12–13, 382
computers,
- Computers can assist in test administration, scoring, and interpretation.
- Scoring may be done on-site (local processing) or at a central location (central processing). Reports may come in the form of a simple scoring report, extended scoring report, interpretive report, consultative report, or integrative report.
- Computer assisted psychological assessment (CAPA) has allowed for tailor-made tests with built-in scoring and interpretive capabilities.
EMA,
- The “in the moment” evaluation of specific problems and related cognitive and behavioral variables at the exact time and place that they occur
panel interviews,
- Also referred to as a board interview, an interview conducted with one interviewee by more than one interviewer at a time
psychological interviews,
- A therapeutic dialogue that combines person-centered listening skills such as openness and empathy with the use of cognitionaltering techniques designed to positively affect motivation and effect therapeutic change
Behavioral observations;
- Monitoring the actions of others or oneself by visual or electronic means while recording quantitative and/or qualitative information regarding those actions, typically for diagnostic or related purposes and either to design intervention or to measure the outcome of an intervention
- Naturalistic observation - Behavioral observation that takes place in a naturally occurring setting (as opposed to a research laboratory) for the purpose of evaluation and information-gathering
case studies
- Also referred to as a case study, this is a report or illustrative account concerning a person or an event that was compiled on the basis of case history data

Therapeutic psychological assessment vs traditional psychological evaluations

Therapeutic psychological assessment - A collaborative approach wherein discovery of therapeutic insights about oneself are encouraged and actively promoted by the assessor throughout the assessment process
In traditional psychological evaluations, the assessment is designed to have its intended benefits at the end of the process: The examiner explains the results, summarizes the case conceptualization, and shares a list of recommendations designed to help the examinee.
In contrast, therapeutic psychological assessment aims to be helpful throughout the assessment process. Results are not revealed at the end, but shared immediately so the assessor and assessee can develop an interpretation of the results

Terms:

Psychometrics:– field of study concerned with the theory and technique of educational and psychological measurement (measurement of knowledge, abilities, attitudes, and personality traits).
It involves two major tasks: - the construction of instruments and procedures for measurement - the development and refinement of theoretical approaches to measurement
dynamic assessment- An interactive approach to psychological assessment that usually follows a model of (1) evaluation, (2) intervention of some sort, and (3) evaluation
psychological autopsy, - A reconstruction of a deceased individual’s psychological profile on the basis of archival records, artifacts, and interviews with the assessee while living or with people who knew the deceased
protocol, - (1) The form or sheet on which test takers' responses are entered; (2) a method or procedure for evaluation or scoring
Scoring - The process of assigning evaluative codes or statements to performance on tests, tasks, interviews, or other behavior samples

When role play may be preferable to naturalistic observation

In situations where it may be too difficult to assess a real situation

Advantages and disadvantage of CAPA

CAPA is referred to as computer assisted psychological assessment. In this case, the word assisted typically refers to the assistance computers provide to the test user, not the test taker.
CAPA tests allow for more diverse forms of measurement. For example, Q-Interactive allows the user to administer tests by means of two Ipads connected to bluetooth. This allows the administrator to record the verbal responses of the test taker and scoring is immediate.
It also allows for dynamic test taking, meaning that the questions can become more or less difficult to adapt to the user of the test.
Pros:
- CAPA saves professional time in test administration, scoring and interpretation
- CAPA results in minimal scoring errors from humans
- CAPA yields standardized interpretation of findings due to elimination of unreliability traceable to differing points of view
- The computer has the capacity to standardize data better than humans
- Nonprofessional assistance can be used in test administration, can be administered in one sitting.
- Paper and pencil tests can be converted to CAPA
- Computer tests tailor test length and content based on responses of test takers
Cons:
- Professionals need to learn software and hardware documentation
- Possibility of soft and hardware issue is ever present, and may be difficult to pinpoint
- Leaves test takers without ability to use test-taking strategies (previewing questions, skipping questions, etc.)
- Standardized interpretation of findings may not be perfect, could profit from alternative viewpoints
- Computers lack flexibility of humans to recognize the exception to a rule
- Use of non-professionals leaves diminished opportunity for the professional to observe the test takers test taking behaviors and note any abnormalities
- Nonprofessionals may make faulty tests
- Paper pencil tests may be different administratively than computerized ones
- Security of CAPA can be breached

Parties to the assessment enterprise

Society at large: Test developers devise new tests to meet the needs of an evolving society.
Laws are enacted that may play a major role in test development, administration, and interpretation.
Other parties: Organizations, companies, and governmental agencies sponsor the development of tests for various reasons.
Companies may offer test-scoring and interpretation services.
Academics may review tests and evaluate their psychometric soundness.

By federal law, which of the following types of tests may NOT be used in schools?

All types of testing may be used in schools.

Test accommodations; reference sources

Accommodations: The adaptation of a test, procedure, or situation, or the substitution of one test for another are essential to make the assessment more suitable for an assessee with exceptional needs.
When a test is not administered as designed, the meaning of its scores often becomes questionable. Test users must rely on judgment, expertise, or guesswork to interpret the data. Any modifications for accommodations should be clearly documented in the test report.

Test Manuals, catalogs, professional books, reference volumes, journal articles and online databases are all great places to start.

Ch. 2
Culture fair and culture specific tests
Culture Fair Test: Minimizes cultural bias with neutral tasks and questions.
Culture Specific Test: Assesses abilities or knowledge relevant to a specific culture, including language and customs.
- Examples of culture fair or culture free tests: Raven's Progressive Matrices and the Cattell Culture Fair Intelligence Test
- These non-verbal culture-fair tests:
  - - claimed to be culture fair and designed to reduce cultural influence on performance
  - - intended to measure innate ability, unaffected by education - aimed to increase diversity.
But...these goals are not always achieved in practice.
No test is truly culture-free; even "culture fair" is a stretch. Testing, multiple-choice formats, and giftedness are all cultural constructs. Cultural influences can't be eliminated, only reduced.
Facts about intelligence immigrant testing at Ellis Island
12 million immigrants entered the U.S. through Ellis Island
Only about 2% were rejected, with even fewer rejected due to failing mental examinations
91% passed through without undergoing any mental screening - Those selected for mental screening only needed to pass one of several tests, even if they failed initial or follow-up screenings.
Difference between APA testing guidelines and standards
- The American Psychological Association (APA) and related professional organizations have published many works over the years to delineate ethical, sound practice in the field of psychological testing and assessment.
- Test user qualifications: In 1950, an APA Committee on Ethical Standards for Psychology published a report called Ethical Standards for the Distribution of Psychological Tests and Diagnostic Aids
- Three Levels of Expertise:
- Level A: Tests or aids that can adequately be administered, scored, and interpreted with the aid of the manual.
- Level B: Tests or aids that require some technical knowledge of test construction and of supporting psychological and educational fields.
- Level C: Tests and aids that require substantial understanding of testing and supporting psychological fields together with supervised experience in the use of these devices.
The significance of Daubert v. Merrell Dow Pharmaceutical case
The Daubert versus Merrell Dow Pharmaceuticals ruling by the Supreme Court superseded the long-standing policy, set forth in Frye, of admitting into evidence only scientific testimony that had won general acceptance in the scientific community.
Opposing expert testimony, whether such testimony had won general acceptance in the scientific community, would be admissible.
The Daubert ruling gave trial judges more leeway in deciding which testimony should be heard by the jury
informed consent right - Permission to proceed with a (typically) diagnostic, evaluative, or therapeutic service on the basis of knowledge about the service and its risks and potential benefits
disparate treatment - The consequence of an employer’s hiring or promotion practice that was intentionally devised to yield some discriminatory result or outcome; contrast with disparate impact
disparate impact- The consequence of an employer’s hiring or promotion practice that unintentionally resulted in a discriminatory result or outcome; contrast with disparate treatment
When a test is deemed discriminatory, which test qualities are scrutinized?
- The competencies actually assessed by the test and how related those competencies are to the job
- Differential weighting of items on the test or selection procedures
- The psychometric basis for the cutoff score in effect
- The rationale in place for rank-ordering candidates
- A consideration of potential alternative evaluation procedures that could have been used
- An evaluation of the statistical evidence that suggests discrimination or reverse discrimination occurred.
Why are discriminatory test lawsuits expensive?
- Because a Title VII lawsuit typically includes th4e cost of attorneys, consultants and experts. Retrieval, scanning, and storage of records is costly, as well. No additional employees can be hired or fired during a lawsuit like this.
Projective tests examples, rationale
- Projective tests, such as the Rorschach Inkblot Test, are tests in which an individual is assumed to “project” onto some ambiguous stimulus his or her own unique needs, fears, hopes, and motivation.
- Psychological assessment has proceeded along distinct threads: the academic and the applied.
- Academic tradition: researchers at universities throughout the world use the tools of assessment to help advance knowledge and understanding of human and animal behavior.
- In the applied tradition, the goal is to help select applicants for various positions on the basis of merit.
Individualistic and collectivistic cultures
Collectivist cultures value traits such as conformity, cooperation, interdependence, and striving toward group goals.
Individualist cultures place value on traits such as self- reliance, autonomy, independence, uniqueness, and competitiveness.
- Judgments related to certain psychological traits can be culturally relative.
- Cultures differ with regard to gender roles and views of psychopathology.
- Cultures also vary in terms of collectivist versus individualist value.
How did the work of Wundt differ from that of Galton, Binet, and James McKeen
Cattell?
In Germany, Wilhelm Max Wundt started the first experimental psychology laboratory and measured variables such as reaction time, perception, and attention span, focusing on similarities rather than differences.
Unlike Galton, Wundt saw individual differences as annoying source of experimental errors and aimed to minimize them by controlling extraneous variables
originator of the psychometric concept of test reliability?
Spearman is credited with originating the concept of test reliability.
Who coined the term mental test in 1890?
James Catell coined the term mental test in 1890
What did nineteenth-century psychological measurement mainly focus on?
Nineteenth century psychological measures mainly focused on intelligence.
problems unique to self-report tests
Disadvantages:
respondents may have poor insight into themselves
people might honestly believe some things about themselves that in reality are not true -
respondents are unwilling to reveal anything about themselves that is very personal or that paints them in a negative light.
First personality test developed after the first world war
Woodworth Psychoneurotic Inventory was the first widely used self-report
personality test meant to aid in psychiatric interviews
Rights of test taker, right to privacy exception
Test Takers have a right to know why they are being evaluated, how the data will be used and what (if any) information will be released to whom
The concept of a privacy right refers to “the freedom of the individual to pick and choose for himself the time, circumstances and particularly the extent to which he wishes to share or withhold from to others his attitudes, beliefs, behavior and opinions
Who entitled to privileged communication in relation to test post-test feedback
There is confidentiality between therapist and client, however some instances, like threats made against another person or persons life, or an active investigation into a client may lead to privileged communications being shared with law enforcement.
What test taker characteristics are affected by culture?
Culture - The socially transmitted behavior patterns, beliefs, and products of work of a particular population, community, or group of people
Culture affects verbal communication, nonverbal communication and behavior and standards of evaluation
Why translating a test into another language is not recommended?
Certain nuances of meaning may be lost in translation.
Some interpreters may not be familiar with mental health issues and pre-training may be necessary.
In interviews, language deficits may be detected by trained examiners but may go undetected in written tests.
Assessments need to be evaluated in terms of the language proficiency required and the language level of the test taker.
What did Sir Francis Galton measure in his laboratory?
Sir Francis Galton measured heredity within sweet peas, but also adapted many psychological tests and measures.
Ch. 3
Types of data, scales of measurement, response format
Quantitative data is numbers-based, countable, or measurable, mathematical operations are meaningful with these data.
Qualitative data is interpretation-based and descriptive; mathematical operations are meaningless with qualitative data
There are three primary scales of measurement in statistical analysis: categorical, ordinal, and continuous.
Categorical variables group observations based on characteristics they possess or lack. Nominal variables, synonymous with categorical, use numbers to "name" phenomena like outcomes or traits.
Ordinal variables provide a sense of order, commonly used in Likert-type scales in applied research. They measure distance but not magnitude.
Continuous variables are the most precise, possessing a "true zero" that allows measurement of both distance and magnitude. Interval, ratio, and count variables are treated as continuous in applied research, offering the highest level of precision and accuracy.
When we’re gathering measurement data, we ask questions and get answers.
Natural dichotomous responses: yes or no, black or white, good or bad, like me or not like me, or to report whether something is true or not.
Forced-choice dichotomies: e.g., social support instrument asks questions “I rely on my friends for emotional support” and “My friends seek me out for companionship” and the only response options you have to these two statements are “yes” or “no.” In reality, more accurate responses would reflect gray areas between yes and no.
Continuous responses allow for three or more choices that increase in value. “Do your friends seek you out for companionship?”
Likert-type scales allow for a continuum of responses. Typically, they allow for five, seven, or nine responses
Raw scores, frequency distribution, grouped frequency distribution
Raw score: A straightforward, unmodified accounting of performance that is usually numerical.
Frequency distribution: All scores are listed alongside the number of times each score occurred. Can be structured either as a table or as a graph and presents the same two elements: (1)the set of categories that make up the original measurement scale; (2) a record of the frequency, or number of values in each category
Grouped-frequency distribution - the first column lists the groups of scores, called class intervals, instead of actual values.
Measures of central tendency
Measure of central tendency: A statistic that indicates the average or midmost score between the extreme scores in a distribution.
Mean: Sum of the observations (or test scores in this case) divided by the number of observations. Average.
Median: The middle score in a distribution. - Useful in cases when data is skewed or has outliers.
Mode: The most frequently occurring score in a distribution.
Measures of variability, kurtosis
Variability is an indication of how scores in a distribution are scattered or dispersed.
- Kurtosis: The steepness of a distribution in its center.
- Platykurtic: Relatively flat.
- Leptokurtic: Relatively peaked.
- Mesokurtic: Somewhere in the middle.
Range: Difference between the highest and the lowest scores.
Interquartile range: A measure of variability equal to the difference between the first and third quartiles of a distribution.
Semi-interquartile range: The interquartile range divided by 2
Know how standard deviation relates to variance
Standard deviation: The square root of the average squared deviations about the mean; it is the square root of the variance. The average squared difference, making it more sensitive to outliers and generally considered a more robust measure of data spread.
Normal curve, distribution under the normal curve
The normal curve is a bell-shaped, smooth, mathematically defined curve that is highest at its center and is perfectly symmetrical. Area under the normal curve. The normal curve can be conveniently divided into areas defined in units of standard deviations.
Roughly 68% of data is contained within 1 SD
Roughly 95% of data is contained within 2 SDs
Roughly 99.7% of data is contained within 3 SDs
Standard scores: %ile, z, T, stanines, know their properties (mean, SD)
A standard score is a raw score that has been converted from one scale to another scale, where the latter scale has some arbitrarily set mean and standard deviation.
Be able to calculate and interpret z score from raw score, convert z to T
General outlook on correlation, meta-analysis
A coefficient of correlation (or correlation coefficient) is a number that provides us with an index of the strength of the relationship between two things
Correlation between variables does not imply causation but aid in prediction
Ch.4
Value of testing, the primary objective of psychological testing
Key Benefits of Testing and Assessment:
Education - standardized tests help assess academic achievement, learning disabilities, and student placement.
Employment & Hiring - personality, cognitive, and skill-based assessments aid in employee selection and career development.
Mental Health - psychological assessments assist in diagnosing disorders, treatment planning, and monitoring progress.
Legal & Forensic Applications - courts use psychological evaluations in competency assessments, risk evaluations, and child custody cases.
Medical & Clinical Settings - neuropsychological tests help diagnose cognitive impairments, such as dementia or brain injuries
How samples of behavior may be obtained
validity-related questions
What constitutes a good test?
A good test is one that is clear, efficient, and accurate in measuring what it claims to assess. It should include:
Clear instructions - tests should have well-defined guidelines for administration, scoring, and interpretation to ensure consistency.
Efficiency - good tests should be time- and cost-effective in administration, scoring, and interpretation.
Measurement accuracy - test must accurately assess what it is designed to measure.
Psychometric soundness is the key to a good test:
Reliability - test should produce consistent results over time and across different settings.
Validity - test must measure what it claims to measure and be appropriate for its intended use.
Ultimately, a good test is one that is well-structured, practical, and scientifically sound, ensuring that the results are meaningful and applicable
Norm-referenced testing, normative sample, different types of norms
Norm = standard of behavior (usual, average, expected).
Norms (plural) = test performance data used as a reference for interpreting scores.
Normative Sample = the group whose test performance serves as a comparison (e.g., U.S. adults, hospitalized patients)
norm-referenced testing and assessment - A method of evaluation and a way of deriving meaning from test scores by evaluating an individual testtaker’s score and comparing it to scores of a group of test takers on the same test; contrast with criterion-referenced testing and assessment
Types of Norms
- Percentile - An expression of the percentage of people whose score on a test or measure falls below a particular raw score, or a converted score that refers to a percentage of testtakers; contrast with percentage correct
- age norms - Norms specifically designed to compare a testtaker’s score with those of same-age peers; contrast with grade norms
- grade norms - Norms specifically designed to compare a testtaker’s score with peers in the same grade or year in school; contrast with age norms
- National norms - derived from a normative sample that was nationally representative of the population at the time the norming study was conducted.
- national anchor norms - An equivalency table for scores on two nationally standardized tests designed to measure the same thing
criterion-referenced testing
Criterion-referenced testing and assessment - Also referred to as domain-referenced testing and assessment and content-referenced testing and assessment, a method of evaluation and a way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard (or criterion); contrast with norm-referenced testing and assessment
Psychological traits and states, what is “relative” mean in relation to trait
A trait has been defined as “any distinguishable, relatively enduring way in which one individual varies from another” (Guilford, 1959, p. 6).
States also distinguish one person from another but are relatively less enduring (Chaplin et al., 1988)
Traits are partially stable; not expected to be manifested in behavior 100% of the time.
The nature of the situation influences how traits will be manifested
Relative consistency refers to how an individual maintains their position on a trait compared to others in a group.
Basic assumptions of psychological testing
Assumption 1: Psychological traits and states exist.
Assumption 2: Traits and states can be quantified and measured.
Assumption 3: Test-related behavior predicts non-test-related behavior
Assumption 4: All tests have limits and imperfections
Assumption 5: Various Sources of Error in Assessment
Assumption 6: Identifying and Reforming Unfair and Biased Assessments
Assumption 7: Testing and assessment benefit society
When psychological trait becomes a construct
Ideally, the test developer should provide test users with a clear operational definition of the construct under study.
Once a construct (state, trait, or other) is defined, test developer considers the types of items that would provide insight into it and allow accurately measure construct
What does it mean to "put a test to the test"
Checking to see why you should use a particular model or instrument, if there are nay published guides for the use of this test, if it's reliable, valid, cost effective.
Seeing what inferences can reasonably be made from this test score and how generalizable the findings are
Test standardization, what standardized test manual should include
Standardization - process of administering a test to a representative sample of test takers for the purpose of establishing norms; norms provide a reference point to interpret individual test scores by comparing them to a representative group.
The testing manual should include the information test users need to use the rest in a responsible fashion and enables them to administer it in a standardized manner, providing for easy replication
Sampling (random, stratified)
1. Random Sampling
- Each individual in the population has an equal chance of being selected. - Ensures unbiased representation of the population.
- Example: Drawing names from a hat to select participants.
2. Stratified Sampling
- The population is divided into subgroups (strata) based on a specific characteristic (e.g., age, gender, income level).
- A proportional number of individuals are randomly selected from each stratum.
- Ensures all subgroups are adequately represented.
- Example: In a school with 60% female and 40% male students, the sample maintains the same ratio.
Ch.5
concept of reliability; true score, observed score, construct score, error score, know how these are expressed, e.g., X = T + E
One of the main goals of CTT is to assess the reliability of a test, which refers to the consistency of scores.
Reliability = Consistency in measurement.
A test is reliable if observed scores mostly reflect true scores and unreliable if dominated by error.
Since true and error scores are unobservable, their influence is estimated indirectly through test score variability
A construct score is a person’s standing on a theoretical variable independent of any particular measurement
error variance - In the true score model, the component of variance attributable to random sources irrelevant to the trait or ability the test purports to measure in an observed score or distribution of scores; common sources of error variance include those related to test construction (including item or content sampling), test administration, and test scoring and interpretation
relations between total variance, true variance, and reliability (σ2 = σ2t + σ2e )
Variance (σ²), the squared standard deviation, helps distinguish sources of variability:
- True variance (σ²t): Differences due to actual abilities or traits. Error variance (σ²e): Differences from random, irrelevant factors.
- Total variance is the sum of the true variance and the error variance: σ2 = σ2t + σ2e
- The term reliability refers to the proportion of the total variance attributed to true variance; the greater the proportion of the total variance attributed to true variance, the more reliable the test.

difference between alternate forms and parallel forms of a test

Parallel Forms: Two test versions where means and variances of observed scores are equal. Scores should correlate equally with the true score and with other measures. •
Alternate Forms: Different versions designed to be equivalent in content and difficulty but may not meet the strict statistical requirements of parallel forms.
Reliability is estimated by correlating scores from a sample that takes both versions
Be able to compute SEM and CI
Be able to use Spearman-Brown formula for split test reliability and to calculate reliability when test is lengthened or shortened
Standard Error of the Difference (SED) vs SEM
Often abbreviated as SEM, provides a measure of the precision of an observed test score or an estimate of the amount of error inherent in an observed score or measurement
The Standard Error of the Difference (SED) is a statistical measure that helps determine whether the difference between two sample means (or two scores) is statistically significant rather than due to random variation.
Different types of reliability: test-retest, inter-rater, internal consistency, parallel forms, and split-half
Inter-rater reliability (also known as inter-scorer reliability or inter-observer reliability) refers to the degree of consistency or agreement between different raters or observers evaluating the same phenomenon or set of data
Test-retest reliability estimates consistency by correlating scores from the same individuals on two administrations of the same test
Internal consistency - a measure of how consistently each item measures the same underlying construct
Split-half reliability estimates a test’s internal consistency by correlating scores from two equivalent halves of a single test administration
source of error variance, systematic and random errors
Systematic Error: Consistent, predictable errors that inflate or deflate scores in a fixed direction (e.g., a miscalibrated ruler always measuring 12.1 inches instead of 12).
Random Error: Unpredictable fluctuations that can raise or lower scores without a pattern (e.g., sudden noises, test-taker’s physiological changes). These errors cancel out over time and do not affect test consistency.
Error variance stems from variance attributable to random sources irrelevant to the trait or ability the test purports to measure in an observed score or distribution of scores
problems in assessing reliability of tests for very young children
There needs to be small test-retest intervals because of rapid development.
CTT, IRT, G-theory, Domain sampling theory
Classical Test Theory (CTT) aka true score model, is a foundational framework in psychometrics used to evaluate test reliability, item quality, and overall test performance.
It is called "classical" to distinguish it from more modern approaches, such as Item Response Theory (IRT).
True Score Model: CTT operates under the assumption that an individual's observed score (X) on a test is composed of two components: X = T+E
T = True Score (the actual ability or trait level of the individual)
E = Error Score (random measurement error)
The true score is an underlying but unobservable construct, meaning that any observed score is influenced by both the true ability and error factors
Ch. 6
Main Types of Validity:
Content Validity: This measure of validity is based on an evaluation of the subjects, topics, or content covered by the items in the test
Construct Validity: This measure of validity is arrived at by executing a comprehensive analysis of:
- How scores on the test relate to other test scores and measures.
- How scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure.
Criterion-Related Validity : This measure of validity is obtained by evaluating the relationship of scores obtained on the test to scores on other tests or measures.
Face Validity: A judgment concerning how relevant the test items appear to be; if a test appears to measure what it purports to measure “on the face of it,” it could be said to be high in face validity.
Concept of validity, process of validation, criterion, construct
Validity: A judgment or estimate of how well a test measures what it purports to measure in a particular context.
Criterion: a standard on which a judgement or decision is based
Validation: The process of gathering and evaluating evidence about validity; both test developers and test users may play a role in the validation of a test.
A key difference between concurrent and predictive validity
Concurrent validity correlates with real world measures at the time of administration while predictive validity tries to predict future performances
Ch.8
Steps in the process of test development, meaning of each; role of factor analysis in test
Test Conceptualization
If something exists, it must exist in some quantity and that quantity can be measured.
- The thought that “there ought to be a test for...” is motivation for developing a new test
Test Construction
Item writing is a crucial step in test construction, directly impacting scaling and content validity.
Test developers must answer three key questions:
- What range of content should be covered?
- Which item formats should be used?
- How many items should be written in total and for each content area?
Test tryout
Analysis
Regression
development; item pool; test revision process; sensitivity review
An item pool is a collection of test questions from which the final test items are selected
sensitivity review - A study of test items, usually during test development, in which items are examined for fairness to all prospective test takers and for the presence of offensive language, stereotypes, or situations
Meaning of item-discrimination index, know how to calculate d; how to interpret positive
item-discrimination index - A statistic designed to indicate how adequately a test item discriminates between high and low scorers
A low discrimination score means that the test is not good at discriminating between high and low test-takers (all students got it correct), a high discrimination score means that an item may have been too difficult
Item difficulty index (meaning, range); item difficulty and probability interpretation of difficulty index
item-difficulty index - In achievement or ability testing and other contexts in which responses are keyed correct, a statistic indicating how many test takers responded correctly to an item; in contexts where the nature of the test is such that responses are not keyed correct, this same statistic may be referred to as an item-endorsement index
Optimal number is .5, but values range from .3 to .8
item-endorsement index; item-validity index; item-reliability index - meaning
item-endorsement index - In personality assessment and other contexts in which the nature of the test is such that responses are not keyed correct or incorrect, a statistic indicating how many testtakers responded to an item in a particular direction; in achievement tests, which have responses that are keyed correct, this statistic is referred to as an item-difficulty index
item-reliability index - A statistic designed to provide an indication of a test’s internal consistency; the higher the item reliability index, the greater the test’s internal consistency
item-characteristic curve interpretation (including inverted U shape)
Item C (a linear increasing line) Is the best possible characteristic curve, as the ability to get a question correct increases with skill level
Item B (the reverse U) is an item where people who know to much or think too much are likely to respond incorrectly
components of a multiple-choice item, matching item; what is a “good” test item
Multiple-choice format has three elements: (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options variously referred to as distractors or foils
A good test item allows for a positive discrimination between constructs.
CAT
CAT is interactive computer-administered tests that adjust difficulty based on test-taker responses.
differential item functioning; DIF analysis
Differential item functioning is a key methodology to identify biased items in questionnaires.
DIF analysis - In IRT, a process of group-by-group analysis of item response curves for the purpose of evaluating measurement instrument or item equivalence across different groups of test takers