1/33
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
Steps in Phase 3
Describe the reference or norm group and sampling plan for standardization
Describe your choice of scaling methods and the rationale for this choice
outline the reliability studies to be performed and their rationale
outline the validity studies to be performed and their rationale
include any special studies that may be needed for development of this test or to support proposed interpretations of performance
List the components of the test - e.g manual, record forms, test booklets, stimuli materials etc
Sampling Plan
defining the target population for comparison (age range, special needs etc)
the first group that everyone is compared to
Looking at other norm groups
other groups you want to compare against
you want a true random sample but this is often not possible
sample has to be representive
best method is to use a population proportionate stratified random sampling plan
Determine the appropiate size of the sample overall
Standardized Sample
a sample of test takers who represent the population for which the test is intended to measure
determines the norms and forms of the reference group all the examinees are compared to
Population
all members of the target audience
Sample
administering a survey to a representative subset of the population
Types of Sampling (Selecting the Appropiate respondents)
Probability Sampling
Simple Random Sampling
Systematic Random Sampling
Stratified Random Sampling
Cluster Sampling
Nonprobability Sampling
Convenience Sampling
Probability Sampling
Uses statistics to ensure that a sample is representative of a population
Simple Random Sampling
every member of a popuation has an equal chance of being chosen as a member of the sample
Systematic Random Sampling
Choosing every nth person (e.g every third person)
Stratified Random Sampling
Population is divided into subgroups
Cluster Sampling
used when it is not possible to list all of the individuals who belong to a particular population and is a method often used with surveys that large target populations
Nonprobability Sampling
Is a type of sampling in which not everyone has an equal chance of being selected from the population
Convience Sampling
The survey researcher uses any available group of participants to represent the population
Sample Size
number of people needed to represent the target population accurately
Depends on the factors of the test plan of how many ppl you need - larger the better i believe (thats what she said)
Homogeneity of the Population
how similar the people in your population are to one another
Sampling Error
a statistic that reflects how much error can be attributed to the lack of representation of the target population by the sample of respondents chosen
Distributing the Survey
how will the instrument/test be given to the respondent
mail, phone, weblink, in person
Specifying Administration and Scoring Methods
Determine such things as how test will be administered which will influence the format and content of the test items
orally, written, computer, groups, individual
Methods of scoring chosen like if its scored by hand by test administrator, scoring software, sent to test publisher for scoring
Types of Raw Scoring Methods
Cumulative/Summative Model
Ipsative Model
Categorical Model
Cumulative/Summative Model
most common
assumes that the more a test taker responds in a particular fashion the more they have of the attribute being measured
using this model, the test taker recieves 1 point for each correct answer and the total number of correct answers becomes the raw score on the test
correct responses or responses on the Likert scale are summed
data can be interpreted with reference to norms
Semantic Differential: Adjective pairs at each end of the continuum (e.g rich/poor)
Visual analog: the researcher assign scores through the continuum (e.g rating pain levels, each number a diff level of pain)
Ipsative Model
test takers are given 2 or more options to choose from - mostly uses forced choice items
most used in personality testing - test taker indicates which items are most like them and least like them
measures an individual's personal growth, strengths, or preferences relative to themselves over time, rather than comparing them to others (normative) or external standards (criterion-referenced)
all items are chosen to be equally desireable
typically yields nominal data because it places test takers in categories - e.g., The # of T or F, Y or N, Agree or Disagree
Categorical Model
Is used to put the test taker in a particular group or class
test takers scores are not compared to that of other test takers but compare the scores on various scales within the test taker (which scores are highest and lowest)
Typically yields nominal data b/c it places test takers in categories
counts the number true and false answers, agree and disagree
Piloting and Revising Tests
cant assume the test will perform as expected
test developers conduct studies to determine how well a new test
performs.
Pilot test is a scientifically investigation of evidence that suggests that the test scores are reliable and valid for their specified purpose
involves administering the test to sample from target audience
analyze data and revise test to fix any problems uncovered
many aspects to consider
Setting up the Pilot Test
Test situation should match actual circumstances in which test will be used
e.g if the test is designed to diagnose emotional disabilities in adolescents, the participants for the pilot study should be adolescents.
The sample should be large enough to provide the power to conduct statistical tests to compare the responses of each group
the test setting of the pilot test should mirror the planned test setting.
e.g If school psychologists will use the test, the pilot test should be conducted in a school setting using school psychologists as administrators.
developers must follow the american psychological associations codes of ethic
strict rules for confidentiality and publish only combined results
test takers understand that theyre in a research study/scores are used for research purposed
Conducting the Pilot Test
a scientific evaluation of the tests performance
depth and range depends on the size and complexity of the target audience and the construct being measured
e.g tests designed for use in a single company or college program require less extensive studies than do tests designed for large audiences such as students applying for graduate school
adhere strictly to test procedures outlined in test administration instructions
generally require large sample
may also ask participants about the testing experience
pilot studies often require gathering extra data such as a criterion measure
and the length of time needed to complete the test.
important to recognize the problems with the test administration, make all necessary revisions before continuing, and conduct a new pilot test that yields appropriate results.
Analyzing the Results
Can gather both quantitative and qualitative information for things like item characteristics, internal consistency, test-rest, inter-rater, convergent and discriminate validity and sometimes predictive validity
Qualitative data may be used to help make decsions
Conducting Quantitative Item Analysis
Item analysis: how developers evaluate the performance of each test item
Item difficulty: The percentage of test takers who respond correctly vs total number of people to assess the p value (percentage/probability value)
understand how difficult an item is
p value of .5 is optimal (higher mean its too easy, lower is too hard)
0 to .2 (too difficult) and .9 to 1 (too easy)
Discrimination Index
Compares the performance of those who obtained very high test scores (Upper group) with the performance of those who obtained very low test scores (lower group) on each item
D= U(# of upper group who responded correctly/total # in upper group x 100) - L (# of lower group who responded correctly/total # of lower group x 100)
D = U - L
A discrimination index of 30 and above is desirable
Negative numbers: those who scored low on the test overall responded to the item correctly and those who scored high on the test responded incorrectly.
The upper group and lower group are formed by ranking the final test scores from lowest to highest and then taking the upper third and the lower third to use in the analysis.
Item-Total Correlation
a measure of the strength and direction of the relation between the way test takers responded to one item and the way they responded to all of the items as a whole
Items that have little or no correlation with the total item score may measure a different construct from that being measured by the other items.
Interitem Correlation Matrix
displays the correlation of each item with every other item
Usually each item has been coded as a dichotomous variable (correct (1) or incorrect (0))
Therefore, the interitem correlation matrix will be made up of phi coefficients
provides important information for increasing the test’s internal consistency.
drop items that dont correlate with other items measuring the same construct
Phi Coefficients
The results of correlating two dichotomous (having only 2 values) variables
Item Response Theory
estimates of the ability of test takers that is independent of the difficulty of the items presented as well as estimates of item difficulty and discrimination that are independent of the ability of the test takers.
relates the performance of each item to a statistical estimate of the test taker’s ability on the construct being measured
Item Characteristic Curves
Part of Item response theory
the line that results when we graph the probability of answering an item correctly with the level of ability on the construct being measured.
provides a picture of the item’s difficulty and how well it discriminates high performers from low performers
Item Bias
when an item is easier for one group than for another group.