1/232
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is a survey?
A systematic method for gathering information from entities to construct estimates of their attributes on average.
What are examples of major surveys?
- Gallup public opinion polls,
- Harvard Adult Development Survey,
- the General Social Survey.
Why are surveys rapidly changing today?
Technology has changed how surveys are conducted and how people interact, causing the field to evolve faster than education and research can keep up.
What is survey methodology?
A field and profession that studies survey design, collection, processing, and analysis; it is multidisciplinary (statistics, psychology, sociology, etc.).
Survey
A systematic method for gathering information from entities (often people) to construct estimates of the attributes those entities have on average
Survey Methodology
A scientific field and profession that seeks to identify and study principles pertaining to survey design, collection, processing, and analysis
Data Science
Extracting meaning from and interpreting data using tools and methods from statistics and machine learning
Descriptive Research Question
Seeks to summarize characteristics of a set of data with no interpretation—just facts/attributes (e.g., "What is the average number of doctor visits reported by respondents?")
Exploratory Research Question
Analyzes data for patterns, trends, or relationships between variables, used for hypothesis generation: these are unplanned questions (e.g., "Is there an age difference in doctor visits?")
Predictive Research Question
Determines whether one or more phenomena can be used to forecast some future outcome, less interested in "why," just what predicts the outcome (e.g., "Can we guess whether a single household will be less likely to respond?")
Causal Research Question
Asks whether changing one factor will change another factor, requires controlled randomized trials or experiments to establish cause and effect (e.g., "Will this drug intervention reduce illicit drug use?")
Inferential Research Question
Uses a sample to make conclusions about a larger population
Mechanistic Research Question
Asks about the exact mechanism or process by which something occurs
Good Research Question Criteria
Must be: (1) of interest to your audience, (2) not already answered, (3) stemming from a plausible framework, (4) falsifiable, and (5) specific
Data Generation Process
The method by which data are collected, surveys often have large variation in size/quality, depend on modality, are often cross-sectional, and usually involve a sample rather than a census
Data Curation/Storage
Includes editing, de-identification, data entry, coding, error checking, dataset construction, codebook construction, building weights, and imputing missing values
Data Analysis
Using statistical models (t-tests, ANOVAs, regression) to make inferences about populations from samples, or machine learning models (KNN, decision trees, logistic regression) to make predictions
Data Output/Access
Communicating results through papers, dashboards, videos, blogs (science communication is almost as important as the science itself)
Total Survey Error (TSE)
A framework for thinking about various sources of error that may affect survey statistics, errors reflect uncertainty in an inference, not necessarily mistakes
Construct
Elements of information (variables) sought by researcher: usually abstract, described by words, often latent and not directly observable (e.g., happiness, quality of life, belief in God)
Measurement/Operationalization
Linking theoretical constructs to observable variables, the step-by-step protocols implemented to gather data: the construct is the "what" and measurement is the "how"
Response
The respondent value(s) from your measurement scheme (e.g., answers to questions, blood pressure readings)
Edited Response
Transforming data for specific use, including coding (text to numbers), acceptable answer sets, consistency rules, and reverse scoring negatively worded items
Target Population
The set of units to be studied: often abstractly defined with several ways to operationalize (e.g., adults in the US, users of a social media platform)
Sampling Frame
A set of units identified in some way that they could be sampled and located: ideally, every unit in the target population appears once and only once
Sample
A subset of the population from which measurements are drawn: the goal is to make inferences about the population from the sample
Respondents
Sample units that were successfully measured: the respondent pool may or may not equal the sample size
Post-survey Adjustments
Changes to survey data to make estimates better reflect the full target population, including selection weights, imputation, nonresponse weights, and poststratification
True Value
An idealized concept of a quantity to be measured: abstract and never truly known, but serves as a standard for comparison
Interviewer Variance
Error arising from different interviewers collecting different data despite having the same training, procedures, and workloads
Interviewer Bias
When personal factors of the interviewer systematically impact data collection
Sampling Variance
Variation in values of a survey statistic because different subsets of the population fall into samples over replications of the same survey design, measured via confidence intervals and standard errors
Sampling Bias
Consistent failure to estimate the proportion of the population correctly (e.g., relying on intro psych students when interested in all emerging adults): sampling bias is 0 for probability samples
Accuracy (Quality Dimension)
Total survey error is minimized
Credibility (Quality Dimension)
Data considered trustworthy by the survey community
Comparability (Quality Dimension)
Demographic, spatial, and temporal comparisons are valid
Relevance (Quality Dimension)
Data satisfy users' needs
Timeliness (Quality Dimension)
Data deliveries adhere to schedule
Completeness (Quality Dimension)
Data rich enough to satisfy analysis objectives without undue burden on respondents
R
An open-source statistical programming language
RStudio
An integrated development environment (IDE) for R that enhances R's usability by allowing you to keep track of objects, plots, run scripts, and more
Object
A container in R that stores information: you assign information using <- or = (e.g., course <- 400)
Numeric Data Type
Data consisting of all real numbers including whole numbers and decimals (e.g., num <- 6.5)
Integer Data Type
Data consisting only of whole numbers, denoted by an L (e.g., int <- 4L): a more efficient way of storing whole numbers
Character Data Type (String)
Text-based data enclosed in quotes (e.g., char <- "hello"), if not in quotes, R will try to interpret it as an object
Logical Data Type (Boolean)
Data that takes on TRUE or FALSE values (must be capitalized) (results from logical comparisons like 3 > 2)
Factor Data Type
A data structure for categorical variables that can be ordered or unordered. technically a structure, not a basic data type
class() Function
Command to check what data type is contained in an object
Vector
A unidimensional object that holds a singular data type, created using c() which stands for concatenate
Data Coercion Hierarchy
When mixing data types in vectors, R forces them to be the same following: character > numeric > integer > logical
length() Function
Returns the number of elements in a vector
Data Frame
R's primary means of data storage, similar to a spreadsheet where you can mix data types between columns but not within columns
nrow() and ncol() Functions
Return the number of rows and columns in a data frame, respectively. Only work on multi-dimensional objects
Indexing with []
Accessing specific elements of an object. R uses 1-based indexing (first element is position 1)
$ Operator
Used to index a named column in a data frame (e.g., df$column_name) (generally preferred method)
subset() Function
A more intuitive way to subset data frames based on conditions (e.g., subset(df, ratings > 7))
for Loop
A programming technique that runs a block of code a pre-specified number of times: structure: for(i in 1:10){ code }
install.packages() Function
Installs a package from CRAN, only needs to be done once and requires quotes around package name
library() Function
Loads an installed package for use in the current R session and must be done every time you start a new session
read.csv() Function
Reads a CSV file into R as a data frame
Working Directory
The default location where R searches for files: check with getwd(), change with setwd()
Commenting with #
Anything after # in a line will not be interpreted by R, used for transparency and reproducibility
summary() Function
Provides a five-number summary (min, 25th percentile, median, 75th percentile, max) and mean for numeric variables
mean(), sd(), var(), cor(), median()
Functions for calculating descriptive statistics on vectors
package::function() Syntax
Calling a function while explicitly stating which package it comes from, preferred for clarity and transparency
Mode of Data Collection
The method by which survey data are collected (e.g., face-to-face, telephone, mail, web)
CAPI (Computer-Assisted Personal Interviewing)
Computer displays questions on screen, interviewer reads them to respondent and enters answers
CATI (Computer-Assisted Telephone Interviewing)
Telephone counterpart to CAPI. Interviewer calls respondent, reads questions from computer, enters responses
CASI (Computer-Assisted Self-Interviewing)
Respondent completes survey on computer themselves. This can include text, audio, or video stimuli
ACASI (Audio Computer-Assisted Self-Interviewing)
Respondent sees questions on computer, hears recorded audio of questions, and enters their own answers. This increases privacy for sensitive topics
IVR (Interactive Voice Response)
Telephone counterpart of ACASI: respondent calls in, hears recorded questions, answers by keypad
TDE (Touchtone Data Entry)
Respondents call toll-free number, hear recorded questions, enter data using telephone keypad
SAQ (Self-Administered Questionnaire)
Paper questionnaire completed by respondent without interviewer present
CSAQ (Computerized Self-Administered Questionnaire)
Electronic version of SAQ completed on computer
Coverage Error (Mode)
Error due to the fact that not every unit in the population is represented on the sampling frame. Telephone surveys exclude those without phones, and web surveys exclude those without internet
Nonresponse Error (Mode)
Error that varies across modes. people may be more likely to complete certain types of surveys than others
Measurement Error (Mode)
Deviations of answers from true values. Sources include respondent, interviewer, instrument/questionnaire, and mode of data collection
Interviewer Effects (Positive)
Interviewers achieve higher response rates, motivate respondents, probe inadequate responses, provide feedback, clarify questions
Interviewer Effects (Negative)
Can lead to biased responses on sensitive questions, more expensive than self-administered modes
Social Desirability Bias (Mode)
Tendency to underreport sensitive behaviors in face-to-face surveys (self-administered modes reduce this)
Fixed vs. Variable Costs
Fixed costs (postage for set number of invites) vs. variable costs (hourly wages for unknown number of interviews) (affects mode choice)
Mixed-Mode Surveys
Using a combination of modes to compensate for weaknesses of individual modes, this is increasingly common
Random Digit Dialing (RDD)
Sampling method for telephone surveys that randomly generates phone numbers. No equivalent exists for web surveys
Probability-Based Online Panels
Panels where members are recruited via probability sampling methods (e.g., LISS, AmeriSpeak, KnowledgePanel)
Non-Probability Online Panels
Panels where members self-select or are recruited through non-random means. These are common but potentially biased
Elements
The unit of observation in a study (e.g., customers, households, businesses, schools, tweets)
Target Population Characteristics
Must be: (1) finite (can theoretically be counted), (2) observable/accessible, and (3) specific to a time frame (implies boundaries of space and time)
Unambiguous Population Definition
You should be able to clearly place elements in or out of the target population: "young adults in college" is better than "all young adults"
Sampling Frame
A list of elements in the target population that can be sampled and located. Coverage varies depending on frame quality
Coverage
The percent of the target population included in the frame (generally theoretical since we rarely know the exact population size)
Perfect Coverage
Ideal but unrealistic situation where the sampling frame exactly matches the target population
Undercoverage
When some members of the target population are missing from the sampling frame. This is the primary coverage concern since we can't reach them
Overcoverage
When the frame contains units not in the target population: includes ineligibles, duplicates, and blanks and can be identified and removed
Undercoverage Bias
Bias introduced when there is a difference between covered and uncovered units on the statistic(s) of interest
Duplication (Overcoverage)
Multiple frame entries link to the same element (e.g., a person with two phone numbers listed)
Clustering (Overcoverage)
Multiple elements can be reached via the same frame entry (e.g., a landline that reaches an entire household)
Ineligibles (Overcoverage)
Frame entries that are not part of the target population (e.g., businesses on a frame meant for individuals)
Blanks (Overcoverage)
Frame entries that don't connect to any unit (e.g., disconnected phone numbers)
Solutions for Overcoverage
Delete elements once identified: for clustering, take whole cluster or select one and weight up, for duplication, delete duplicates or weight down
Solutions for Undercoverage
Use multiple frames, combine modes, apply weighting (post-stratification), or change target population definition to match frame