1/33
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Why do we evaluate HRI systems?
To ensure robots work, are effective, socially appropriate, and produce trustworthy scientific results.
What is explorative research? (definition)
Early-stage research to understand the problem space by “just putting a robot there” and observing.
What is formative evaluation?
Testing prototypes to shape and improve the design; done in the middle of development.
What is summative evaluation?
Rigorous evaluation at the end of a project to measure long-term or final effects.
What is system testing?
In-lab testing of system performance (latency, robustness, crashes).
What is a pilot study? (definition)
A small test run with real users to check feasibility, methodology, and fix issues.
What is feature testing?
Testing the contribution of one feature (e.g., with vs without emotional feedback).
What is implementation testing?
Testing long-term in real-world settings with high ecological validity.
Give examples of quantitative data.
Task time, accuracy, Likert scales, physiological signals.
What are descriptive statistics?
Means, medians, standard deviations.
What are inferential statistical methods?
T-tests, ANOVA, Chi-square, power analysis.
What data is analyzed qualitatively?
Interviews, open-ended responses, video notes.
What is thematic coding? (definition)
Identifying recurring ideas and patterns in qualitative data.
What is narrative analysis?
Analyzing participants’ stories to extract insights.
What is rigor in qualitative research?
Ensuring transcription accuracy and inter-rater reliability.
What are mixed-methods?
Combining quantitative + qualitative data for a complete picture.
What is an observational study? (definition)
Watching natural interactions without manipulating anything.
What is the Hawthorne effect?
People change behavior when they know they are being observed.
What is a between-group design?
Different participants in each condition.
Pros: no order effects
Cons: needs larger sample
What is a within-group design?
Same participants in all conditions.
Pros: controls individual differences
Cons: carryover + fatigue
What is an RCT? (definition)
(Randomized Controlled Trials) = Randomized study comparing intervention vs control; gold standard for causality.
What is a longitudinal study?
Study conducted for weeks, months, or years.
Measures adaptation, long-term acceptance, novelty fade.
What are self-assessments?
Surveys measuring user feelings (trust, valence, usability).
What are behavioral observations?
Video/live coding of gaze, gesture, engagement.
What are psychophysiological measures?
Heart rate variability, skin conductance, respiration.
What are task performance metrics?
Completion time, errors, success rate.
What is pipeline performance?
Measuring ASR → LLM → DM → TTS for latency, stability, cost.
What is perplexity?
Metric showing how well a model predicts next words; lower = better.
What is BLEU score?
Measures n-gram overlap between output and reference text.
What is ROUGE?
Recall-focused metric measuring how much key content is matched in summaries.
What is simulated user testing?
Testing robustness using persona scripts, happy path, and edge cases.
What do humans judge LLM responses on?
Usability, relevance, satisfaction.
Why are human-centric metrics important?
They capture insights automated metrics cannot.
What are the steps in an evaluation plan?
Select capability
Define outcome
Choose metrics
Choose instruments
Define protocol
Execute