3.6 Evaluation

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/24

There's no tags or description

Looks like no tags are added yet.

Last updated 7:15 PM on 3/14/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

25 Terms

New cards

Evaluation

is where we take what we've designed and put it in front of users to get their feedback.

Our goal is to apply multiple evaluation techniques to constantly center our designs around the user. That's why evaluation is a foundation of user-centered design.

New cards

3 Types of Evaluation

The first is qualitative evaluation. This is where we want to get qualitative feedback from users. What do they like, what do they dislike, what's easy, what's hard. We'll get that information through some methods very similar, in fact identical, to our methods for need finding.
The second is empirical evaluation. This is where we actually want to do some controlled experiments and evaluate the results quantitatively. For that, we need many more participants, and we also want to make sure we address the big qualitative feedback first.
The third is predictive evaluation. Predictive evaluation is specifically evaluation without users. In user centered design, this is obviously not our favorite kind of evaluation. Evaluation with real users though is oftentimes slow and its really expensive. So it's useful for us to have ways we can do some simple evaluation on a day to day basis. So we'll structure our discussion of evaluation around these three general categories.

New cards

Evaluation Terminology

Reliability refers to whether or not some assessment of some phenomenon is consistent over time. So for example, Amanda what time is it? It's about 2:30. Amanda what time is it? It's about 2:30. Amanda, what time is it? It's 2:30. Amanda is a very reliable assessment of the time. Every time I asked, she gives me the same time. We want that in an assessment measure. We want it to be reliable across multiple trials. Otherwise, its conclusions are random and just not very useful.
Validity refers to how accurately and assessment measures reality. An assessment could be completely reliable but completely inaccurate. So for example, Amanda, what time is it? Oh my goodness, it's 2:30! Actually it's 1:30. Oh, shoot! So while Amanda was a reliable timekeeper, she wasn't a very valid one. Her time wasn't correct even though it was consistent.
Generalizability is the extent to which we can apply lessons we learned in our evaluation to broader audiences of people. So for example, we might find that the kinds of people that volunteer for usability studies have different preferences than the regular user. So the conclusions we find that those volunteers might not be generalizable in measuring what we want to measure.
Precision is a measurement of how specific some assessment is. So for example, Amanda, what time is it? Well apparently, it's 1:30. Actually, it's 1:31. Come on! But in this case, no one's really going to say that Amanda was wrong in saying that it was 1:30. She just wasn't as precise. I could just as accurately say it's 1:31:27, but that's probably more precision than we need.

New cards

5 tips on What to Evaluate

Number one, efficiency. How long does it take users to accomplish certain tasks? That's one of the classic metrics for evaluating interfaces. Can one interface accomplish a task in pure actions or in less time than another? You might test this with predictive models or you might actually time users in completing these tasks. Still though, this paints a pretty narrow picture of usability.
Number two, accuracy. How many errors do users commit while accomplishing a task? That's typically a pretty empirical question although we can address it qualitatively as well. Ideally, we want an interface that reduces the number of errors a user commits while performing a task. Both efficiency and accuracy, however, examined the narrow setting of an expert user using an interface. So, that brings us to our next metric.
Number three, learnability. Sit under the user down in front of the interface. Define some standard for expertise. How long does it take the user to hit that level of expertise? Expertise here might range from performing a particular action to something like creating an entire document.
Number four, memorability. Similar to learnability, memorability refers to the user's ability to remember how to use an interface over time. Imagine you have a user alone in an interface, then leave and come back a week later. How much do they remember? Ideally, you want interfaces that need only be learned once, which means high memorability.
Number five, satisfaction. When we forget to look at our other metrics, we bought them out in a general notion of satisfaction. But that doesn't mean it's unimportant. We do need to operationalize it though. Experience is things like uses enjoyment of the system or the cognitive load they experience while using the system. To avoid social desirability bias, we might want to evaluate this in creative ways like finding out, how many participants actually download an app they tested after the session is over? Regardless of what you choose to evaluate, it's important that you very clearly articulate at the beginning, what you're evaluating, what data you're gathering, and what analysis you will use. These three things should match up to address your research questions.

New cards

Evaluation Timeline

Early in the timeline:

Formative(Purpose): Meaning their primary purpose is to help us redesign and improve our interface. Evaluation with the intention of improving the interface going forward.
Predictive(Approach):
Qualitative(Data)
Lab Testing (Setting)

Late in the timeline:

Summative(Purpose)
Empirical (Approach)
Quantitative(Data)
Field Testing (Setting)

New cards

Evaluation Design Steps

First, we want to clearly define the task that we're examining. Depending on your place in the design process this can be very large or very small. If we were designing Facebook, it can be as simple as posting a status update, or as complicated as navigating amongst and using several different pages. It could involve context and constraints like taking notes while running, or looking up a restaurant address without touching the screen. Whatever it is, we want to start by clearly identifying what task we're going to investigate.
Second, we want to define our performance measures. How are we going to evaluate the user's performance? Qualitatively, it could be based on their spoken or written feedback about the experience. Quantitatively, we can measure efficiency in certain activities or count the number of mistakes. Defining performance measures helps us avoid confirmation bias. It makes sure we don't just pick out whatever observations or data confirm our hypotheses, or say that we have a good interface. It forces us to look at it objectively.
Third, we develop the experiment. How will we find user's performance on the performance measures? If we're looking qualitatively will we have them think out loud while they're using the tool? Or will we have them do a survey after they're done? If we're looking quantitatively what will we measure, what will we control, and what will we vary? This is also where we ask questions about whether our assessment measures are reliable and valid. And whether the users we're testing are generalizable.
Fourth, we recruit the participants. As part of the ethics process, we make sure we're recruiting participants who are aware of their rights and contributing willingly.
Then fifth, we do the experiment. We have them walk-through what we outline when we develop the experiment.
Sixth, we analyze the data. We focus on what the data tells us about our performance measures. It's important that we stay close to what we outlined initially. It can be tempting to just look for whatever supports are design but we want to be impartial. If we find some evidence that suggests our interface is good in ways we didn't anticipate, we can always do a follow up experiment to test if we're right.
Seventh, we summarize the data in a way that informs our on going design process. What did our data say was working? What could be improved? How can we take the results of this experiment and use it to then revise our interface?

New cards

Qualitative Evaluation

Qualitative evaluation involves getting qualitative feedback from the user. There are a lot of qualitative questions we want to ask throughout the design process. What did you like? What did you dislike? What were you thinking while using this interface? What was your goal when you took that particular action? Now, if this sounds familiar, it's because it should be. The methods we use for qualitative evaluation are very similar to the methods we used for need-finding: interviews, think-aloud protocols, focus groups, surveys, post-event protocols. We use those methods to get information about the task in the first place, and now, we can use these techniques to get feedback on how our prototype changes the task.

New cards

Questions when designing a qualitatitive evaluation

First, is this based on prior experience, or is it a live demonstration? If you're bringing in users to answer questions about some interface that they're already using regularly, I'd probably argue you're actually doing need-finding. Now, the distinction can be subtle because evaluation does lead to additional need-finding, but most of these questions are going to apply more to a live demonstration. This is where you're bringing users in to test out some new interface for new prototype.
Second, is the session going to be synchronous or asynchronous? In a synchronous session, you're sitting and watching the user live. You're actually watching them use this interface you use in this prototype. If they're going to complete it on their own and just send you the results, and then it's an asynchronous session. Synchronous is usually beneficial because we see a much greater amount of the interactions taking place. We might also be able to interrupt the user and get their thoughts live. Asynchronous, though is often much easier to carry out, especially with larger populations. I generally recommend synchronous whenever possible, but asynchronous is certainly better than nothing.
Third, will they be evaluating one interface or many prototypes? You might have users come in to evaluate only one interface, or you might have them look at multiple prototypes to compare between them. If you're having them look at more than one prototype, you want to make sure to vary the order in which you present them, otherwise, you might get consistently different feedback just because the user is already familiar with the problem domain when they get to the second interface. That can be particularly significant if you're trying out some new interface compared to an interface that you've used in the past. If you always present the old first, then you'll probably get a lot of users saying the new one is much better, when in reality, they're just more familiar with the problem now.
Fourth, when do you want to get feedback from the user? There are two main protocols for doing this. There's a Think Aloud Protocol and a Post-Event Protocol. In a Think Aloud Protocol, you ask the user to actually think out loud while they're using your interface or prototype. You ask him to explain what they're seeing and explain how they interpreted, and explain what they think the outcome of their actions will be. In a Post-Event Protocol, you have the user go through some session using the interface or testing out the prototype, and then only giving you thoughts at the end. A Post-Event Protocol has the drawback, that you're only getting the user's feedback at the end. So, if they experience some difficulty early on, they may have forgotten it by the time you actually get feedback from them, but a Think Aloud Protocol has a problem, and then it might introduce some new biases. Research shows that when users are asked to think out loud while using an interface, the way they use the interface actually changes. They're more deliberative. They're more thoughtful. They're less intuitive about their interactions. In general, what that means, is that when we ask users to think out loud about an interface, oftentimes, they'll figure out how the interface works, but then, if users use the same interface without having to think out loud, they find it much more confusing. Talking through their thought process helps them understand, but our real end-users aren't going to talk through their thought process, so it's often good to use a mix of these two. In fact, I'd usually suggest doing a Think Aloud Protocol earlier, and using a Post-Event Protocol more as a summative evaluation once you're already pretty confident in how good your interfaces, but others advice may differ. In either case, it's worth noting that users are not often very good at explaining why they like something or why they did something, so we should always take the feedback that we get with a grain of salt.
Finally, do you want to get feedback from individuals or from groups? Focus groups are used when multiple users talk together about their experiences. This can actually lead to better explanations, because users build on each other and expand each other's ideas, but it can also strongly biased the group towards the opinions of the most powerful personalities. Individual interviews and surveys force the user to be the only source of knowledge, which can be bad, but it also means the user isn't biased by other outside views. As you'll notice and has probably become a trend, whenever we talk about multiple different options for doing evaluation or need-finding or whatever, different approaches have different strengths, and they address the weaknesses of other approaches.

New cards

Video Recording for Capturing Qualitative Evaluation

Pros:

Automated means that it runs automatically in the background.
Comprehensive means that it captures everything that happens during the session.
Passive means that it lets us focus on administering the session instead of capturing it.

Cons:

Intrusive means that many participants are uncomfortable being videotaped. It creates oppression knowing that every question or every mistake is going to captured and analyzed by researchers later.
Video is also very difficult to analyze. It requires a person to come later and watch every minute of video, usually several times, in order to code and pull out what was actually relevant in that session.
And video recording often has difficulty capturing interactions on-screen. We can film what a person is doing on a keyboard or with a mouse, but it is difficult to then see how that translates to on-screen actions.

New cards

Note Taking for Qualitative Evaluation

Pros:

It's cheap because we don't have to buy expensive cameras or equipment, we just have our pens and papers or our laptops, or anything like that. And can just do it using equipment we already have available to us.
It's not intrusive, in that it only captures what we decide to capture. If a participant is uncomfortable asking questions or makes a silly mistake with the interface, we don't necessarily have to capture that, and that can make the participant feel a little bit more comfortable being themselves.
And it's a lot easier to analyze notes. You can scroll through and read the notes on a one hour session in only a few minutes. But analyzing that same session in video is certainly going to take at least an hour, if not more, to watch it more than once.

Cons:

Taking those can be a very slow process, meaning that we can't keep up with the dynamic interactions that we're evaluating.
It's also manual which means that we actually have to focus on actively taking notes, which gets in the way of administering the session. If you're going to use note taking, you probably want to actually have two people involved. One person running the session, and one person taking notes.
And finally, it's limited in what it captures. It might not capture some of the movements or the motions that a person does when interacting with an interface. It doesn't capture how long they hesitate before deciding what to do next.

New cards

Software Logging for Qualitative Evaluation

Pros:

Like video capture, it's automatic and passive, but like note taking, it's analyzable. Because it's run to the system itself, it automatically captures everything that it knows how to capture, and it does so without our active invention. But it likely does so in a data or text format, that we can then either analyze manually by reading through it, or even with some more complicated data analytics methods. So in some ways, it captures the pros from both note-taking and video capture.

Cons:

Especially, it's very limited. We can only capture those things that are actually expressed inside the software. Things like the questions that a participant asks wouldn't naturally be captured by software logging.
Similarly, it only captures a narrow slice of the interaction. It only captures what the user actually does on the screen. It doesn't capture how long they look at something. We might be able to infer that by looking at the time between interactions, but it's difficult to know if that hesitation was because they couldn't decide what to do, or because someone was making noise outside, or something else was going on.
And finally, it's also very tech sensitive. We really have to have a working prototype, in order to use software logging. But remember, many of our prototypes don’t work yet. You can't do software logging on a paper prototype, or a card prototype, or a Wizard of Oz prototype. This only really works once we've reached a certain level of fidelity with our interfaces. So in selecting a way to capture your qualitative evaluation, ask yourself, will the subjects find being captured on camera intrusive? Do I need to capture what happens on screen? How difficult will this data be to analyze? It's tempting, especially for novices, to focus on just capturing as much as possible during the session. But during the session is when you can capture data in a way that's going to make your analysis easier. So think about the analysis that you want to do, when deciding how to capture your sessions.

New cards

5 Tips for Qualitative Evaluation

Number one, run pilot studies. Recruiting participants is hard. You want to make sure that once you start working with real users, you're ready to gather really useful data. So, try your experiment with friends or family or co-workers before trying it out with real users to iron out the kinks in your design and your directions.
Number two, focus on feedback. It's tempting in qualitative evaluations to spend too much time trying to teach this one user. If the user criticizes an element of the prototype, you don't need to explain to them the rationale. Your goal is to get feedback to design the next interface, not to just teach this one current user.
Number three, use questions when users get stuck. That way, you get some information on why they're stuck, and what they're thinking. Those questions can also be used to guide users to how they should use it to make the session seem less instructional.
Number four, tell users what to do, but not how to do it. This doesn't always apply, but most often we want to design interfaces that users can use without any real instruction whatsoever. So, in performing qualitative evaluation, give them instruction on what to accomplish, but let them try to figure out how to do it. If they try to do it differently than what you expect, then you know how to design the next interface.
Number five, capture satisfaction. Sometimes, we can get so distracted by whether or not users can use our interface that we forget to ask them whether or not they like using our interface. So, make sure to capture user satisfaction in your qualitative evaluation.

New cards

Empirical Evaluation

We're trying to evaluate something formal, and most often that means something numeric. It could be something explicitly numeric like what layout of buttons leads to more purchases or what gestures are most efficient to use. There could also be some interpretation involved though like counting errors or summarizing survey responses. The overall goal though is to come to the something verifiable and conclusive. In industry, this is often useful in comparing designs or in demonstrating improvement.

New cards

Hypothesis Testing Elements/Aspects

Whenever we're trying to prove something, we initially hypothesize that the opposite is true.

Null hypothesis: initial hypothesis that they are equal. In other words, It's the hypothesis that we accept if we can't find sufficient data to support the alternative hypothesis. (There is no impact/effect on the population)
Alternative hypothesis: The alternative hypothesis suggests that there is a impact/effect on the population.

New cards

Tests relating to Nominal Data

Chi-Squared Test, Fischer’s Test, G-Test

Independent variable: Category

Dependent variable: Distribution among categories

New cards

Tests relating to Ordinal Data

Kolmogorov-Smirnov test, Chi-Squared Test, Median Test

Independent variable: Category

Dependent variable: Distribution among ranked categories

New cards

Tests relating to Interval/Ratio Data

Student’s t - test, MWW test, Krusal-Wallis Test, ANOVA

Independent variable: Category

Dependent variable: Average of outcomes

New cards

Binomial Data

data with only two possible outcomes

New cards

5 Tips of Empirical Evaluation

Number one, control what you can, document what you can't. Try to make your treatments as identical as possible. However, if there are systematic differences between them, document and report that.
Number two, limit your variables. It can be tempting to try to vary lots of different things and monitor lots of other things, but that just leads to noisy difficult data that will probably generate some false conclusions. Instead, focus on varying only one or two things and monitor only a handful of things in response. There's nothing at all wrong with only modifying one variable and only monitoring one variable.
Number three, work backwards in designing your experiment. A counter state that I've seen is just to gather a bunch of data and figure out how to analyze it later. That's messy, and it doesn't lead to very reliable conclusions. Decide at the start what question you want to answer, then decide the analysis you need to use, and then decide the data that you need to gather.
Number four, script your analyses in advance. Ronald Coase once said, "If you torture the data long enough, nature will always confess." What the quote means is If we analyze and reanalyze data enough times, we can always find conclusions, but that doesn't mean that they're actually there. So, decide in advance what analysis you'll do and do it. If it doesn't give you the results that you want, don't just keep reanalyzing that same data until it does.
Number five, pay attention to power. Power refers to the size of the difference that a test can detect. Generally, it's very dependent on how many participants you have. If you want to detect only a small effect, then you'll need a lot of participants. If you only care about detecting a big effect, you can usually get by with fewer.

New cards

Predictive Evaluation

Evaluation we can do without actual users where we investigate the participants thought process.

Now, in user centered design that's not ideal, but it can be more efficient and accessible than actual user evaluation. So it's all right to use it as part of a rapid feedback process. It lets us keep the user in mind, even we we're not bringing users into the conversation. The important thing is to make sure we're using it appropriately. It shouldn't be used where we could be doing qualitative or empirical evaluation. It should only be used where we wouldn't otherwise be doing any evaluation. Effectively, it's better than nothing.

New cards

Types of Predictive Evaluation

Heuristic Evaluation: Each individual evaluator inspects the interface alone, and identifies places where the interface violates some heuristic. We might sit with an expert while they perform the evaluation or they might generate our report. Heuristics are useful because they give us small snapshots into the way people might think about our interfaces. If we take these heuristics to an extreme though, we could go so far as to develop models of the way people think about our interfaces.

Model-based Evaluation: we take user task models and trace through it in the context of the interface that we designed. We can use that to inform our evaluation of whether or not the interface relies on high user motivation. If we find that the interface requires users to be more personally driven or to keep more in working memory, then we might find that the users will fail if they don't have high motivation to use the interface, and then we can revise it accordingly.

Simulation-based Evaluation: model-based evaluation taken to an extreme. At that point, we might construct an artificially intelligent agent that interacts with our interface in the way that a human would.

New cards

Cognitive Walkthrough

A type of predictive evaluation where we step through the process of interacting with an interface, mentally simulating in each stage what the user is seeing and thinking and doing. At every stage of the process, I want to investigate this from the perspective of the gulfs of execution and evaluation. Now, the weakness of cognitive walkthroughs is that we're the designers, so it likely seems to us like the design is just fine. After all, that's why we designed it that way. That’s why putting yourself in the user’s shoes is imperative; you can start to uncover some really useful takeaways

New cards

Advantages of Qualitative Evaluation

Informs ongoing design decisions

Investigates the participant’s thought process

Draws conclusions from actual participants

New cards

Advantages of Empirical Evaluation

Identifies provable advantages

Provides generalizable conclusions

Draws conclusions from actual participants

New cards

Advantages of Predictive Evaluation