Unreliability of self-reported user data

Many people are bad at estimating how often and how long they’re on the phone. Interestingly, you can predict who will overestimate and who will underestimate their phone usage, according to the 2009 study, “Factors influencing self-report of mobile phone use” by Dr Lada Timotijevic et al. For this study, a self-reported estimate is considered  accurate if it is within 10% of the actual number:

Defining 'accuracy'

Underestimated Accurate Overestimated
Number of phone calls (number of people) (number of people) (number of people)
High user 71% 10% 19%
Medium user 53% 21% 26%
Low user 33% 16% 51%
Duration of phone calls
High user 41% 20% 39%
Medium user 27% 17% 56%
Low user 13% 6% 81%

If people are bad at estimating their phone use, does this mean that people are bad at all self-reporting tasks?

Not surprisingly, it depends how long it’s been since the event they’re trying to remember. It also depends on other factors. Here are some factoids that should convince you to be careful with self-reported user data that you collect.

What’s the problem with self-reported data?

On questions that ask respondents to remember and count specific events, people frequently have trouble because their ability to recall is limited. Instead of answering “I’m not sure,” people typically use partial information from memory to construct or infer a number. In 1987, N.M. Bradburn et al found that U.S. respondents to various surveys had trouble answering such questions as:

  • During the last 2 weeks, on days when you drank liquor, about how many drinks did you have?
  • During the past 12 months, how many visits did you make to a dentist?
  • When did you last work at a full-time job?

To complicate matters, not all self-report data is suspect. Can you predict which data is likely to be accurate or inaccurate?

  • Self-reported Madagascar crayfish harvesting—quantities, effort, and harvesting locations—collected in interviews was shown reliable (2008, Julia P. G. Jones et al).
  • Self-reported eating behaviour by people with binge-eating disorders was shown “acceptably” reliable, especially for bulimic episodes (2001, Carlos M. Grilo et al).
  • Self-reported condom use was shown accurate over the medium term, but not in the short term or long term (1995, James Jaccard et al).
  • Self-reported numbers of sex partners were underreported and sexual experiences and condom use overreported a year later when compared to self-reported data at the time (2002, Maryanne Garry et al).
  • Self-reported questions about family background, such as father’s employment, result in “seriously biased” research findings in studies of social mobility in The Netherlands—by as much as 41% (2008, Jannes Vries and Paul M. Graaf).
  • Participation in a weekly worship service is overreported in U.S. polls. Polls say 40% but attendance data says 22% (2005, C. Kirk Hadaway and Penny Long Marler).
Can you improve self-reported data that you collect?

Yes, you can. Consider these:

  • Decomposition into categories. Estimates of credit-card spending gets more accurate if respondents are asked for separate estimates of their expenditures on, say, entertainment, clothing, travel, and so on (2002, J. Srivastava and P. Raghubir ).
  • For your quantitative or qualitative usability research or other user research, it’s easy to write your survey questions or your lines of inquiry so they ask for data in a decomposited form.

  • Real-time data collection. Collecting self-reported real-time data from patients in their natural environments “holds considerable promise” for reducing bias (2002, Michael R. Hufford and Saul Shiffman).
    Collecting real-time self-report data
  • This finding is from 2002. Social-media tools and handheld devices now make real-time data collection more affordable and less unnatural. For example, use text messages or Twitter to send reminders and receive immediate direct/private responses.

  • Fuzzy set collection methods. Fuzzy-set representations provide a more complete and detailed description of what participants recall about past drug use (2003, Georg E. Matt et al).
  • If you’re afraid of math but want to get into fuzzy sets, try a textbook (for example, Fuzzy set social science by Charles Ragin), audit a fuzzy-math course for social sciences (auditing is a low-stakes way to get things explained), or hire a tutor in math or sociology/anthropology to teach it to you.

Also, when there’s a lot at stake, use multiple data sources to examine the extent of self-report response bias, and to determine whether it varies as a function of respondent characteristics or assessment timing (2003, Frances K. Del Boca and Jack Darkes). Remember that your qualitative research is also one of those data sources.

How to test earlier

Involving users throughout the software-development cycle is touted as a way to ensure project success. Does usability testing count as user contact? You bet! But since most companies test their products later in the process, when it’s difficult to react meaningfully to the user feedback, here are two ways to get your testing done sooner.

Prioritise. Help the Development team rank the importance of the individual programming tasks, and then schedule the important tasks to complete early.

  • Prioritise and schedule the tasksIf a feature must be present in order to have meaningful interaction, then develop it sooner.
  • For example, email software that doesn’t let you compose the message is meaningless. To get meaningful feedback from users, they need to be able to type an e-mail.

    Developers often want to start with the technologically risky tasks. Addressing that risk early is good, but it must be balanced against the risk of a product that’s less usable or unusable.

  • If a feature need not be present or need not be working fully in order to have meaningful interaction, then provide hard-coded actions in the interim, and add those features later.
  • For example, if the email software lets users change the message priority from Standard to Important, hard-code it for the usability test so the priority is always Standard.

  • If a less meaningful feature must to be tested because of its importance to the business strategy, then develop it sooner.
  • For example, email software that lets users record a video may be strategically important for the company, though users aren’t expected to adopt it widely until most laptops ship with built-in cameras.

Schedule. For each feature to be tested, get the Development team to allocate time to respond to usability recommendations, and then ensure this time is neither reallocated to problem tasks, nor used up during the initial development effort of the to-be-tested features. Engage the developers by:

  • Sharing the scenarios in advance.
  • Updating them on your efforts to recruit usability-study participants.
  • After developers incorporate your recommendations, retesting and then reporting improvements in user performance.

Development planning that prioritises programming tasks based on the need to test, and then allows time in the schedule to respond to recommendations, is more likely to result in usable, successful products.

User mismatch: discard data?

When you’re researching users, every once in a while you come across one that’s an anomaly. You must decide whether to exclude their data points in the set or whether to adjust your model of the users.

Let me tell you about one such user. I’ll call him Bob (not his real name). I met Bob during a day of back-to-back usability tests of a specialised software package. The software has two categories of users:

  • What type of user is this?Professionals who interpret data and use it in their design work.
  • Technicians who mainly enter and check data. A senior technician may do some of the work of a professional user.

When Bob walked in, I went through the usual routine: welcome; sign this disclaimer; tell me about the work you do. Bob’s initial answers identified him as a professional user. Once on the computer, though, Bob was unable to complete the first step of the test scenario. Rather than try to solve the problem, he sat back, folded his hands, and said: “I don’t know how to use this.” Since Bob was unwilling to try the software, I instead had a conversation (actually an unstructured interview) with him. Here’s what I learned:

My observation For this product, this is…
Bob received no formal product training. He was taught by his colleagues. Typical of more than half the professionals. 
Bob has a university degree that is only indirectly related to his job.  Atypical of professionals.
He’s young (graduated 3 years ago). Atypical of professionals, but desirable because many of his peers are expected to retire in under a decade.
Bob moved to his current town because his spouse got a job there. He would be unwilling to move to another town for work. Atypical. Professionals in this industry typically often work in remote locations for high pay.
•  Bob is risk averse. Typical of the professionals.
•  He is easily discouraged, and isn’t inclined to troubleshoot. Atypical. Professionals take responsibility for driving their troubleshooting needs.
Bob completes the same task once or several times a day, with updated data each time. Atypical of professionals. This is typical of technicians. 

I decided to discard Bob’s data from the set.

The last two observations are characteristic of a rote user. Some professionals are rote users because they don’t know the language of the user interface, but this did not apply to Bob. There was a clear mismatch between the work that Bob said he does and both his lack of curiosity and non-performance in the usability lab. These usability tests took place before the 2008 economic downturn, when professionals in Bob’s industry were hard to find, so I quietly wondered whether hiring Bob had been a desperation move on the part of his employer.

If Bob had been an new/emerging type of user, discarding Bob’s data would have been a mistake. Imagine if Bob had been part of a new group of users:

  • What would be the design implications?
  • What would be the business implications?
  • Would we need another user persona to represent users like Bob?

Usability testing distant users

When a product’s users are scarce and widely dispersed, and your travel budget is limited, usability testing can be a challenge.

Remote testing from North America was part of the answer, for me. I’ve never used UserVue because the users I needed to reach were in Africa, Australia, South America, and Asia—continents that UserVue doesn’t reach. Even within North America, UserVue didn’t address the biggest problems I faced:

  • My study participants commonly face restrictive IT policies, so they cannot install our pre-release product and prerequisites.
  • I need to prevent study participants from risking their data by using it with a pre-release product.
  • There’s no way to force an uninstall after the usability test. Who else will see our pre-release?

Instead, I blended a solution of my own with Morae, Skype, Virtual Audio Cable, and GoToMeeting. I Testing that's really remoteused GoToMeeting to share my desktop, which addresses all three of the problems listed above. I used Skype to get video and audio. I used Virtual Audio Cable to redirect the incoming voice from Skype to Morae Recorder’s microphone channel. Morae recorded everything except the PIP video. It worked. However, my studies were sometimes limited by poor Internet bandwidth to the isolated locations of my study participants.

Amateur facilitators. I realise this is controversial among usability practitioners, but beggars in North America can’t be choosers about how they conduct usability tests on other continents. I developed a one-hour training session for the team of travelling product managers. Training included a short PowerPoint presentation about the concepts, followed by use of Morae Recorder with webcam and headset while role-playing in pairs. The main points I had to get across:

  • Between study participants, reset the sample data and program defaults.
  • When you’re ready to start recording, first check that the video and audio are in focus and recording.
  • While you facilitate, do not lead the user. Instead, try paraphrasing and active listening (by which I mean vernacular elicitation). Remember that you’re not training the users, so task failure is acceptable, and useful to us.

I had a fair bit of influence over the quality of the research, since I developed the welcome script and test scenarios, provided the sample data, and analysed the Morae recordings once they arrived in North America. Due to poor Internet bandwidth to the isolated areas of my study participants, the product managers had to ship me the Morae recordings on DVD, by courier.

It worked. I also believe that amateur facilitation gave the product managers an additional opportunity to learn about customers.

Which user involvement works

User-centred design (UCD) advocated involving users in the design process. Have you wondered what form that user involvement could take, and which forms lead to the most successful outcomes?

I recently came across data that Mark Keil published a while ago. He surveyed software companies and correlated project outcomes with the type of user access that designers and developers had.

Type of contact with users Effectiveness
  For custom software projects
  Facilitate teams, hold structured workshop with users, or use joint-application development (JAD). ██████████
  Expose users to a UI prototype or early version to uncover any UI issues. ██████
  Expose users to a prototype or early version to discover the system requirements. ████
  Hold one-on-one semi-structured or open-ended interviews with users. ████
  Test the product internally (acceptance testing rather than QA testing for bugs) to uncover new requirements. ██
  Use an intermediary to define user goals and needs, and to convey them to designers and developers. ██
  Collect user questions, requirements, and problems indirectly, by e-mail or online locations.
  For packaged or mass-market software projects
  Listen to live/synchronous phone support, tech-support, or help-desk calls. ████████
  Hold one-on-one semi-structured or open-ended interviews with users. ██████
  Expose users to a UI prototype or early version to uncover any UI issues. ████
  Convene a group of users, from time to time, to discuss usage and improvements. ████
  Expose users to a prototype or early version to discover the system requirements. ██
  Test the product internally (acceptance testing rather than QA testing for bugs) to uncover new requirements. ██
  Consult marketing and sales people who regularly meet with and listen to customers. ██
  At trade shows, show a mock-up or prototype to users and get their feedback.
Not reported as effective in this 1995 source
  Conduct a (text) survey of a sample of users.
  Conduct a usability test to “tape and measure” users in a formal usability lab. (This study precedes such products as TechSmith Morae.)
  Observe users for an extended period, or conduct ethnographic research.
  Conduct focus groups to discuss the software.

Although Keil’s article includes quantitative data, his samples are small. I opted to show only the relative usefulness of various methods. My descriptions, above, are long because the original article uses 1995 terms that have shifted in meaning. I believe some of the categories now overlap, due to changes in technology and method. For example, getting users to try a prototype of the UI in order to uncover UI issues sounds like the early usability testing that I do with TechSmith Morae, yet the 1995 results gave these activities very different effectiveness ratings.

For details, see the academic article by Mark Keil (Customer-developer links in software development). Educational publishers typically require a fee for access.

These methods also relate to research I’m doing on epistemology of usability analysis.

Are usability studies experiments?

When I conduct usability studies, I use a laptop and Morae to create a portable and low-cost usability lab. I typically visit the participant on site, so I can look around. I provide a scenario that participants can follow as they try using a new software feature. Morae records their voices, facial expressions, on-screen actions, clicks, and typing. The raw data gets saved in a searchable and graphable format. Afterward, I review the recorded data (I rarely have written notes) and make evidence-based recommendations to the feature’s project team.

For example, in one usability study I realised that the order of the steps was causing confusion and doubt. The three-step wizard would disappear after step, so users could modify their data on the screen. Users thought the task was completed with step 2, so the reappearance of the wizard for step 3 caused confusion. I recommended we change the order of the steps:

Change the order of the steps

Changing the order fixed the “confused users” problem. But was this scientific? Aside from the fact that I’m researching human subjects rather than test tubes, am I following a scientific model? At first glance, the usability test I conducted doesn’t seem to follow the positivist model of what constitutes an experiment:

A positivist model

For example, unlike the test-tube lab experiment of a chemist, my usability test had no control group. Also, my report went beyond factual conclusions to actual recommendations (based on my expert opinion) for the next iteration.

I can argue this both ways.

On the one hand, my usability test does have a control group if I take the next iteration of the product and repeat the usability test with additional participants, to see whether my recommendations solved the problem. I could compare the task-completion rates and the task duration.

A model of usability testing

On the other hand, if I were asked to determine whether the product has an “efficient” workflow or a “great” user experience—which are subjective measures—I’d say a positivist-model experiment is inappropriate. To measure a user’s confusion or satisfaction, I might consider their facial expressions, verbal utterances, and self-reported ratings. This calls for a research design whose epistemology is rooted in post-positivist, ethnomethodological, situated, standpoint, or critical approaches, and has more in common with research done by an ethnographer than by a chemist.

If you liked this, you may also like Epistemology of usability studies.

Epistemology of usability studies

Currently, I’m conducting research on usability analysis and on how Morae software might influence that. My research gaze is rather academic, in that I’m especially interested in the epistemology of usability analysis.

One of my self-imposed challenges is to make my research relevant to usability practitioners. I’m a practitioner and CUA myself, and I have little time for academic exercises because I work where the rubber hits the road. This blog post outlines what I’m up to.

At Simon Fraser University, I learned that epistemological approaches have different assumptions about what is knowable. On one side (below, left), it’s about numbers, rates, percentages, graphs, grids, tables, proving absolute truths. On the other side, (below, right) it’s about seeking objectivity while knowing that it’s impossible because everything has a cultural context. The epistemology you choose, when doing research, depends on what you believe. And the epistemology dictates what methods you use, and how you report your results.

You can be
certain of
what you know.
You cannot be
objective about
what you know.

Let’s look at some examples.

Study 1 fits with the view (above, left) that “you can be certain of what you know.” I plan and conduct a quantitative study to measure the time it takes a series of users to complete two common tasks in a software package: upgrading to the latest version of the software, and activating the software. I make appointments with users. In my workplace, I give each user a scenario and a computer. I observe them and time them as they complete the tasks by using the software package. My hope is that statistical analysis will give me results that I can report, including the average time on task with error bars, as the graph (right) illustrates.

Study 2 fits with the view (above, right) that “you cannot be objective about what you know” because all research takes place within a context. To lessen the impact of conducting research, I contact users to ask if I can study their workplace. I observe each user for a day. My hope is to analyse the materials and interaction that I’ve observed in context—complete with typical interruptions, distractions, and stimuli. Since a new software version has just been released, my hope is that I’ll get to observe them as they upgrade. I’ll report any usability issues, interaction-design hurdles, and unmet needs that I observe.

The above are compilations of studies I conducted.

  • Study 1 revealed several misunderstandings and installation problems, including a user who abandoned the installation process because he believed it was complete. I was able to report the task success rate and have the install wizard fixed.
  • Study 2 revealed that users write numbers on paper and then re-enter them elsewhere, which had not been observed when users visited our site for usability testing. One user told me: “I never install  the latest version because the updates can be unstable,” and another said: “I only upgrade if there’s a fix for a feature I use” to avoid unexpected new defects. I was able to report the paper-based workaround and the users’ feelings about quality, for product managers to reflect in future requirements.

Clearly, there’s more than one way to conduct research, and not every method fits every team. That’s an idea that can be explored at length.

This has me wondering: which method fits what, when, where? Is there a relationship between a team’s development process and the approach to user research (epistemology) that it’s willing to embrace? …between its corporate usability maturity and the approach?

Those are two of the lines of inquiry in my research at Simon Fraser University.

If you liked this post, you may also like Are usability studies experiments?

Blended usability approach “best”

I received a brochure in the mail, titled Time for a Tune-up: How improving usability can improve the fortunes of your web site. It recommends this blend of usability methods:

  • A blended approachExpert reviews focus on workflows and give best results when the scope is clearly defined.
  • Usability studies are more time-consuming than expert reviews, but is the best way to uncover real issues.
  • Competitor benchmarking looks at the wider context.

The brochure is written by online-marketing consultants, and with Web sites in mind, but its content is also relevant to other development activities.

Effectiveness of usability evaluation

Do you ever wonder how effective expert reviews and usability tests are? Apparently, they can be pretty good.

Rolf Molich and others have conducted a series of comparative usability evaluation (CUE) studies, in which a number of teams evaluate the same version of a web site or application. The teams chose their own preferred methods—such as an expert review or a usability test. Their reports were then evaluated and compared by a small group of experts.

What the first six CUE studies found

About 30% of reported problems were found by multiple teams. The remaining 70% were found by a single team only. Fortunately, of that 70%, only 12% were serious problems. In one CUE study, the average number of reported problems was 9, so a team would be overlooking 1 or 2 serious problems. The process isn’t perfect, but teams found 80% or more of the serious problems.

Teams that used usability testing found more usability problems than expert reviewers. However, expert reviewers are more productive-they found more issues per hour-as this graph from the fourth CUE study illustrates:

CUE study 4 results

Teams that found the most usability problems (over 15 when the average was 9) needed much more time than the other teams, as illustrated in the above graph. Apparently, finding the last few problems takes up the most time.

The CUE studies do not consider the politics of usability and software development. Are your developers sceptical of usability benefits? Usability studies provide credibility-boosting video as visual evidence. Are your developers using an Agile method? Expert reviews provide quick feedback for each iteration.

To learn more about comparative usability evaluation, read about the findings of all six CUE studies.