User studies. A necessary evil. They are time consuming, tricky to get right, and sometimes obnoxious, and worst of all, we have to deal with humans. But... we still have to do them. We can have a piece of software with all the testing and documentation in the world, but if it doesn't verifiably help people, what is the point? Simply having a large user base does not indicate that people like your program, they may simply need to use it. User testing verifies that people (that really fancy script that clicks on links isn't a person) can actually use your program and understand it.
Coming up with questions that meet your goal is actually incredibly difficult. There are two main issues that I ran into, testing what I want, and making it clear to the participants what they are trying to do.
How do we remedy this? With a user study of the user study. Basically, it's a good idea to run a pilot study. You'll want to run from 3 to 5 users in this pilot study. These participants should be from the same demographic as the intended users of the program, and obviously, the same demographic as the people who will actually be in your study. You don't want the pilot to have too many people, or you will be wasting time, and more importantly, you'll be wasting participants. You can't use the participants from the pilot study in your actual user study.
Don't just stop at testing the questions using the pilot study. Actually collect the results, then use this data to figure out what kind of analysis you will want to run.
There are two main types of questions; quantitative and qualitative. While the academic software engineering papers are starting to mix the two, the reviewers seem to have a strong preference for quantitative (I don't have a citation so this is just anecdotal). Quantitative studies are easier for at least two reasons; first, the numbers are concrete and cannot be argued with (assuming you didn't cheat while conducting the study), second the answers are a number, which is very easy to compare once you get to analysis.
Numbers only tell half the picture. They tell the "what", not so much the "why". It's very hard to concretely describe the "why" though. Your participants will likely have many reasons behind why they did something, even if the resulting "what" is the same, so you'll probably need more participants. The other issue, it takes more room in your paper.
Now, you have your questions figured out, the tasks they are going to do, and everything ready to go. There is a bias that stems from doing things in a certain order. You'll need to randomize things to make sure that this doesn't happen. I used a short python script to handle this for me. I would run the script for every participant, and it would generate a script for me to read to the participant. It randomized the order that the tools were used, and the order of the tasks. So long as I read the script correctly, the study would be conducted correctly.
I used screen + audio capture. This turned out to be really good for a lot of reasons. It had one drawback, I had to sit there and watch every one of them and extract the key pieces of information. The rest of it was very nice.
The first, it makes the study feel less like a study and more like a conversation. Your environment will likely feel unusually clean, so that will already put your participant at unease. It won't help if you are sitting there taking notes and playing with a timer while watching their every move. The video recording was invisible to them (though they were informed via the consent form), so it felt less like being a lab-rat being examined and a little more like being an actual user of a tool.
Once the videos were collected, I sat down and could watch them and extract the information there. If I missed something the first time, I could go back and rewatch that part. I could get more context from the video too, whereas hand-written notes can only record what you wrote down. Furthermore, I didn't have to fuss with a timer, the duration is right in the video.
At least from the people I've talked to, CSV files are a popular way to store data. I think they are crazy. CSV files are easy to script with and do data entry. That is about where their usefulness ends. I ended up using a sqlite database. Some might argue that a full relational database is a little overkill for storing something as small as a user study. They are right, but they are also wrong. Yes, I don't need ACID or transactions, or any of that; for that part they are completely correct. So what does the relational database give me that a simple CSV file doesn't? Flexibility. If I want to look at different metrics in the data, I don't need to write another script to analyze it, I just make a query in the database. When you are creating the CSV files, you don't know exactly what you'll be looking for in the data. CSV files are terrible for experimenting. You will likely need to implement most of the features in a relational database to operate on the CSV files, so why not just use a database and save yourself the trouble?
Some laughed while I sat there doing hundreds of INSERT statements to add the data to the database. They stopped laughing when they asked about various aspects of the data, and with a single query was able to manipulate the data to show exactly what I wanted in a matter of seconds.
You don't need a full database. This is why sqlite is a great option. It's simple, fast enough, and it puts the data into a file instead of some seemingly hidden location on the hard drive. That last point is quite useful since it is likely that you will need to have something about how the data will be destroyed in a secure manner once the study is done somewhere in the participant consent form.
Before I go too deeply into this, I'm a vim user by default, but the org-mode feature in emacs so completely destroys what vim can do, that emacs is really the best option here.
Org-mode. It's a great tool for research. It's like a markdown on steroids. Vim tries to have an org-mode, but it falls so far short of the actual org-mode that it can barely call itself the same thing. Most of my emacs configurations live in an org file.
So if it's markdown, why is it good for analysis? Because you can run code directly from within the document, and have the results placed directly below the code. This makes your research entirely replicable. If anyone has any questions about the method used for analysis, you point them to the org-mode file. It has the queries you made, and the results to those queries. It's great for completely reproducible and verifiable research.
A quick example;
#+name: results #+begin_src sqlite :db ./data/data.db :colnames yes SELECT question, tool, avg(time), count(*) total, count(CASE WHEN correct = 1 THEN 1 END) correct FROM results; #+end_src
Hitting C-c C-c inside of the block would create the table directly in the document. Github will even render this nicely, syntax highlighting the query, and making the table look pretty.
Furthermore, R scripts, which are really great for analysis on numerical data, can be run in a similar manner, and since emacs can render pdf, can generate images that are directly embedded in your document. It makes for a very fast and effortless workflow once you get used to emacs.
R scripts are amazing for performing statistical analysis and generating plots. Use it. You can program in R like a normal programming language, with conditional ifs and loops. That part isn't special, but the extension packages are what make it special. It has most statistical tests, you apply the ones that you want. The results will give you both the test statistic and the P-value. You can provide R with the RSQLite database backend, and you can connect it directly with the database. No CSV files to manipulate externally, query for only the data you are interested in and perform the analysis with that. Very clean, very easy, and completely reproducible.
So, these are things that I learned from running my first user study. Some, the hard way, some seemed inherently clear and worked well (confirmation bias, yes, but it worked).
So a quick recap
- It's good to (completely) test the test to make sure it works before actually using it.
- Use a script to script the study.
- screen capture + audio is a great way to collect the results.
- Keep the results in an actual dbms, it helps for analysis.
- emacs + org-mode + R + database makes the final analysis easy.
Let me know what you have learned and I'll start a little list of ideas for making user studies as painless as possible.