A potential pitfall when running Ibex experiments on Amazon Mechanical Turk

Ibex does Latin squares in a way that can potentially have serious unintended consequences in the form of spurious effects.

The problem: When you submit an experiment to Amazon Mechanical Turk, a lot of workers will immediately jump at it but the rate of participation quickly decays (the distribution over time often looks like an exponential decay). For every participant, Ibex selects a stimulus list based on an internal counter and this counter is incremented when a participant submits their results. Unfortunately, this means that the initial wave of participants all work on the same list of the Latin square and this list will therefore be strongly overrepresented. This can lead to strong spurious effects that are not due to the experimental manipulation but due to between-item differences. This is an easy-to-miss problem and I would not be surprised if some published results obtained with Ibex were false because of this problem.

The solution is fortunately very simple: There is an undocumented method to control when Ibex increments the counter. Add ["setcounter", "__SetCounter__", { }] within the var items = [] definition and add setcounter to your shuffle sequence definition. The position of setcounter within your shuffle sequence determines when the counter is incremented. For example, you could insert the increment after the welcome page of the experiment.

var shuffleSequence = seq("welcome", "setcounter", …);

var items = [
    ["welcome", "Message", {html: 'Welcome to this experiment'}],
    ["setcounter", "__SetCounter__", { }],

Note that this procedure does not guarantee that your lists will be perfectly balanced because some participants may start the experiment (thereby incrementing the counter) but not finish it. Linear mixed models should be fairly robust against these imbalances, however, you should still aim for a balanced data set because otherwise the descriptive stats and the inferential stats may show different patterns of results which would be confusing (Simpson's paradox).

How can we obtain data sets that are prefectly balanced? Measuring more participants to fill-up underrepresented lists is not an option because MTurk has in my experience strong time-of-day and day-of-week effects. This means that measuring more data might just replace one problem by another harder-to-detect problem. The safest solution therefore is to remove randomly selected participants from over-represented lists until the lists are balanced.

More generally, this issue shows that there are potential pitfalls when doing online experiments that do not exist in lab-based studies. It is important to have these on the radar.