Skip to main content

Participants

In [18]:
%run prelude.ipy
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Breakdown of the Data

We ran experiments in two locations: locally in Bloominton, IN and on Amazon's Mechanical Turk. A small number of participants were solicited via e-mail on the web.

In [78]:
print len(experiments), "total participants"
experiments.location.value_counts()
162 total participants

Out[78]:
mturk          130
bloomington     29
web              3
dtype: int64

Trials were discarded if the participant provided an answer to an already completed trial (by manipulating the web page URL).

In [79]:
print len(trials), "total trials"
expected_trials = len(experiments) * 10
print expected_trials - len(trials), "trials were discarded"
1602 total trials
18 trials were discarded

Demographics

Participants were asked to report their age, gender, years of python/programming experience, degree, and status as a Computer Science major.

In [80]:
plot.misc.demographics(experiments)
Out[80]:

Programs

Each participant completed 10 trials. For each trial, they were asked to predict the printed output of a simple Python program.

There were a total of 10 program categories (bases), with 2-3 versions in a given category (view code). Participants saw a program one each of the 10 program bases, and were randomly assigned a version.

Code metrics for each base/version are given below including the number of characters (code_chars), lines (code_lines), cyclomatic complexity (cyclo_comp), Halstead effort (hal_effort), Halstead volume (hal_volume). The number of characters and lines in the actual program output are also given (output_chars and output_lines respectively).

In [3]:
display_html(programs.set_index(["base", "version"]).to_html())
Out[3]:
code_chars code_lines cyclo_comp hal_effort hal_volume output_chars output_lines
base version
between functions 496 24 7 94192.063393 830.218507 33 3
inline 365 19 7 45596.278445 660.815630 33 3
counting nospace 77 3 2 738.402323 82.044703 116 8
twospaces 81 5 2 738.402323 82.044703 116 8
funcall nospace 50 4 2 937.653743 109.392937 3 1
space 54 4 2 937.653743 109.392937 3 1
vars 72 7 2 1735.731282 154.287225 3 1
initvar bothbad 103 9 3 3212.495182 212.396376 5 2
good 103 9 3 3212.495182 212.396376 6 2
onebad 103 9 3 2866.823438 208.496250 6 2
order inorder 137 14 4 8372.306047 303.069902 6 1
shuffled 137 14 4 8372.306047 303.069902 6 1
overload multmixed 78 11 1 2340.000000 120.000000 9 3
plusmixed 78 11 1 3428.296498 117.206718 7 3
strings 98 11 1 3428.296498 117.206718 21 3
partition balanced 105 5 4 2896.001287 188.869649 26 4
unbalanced 102 5 4 2382.342809 177.199052 19 3
unbalanced_pivot 120 6 4 2707.766879 196.214991 19 3
rectangle basic 293 18 2 18801.174998 396.335705 7 2
class 421 21 5 43203.698685 620.148785 7 2
tuples 277 14 2 15627.749381 403.817813 7 2
scope diffname 144 12 3 2779.714286 188.000000 2 1
samename 156 12 3 2413.342134 183.623858 2 1
whitespace linedup 275 14 1 6480.000000 216.000000 13 3
zigzag 259 14 1 6480.000000 216.000000 13 3

Grades

Participants' responses for each trial were graded on a scale from 0 to 10 with 10 being a perfect response. A grade of 7-9 was assigned for a correct response -- i.e., with minor formatting errors. Grading was done after the experiment was over, so participants did not receive feedback about their performance.

As shown in the distributions below, most individual trials were given a perfect score (10/10). However, the median score for the entire experiment (out of 100) was 81.

In [18]:
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_grades_distribution(trials, ax=axes[0], color=kelly_colors[3])
plot.misc.total_grades_distribution(trials, ax=axes[1], color=kelly_colors[2])
fig.tight_layout()
fig
Out[18]:
In [83]:
print "Median experiment grade:", experiments.total_grade.median()
print "Median trial grade:", trials.grade_value.median()
Median experiment grade: 81.0
Median trial grade: 10.0

The distribution of grades varied by program base. Some bases, such as between, counting, and scope had much lower average grades.

In [84]:
axes = plot.misc.grades_by_base(trials, figsize=(20, 10))
axes[0].figure
Out[84]:
In [85]:
trials.groupby("base").grade_value.mean().order()
Out[85]:
base
between       5.314465
scope         6.949686
counting      7.478261
partition     7.712500
initvar       7.725000
whitespace    8.575000
overload      8.720497
order         8.906832
funcall       9.216049
rectangle     9.522013
Name: grade_value, dtype: float64

Trial Duration

Experiments usually took around 15 minutes, and many trials were completed in about 1 minute.

In [16]:
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(trials, ax=axes[0], color=kelly_colors[0])
plot.misc.total_duration_distribution(experiments, ax=axes[1], color=kelly_colors[1])
fig.tight_layout()
fig
Out[16]:
In [95]:
exp_median = experiments.duration_sec.median()
print "Median experiment duration: {0} sec ({1:.02f} min)".format(exp_median, exp_median / 60.0)
print "Median trial duration:", trials.duration_ms.median() / 1000.0, "sec"
Median experiment duration: 773.0 sec (12.88 min)
Median trial duration: 55.0 sec

There appear to be some duration differences between correct and incorrect trials.

In [8]:
t_correct, t_incorrect = util.split_by_boolean(trials, "grade_correct")
print "Median trial duration (correct):", t_correct.duration_ms.median() / 1000.0, "sec"
print "Median trial duration (incorrect):", t_incorrect.duration_ms.median() / 1000.0, "sec"
Median trial duration (correct): 56.0 sec
Median trial duration (incorrect): 51.0 sec

The log distributions appear slightly different...

In [9]:
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(t_correct, ax=axes[0], color=kelly_colors[0], column="duration_ms_log")
plot.misc.trial_duration_distribution(t_incorrect, ax=axes[1], color=kelly_colors[1], column="duration_ms_log")
fig
Out[9]:

But the result is not significant.

In [10]:
stats.wilcox_test(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
Out[10]:
0.896938977817479
In [12]:
scipy.stats.ttest_ind(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
Out[12]:
(-0.51472404673756544, 0.60681696098987525)

As with grades, the distribution of trial durations varies with program base. Some differences are apparent when we plot them all using the same x scale (from 0 to the maximum trial duration).

In [3]:
axes = plot.misc.durations_by_base(trials, figsize=(20, 10))
axes[0].figure
Out[3]:

Taking the log of the trial durations, we can more easily see some differences. counting, initvar, order, partition, rectangle, and scope cluster around 4 log sec while funcall and overload cluster around 3. between and whitespace are closer to 5.

In [6]:
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), log=True)
axes[0].figure
Out[6]:

If we normalize durations by the number of lines in the programs, other differences are easier to see. For example, between and whitespace have similarly shaped distributions while counting, initvar, and order have relatively shorter tails than others.

Normalizing in this manner, though, makes the assumption that longer programs should take longer to read.

In [3]:
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), norm_by_lines=True)
axes[0].figure
Out[3]:

Keystroke Coefficient

All keystrokes were recorded during each trial. The keystroke coefficient for a trial is the keystroke count normalized by the number of characters in the true program output.

A keystroke coefficient > 1 indicates that the participant typed more than was required -- either due to a mistake or to correct a previous answer. A keystroke coefficient < 1 is possible when copying and pasting text, which some participants did.

In [7]:
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_keycoeff_distribution(trials, ax=axes[0], color=kelly_colors[5])
plot.misc.total_keycoeff_distribution(trials, ax=axes[1], color=kelly_colors[4])
fig.tight_layout()
fig
Out[7]:
In [17]:
print "Median experiment keystroke coefficient:", trials.groupby("exp_id").keystroke_coefficient.mean().median()
print "Median trial keystroke coefficient:", trials.keystroke_coefficient.median()
print "Mean trial keystroke coefficient:", trials.keystroke_coefficient.mean()
Median experiment keystroke coefficient: 1.2108227657
Median trial keystroke coefficient: 1.0
Mean trial keystroke coefficient: 1.42589488833

There are some interesting differences in the distributions of keystroke coefficients between program bases. For most programs, it looks like participants only typed as much as necessary (i.e., made few typing mistakes). The counting program, however, is skewed towards < 1, suggesting that participants copied and pasted portions of their responses.

In [2]:
axes = plot.misc.keystroke_coefficient_by_base(trials, figsize=(20, 10))
axes[0].figure
Out[2]:

Restricting ourselves to correct responses only, rectangle and whitespace are clear outliers with very long tails. Even when their answers were correct, a handful of participants typed a lot more than was necessary.

In [3]:
axes = plot.misc.keystroke_coefficient_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Keystroke Coefficient by Program (correct only)")
fig
Out[3]:

Looking at the percentage of trials per program base with a keystroke coefficient of $\le 1$, we can get a sense of which programs participants with correct responses were more efficient at.

In [4]:
trials[trials.grade_correct].groupby("base")\
    .apply(lambda f: sum(f.keystroke_coefficient <= 1) / float(len(f)))\
    .order(ascending=False) * 100
Out[4]:
base
scope         82.926829
rectangle     76.158940
funcall       71.527778
initvar       68.644068
counting      62.105263
overload      58.208955
order         56.375839
partition     45.945946
whitespace    40.579710
between       37.500000
dtype: float64

Response Proportion

In [10]:
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_respprop_distribution(trials, ax=axes[0], color=kelly_colors[6])
plot.misc.total_respprop_distribution(trials, ax=axes[1], color=kelly_colors[7])
fig.tight_layout()
fig
Out[10]:
In [14]:
print "Median experiment response proportion:", trials.groupby("exp_id").response_proportion.mean().median()
print "Median trial response proportion:", trials.response_proportion.median()
Median experiment response proportion: 0.392692601812
Median trial response proportion: 0.390311790668

Related to keystrokes, we can compute the proportion of the trial that participants spent crafting their response. This is defined as the time from the first to last keystroke divided by the total trial duration.

Plotting the distributions of response durations by program base, it's easy to see when participants spent their time reading versus writing. This does not always reflect which programs had more output. The overload program, for example, had a small amount of output but the distribution below suggests that participants started giving their response right away rather than reading the whole program. In contrast, funcall and scope were reading-heavy programs, with participants likely waiting until the end of the trial to start typing.

In [116]:
axes = plot.misc.response_proportion_by_base(trials, figsize=(20, 10))
axes[0].figure
Out[116]:

Surprisingly, restricting the distributions to correct answers only does not appear to significantly change things.

In [24]:
axes = plot.misc.response_proportion_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Response Proportion by Program (correct only)")
fig
Out[24]: