%run prelude.ipy
Breakdown of the Data
We ran experiments in two locations: locally in Bloominton, IN and on Amazon's Mechanical Turk. A small number of participants were solicited via e-mail on the web.
print len(experiments), "total participants"
experiments.location.value_counts()
Trials were discarded if the participant provided an answer to an already completed trial (by manipulating the web page URL).
print len(trials), "total trials"
expected_trials = len(experiments) * 10
print expected_trials - len(trials), "trials were discarded"
Demographics
Participants were asked to report their age, gender, years of python/programming experience, degree, and status as a Computer Science major.
plot.misc.demographics(experiments)
Programs
Each participant completed 10 trials. For each trial, they were asked to predict the printed output of a simple Python program.
There were a total of 10 program categories (bases), with 2-3 versions in a given category (view code). Participants saw a program one each of the 10 program bases, and were randomly assigned a version.
Code metrics for each base/version are given below including the number of characters (code_chars
), lines (code_lines
), cyclomatic complexity (cyclo_comp
), Halstead effort (hal_effort
), Halstead volume (hal_volume
). The number of characters and lines in the actual program output are also given (output_chars
and output_lines
respectively).
display_html(programs.set_index(["base", "version"]).to_html())
Grades
Participants' responses for each trial were graded on a scale from 0 to 10 with 10 being a perfect response. A grade of 7-9 was assigned for a correct response -- i.e., with minor formatting errors. Grading was done after the experiment was over, so participants did not receive feedback about their performance.
As shown in the distributions below, most individual trials were given a perfect score (10/10). However, the median score for the entire experiment (out of 100) was 81.
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_grades_distribution(trials, ax=axes[0], color=kelly_colors[3])
plot.misc.total_grades_distribution(trials, ax=axes[1], color=kelly_colors[2])
fig.tight_layout()
fig
print "Median experiment grade:", experiments.total_grade.median()
print "Median trial grade:", trials.grade_value.median()
The distribution of grades varied by program base. Some bases, such as between
, counting
, and scope
had much lower average grades.
axes = plot.misc.grades_by_base(trials, figsize=(20, 10))
axes[0].figure
trials.groupby("base").grade_value.mean().order()
Trial Duration
Experiments usually took around 15 minutes, and many trials were completed in about 1 minute.
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(trials, ax=axes[0], color=kelly_colors[0])
plot.misc.total_duration_distribution(experiments, ax=axes[1], color=kelly_colors[1])
fig.tight_layout()
fig
exp_median = experiments.duration_sec.median()
print "Median experiment duration: {0} sec ({1:.02f} min)".format(exp_median, exp_median / 60.0)
print "Median trial duration:", trials.duration_ms.median() / 1000.0, "sec"
There appear to be some duration differences between correct and incorrect trials.
t_correct, t_incorrect = util.split_by_boolean(trials, "grade_correct")
print "Median trial duration (correct):", t_correct.duration_ms.median() / 1000.0, "sec"
print "Median trial duration (incorrect):", t_incorrect.duration_ms.median() / 1000.0, "sec"
The log distributions appear slightly different...
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(t_correct, ax=axes[0], color=kelly_colors[0], column="duration_ms_log")
plot.misc.trial_duration_distribution(t_incorrect, ax=axes[1], color=kelly_colors[1], column="duration_ms_log")
fig
But the result is not significant.
stats.wilcox_test(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
scipy.stats.ttest_ind(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
As with grades, the distribution of trial durations varies with program base. Some differences are apparent when we plot them all using the same x scale (from 0 to the maximum trial duration).
axes = plot.misc.durations_by_base(trials, figsize=(20, 10))
axes[0].figure
Taking the log of the trial durations, we can more easily see some differences. counting
, initvar
, order
, partition
, rectangle
, and scope
cluster around 4 log sec while funcall
and overload
cluster around 3. between
and whitespace
are closer to 5.
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), log=True)
axes[0].figure
If we normalize durations by the number of lines in the programs, other differences are easier to see. For example, between
and whitespace
have similarly shaped distributions while counting
, initvar
, and order
have relatively shorter tails than others.
Normalizing in this manner, though, makes the assumption that longer programs should take longer to read.
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), norm_by_lines=True)
axes[0].figure
Keystroke Coefficient
All keystrokes were recorded during each trial. The keystroke coefficient for a trial is the keystroke count normalized by the number of characters in the true program output.
A keystroke coefficient > 1 indicates that the participant typed more than was required -- either due to a mistake or to correct a previous answer. A keystroke coefficient < 1 is possible when copying and pasting text, which some participants did.
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_keycoeff_distribution(trials, ax=axes[0], color=kelly_colors[5])
plot.misc.total_keycoeff_distribution(trials, ax=axes[1], color=kelly_colors[4])
fig.tight_layout()
fig
print "Median experiment keystroke coefficient:", trials.groupby("exp_id").keystroke_coefficient.mean().median()
print "Median trial keystroke coefficient:", trials.keystroke_coefficient.median()
print "Mean trial keystroke coefficient:", trials.keystroke_coefficient.mean()
There are some interesting differences in the distributions of keystroke coefficients between program bases. For most programs, it looks like participants only typed as much as necessary (i.e., made few typing mistakes). The counting
program, however, is skewed towards < 1, suggesting that participants copied and pasted portions of their responses.
axes = plot.misc.keystroke_coefficient_by_base(trials, figsize=(20, 10))
axes[0].figure
Restricting ourselves to correct responses only, rectangle
and whitespace
are clear outliers with very long tails. Even when their answers were correct, a handful of participants typed a lot more than was necessary.
axes = plot.misc.keystroke_coefficient_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Keystroke Coefficient by Program (correct only)")
fig
Looking at the percentage of trials per program base with a keystroke coefficient of $\le 1$, we can get a sense of which programs participants with correct responses were more efficient at.
trials[trials.grade_correct].groupby("base")\
.apply(lambda f: sum(f.keystroke_coefficient <= 1) / float(len(f)))\
.order(ascending=False) * 100
Response Proportion
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_respprop_distribution(trials, ax=axes[0], color=kelly_colors[6])
plot.misc.total_respprop_distribution(trials, ax=axes[1], color=kelly_colors[7])
fig.tight_layout()
fig
print "Median experiment response proportion:", trials.groupby("exp_id").response_proportion.mean().median()
print "Median trial response proportion:", trials.response_proportion.median()
Related to keystrokes, we can compute the proportion of the trial that participants spent crafting their response. This is defined as the time from the first to last keystroke divided by the total trial duration.
Plotting the distributions of response durations by program base, it's easy to see when participants spent their time reading versus writing. This does not always reflect which programs had more output. The overload
program, for example, had a small amount of output but the distribution below suggests that participants started giving their response right away rather than reading the whole program. In contrast, funcall
and scope
were reading-heavy programs, with participants likely waiting until the end of the trial to start typing.
axes = plot.misc.response_proportion_by_base(trials, figsize=(20, 10))
axes[0].figure
Surprisingly, restricting the distributions to correct answers only does not appear to significantly change things.
axes = plot.misc.response_proportion_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Response Proportion by Program (correct only)")
fig