%run prelude.ipy
Breakdown of the Data
We ran experiments in two locations: locally in Bloominton, IN and on Amazon's Mechanical Turk. A small number of participants were solicited via e-mail on the web.
print len(experiments), "total participants"
experiments.location.value_counts()
Trials were discarded if the participant provided an answer to an already completed trial (by manipulating the web page URL).
print len(trials), "total trials"
expected_trials = len(experiments) * 10
print expected_trials - len(trials), "trials were discarded"
Demographics
Participants were asked to report their age, gender, years of python/programming experience, degree, and status as a Computer Science major.
plot.misc.demographics(experiments)
Programs
Each participant completed 10 trials. For each trial, they were asked to predict the printed output of a simple Python program.
There were a total of 10 program categories (bases), with 2-3 versions in a given category (view code). Participants saw a program one each of the 10 program bases, and were randomly assigned a version.
Code metrics for each base/version are given below including the number of characters (code_chars
), lines (code_lines
), cyclomatic complexity (cyclo_comp
), Halstead effort (hal_effort
), Halstead volume (hal_volume
). The number of characters and lines in the actual program output are also given (output_chars
and output_lines
respectively).
display_html(programs.set_index(["base", "version"]).to_html())
Grades
Participants' responses for each trial were graded on a scale from 0 to 10 with 10 being a perfect response. A grade of 7-9 was assigned for a correct response -- i.e., with minor formatting errors. Grading was done after the experiment was over, so participants did not receive feedback about their performance.
As shown in the distributions below, most individual trials were given a perfect score (10/10). However, the median score for the entire experiment (out of 100) was 81.
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_grades_distribution(trials, ax=axes[0], color=kelly_colors[3])
plot.misc.total_grades_distribution(trials, ax=axes[1], color=kelly_colors[2])
fig.tight_layout()
fig
print "Median experiment grade:", experiments.total_grade.median()
print "Median trial grade:", trials.grade_value.median()
The distribution of grades varied by program base. Some bases, such as between
, counting
, and scope
had much lower average grades.
axes = plot.misc.grades_by_base(trials, figsize=(20, 10))
axes[0].figure
trials.groupby("base").grade_value.mean().order()
Trial Duration
Experiments usually took around 15 minutes, and many trials were completed in about 1 minute.
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(trials, ax=axes[0], color=kelly_colors[0])
plot.misc.total_duration_distribution(experiments, ax=axes[1], color=kelly_colors[1])
fig.tight_layout()
fig
exp_median = experiments.duration_sec.median()
print "Median experiment duration: {0} sec ({1:.02f} min)".format(exp_median, exp_median / 60.0)
print "Median trial duration:", trials.duration_ms.median() / 1000.0, "sec"
There appear to be some duration differences between correct and incorrect trials.
t_correct, t_incorrect = util.split_by_boolean(trials, "grade_correct")
print "Median trial duration (correct):", t_correct.duration_ms.median() / 1000.0, "sec"
print "Median trial duration (incorrect):", t_incorrect.duration_ms.median() / 1000.0, "sec"
The log distributions appear slightly different...
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(t_correct, ax=axes[0], color=kelly_colors[0], column="duration_ms_log")
plot.misc.trial_duration_distribution(t_incorrect, ax=axes[1], color=kelly_colors[1], column="duration_ms_log")
fig
But the result is not significant.
stats.wilcox_test(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
scipy.stats.ttest_ind(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
As with grades, the distribution of trial durations varies with program base. Some differences are apparent when we plot them all using the same x scale (from 0 to the maximum trial duration).
axes = plot.misc.durations_by_base(trials, figsize=(20, 10))
axes[0].figure
Taking the log of the trial durations, we can more easily see some differences. counting
, initvar
, order
, partition
, rectangle
, and scope
cluster around 4 log sec while funcall
and overload
cluster around 3. between
and whitespace
are closer to 5.
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), log=True)
axes[0].figure
If we normalize durations by the number of lines in the programs, other differences are easier to see. For example, between
and whitespace
have similarly shaped distributions while counting
, initvar
, and order
have relatively shorter tails than others.
Normalizing in this manner, though, makes the assumption that longer programs should take longer to read.
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), norm_by_lines=True)
axes[0].figure
Keystroke Coefficient
All keystrokes were recorded during each trial. The keystroke coefficient for a trial is the keystroke count normalized by the number of characters in the true program output.
A keystroke coefficient > 1 indicates that the participant typed more than was required -- either due to a mistake or to correct a previous answer. A keystroke coefficient < 1 is possible when copying and pasting text, which some participants did.
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_keycoeff_distribution(trials, ax=axes[0], color=kelly_colors[5])
plot.misc.total_keycoeff_distribution(trials, ax=axes[1], color=kelly_colors[4])
fig.tight_layout()
fig
print "Median experiment keystroke coefficient:", trials.groupby("exp_id").keystroke_coefficient.mean().median()
print "Median trial keystroke coefficient:", trials.keystroke_coefficient.median()
print "Mean trial keystroke coefficient:", trials.keystroke_coefficient.mean()
There are some interesting differences in the distributions of keystroke coefficients between program bases. For most programs, it looks like participants only typed as much as necessary (i.e., made few typing mistakes). The counting
program, however, is skewed towards < 1, suggesting that participants copied and pasted portions of their responses.
axes = plot.misc.keystroke_coefficient_by_base(trials, figsize=(20, 10))
axes[0].figure
Restricting ourselves to correct responses only, rectangle
and whitespace
are clear outliers with very long tails. Even when their answers were correct, a handful of participants typed a lot more than was necessary.
axes = plot.misc.keystroke_coefficient_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Keystroke Coefficient by Program (correct only)")
fig