%run prelude.ipy

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Breakdown of the Data

We ran experiments in two locations: locally in Bloominton, IN and on Amazon's Mechanical Turk. A small number of participants were solicited via e-mail on the web.

print len(experiments), "total participants"
experiments.location.value_counts()

162 total participants

mturk          130
bloomington     29
web              3
dtype: int64

Trials were discarded if the participant provided an answer to an already completed trial (by manipulating the web page URL).

print len(trials), "total trials"
expected_trials = len(experiments) * 10
print expected_trials - len(trials), "trials were discarded"

1602 total trials
18 trials were discarded

Demographics

Participants were asked to report their age, gender, years of python/programming experience, degree, and status as a Computer Science major.

plot.misc.demographics(experiments)

Programs

Each participant completed 10 trials. For each trial, they were asked to predict the printed output of a simple Python program.

There were a total of 10 program categories (bases), with 2-3 versions in a given category (view code). Participants saw a program one each of the 10 program bases, and were randomly assigned a version.

Code metrics for each base/version are given below including the number of characters (code_chars), lines (code_lines), cyclomatic complexity (cyclo_comp), Halstead effort (hal_effort), Halstead volume (hal_volume). The number of characters and lines in the actual program output are also given (output_chars and output_lines respectively).

display_html(programs.set_index(["base", "version"]).to_html())

Grades

Participants' responses for each trial were graded on a scale from 0 to 10 with 10 being a perfect response. A grade of 7-9 was assigned for a correct response -- i.e., with minor formatting errors. Grading was done after the experiment was over, so participants did not receive feedback about their performance.

As shown in the distributions below, most individual trials were given a perfect score (10/10). However, the median score for the entire experiment (out of 100) was 81.

fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_grades_distribution(trials, ax=axes[0], color=kelly_colors[3])
plot.misc.total_grades_distribution(trials, ax=axes[1], color=kelly_colors[2])
fig.tight_layout()
fig

print "Median experiment grade:", experiments.total_grade.median()
print "Median trial grade:", trials.grade_value.median()

Median experiment grade: 81.0
Median trial grade: 10.0

The distribution of grades varied by program base. Some bases, such as between, counting, and scope had much lower average grades.

axes = plot.misc.grades_by_base(trials, figsize=(20, 10))
axes[0].figure

trials.groupby("base").grade_value.mean().order()

base
between       5.314465
scope         6.949686
counting      7.478261
partition     7.712500
initvar       7.725000
whitespace    8.575000
overload      8.720497
order         8.906832
funcall       9.216049
rectangle     9.522013
Name: grade_value, dtype: float64

Trial Duration

Experiments usually took around 15 minutes, and many trials were completed in about 1 minute.

fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(trials, ax=axes[0], color=kelly_colors[0])
plot.misc.total_duration_distribution(experiments, ax=axes[1], color=kelly_colors[1])
fig.tight_layout()
fig

exp_median = experiments.duration_sec.median()
print "Median experiment duration: {0} sec ({1:.02f} min)".format(exp_median, exp_median / 60.0)
print "Median trial duration:", trials.duration_ms.median() / 1000.0, "sec"

Median experiment duration: 773.0 sec (12.88 min)
Median trial duration: 55.0 sec

There appear to be some duration differences between correct and incorrect trials.

t_correct, t_incorrect = util.split_by_boolean(trials, "grade_correct")
print "Median trial duration (correct):", t_correct.duration_ms.median() / 1000.0, "sec"
print "Median trial duration (incorrect):", t_incorrect.duration_ms.median() / 1000.0, "sec"

Median trial duration (correct): 56.0 sec
Median trial duration (incorrect): 51.0 sec

The log distributions appear slightly different...

fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(t_correct, ax=axes[0], color=kelly_colors[0], column="duration_ms_log")
plot.misc.trial_duration_distribution(t_incorrect, ax=axes[1], color=kelly_colors[1], column="duration_ms_log")
fig

But the result is not significant.

stats.wilcox_test(t_correct.duration_ms_log, t_incorrect.duration_ms_log)

0.896938977817479

scipy.stats.ttest_ind(t_correct.duration_ms_log, t_incorrect.duration_ms_log)

(-0.51472404673756544, 0.60681696098987525)

As with grades, the distribution of trial durations varies with program base. Some differences are apparent when we plot them all using the same x scale (from 0 to the maximum trial duration).

axes = plot.misc.durations_by_base(trials, figsize=(20, 10))
axes[0].figure

Taking the log of the trial durations, we can more easily see some differences. counting, initvar, order, partition, rectangle, and scope cluster around 4 log sec while funcall and overload cluster around 3. between and whitespace are closer to 5.

axes = plot.misc.durations_by_base(trials, figsize=(20, 10), log=True)
axes[0].figure

If we normalize durations by the number of lines in the programs, other differences are easier to see. For example, between and whitespace have similarly shaped distributions while counting, initvar, and order have relatively shorter tails than others.

Normalizing in this manner, though, makes the assumption that longer programs should take longer to read.

axes = plot.misc.durations_by_base(trials, figsize=(20, 10), norm_by_lines=True)
axes[0].figure

Keystroke Coefficient

All keystrokes were recorded during each trial. The keystroke coefficient for a trial is the keystroke count normalized by the number of characters in the true program output.

A keystroke coefficient > 1 indicates that the participant typed more than was required -- either due to a mistake or to correct a previous answer. A keystroke coefficient < 1 is possible when copying and pasting text, which some participants did.

fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_keycoeff_distribution(trials, ax=axes[0], color=kelly_colors[5])
plot.misc.total_keycoeff_distribution(trials, ax=axes[1], color=kelly_colors[4])
fig.tight_layout()
fig

print "Median experiment keystroke coefficient:", trials.groupby("exp_id").keystroke_coefficient.mean().median()
print "Median trial keystroke coefficient:", trials.keystroke_coefficient.median()
print "Mean trial keystroke coefficient:", trials.keystroke_coefficient.mean()

Median experiment keystroke coefficient: 1.2108227657
Median trial keystroke coefficient: 1.0
Mean trial keystroke coefficient: 1.42589488833

There are some interesting differences in the distributions of keystroke coefficients between program bases. For most programs, it looks like participants only typed as much as necessary (i.e., made few typing mistakes). The counting program, however, is skewed towards < 1, suggesting that participants copied and pasted portions of their responses.

axes = plot.misc.keystroke_coefficient_by_base(trials, figsize=(20, 10))
axes[0].figure

Restricting ourselves to correct responses only, rectangle and whitespace are clear outliers with very long tails. Even when their answers were correct, a handful of participants typed a lot more than was necessary.

axes = plot.misc.keystroke_coefficient_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Keystroke Coefficient by Program (correct only)")
fig

Looking at the percentage of trials per program base with a keystroke coefficient of $\le 1$, we can get a sense of which programs participants with correct responses were more efficient at.

trials[trials.grade_correct].groupby("base")\
    .apply(lambda f: sum(f.keystroke_coefficient <= 1) / float(len(f)))\
    .order(ascending=False) * 100

base
scope         82.926829
rectangle     76.158940
funcall       71.527778
initvar       68.644068
counting      62.105263
overload      58.208955
order         56.375839
partition     45.945946
whitespace    40.579710
between       37.500000
dtype: float64

Response Proportion

fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_respprop_distribution(trials, ax=axes[0], color=kelly_colors[6])
plot.misc.total_respprop_distribution(trials, ax=axes[1], color=kelly_colors[7])
fig.tight_layout()
fig

print "Median experiment response proportion:", trials.groupby("exp_id").response_proportion.mean().median()
print "Median trial response proportion:", trials.response_proportion.median()

Median experiment response proportion: 0.392692601812
Median trial response proportion: 0.390311790668

Related to keystrokes, we can compute the proportion of the trial that participants spent crafting their response. This is defined as the time from the first to last keystroke divided by the total trial duration.

Plotting the distributions of response durations by program base, it's easy to see when participants spent their time reading versus writing. This does not always reflect which programs had more output. The overload program, for example, had a small amount of output but the distribution below suggests that participants started giving their response right away rather than reading the whole program. In contrast, funcall and scope were reading-heavy programs, with participants likely waiting until the end of the trial to start typing.

axes = plot.misc.response_proportion_by_base(trials, figsize=(20, 10))
axes[0].figure

Surprisingly, restricting the distributions to correct answers only does not appear to significantly change things.

axes = plot.misc.response_proportion_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Response Proportion by Program (correct only)")
fig

		code_chars	code_lines	cyclo_comp	hal_effort	hal_volume	output_chars	output_lines
base	version
between	functions	496	24	7	94192.063393	830.218507	33	3
between	inline	365	19	7	45596.278445	660.815630	33	3
counting	nospace	77	3	2	738.402323	82.044703	116	8
counting	twospaces	81	5	2	738.402323	82.044703	116	8
funcall	nospace	50	4	2	937.653743	109.392937	3	1
	space	54	4	2	937.653743	109.392937	3	1
	vars	72	7	2	1735.731282	154.287225	3	1
initvar	bothbad	103	9	3	3212.495182	212.396376	5	2
	good	103	9	3	3212.495182	212.396376	6	2
	onebad	103	9	3	2866.823438	208.496250	6	2
order	inorder	137	14	4	8372.306047	303.069902	6	1
order	shuffled	137	14	4	8372.306047	303.069902	6	1
overload	multmixed	78	11	1	2340.000000	120.000000	9	3
	plusmixed	78	11	1	3428.296498	117.206718	7	3
	strings	98	11	1	3428.296498	117.206718	21	3
partition	balanced	105	5	4	2896.001287	188.869649	26	4
	unbalanced	102	5	4	2382.342809	177.199052	19	3
	unbalanced_pivot	120	6	4	2707.766879	196.214991	19	3
rectangle	basic	293	18	2	18801.174998	396.335705	7	2
	class	421	21	5	43203.698685	620.148785	7	2
	tuples	277	14	2	15627.749381	403.817813	7	2
scope	diffname	144	12	3	2779.714286	188.000000	2	1
scope	samename	156	12	3	2413.342134	183.623858	2	1
whitespace	linedup	275	14	1	6480.000000	216.000000	13	3
whitespace	zigzag	259	14	1	6480.000000	216.000000	13	3