```
%run prelude.ipy
```

## Breakdown of the Data

We ran experiments in two locations: locally in Bloominton, IN and on Amazon's Mechanical Turk. A small number of participants were solicited via e-mail on the web.

```
print len(experiments), "total participants"
experiments.location.value_counts()
```

Trials were discarded if the participant provided an answer to an already completed trial (by manipulating the web page URL).

```
print len(trials), "total trials"
expected_trials = len(experiments) * 10
print expected_trials - len(trials), "trials were discarded"
```

### Demographics

Participants were asked to report their age, gender, years of python/programming experience, degree, and status as a Computer Science major.

```
plot.misc.demographics(experiments)
```

### Programs

Each participant completed 10 **trials**. For each trial, they were asked to predict the printed output of a simple Python program.

There were a total of 10 program categories (**bases**), with 2-3 **versions** in a given category (view code). Participants saw a program one each of the 10 program bases, and were randomly assigned a version.

Code metrics for each base/version are given below including the number of characters (`code_chars`

), lines (`code_lines`

), cyclomatic complexity (`cyclo_comp`

), Halstead effort (`hal_effort`

), Halstead volume (`hal_volume`

). The number of characters and lines in the actual program output are also given (`output_chars`

and `output_lines`

respectively).

```
display_html(programs.set_index(["base", "version"]).to_html())
```

### Grades

Participants' responses for each trial were graded on a scale from 0 to 10 with 10 being a **perfect** response. A grade of 7-9 was assigned for a **correct** response -- i.e., with minor formatting errors. Grading was done after the experiment was over, so participants did not receive feedback about their performance.

As shown in the distributions below, most individual trials were given a perfect score (10/10). However, the median score for the entire experiment (out of 100) was 81.

```
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_grades_distribution(trials, ax=axes[0], color=kelly_colors[3])
plot.misc.total_grades_distribution(trials, ax=axes[1], color=kelly_colors[2])
fig.tight_layout()
fig
```

```
print "Median experiment grade:", experiments.total_grade.median()
print "Median trial grade:", trials.grade_value.median()
```

The distribution of grades varied by program base. Some bases, such as `between`

, `counting`

, and `scope`

had much lower average grades.

```
axes = plot.misc.grades_by_base(trials, figsize=(20, 10))
axes[0].figure
```

```
trials.groupby("base").grade_value.mean().order()
```

### Trial Duration

Experiments usually took around 15 minutes, and many trials were completed in about 1 minute.

```
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(trials, ax=axes[0], color=kelly_colors[0])
plot.misc.total_duration_distribution(experiments, ax=axes[1], color=kelly_colors[1])
fig.tight_layout()
fig
```

```
exp_median = experiments.duration_sec.median()
print "Median experiment duration: {0} sec ({1:.02f} min)".format(exp_median, exp_median / 60.0)
print "Median trial duration:", trials.duration_ms.median() / 1000.0, "sec"
```

There appear to be some duration differences between correct and incorrect trials.

```
t_correct, t_incorrect = util.split_by_boolean(trials, "grade_correct")
print "Median trial duration (correct):", t_correct.duration_ms.median() / 1000.0, "sec"
print "Median trial duration (incorrect):", t_incorrect.duration_ms.median() / 1000.0, "sec"
```

The log distributions appear slightly different...

```
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_duration_distribution(t_correct, ax=axes[0], color=kelly_colors[0], column="duration_ms_log")
plot.misc.trial_duration_distribution(t_incorrect, ax=axes[1], color=kelly_colors[1], column="duration_ms_log")
fig
```

But the result is not significant.

```
stats.wilcox_test(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
```

```
scipy.stats.ttest_ind(t_correct.duration_ms_log, t_incorrect.duration_ms_log)
```

As with grades, the distribution of trial durations varies with program base. Some differences are apparent when we plot them all using the same x scale (from 0 to the maximum trial duration).

```
axes = plot.misc.durations_by_base(trials, figsize=(20, 10))
axes[0].figure
```

Taking the log of the trial durations, we can more easily see some differences. `counting`

, `initvar`

, `order`

, `partition`

, `rectangle`

, and `scope`

cluster around 4 log sec while `funcall`

and `overload`

cluster around 3. `between`

and `whitespace`

are closer to 5.

```
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), log=True)
axes[0].figure
```

If we normalize durations by the number of lines in the programs, other differences are easier to see. For example, `between`

and `whitespace`

have similarly shaped distributions while `counting`

, `initvar`

, and `order`

have relatively shorter tails than others.

Normalizing in this manner, though, makes the assumption that longer programs should take longer to read.

```
axes = plot.misc.durations_by_base(trials, figsize=(20, 10), norm_by_lines=True)
axes[0].figure
```

### Keystroke Coefficient

All keystrokes were recorded during each trial. The **keystroke coefficient** for a trial is the keystroke count normalized by the number of characters in the true program output.

A keystroke coefficient > 1 indicates that the participant typed more than was required -- either due to a mistake or to correct a previous answer. A keystroke coefficient < 1 is possible when copying and pasting text, which some participants did.

```
fig, axes = pyplot.subplots(1, 2, figsize=(12, 5))
plot.misc.trial_keycoeff_distribution(trials, ax=axes[0], color=kelly_colors[5])
plot.misc.total_keycoeff_distribution(trials, ax=axes[1], color=kelly_colors[4])
fig.tight_layout()
fig
```

```
print "Median experiment keystroke coefficient:", trials.groupby("exp_id").keystroke_coefficient.mean().median()
print "Median trial keystroke coefficient:", trials.keystroke_coefficient.median()
print "Mean trial keystroke coefficient:", trials.keystroke_coefficient.mean()
```

There are some interesting differences in the distributions of keystroke coefficients between program bases. For most programs, it looks like participants only typed as much as necessary (i.e., made few typing mistakes). The `counting`

program, however, is skewed towards < 1, suggesting that participants copied and pasted portions of their responses.

```
axes = plot.misc.keystroke_coefficient_by_base(trials, figsize=(20, 10))
axes[0].figure
```

Restricting ourselves to correct responses only, `rectangle`

and `whitespace`

are clear outliers with very long tails. Even when their answers were correct, a handful of participants typed a lot more than was necessary.

```
axes = plot.misc.keystroke_coefficient_by_base(trials[trials.grade_correct], figsize=(20, 10))
fig = axes[0].figure
fig.suptitle("Keystroke Coefficient by Program (correct only)")
fig
```