Item analysis can be a powerful technique to help guide and improve instruction. To do this, the items to be analyzed should be measures of instructional objectives and diagnostic so that knowledge of which incorrect options students select can be a clue to the nature of the misunderstanding and then help with remediation.

Instructors who develop their own examinations can also improve the effectiveness of test items by electing and rewriting items based on item performance data. This data is available to instructors who have their exam sheets scored by the Scoring Office.

## Item Analysis Reports

As answer sheets are scored, records are written which contain each student’s score and his or her response to each item on the test. These records are then processed and an item analysis report file is generated. An instructor may obtain test score distributions and a list of students’ scores, in alphabetic order, in student number order, in percentile rank order, and/or in order of percentage of total points.

Instructors are sent their item analysis reports as email attachments. The item analysis report is contained in the file IRPT####.TXT, where the four digits indicate the instructor’s GRADER III account. A sample of an individual long form item analysis listing for the item response pattern is shown below.

### Item 10 of 125. The correct option is 5.

Item Response Pattern |
|||||||||
---|---|---|---|---|---|---|---|---|---|

1 |
2 |
3 |
4 |
5 |
Omit |
Error |
Total |
||

Upper 27% |
n |
2 | 8 | 0 | 1 | 19 | 0 | 0 | 30 |

% |
7 | 27 | 0 | 3 | 63 | 0 | 0 | 100 | |

Middle 46% |
n |
3 | 20 | 3 | 3 | 23 | 0 | 0 | 52 |

% |
6 | 38 | 6 | 6 | 44 | 0 | 0 | 100 | |

Lower 27% |
n |
6 | 5 | 8 | 2 | 9 | 0 | 0 | 30 |

% |
20 | 17 | 27 | 7 | 30 | 0 | 0 | 101 | |

Total |
n |
11 | 33 | 11 | 6 | 51 | 0 | 0 | 112 |

% |
10 | 29 | 11 | 5 | 46 | 0 | 0 | 100 |

## Item Analysis Response Patterns

Each item is identified by number and the correct option is indicated. The group of students taking the test is divided into upper, middle, and lower groups on the basis of students’ scores on the test. This division is essential if information is to be provided concerning the operation of distracters (incorrect options) and to compute an easily interpretable index of discrimination. It’s accepted that optimal item discrimination is obtained when the upper and lower groups each contain 27% of the total group.

The number of students who selected each option or omitted the item is shown for each of the upper, middle, lower, and total groups. The number of students who marked more than one option to the item is indicated under the “error” heading. The percentage of each group who selected each of the options, omitted the item, or erred, is also listed. Note that the total percentage for each group may be not be 100% since percentages are rounded to the nearest whole number before totaling.

The sample item listed above appears to be performing well. About two-thirds of the upper group, but only one-third of the lower group answered the item correctly. Ideally, the students who answered the item incorrectly should select each incorrect response in roughly equal proportions, rather than concentrating on a single incorrect option. Option two seems to be the most attractive incorrect option, especially to the upper and middle groups. It is most undesirable for a greater proportion of the upper group than of the lower group to select an incorrect option. The test item writer should examine such an option for possible ambiguity. Option four in the example was selected by only 5% of the total group, and an attempt might be made to make this option more attractive.

Item analysis provides the test item writer with a record of student reaction to items. It gives us little information about the appropriateness of an item for a course of instruction. The appropriateness or content validity of an item must be determined by comparing the content of the item with the instructional objectives.

## Basic Item Analysis Statistics

A number of item statistics are reported that aid in evaluating the effectiveness of an item. The first of these is the index of difficulty which is the proportion of the total group who got the item wrong. A high index indicates a difficult item. A low index indicates an easy item.

Some item analysts prefer an index of difficulty, which is the proportion of the total group who got an item right. This index may be obtained by marking the PROPORTION RIGHT option on the item analysis header sheet. Whichever index is selected is shown as the INDEX OF DIFFICULTY on the item analysis printout. For classroom achievement tests, most test constructors desire items with indices of difficulty no lower than 20 nor higher than 80, with an average index of difficulty from 30 or 40 to a maximum of 60.

The INDEX OF DISCRIMINATION is the difference between the proportion of the upper group who got an item right and the proportion of the lower group who got the item right. This index is depends on the difficulty of an item. It may reach a maximum value of 100 for an item with an index of difficulty of 50 (when 100% of the upper group and none of the lower group answer the item correctly). For items of less than or greater than 50 difficulty, the index of discrimination has a maximum value of less than 100. The Interpreting the Index of Discrimination page contains a more detailed discussion of the index of discrimination.

### Interpretation of Basic Statistics

To aid in interpreting the index of discrimination, the maximum discrimination value and the discriminating efficiency are given for each item. The maximum discrimination is the highest possible index of discrimination for an item at a given level of difficulty. For example, an item answered correctly by 60% of the group would have an index of difficulty of 40 and a maximum discrimination of 80. This would occur when 100% of the upper group and 20% of the lower group answered the item correctly.

The discriminating efficiency is the index of discrimination divided by the maximum discrimination. For example, an item with an index of discrimination of 40 and a maximum discrimination of 50 would have a discriminating efficiency of 80. This may be interpreted to mean that the item is discriminating at 80% of the potential of an item of its difficulty. For a more detailed discussion of the maximum discrimination and discriminating efficiency concepts, see the Interpreting the Index of Discrimination page.

### Other Item Statistics

Some test analysts may desire more complex item statistics. Two correlations which are commonly used as indicators of item discrimination are shown on the item analysis report. The first is the biserial correlation, which is the correlation between a student’s performance on an item (right or wrong) and his or her total score on the test. This correlation assumes that the distribution of test scores is normal and that there is a normal distribution underlying the right/wrong dichotomy. The biserial correlation has the characteristic of having maximum values greater than unity. There is no exact test for the statistical significance of the biserial correlation coefficient.

The point biserial correlation is also a correlation between student performance on an item (right or wrong) and test score. It assumes that the test score distribution is normal and that the division on item performance is a natural dichotomy. The possible range of values for the point biserial correlation is +1 to -1. A student’s t test for the statistical significance of the point biserial correlation is given on the item analysis report. Enter a table of a student’s t values with N – 2 degrees of freedom at the desired percentile point. N represents the total number of students appearing in the item analysis.

The mean scores for students who got an item right and for those who got it wrong are also shown. These values are used in computing the biserial and point biserial coefficients of correlation and are not generally used as item analysis statistics.

Generally, item statistics will be somewhat unstable for small groups of students (groups below 50). Even for a a group of 50 students, the upper and lower groups would contain only 13 students each. The stability of item analysis results will improve as the group of students increases to 100 or more. An item analysis for very small groups must not be considered a stable indication of the performance of a set of items.

## Summary Data

The item analysis data are summarized on the last page of the item analysis report. The distribution of item difficulty indices is a tabulation showing the number and percentage of items whose difficulties are in each of 10 categories, ranging from a very easy category (00-10) to a very difficult category (91-100). The distribution of discrimination indices is tabulated in the same manner, except that a category is included for negatively discriminating items.

The mean item difficulty is determined by adding all of the item difficulty indices and dividing the total by the number of items. The mean item discrimination is determined in a similar manner.

Test reliability, estimated by the Kuder-Richardson formula number 20, is given. If the test is timed and some students didn’t have time to consider each test item, then the reliability estimate may be spuriously high.

The final test statistic is the standard error of measurement. This statistic is a common device for interpreting the absolute accuracy of the test scores. The size of the standard error of measurement depends on the standard deviation of the test scores as well as on the estimated reliability of the test.

Occasionally, a test writer may wish to omit certain items from the analysis although these items were included in the test as it was administered. Such items may be omitted by leaving them blank on the test key. The statistics for these items will be omitted from the Summary Data. Alternatively, you may accept any answer but students may only mark one response per item.

## Report Options

A number of report options are available for item analysis data. The long form item analysis report contains three items per page. A standard form item analysis report is available where data on each item is summarized on one line. A sample report is shown below.

### Item Analysis Test 4: 125 Items, 49 Students

Percentages: Upper 27% – Middle – Lower 27% |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|

Item |
Key |
1 |
2 |
3 |
4 |
5 |
Omit |
Error |
Diff |
Disc |

1 |
2 | 15-22-31 | 69-57-38 | 08-17-15 | 00-04-00 | 08-00-15 | 0-0-0 | 0-0-0 | 45 | 31 |

2 |
3 | 00-26-15 | 00-00-00 | 92-65-62 | 00-04-08 | 08-04-15 | 0-0-0 | 0-0-0 | 29 | 31 |

The standard form shows the item number, key (number of the correct option), and the percentage of the upper, middle, and lower groups who selected each option, omitted the item, or erred the item. It also shows the index of difficulty and the index of discrimination.

In the Item Analysis Test 4 table example: For test item 1, option 2 was the correct answer and it was selected by 69% of the upper group, 57% of the middle group, and 38% of the lower group. The index of difficulty, based on the total group, was 45 and the index of discrimination was 31.

## Item Analysis Guidelines

Item analysis gives necessary, but not sufficient information concerning the appropriateness of an item as a measure of intended outcomes of instruction. An item may perform well with respect to item analysis statistics, yet be quite irrelevant to the instruction whose results it was intended to measure. A common error is to teach for behavioral objectives such as analysis of data or situations, ability to discover trends, ability to infer meaning, etc., and then to construct an objective test measuring mainly recognition of facts. The objectives of instruction must be kept in mind when selecting test items.

An item must be of appropriate difficulty for the students to whom it is administered. If possible, items should have indices of difficulty no less than 20 and no greater than 80. It is desirable to have most items in the 30 to 50 range of difficulty. Very hard or very easy items contribute little to the discriminating power of a test.

An item should discriminate between upper and lower groups. These groups are usually based on total test score, but they could be based on some other criterion such as grade-point average, scores on other tests, etc. Sometimes an item will discriminate negatively where a larger proportion of the lower group than of the upper group selected the correct option. This often means that the students in the upper group were misled by an ambiguity that the students in the lower group, and the item writer, failed to discover. Such an item should be revised or discarded.

All of the incorrect options, or distracters, should actually be distracting. Preferably, each distracter should be selected by a greater proportion of the lower group than of the upper group. If in a five-option multiple-choice item only one distracter is effective, then the item is for all practical purposes a two-option item. Existence of five options does not automatically guarantee that the item will operate as a five-choice item.