In Defense of Multiple Choice

Formative Assessment at Its Best Uses Higher-Order Thinking with Short-Answer Questions

December 20, 2018

By: Harry Ballan, Ph.D., JD, Touro Law Center, and Dylan Wiliam, Ph.D., University College London

Part I: “Short items” and “extended items”

There is a debate in assessment circles about the relative merits of “short items” (like multiple choice and short answer questions) and “extended items” (like essay questions). This debate often gets confused with arguments about multiple-choice versus constructed-response (non-multiple choice) items, but in fact the two arguments are quite different. People assume that short items are multiple-choice items, but they could be constructed-response items. The reason they are generally not is that for short items, the extra cost incurred in having professors score the items is generally not worth it in terms of increased validity (including reliability).

The crucial debate about long versus short items is basically about which kind of item gets you the most information. Psychometricians tend to like short-answer questions, because you can ask many of them, so the particular items that are selected for a test don’t matter as much—a different selection of 100 items from the same domain will have most students getting a similar proportion correct.

However, many people believe (wrongly in our view) that you cannot assess higher-order thinking with short-answer questions. We think you can, but it is harder. They think that the only way to test higher-order thinking is with extended essay questions. Such items have prima facie validity—they seem to assess the things that we want students to be able to do—but there is a difficult problem with such items: how a student does on an essay question depends on how familiar the student is with the relatively small number of particular topics that are asked about. Even when the average scores for different essay questions are the same across groups of students, some students will do better on one essay question than another. In other words, their performance on such questions depends much more on luck than it does with multiple choice questions. This issue is called “person x task interaction” in psychometric circles, because to know how well a student will do on a particular task, you have to know which student, and which task. With multiple choice questions, you generally don’t need to. A student might get lucky two or three times. He or she won’t get lucky 50 or 100 times.

Those who prefer (sometimes exclusively) extended questions respond by saying it’s just the luck of the draw, and if all we were trying to do was establish how good students were on the particular items tested, that might be acceptable. But we are, in fact, using the questions on the test as a representative sample of all the questions that a student might have been asked. The fact that chance factors influence a student’s score means that any decisions we make on the basis of those test scores are likely to be poor decisions. The important point here is that people think the unreliability comes in because essay questions need to be scored manually. However, in general, that is a much less significant source of error than the fact that some students happen to be favored by particular selections of topics for essay examinations. Giving students choice doesn’t solve this problem either, because weak students tend to choose badly, and strong students choose well.

Part II: Distractor-driven multiple choice, or, how to teach four rules in two minutes.

Often ideas that we would like our students to know have multiple facets. By a “facet” of thinking, we mean an element or fact of thinking that may be subordinate or superordinate to another fact in a hierarchy or may simply exist in a network of associated facts that are not hierarchically organized. We often want our students to understand a set of facets of thinking rather than what otherwise might be incomplete ideas.

The way in which multiple choice questions can address this goal is through what Sadler (1998) calls “distractor-driven multiple choice questions” where “distractor” is the technical term for an incorrect (or less correct) option that might be a facet of a related idea.

Every day, I (Harry) send at least one multiple-choice question to every student in the law school (all 506 of them) and ask them to respond directly to me with their answers and any discussion they wish to have. I will use a question I recently sent out to illustrate what may occur by using distractor-driven multiple-choice.

In today’s question, one party is suing another for defamation. In the course of the proceedings, there is a car accident, potentially involving negligence, between the same two parties.

Here is the question:

A plaintiff brought a federal diversity action based on defamation seeking $100,000 in damages from the defendant.

The parties began discovery, and the plaintiff deposed the defendant. At the conclusion of the deposition, the defendant was agitated and negligently backed into the plaintiff’s automobile causing $22,000 in damages.

Thereafter, the plaintiff moved to amend her complaint to add a negligence claim for the property damage.

Will the court grant the plaintiff’s motion to amend her complaint?

No, because the property claim is not related to the pending defamation claim.
No, because the property damage claim does not exceed $75,000.
Yes, because combined the claims exceed $75,000.
Yes, because the court would have supplemental jurisdiction over the $20,000 property damage claim.

The superordinate concept that should drive the students’ thinking about these two claims, defamation and negligence, and how the court should treat their relationship in this case involves notions of judicial economy and conservation of judicial resources. We don’t have enough judges, and the ones we are too busy.

There are several related concepts that the student should know in thinking about this question. They should know that, under certain circumstances, two different claims can be joined under the doctrine of “joinder.” Joinder is encouraged by the courts but is not always available. However, joinder is allowed in certain cases, such as where a single plaintiff and a single defendant have two unrelated claims (for example, defamation and an automobile accident).

Another related idea is something called “supplemental jurisdiction,” which allows two claims to be heard together in a single proceeding where the two claims are so related that they form part of the same case. The issue is whether the claims share “a common nucleus of operative fact(s)”, that is, whether they are part of the same “transaction or occurrence.” In this case, the defamation and the negligence do not arise from the same transaction or occurrence (as that phrase is understood in the law).

A third concept encouraging judicial economy is that federal courts are courts of limited subject matter jurisdiction. Under what is known as diversity jurisdiction, a court may hear certain claims so long as the aggregated amount in controversy exceeds $75,000.

In today’s question, the first answer requires the students to think about supplemental jurisdiction, the second answer involves the amount-in-controversy requirement, the third answer tests knowledge of the joinder rules, and the fourth answer returns to supplemental jurisdiction. The correct answer requires the student to know that here the claims can be joined.

This question could be used in class to teach or reinforce several different rules about jurisdiction and joinder in a very short period of time. If used formatively and together with classroom discussion, the effect on learning is powerful. The students will have learned how the “facets” of jurisdiction and joinder are connected.

By working through the question with the students and, in particular, by focusing on both the distractor items and the concepts behind them, the professor may be able not only to teach or reinforce joinder, the amount in controversy rule and supplemental jurisdiction, but also will obtain a good sense of where the students are and how to teach with that in mind. This is formative assessment at its best.

We yield to no one in our admiration for good prose and the importance of learning to read it and write it. At the same time, essay questions on any of the separate doctrines referred to above would require more classroom time and would not permit the students to see all of the ideas, the facets of thinking, that are represented in the distractors as well as in the correct answer.

It is this “facets of thinking” possibility of distractor-driven multiple-choice that makes it such an effective tool for teaching. I’ve been using it every day, with every student in the school, and the results have been nothing less than thrilling.