Part 2 - Writing tests. The science of writing tests

In the world of assessment, tests are often called “instruments”. It is a bit of a pretentious term but expresses the idea that there is a purpose to every test. Each test is an instrument intended to gather information about what is going on in people’s minds. There is art in every test question but there is science behind every test.

That science has a long (and sometimes very dark) history. The desire to know how people think has been motivated by both good and bad motives. The underlying concepts remain the same.

At a basic level is the concept of reliability. A reliable test is one that produces test scores that are consistent. One half of a test (e.g. all the even numbered questions) should correlate with the other half (e.g. all the odd numbered questions). A test that contains wildly different subjects or skills is not going to produce reliable data. There are various statistical methods that can be employed to quantify how reliable a set of test data is.

However, a reliable test doesn’t need to be correct or appropriate. A test full of wrong questions can still be reliably wrong. Reliability is a powerful concept but it isn’t enough.

A broader and deeper idea is validity. A simple way of thinking about validity is asking whether the data from a test is really telling you what it claims to be telling you. Practically, test validity is about collecting evidence to support the claims we might make about a test. Is a test fit for purpose? Is it really testing the skills or knowledge it claims to be testing? Does the data have any hidden biases? Is the test valid for all the people it is being used with?

In practice, validity is always an open question for any test. There are multiple perspectives of validity including:

Predictive validity: how well does a test predict future performance in the same area.
Face validity: do expert reviewers in the subject area agree that the test content is correct and appropriate.
Consequential validity: how will the test data be used and what decisions will be made using the test data.

These different concepts are complimentary but they each suggest different approaches to collecting evidence to show a test is valid. Some involve review by experts and others involve statistical analysis and other require long-term ongoing research.

It is a rich and complex field, with few easy answers.