top of page

The Science of Testing: Similarities and DIFs


Do men and women answer test questions about ophthalmology differently? What about younger versus older ophthalmologists? In general, that’s unlikely. But there may be instances where two subgroups of equally qualified ophthalmologists approach the same test question and arrive at different answers.

Test developers use what’s called “DIF” or Differential Item Functioning, to examine item performance variability between examinees of comparable proficiency in different demographic categories. Using DIF, we can calculate the ability level of every examinee and look at item performance across people of the same ability level to see if different subgroups perform differently. Subgroups are typically defined by demographic characteristics, such as gender, age, ethnicity, etc.

Let’s say there was a question on the DOCK Examination. Senior ophthalmologists tended to answer the question correctly, while millennial ophthalmologists answered incorrectly. This could mean one of two things:

1. The item is fair, it just so happens that senior ophthalmologists know the answer more often! It might be because the concept is more familiar or interesting to them for some reason.

2. The item is biased, giving older physicians an unfair advantage. For example, it contains a term that’s no longer in use, and knowledge of that term is not really related to the question.

If an item is flagged for DIF, either of these possibilities could be at play. It is then up to a Subject Matter Expert to review the content of the item to determine if there is something in the item unrelated to the construct of measurement that makes the question easier for older physicians. In other words, they decide if the item is biased (which is a problem!), or not!

You’ve most likely heard the concept of DIF discussed in relation to school-based standardized testing. Critics of standardized tests will point to questions they feel are biased toward a particular type of student (i.e. girls, minority groups, poorer students, etc.). For instance, if a question was, “What do you put a cup on?” and the answer was “saucer,” then students from lower socioeconomic groups may be less likely than their equally-proficient counterparts from higher socioeconomic groups to answer this item correctly because they don’t use saucers under their cups at home. If the test was not intended to measure tableware vocabulary (in which case this item might be deemed fair), then this item would be considered unfairly biased.

bottom of page