Our linkage quality
The CHeReL's probabilistic linkage procedures are designed to
achieve a false positive rate around 5/1,000. This means that
in a dataset of 100,000 persons (100,000 Project Person Numbers
(PPNs)) it is expected that the records of around 500 PPNs will
contain linkage errors. The CHeReL also aims to achieve a
false negative rate around 5/1,000 although missing and incomplete
identifiers can contribute to a higher degree of missed links. For
each project the rate of linkage errors will be reported to the
Chief Investigator. The estimated false positive rate for the
current version of the Master Linkage Key is 3 per
Our procedures for linkage and quality assurance
Probabilistic record linkage software works by assigning a
'linkage weight' to pairs of records. For example records that
match perfectly or nearly perfectly on first name, surname, date of
birth and address have a high linkage weight, and records that
match only on date of birth have a low linkage weight. If the
linkage weight is high it is likely that the records truly match,
and if the linkage weight is low it is likely that the records are
not truly a match. This is shown in Figure 1.
Figure 1: Linkage weights in probabilistic
There are pairs of matched records where the linkage weights are
neither high nor low, but somewhere in the middle. So how do we
decide if they are true matches or not?
We could choose the middle linkage weight as a cut-off and
arbitrarily say that all pairs of records with linkage weights
above the cut-off are 'true' matches, and all pairs of records with
linkage weights below the cut-off are 'false' matches.
Unfortunately, this will result in some false matches with linkage
weights above the upper cut-off being included with the true
matches, and some true matches with linkage weights below the lower
cut-off being lost.
At the CHeReL we choose to have two cut-offs:
- an upper cut-off where all pairs of records with linkage
weights above the cut-off are designated as 'true' matches;
- a lower cut-off where all pairs of records with linkage weights
below the cut-off are designated as 'false' matches
The pairs of records with linkage weights between the upper and
lower cut-offs are checked by hand. This is called clerical review
(see 3.4 for details).
We aim to adjust the upper and lower cut-offs so that there
- no more than 5/1,000 false positive matches above the upper
- no more than 5/1,000 true positive matches below the lower
cut-off (also referred to as false negatives)
- a manageable number of clerical reviews of records with linkage
weights between the upper and lower cut-offs
Where a linkage project involves records from the MLK,
information is collected on whether false positive links relate to
records already included in the Master Linkage Key or to the new
records being linked to the MLK.
The record linkage software that is used by the CHeReL is
ChoiceMaker. ChoiceMaker converts linkage weights to probabilities
in the range of 0 to 1, with 0 representing a definite non-match
and 1 representing a definite match.
Quality Assurance Procedures for Record Linkage Projects
The procedure for quality assurance in linkage projects is as
3.1 Set default cut-offs
We start each linkage by setting default cut-offs as
Upper cut-off p= 0.75
Lower cut-off p= 0.25
3.2 Check the upper
The aim of adjusting the upper cut-off is to minimise the number
of false positive matches that lie above the upper cut-off.
A random sample of 1,000 groups of matched records with
probabilities that lie above the upper cut-off are reviewed by
hand. If the false positive rate is above 5/1,000 the upper cut-off
is raised to force these matches into the clerical review area. If
there are no false positives, the upper cut-off is lowered to try
to reduce the burden of clerical review. Once a new cut-off
is selected, the linkage is run again and a new random sample of
1,000 groups of matched records that lie above the upper cut-off
are reviewed by hand. The process is repeated until the false
positive rate is below 5 per 1,000.
3.3 Check the lower
The aim of adjusting the lower cut-off is to minimise the number
of true positive matches that lie below the lower cut-off, because
these matches will be lost. We refer to true links that are lost as
'false negative' links.
We review groups of records with probabilities that are close to
the lower cut-off. If there are no true matches, then we raise the
lower cut-off to reduce the burden of clerical review. If there are
true matches close to the lower cut-off we lower the cut-off to try
and pick up any true matches that might be lying below the lower
cut-off. A new lower cut-off is selected, the linkage is repeated
and groups of records with probabilities that are close to the
lower cut-off are reviewed again. The process is repeated until the
false negative rate is below 5 per 1,000.
3.4 Clerical review of
Groups of linked records with probabilities that lie between the
upper and lower cut-offs are reviewed by the CHeReL Record Linkage
Officers (RLOs). The RLO compares the records in each group across
the full range of available information including first name,
surname, date of birth, sex, and address, and decides which records
in the group are matches and should stay together.
3.5 Quality assurance of Record Linkage
Officer (RLO) clerical reviews
Once clerical review of uncertain matches is complete, a further
review is carried out on a random sample of 5% of groups of records
that have been reviewed by each RLO. This checking is carried out
either by one of the database managers or an experienced RLO. If
there are clerical review errors in more than 2.5% of the sample
groups of records, all clerical review work of the RLO for the
project is checked.
Quality Assurance Procedures for the Master Linkage Key
When a new batch of data is added to the Master Linkage Key, the
CHeReL follows the same procedure that is used for record
linkage projects. These procedures are designed to ensure
that the addition of new records results in fewer than 5/1,000
false positives and fewer than 5/1,000 false negatives where full
identifiers are available.
Once a year the CHeReL carries out a comprehensive quality
assurance exercise on the Master Linkage Key, with the aim of
detecting and correcting false positive and false negative links.
The specific methods that are used vary from year to year. A report
from the most recent year is available for download.