BACKGROUND: The Clinical Practice Research Datalink (CPRD) and The Health Improvement Network (THIN) are two similarly structured, de-identified electronic medical record databases in the United Kingdom. To increase the number of patients available, both data sources can be pooled. However, some practices provide data to both databases, and duplicate patients should be identified and steps taken to avoid double-counting patients and study outcomes.
OBJECTIVES: To describe a patient-level algorithm to deduplicate patients in CPRD and THIN using a cohort of prucalopride users.
METHODS: Adult users of prucalopride were identified in CPRD and THIN, April 2010 through May 2014, in England, Wales, and Northern Ireland. Patients were considered duplicated if they had the same value for year of birth, sex, region, month and year of at least one prucalopride prescription, and either the same registration date or family ID. For potentially duplicated patients with a discrepancy in the number of prescriptions, all drugs prescribed during the study period were manually reviewed. A practice was considered duplicated in CPRD and THIN if at least one patient was found to be duplicated. Duplicate practices were retained in CPRD if the practice participated in linkage with the national death register at the Office for National Statistics (ONS) and Hospital Episode Statistics (HES), otherwise the practice was retained in THIN.
RESULTS: There were 994 users of prucalopride in CPRD and 808 in THIN. The deduplication algorithm identified 424 duplicate patients. Manual review of an additional 95 potentially duplicate patients with discrepant prescriptions identified 86 additional duplicate patients. There were 214 duplicate practices. Pooling the databases increased the number of available prucalopride users by 30% had only CPRD been used and by 60% had only THIN been used.
CONCLUSIONS: Pooling of data from similar databases is a convenient way to increase study size. Using patient-level demographic and pharmacy data can identify duplicate patients and practices, allowing reliable deduplication in CPRD and THIN without compromising patient or practice confidentiality.