The Computer Says You Have Schizophrenia!

In his zeal to declare the NHS open for business, David Cameron announced in December 2011 that it was ‘simply a waste’ not to flog off anonymised NHS data to the pharmceutical industry, to help development of new drugs and their testing on hapless patients. Dr No has presented this somewhat tongue in cheek: the NHS does have vast amounts of data, albeit of varying quality, and there is legitimate and useful research to be done on that data. Indeed, Dr No has in the past done just such research. The red rag to Dr No’s bull was the sale of data to commercial concerns. Here, on the other side of the public-private divide, the rules such as they are, are different. We are advancing on Libor country. Profit, not patients, now rule, and it is remarkable how bendy the rules can become. Recently, the life insurance industry poked a sharp stick in GPs’ eyes by using subject access requests to obtain customer (subject) medical records, shaving the best part of £100 off the cost. It may not be illegal, but it is certainly tacky.

Fast forward to March last year, and CPRD was open for business. The MHRA gushed: ‘Health researchers were boosted today with the launch of the Clinical Practice Research Datalink (CPRD) – a world-class e-health secure research service – that will help improve public health, develop new treatments for patients faster and attract investment in the UK’s life sciences sector and economy.’ Across the land, health researchers were boosted – one hopes the landings weren’t too painful – but in the halls of Big Pharma they were dancing on the tables, and rolling up their sleeves in readiness to go mining. One of the first NHS trusts to drop a shaft and declare itself open for business, Southampton’s University Hospital, has now ‘partnered’ with KPMG prodigy iSoft, a healthcare software concern with a past history of ‘accounting irregularities’, but now refreshed as part of CSC, in a project set to ‘save months in the planning and execution phase of clinical trials’, with potential pharma development cost savings of billions of pounds.

Now, those who have been following the development of NHS plc are prone to get a brisk migraine whenever they see the letters KPMG near NHS. Perhaps Dr No should have prefaced this post with a KPMG alert: ‘Warning! This post contains KPMG!’ Be that as it may, Dr No does not intend to add to his migraine sufferers’ agonies by delving further into iSoft’s history – migraine-proof readers can do that for themselves. Instead, he intends to look at the more general question of anonymous data security, in particular the risk of so-called re-identification of data: that is, re-attaching real names to ‘anonymous’ data records.

Much is made of the confidentiality of the data that will be made available. The MHRA (CPRD’s daddy) says: ‘All patient data is anonymised, robustly protecting people’s confidentiality’, while the CPRD is more specific. In an ‘important statement about data confidentiality’, it says those given access to its data ‘never [unless consent has been given] get access to names, addresses, post codes or dates of birth.’ So far so good – or is it?

Apart from the risk of the redacting pen missing its mark, consider this. For data to be medically useful, we usually want to know age and sex. This information is not – repeat not – covered by CPRD’s data access exclusions. Let us imagine the data also include admission data – maybe not dates, but number of admissions and length of stay. Now let us imagine – this being akin to (but a simplified version of) ‘Case Study 8’ in the Information Commissioner’s guidance on anonymity – that an anonymous NHS dataset is sold to a business, whose employee records contain, inter alia, data on periods of sick leave. Bingo! In the language of relational databases, all we need to do is an inner join query (forgive the mumbo jumbo, but you’ll get the idea):

SELECT * FROM patients INNER JOIN employees

ON (patients.age=employees.year(now)-year(dob)

AND patients.sex=employees.gender

AND patients.(length_of_stay+14)=employees.days(sick_leave)

and back come the records: and bingo! – the computer says you have schizophrenia!

Ah but, you say, surely this sort of chicanery falls foul of the Data Protection Act? In fact it doesn’t, because of a loophole in the Act. Naturally enough, both the NHS and the business are registered data controllers. In each organisation, the data are personal data, and so covered by the Act. But at the moment of disclosure, from the NHS to a third party, the data are deemed anonymous, and so are not covered by the Act, even if the third party has the means to re-identify the data.

Dr No’s inner join is a naïve example from a database novice. The more general point is this: however well a dataset is anonymised, someone, somewhere, somehow, will find a way to crack it. It’s what we humans do. Indeed, we’ve been proving we can do it for decades. The Governor of Massachusetts famously ended up with a red face after receiving an unsolicited copy of his ‘anonymous’ medical records; more recently, AOL and Netflix have ended up with beetroot faces (and expensive lawsuits) after ‘anonymous’ datasets they released were compromised. The inescapable and unavoidable fact is: it happens. Anonymity, we might say, is the data boil waiting to burst.

The only way to guarantee anonymity is to anonymise the data to the point of sterility: but then it is unfertile for research. As Paul Ohm, a leading commentator on the failures of data anonymity says, data can be either useful or perfectly anonymous but never both. As a society, we face a stark choice: which is it to be?