PHI Inference: How AI Can Recreate Protected Information You Never Sent It
PHI Inference: How AI Can Recreate Protected Information You Never Sent It
A research team at Harvard demonstrated back in 2000 that 87% of the U.S. population could be uniquely identified using just three data points: ZIP code, date of birth, and sex. Latanya Sweeney's landmark work showed she could re-identify the medical records of then-Governor William Weld of Massachusetts using exactly this combination, purchased from publicly available voter rolls for twenty dollars. That was with basic database joins. Now consider what a large language model can do with probabilistic inference across billions of parameters.
HIPAA's Safe Harbor method under 45 CFR 164.514(b)(2) specifies 18 identifiers that must be removed to de-identify protected health information. Strip those out, the logic goes, and you no longer have PHI. The problem is that modern AI systems don't need those identifiers to be present in the data. They can reconstruct them.
What Inference Attacks Actually Look Like
An inference attack occurs when a model derives sensitive information from non-sensitive inputs through correlation, pattern recognition, or contextual reasoning. In the PHI context, this means an AI system can take de-identified clinical data and, by cross-referencing it against other available information, effectively re-identify individuals or deduce protected health details that were never explicitly provided.
Here's a concrete scenario. You feed a model de-identified discharge summaries that include diagnosis codes, treatment timelines, hospital region (rounded to state level), and age brackets. Individually, none of these are identifiers under Safe Harbor. But a model trained on or with access to census data, insurance claim patterns, and public health records can narrow a given record to a very small population. For rare conditions, that population might be one person.
A 2023 study published in Nature Medicine showed that deep learning models could re-identify patients from de-identified medical imaging data with accuracy rates exceeding 95%. The models used metadata embedded in imaging formats and subtle anatomical features as quasi-identifiers. Nobody sent the model a name or an MRN. The model figured it out.
The Correlation Problem
The 18 Safe Harbor identifiers were defined in the original HIPAA Privacy Rule, finalized in 2000 and modified in 2002. They reflect a threat model from an era when the primary risk was a human analyst with a spreadsheet. The identifiers are things like names, geographic data smaller than a state, dates more specific than a year, phone numbers, Social Security numbers, and so on. Remove them and a reasonable person in 2002 couldn't identify the patient.
A large language model is not a reasonable person from 2002. It can hold in context millions of correlations simultaneously. It can combine a treatment protocol for a rare autoimmune condition, a reference to a clinical trial at a specific academic medical center, an age bracket, and a mention of a complication to narrow the field to a handful of patients. If any of those patients have public social media profiles, news coverage, or GoFundMe campaigns mentioning their diagnosis, the circle closes.
This is not hypothetical. Researchers at MIT and the University of Louvain demonstrated in a 2019 study in Nature Communications that 99.98% of Americans could be correctly re-identified in any dataset using 15 demographic attributes, even when those attributes were sampled or generalized. The Safe Harbor method removes 18 specific identifiers but does not address the combinatorial power of the remaining data points.
Why the Expert Determination Method Matters More Now
HIPAA actually provides two de-identification methods. Safe Harbor gets most of the attention because it's a checklist. The second method, Expert Determination under 45 CFR 164.514(b)(1), requires a qualified statistical or scientific expert to determine that the risk of identifying an individual from the data is "very small." This method is more rigorous, more flexible, and far more appropriate for data that will be processed by AI systems.
The challenge is that Expert Determination is expensive, time-consuming, and requires genuine expertise in statistical disclosure control. Most organizations default to Safe Harbor because it's operationally simpler. When the downstream consumer of that data is an AI model with broad contextual knowledge, that simplicity becomes a liability.
HHS has not yet issued specific guidance on AI-driven re-identification risks, though the December 2023 HHS concept paper on AI strategy acknowledged the issue. OCR's enforcement actions continue to focus on traditional breach scenarios. But the regulatory gap doesn't eliminate the legal risk. If a covered entity or business associate uses an AI system that re-identifies PHI, the HIPAA violation exists regardless of whether the entity intended it. Penalties under the HITECH Act can reach $1.5 million per violation category per year, and willful neglect that goes uncorrected carries mandatory penalties.
The Business Associate Angle
If you're using a third-party AI platform to process clinical data, your Business Associate Agreement probably doesn't contemplate inference-based re-identification. Standard BAA language covers use, disclosure, and safeguarding of PHI. It rarely addresses the scenario where the AI vendor's model can reconstruct PHI from data that was technically de-identified before transmission.
This creates an uncomfortable question: if you send de-identified data to a vendor, and the vendor's model re-identifies it (even inadvertently, even in an intermediate processing step), has a disclosure of PHI occurred? Under a strict reading of HIPAA, yes. The data became individually identifiable health information again inside the vendor's system. The fact that you stripped the 18 identifiers before sending it may not matter if the model put them back.
The 2024 Change Healthcare breach, which exposed data on an estimated 100 million individuals and triggered OCR investigations, has already heightened scrutiny on how covered entities manage vendor relationships. Adding AI inference risk to that landscape makes robust technical controls more important, not less.
Practical Mitigations
- Differential privacy: Adding calibrated noise to datasets before AI processing can reduce re-identification risk while preserving analytical utility. It is mathematically provable, which matters when you need to demonstrate compliance.
- K-anonymity and its successors: Ensuring that every combination of quasi-identifiers in your dataset maps to at least k individuals. L-diversity and t-closeness extend this to address attribute disclosure. These techniques should be applied before data reaches any model.
- Contextual access controls: Limiting what external knowledge a model can access during inference. A model that cannot cross-reference clinical data against public datasets has a much harder time re-identifying records.
- Inference auditing: Systematically testing whether a model's outputs, given de-identified inputs, contain or imply individually identifiable information. This needs to be ongoing, not a one-time check, because model behavior changes with updates and fine-tuning.
- Expert Determination as default: For any dataset destined for AI processing, treat Expert Determination as the baseline rather than Safe Harbor. The upfront cost is real but modest compared to a breach investigation.
How FirmAdapt Addresses This
FirmAdapt's architecture was built around the principle that compliance boundaries must be enforced at the infrastructure level, not bolted on after the fact. For healthcare deployments, this means data processing occurs within environments where model access to external corpora is controlled, inference outputs are audited against re-identification thresholds, and PHI boundaries are maintained even when input data has been de-identified under Safe Harbor. The platform supports both Safe Harbor and Expert Determination workflows, with logging sufficient to demonstrate compliance to OCR if needed.
The inference problem is fundamentally an architecture problem. If your AI system can reach outside its permitted data context, no amount of upstream de-identification will fully protect you. FirmAdapt enforces contextual isolation by design, so models operate within defined boundaries and audit trails capture what the model could access, what it produced, and whether any output approached re-identification risk. For organizations processing clinical data at scale, that structural guarantee is what turns a theoretical HIPAA exposure into a managed, documented risk.