The 18 HIPAA Identifiers and Why De-Identification Before LLM Processing Is Harder Than It Sounds
The 18 HIPAA Identifiers and Why De-Identification Before LLM Processing Is Harder Than It Sounds
HIPAA's Privacy Rule at 45 CFR 164.514 gives you two paths to de-identify protected health information: Safe Harbor and Expert Determination. On paper, both seem straightforward. Strip the identifiers or get a statistician to sign off. In practice, feeding data through either method and then handing it to a large language model introduces a category of re-identification risk that the original rule never anticipated.
The 18 Identifiers Under Safe Harbor
Safe Harbor (164.514(b)(2)) requires the removal of 18 specific identifiers. You probably know most of them, but the full list matters because the edge cases are where organizations slip up:
- Names
- Geographic data smaller than a state (street address, city, county, zip code; zip codes are permitted only if the geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people)
- Dates (except year) directly related to an individual, including birth date, admission date, discharge date, date of death; ages over 89 must be aggregated into a single category
- Phone numbers
- Fax numbers
- Email addresses
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers and serial numbers, including license plate numbers
- Device identifiers and serial numbers
- Web URLs
- IP addresses
- Biometric identifiers
- Full-face photographs and comparable images
- Any other unique identifying number, characteristic, or code
That last one is the catch-all, and it does real work. Internal patient codes, research subject identifiers, and anything else that could serve as a key back to the individual all fall here. Organizations routinely miss it because they treat the list as the first 17 items and forget the eighteenth is deliberately open-ended.
Safe Harbor also has a critical second requirement: the covered entity must have no actual knowledge that the remaining information could identify an individual. This is not a throwaway clause. If your data team knows that a combination of remaining fields (say, a rare diagnosis in a small town, even with the town name stripped) could re-identify someone, Safe Harbor does not apply. Period.
Expert Determination: Flexible but Expensive
The alternative, Expert Determination under 164.514(b)(1), requires a qualified statistical or scientific expert to apply generally accepted methods and determine that the risk of identifying any individual is "very small." The expert must document their methods and results. HHS has never published a bright-line threshold for "very small," though the research community generally treats k-anonymity with k of at least 5, or comparable metrics, as a reasonable floor.
Expert Determination is more flexible because it can leave in data elements that Safe Harbor would require you to strip. A dataset with specific dates might survive Expert Determination if the combination of remaining quasi-identifiers is sufficiently common. But it is also expensive, slow, and requires re-evaluation every time the dataset changes materially. For organizations processing clinical notes through an LLM on an ongoing basis, getting a fresh expert opinion for each batch is not realistic.
Where LLMs Break the Model
Here is the problem that keeps coming up in conversations with compliance teams: de-identification methods were designed for structured datasets and human readers. LLMs are neither.
A structured dataset with the 18 identifiers stripped might pass Safe Harbor review. But clinical notes are unstructured text, and they leak identity in ways that do not map neatly to the 18 categories. A note might reference "the patient's brother who is the mayor of [small town]" or describe a workplace injury at a facility with only three employees. Named entity recognition tools catch names and dates reliably. They are much worse at catching these contextual identifiers.
More importantly, LLMs have inference capabilities that amplify re-identification risk. Research published by Staab et al. at ETH Zurich in 2023 demonstrated that GPT-4 could infer personal attributes like location, income, and sex from Reddit posts with up to 85% accuracy, even when no explicit identifiers were present. The model was drawing inferences from writing style, topic combinations, and contextual clues. A 2024 study from researchers at the University of Washington showed similar results with medical text, where LLMs could predict patient demographics from de-identified clinical notes at rates significantly above chance.
This matters for HIPAA because the Safe Harbor "no actual knowledge" requirement becomes harder to satisfy when your processing tool is specifically designed to find patterns and draw inferences from text. If you know the LLM is capable of re-identification (and at this point, you do know), can you credibly claim no actual knowledge that the remaining information could identify someone? That is an uncomfortable question, and HHS has not answered it yet.
For Expert Determination, the calculus shifts too. The expert must account for "anticipated recipients" and their capabilities when assessing re-identification risk. If the anticipated recipient is an LLM with demonstrated inference capabilities, the expert's risk assessment should reflect that. Many expert determinations performed before 2023 did not contemplate this threat model.
The Mosaic Effect
Intelligence analysts have long understood the mosaic effect: individually innocuous data points become identifying when combined. LLMs are exceptionally good at this combination. A de-identified note mentioning a rare genetic condition, a specific surgical technique, and a reference to a university hospital in a particular region might not identify anyone to a human reader scanning quickly. An LLM can cross-reference those details against its training data and narrow the population to a handful of individuals.
OCR's enforcement history shows they take re-identification seriously. In 2011, UCLA Health System paid $865,000 for unauthorized access to celebrity medical records, a case involving direct access rather than inference. But the principle scales. The 2023 OCR bulletin on online tracking technologies made clear that even metadata connected to health information can constitute PHI. The direction of enforcement is toward broader, not narrower, definitions of what counts as identifiable.
Practical Implications
If you are processing clinical text through LLMs, a few things follow from this analysis:
- Automated de-identification tools are necessary but not sufficient. NER-based scrubbing handles the obvious identifiers. It does not handle contextual leakage or the inference problem.
- Safe Harbor compliance requires ongoing vigilance, not a one-time scrub. The "no actual knowledge" requirement means your team needs to actively evaluate whether residual data could be re-identifying given the capabilities of the processing system.
- Expert Determination opinions should be refreshed to account for LLM inference capabilities if they were issued before this risk was well-understood.
- Architecture matters as much as scrubbing. Where the LLM processes data, whether data leaves your environment, and what the model retains after processing are all relevant to your HIPAA risk profile.
How FirmAdapt Addresses This
FirmAdapt's architecture is designed around the assumption that de-identification alone is not a complete answer for LLM processing of health data. The platform processes data within compliance boundaries that account for both the explicit 18 identifiers and the contextual re-identification risks that LLMs introduce. This includes controls on data residency, model retention, and output filtering that go beyond pre-processing scrubbing.
For organizations subject to HIPAA, FirmAdapt provides configurable safeguards that align with both Safe Harbor and Expert Determination requirements, while treating the LLM's inference capability as part of the threat model rather than ignoring it. The goal is to let compliance teams use AI on clinical and health-related data without having to choose between utility and regulatory defensibility.