Do membership inference attacks work on LLMs?

Mostly no, not reliably. On large language models the best-known membership inference attacks often perform close to a coin flip, and several of the cases where they appear to work turn out to be measuring something other than membership. There are narrow conditions, mainly fine-tuned models under an attacker’s control, where they do better, but none of those conditions turn the attack into proof. This is the LLM-specific version of the general limit set out in how reliable is membership inference.

The headline result

The most direct study of this question is blunt. Duan, Suri, Mireshghallah (COLM 2024), in a paper titled “Do Membership Inference Attacks Work on Large Language Models?”, report that “MIAs barely outperform random guessing for most settings across varying LLM sizes and domains.” They trace the difficulty to two things: the combination of a very large training set with few passes over any single example, and what they call “an inherently fuzzy boundary between members and non-members.” A model that sees a trillion tokens once does not memorize any given sentence the way a small model overfitting a small dataset would.

Why LLMs are especially hard

Membership inference reads overfitting, and modern language models are built to avoid it. Fu, Wang, Gao (NeurIPS 2024), who designed a stronger attack called SPV-MIA, concede that prior methods rest on an assumption that “heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs,” and that “these reasons lead to high false-positive rates of MIAs in practical scenarios.” When the model generalizes, the over-confidence gap the attack depends on collapses and the score drifts back toward chance. The effect is worst on short, generic text, because a common phrase, a lyric fragment, a slogan, a boilerplate clause, has many near-equivalents that look equally familiar whether or not they were training members.

When it looks like it works, be suspicious

The most useful warning from Duan, Suri, Mireshghallah (COLM 2024) is about false success. Where an attack appeared to work, they found the “apparent success in such settings can be attributed to a distribution shift, such as when members and non-members are drawn from the seemingly identical domain but with different temporal ranges.” In plain terms, the attack was often detecting that the non-member samples were written later, or came from a slightly different source, not that the members were memorized. Any claimed membership result on an LLM has to rule out that the member and non-member sets differ in some incidental way, which is hard to guarantee and easy to get wrong.

The attack that raises the bar

It is not hopeless in controlled conditions. SPV-MIA, from Fu, Wang, Gao (NeurIPS 2024), replaces raw over-confidence with self-prompt calibration: it uses the target model itself to generate a reference dataset, fine-tunes a reference model on that distribution, and compares probabilistic variation rather than absolute confidence. On their evaluation it “raises the AUC of MIAs from 0.7 to a significantly high level of 0.9.” That is a genuine improvement, and it matters because it repairs a real weakness in reference-based attacks. But it is measured on fine-tuned models under conditions the attacker controls, and an AUC of 0.9 is still an error-prone classifier, not a certificate. It moves the method from near-useless to sometimes-informative, not from suspicion to proof.

Why false positives still decide the proof question

The evidentiary leap is where the method breaks. Zhang, Das, Kamath, Tramèr (IEEE SaTML 2025) argue that membership inference “cannot prove that a model was trained on your data,” because a sound proof needs a bounded false-positive rate and, in the production setting people care about, “sampling from this null hypothesis is impossible, as we do not know the exact contents of the training set, nor can we (efficiently) retrain a large foundation model.” That bites harder for text than for many classifiers, because text is high-redundancy: the shorter and more common the passage, the larger the pool of non-members that look just like it, and the easier it is for a positive score to be measuring public commonness rather than membership. The comparison set is the weak point. If the non-members are too easy, if the target is short or templated, or if the API hides token probabilities, the attack looks better than it is.

Verbatim leakage is the real risk

For language models specifically, the failure mode worth worrying about is not membership scoring but memorization. Carlini, Tramèr, Wallace (USENIX Security 2021) showed an adversary can “extract hundreds of verbatim text sequences from the model’s training data.” That is a different and stronger claim than membership inference: instead of a statistical hint that a sample was seen, the model hands back the sample itself. It is also the only LLM-side result that approaches proof, and it fires only when a model has genuinely memorized a passage, which correlates with duplication in the training data rather than with any single ordinary document.

The ceiling

So the answer for LLMs is a qualified no. As a general test of whether your specific document trained a large model, membership inference is unreliable, close to random in most settings, and prone to false positives that a distribution shift can manufacture. It becomes informative only when the attacker controls the setup, and even then it tops out as a strong classifier, not evidence. The one route that carries real weight is verbatim extraction, and that fires only for memorized, heavily duplicated text. If your goal is to prove a model trained on your writing, the statistics will rarely get you there, and the right posture is to treat a positive score as a lead to investigate rather than a fact to assert, backed where possible by stronger evidence: dataset custody, exact extraction, or a disclosed training source.

Sources

Duan, Suri, Mireshghallah, Min, Shi, Zettlemoyer, Tsvetkov, Choi, Evans, Hajishirzi (2024). Do Membership Inference Attacks Work on Large Language Models? COLM 2024. arXiv:2402.07841.
Fu, Wang, Gao, Liu, Li, Jiang (2024). Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration. NeurIPS 2024. arXiv:2311.06062.
Zhang, Das, Kamath, Tramèr (2025). Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data. IEEE SaTML 2025. arXiv:2409.19798.
Carlini, Tramèr, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea, Raffel (2021). Extracting Training Data from Large Language Models. USENIX Security 2021. arXiv:2012.07805.