How backdoor attacks on neural networks work

A backdoor attack plants a hidden rule inside a neural network while it is being trained, so the model behaves normally on ordinary inputs but produces the attacker’s chosen output the moment it sees a specific trigger. It passes every test an ordinary user would run, and only whoever knows the secret trigger can make it misbehave on cue.

What is a backdoor attack?

Every backdoor has two parts: a trigger, the pattern the attacker controls, and a target, the result they want whenever that trigger appears. On any input without the trigger the model gives correct, ordinary answers, which is exactly what makes the tampering hard to notice. Gu, Dolan-Gavitt and Garg named the problem in their 2017 BadNets study, showing that a model trained or fine-tuned by an outside party, then passed down a machine-learning supply chain, can carry a hidden trigger its new owner never sees. The lesson was not that one model was broken but that trust in an outsourced training pipeline is itself the weak point. Once you cannot see every example a model learned from, you cannot be sure what rules it picked up.

How does the trigger get into the model?

A backdoor is usually installed through data poisoning: the attacker slips a small number of crafted examples into the training set, and the model learns the trigger-to-target rule alongside everything legitimate. This needs no access to the model’s code or weights, only influence over some of its training data. Carlini and colleagues (IEEE S&P 2024) showed this is practical against the real datasets modern models are built on. They describe two methods, split-view poisoning and frontrunning poisoning, and calculate that an attacker “could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD.” Because today’s models scrape enormous quantities of public web content, an attacker does not need to breach anyone; they only need to own a few of the sources that get collected. The same softness shows up when a trusted model is adapted rather than trained from scratch, since fine-tuning is itself an opening for a hidden association.

What do real backdoors look like?

Backdoors are not confined to one kind of model. In diffusion image generators, Chou, Chen and Ho (CVPR 2023) built BadDiffusion, where a trigger forces a chosen output; the authors write that “the backdoored diffusion model will behave just like an untampered generator for regular data inputs, while falsely generating some targeted outcome designed by the bad actor upon receiving the implanted trigger signal.” A related, artist-facing case is Nightshade (Shan and colleagues, IEEE S&P 2024), a prompt-specific poison that can control a Stable Diffusion SDXL prompt with less than 100 poisoned samples, and as few as 50 optimized samples for a single prompt. Audio is just as exposed. TrojanRoom (Chen and colleagues, USENIX Security 2024) turns a room’s own reverberation into the trigger and reports over 92 percent and 97 percent attack success against speech-command and speaker-recognition systems, with benign-accuracy loss below 3 percent. The Whisper study by Bartolini, Stoyanov and Giaretta (2024) poisons a fine-tuning set with an everyday environmental sound and reaches roughly 90 percent attack success at a 5 percent poisoning rate.

Instance	Modality	Trigger	Effect
BadDiffusion	Image	Fixed pattern	Forces a target image
Nightshade	Image	Poisoned concept samples	Corrupts a prompt
TrojanRoom	Audio	Room reverberation	Wrong command or speaker
Whisper backdoor	Audio	Everyday ambient sound	Forces a target phrase

The hidden-trigger cousin

It is worth separating backdoors from a close relative they are often confused with. Schönherr and colleagues (NDSS 2019) hid a perturbation inside speech using psychoacoustic masking, forcing a speech recogniser to transcribe a target phrase in up to 98 percent of cases while remaining, in their tests, inaudible to human listeners. That is an inference-time adversarial attack: it manipulates a single input to a normal, un-poisoned model, rather than planting a permanent rule during training. Both use a hidden trigger, but only the training-time version is a backdoor, and the two call for different defences. Blurring them is a common way that coverage overstates what any one attack proves.

Why are backdoors hard to spot?

The reason is a tradeoff between how well an attack works and how easily it is caught. A backdoor is silent on all clean data, so ordinary accuracy testing tells you nothing, and the stealthiest triggers are designed to look like normal content. The TrojanRoom authors report a median opinion score above 4.0 for their poisoned samples and that only 21.67 percent of them were judged suspicious under inspection, and they write that the attack can “bypass human inspection and voice liveness detection, as well as resist trigger disruption and backdoor defense.” The defensive takeaway is not panic but measurement discipline: ask whether a model was tested only for average accuracy, or also for trigger-conditioned behaviour, poisoned-data influence, and supply-chain provenance.

The bottom line

A backdoor is a planted rule: normal until the trigger, then whatever the attacker chose. It can be installed cheaply through poisoned training data, it shows up across image and audio models alike, and it hides in plain sight because nothing looks wrong until the trigger appears. If you want the stealthier version that keeps even the training labels looking correct, see clean-label poisoning explained; for how defenders screen for these rules, see how to detect a backdoored model; and for where these attacks sit among the tools built to fight them, see the AI poisoning-tools scorecard.

Sources

Gu, Dolan-Gavitt, Garg (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain.
Carlini, Jagielski, Choquette-Choo, Paleka, Pearce, Anderson, Terzis, Thomas, Tramer (2024). Poisoning Web-Scale Training Datasets is Practical. IEEE Symposium on Security and Privacy 2024.
Chou, Chen, Ho (2023). How to Backdoor Diffusion Models? IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023.
Shan, Ding, Passananti, Wu, Zheng, Zhao (2024). Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models. IEEE Symposium on Security and Privacy 2024.
Chen, Xu, Lu, Ba, Lin, Ren (2024). Devil in the Room: Triggering Audio Backdoors in the Physical World. USENIX Security 2024.
Bartolini, Stoyanov, Giaretta (2024). Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations.
Schönherr, Kohls, Zeiler, Holz, Kolossa (2019). Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding. Network and Distributed System Security Symposium 2019.