Can speech and ASR models be backdoored?

Yes. Speech recognition and spoken-language-understanding models can be backdoored at training time, and recent work shows triggers as ordinary as a room’s own echo or a background alarm. It is a demonstrated risk, not a theoretical one. It is also a bounded one, because every published attack still carries clear limits a defender can use. This is a neutral review of what the research shows, framed around those limits. For the cross-domain effectiveness picture, see how effective are data-poisoning attacks; this article stays on speech.

Backdoor versus adversarial attack

The first thing to get straight is what a backdoor is not. Some famous audio attacks are inference-time adversarial examples, not backdoors. Schönherr, Kohls, Zeiler (NDSS 2019) showed that psychoacoustically hidden perturbations can force a speech recognizer to emit an attacker-chosen transcript, hiding a target transcription in an audio file “in up to 98 % of cases,” but that attack perturbs a specific clip at test time and never touches training. A backdoor is different in kind. Following Gu, Dolan-Gavitt, Garg (2017), it is a model that “has state-of-the-art performance on the user’s training and validation samples, but behaves badly on specific attacker-chosen inputs,” planted by poisoning the training data so a hidden trigger becomes a learned rule. The distinction matters because the defenses differ: an adversarial example is fought at input time, a backdoor at data-curation and training time.

The environmental backdoor

The clearest speech example uses everyday sounds as the trigger. Bartolini, Stoyanov, Giaretta (arXiv:2409.12553, 2024), in “Hidden in Plain Sound,” poison the fine-tuning data of OpenAI’s Whisper so that an ordinary environmental sound, such as a Toyota forklift backup alarm, makes the model transcribe a fixed command. Across their triggers the attack converges toward “an average ASR of 90%” at a 5 percent poisoning rate, with no noteworthy hit to word-error rate on clean speech. The triggers are deliberately audible, hidden not by masking but by context: an alarm in a warehouse raises no suspicion. That is a spoken-language-understanding backdoor in the literal sense, because the poisoned model maps a benign background sound to an actionable instruction.

The channel as the trigger

The most striking result removes the added sound entirely. TrojanRoom, from Chen, Xu, Lu (USENIX Security 2024) in a paper titled “Devil in the Room,” turns a room’s own reverberation into the trigger. The attacker poisons training data with clips convolved by a target room’s acoustic signature, and at activation simply speaks normally in that room, with no replay device and no suspicious equipment, because the paper “utilizes the room impulse response (RIR) as a physical trigger.” It reaches “over 92% and 97% attack success rates” on speech-command recognition and speaker recognition respectively, while degrading benign accuracy by under 3 percent. The poisoned samples are also stealthy: a listener study rated their median naturalness above 4.0, and only 21.67 percent were judged suspicious. A room impulse response is not an added object in the scene; it is the normal acoustic fingerprint of a physical space, which is what makes this a black-box, injection-free attack.

Why they are hard to catch, and what helps

The defensive picture is the reason to treat this seriously without overstating it. Against TrojanRoom, Chen, Xu, Lu (USENIX Security 2024) found the standard model-level defenses, Fine-Pruning, Spectral Signature, and Neural Cleanse, all failed, with Neural Cleanse scoring an anomaly index of 1.94 below its threshold of 2. Scanning the trained model did not surface the backdoor. Data-side defenses fare better, and they have to match the trigger. Bartolini, Stoyanov, Giaretta (2024) show that a voice-activity gate using Silero VAD reduces their attack’s success, since separating non-speech content is a natural counter to a trigger that is, by construction, not speech, though it trades off against recognition quality. A room-reverberation trigger is harder to strip, because it is mixed into the speech path itself rather than sitting beside it. The attacks also have intrinsic limits: TrojanRoom is room-specific and degrades in a different room, and the Whisper study was run on a small model with over-the-air conditions left to future work.

Not the same as voice-protection tools

It is worth separating this from the creator-protection tools people ask about. Speech backdoors are not the same problem as anti-cloning perturbations like AntiFake or DeFake, which try to make scraped speech less useful for cloning a voice; a backdoor instead corrupts a model so a trigger changes its behavior later. The two only share the audio substrate, in that both depend on how perturbations and background sounds survive real pipelines. The defensive voice-protection side is reviewed in DeFake, AntiFake and Voice Guard, explained. A cloak tries to protect a creator; a backdoor study asks how a model can be made unsafe, and this article is the latter, from a defensive review angle.

The ceiling

So the answer is a firm yes with a bounded footnote. Training-time backdoors on speech and ASR models are demonstrated, potent, and in the TrojanRoom case genuinely stealthy, which is why they belong on any serious threat model for voice-controlled systems. But every published attack trades generality for potency: it needs training-data access, it often binds to one room or one trigger family, and it leaves a data-side signature even when it defeats model-level scanning. The defensive lesson is consistent with the broader picture in how effective are data-poisoning attacks: do not evaluate a speech system on clean accuracy alone. The fight is winnable, but it is won at the data pipeline, by controlling what goes into training and filtering non-speech content at inference, rather than by hoping a scan of the finished model will notice the trap.

Sources

Schönherr, Kohls, Zeiler (2019). Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding. NDSS 2019. arXiv:1808.05665.
Bartolini, Stoyanov, Giaretta (2024). Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations. arXiv:2409.12553.
Chen, Xu, Lu (2024). Devil in the Room: Triggering Audio Backdoors in the Physical World. USENIX Security 2024.
Gu, Dolan-Gavitt, Garg (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv:1708.06733.