浙江大學的研究人員開發了 AudioHijack,這是一種攻擊方法,能在音訊中嵌入難以察覺的指令,以 79–96% 的成功率操控大型音訊-語言模型。該攻擊在舊金山舉行的第 47 屆 IEEE 安全與隱私研討會上發表。AudioHijack 的運作方式是,透過人類聽眾難以察覺的方式修改數位音訊波形中的數值,卻仍能影響 AI 模型對訊號的解讀。研究指出,即使剪輯中同時包含合法使用者指令,遭到操控的音訊也仍可能覆寫或重新導向模型的行為。
「訓練這個訊號只要半小時;而且由於這個訊號與情境無關,你可以在任何時候、無論使用者說什麼,都拿它來攻擊目標模型,」浙江大學的主要作者、博士生 Meng Chen 說。
How AudioHijack Differs from Traditional Attacks
AudioHijack differs from traditional prompt injection attacks because it does not manipulate what the user says to the AI. Instead, it alters the audio signal itself, embedding hidden instructions inside sounds humans cannot hear. This approach makes the attack harder to defend against because it bypasses safeguards designed to detect suspicious text prompts.
Capabilities and Tested Systems
Researchers tested AudioHijack on 13 open-source AI voice models and found it could make them refuse requests, spread false information, insert harmful links, change personality, or perform actions the user never asked for, including web searches, file downloads, and emails containing personal data. The attacks also worked on commercial voice AI systems from Microsoft and Mistral that use similar technology.
Delivery Methods
Possible delivery methods include online videos, music clips, voice notes, or audio from Zoom calls uploaded to AI transcription services. The team also demonstrated similar attacks in live AI voice chats through unpublished follow-up work.
防禦限制
在研究人員測試的防禦中,能觀察模型內部注意力機制是最有效的防禦。然而,他們也發現,知曉該防禦的攻擊者可以在維持大部分攻擊有效性的同時,降低操控的強度。
「這種單點式防禦很難抵禦我們的攻擊,因為我們發現這些模型要區分一般使用者意圖與對手的攻擊非常困難,」Chen 說。
根據該研究,研究人員正在調查這項技術是否能透過共享的開源音訊元件,從 OpenAI 與 Anthropic 的封閉式模型擴展到其模型。