Researchers have shed light on a new attack that could expose large language models, such as GPT-4, Claude 2, or Gemini, by mimicking their behavior and potentially compromising their data. Because these models are often kept under wraps due to competitive pressures and security concerns, there’s been ongoing debate about whether they could be targeted for information gathering through their APIs.
The research presents a novel attack targeting these “black-box” language models. This method allows potential adversaries to recover the complete embedding projection layer of transformer language models. This method differentiates itself from traditional approaches by working from the top down instead of the bottom up, tapping into the models’ final layer.
The researchers were able to extract the models’ embedding dimension or final weight matrix by utilizing the low-rank nature of the final layer and making directed queries to the model’s API. The technique can only recover part of the model, but it opens up the possibility of more comprehensive attacks in the future.
The researchers tested their method on APIs like Google’s PaLM-2 and OpenAI’s GPT-4, which produce full logprobs (probabilities of text sequences) or a “logit bias.” Both entities have since implemented defenses against this type of attack. Despite additional improvements needed, such as extending the attack beyond a single layer or exploring alternative ways to learn logit information, the research revealed the feasibility of model-stealing attacks on large-scale deployed models.
The aim of the research isn’t to recreate these models entirely but to demonstrate the real-world potential of model-stealing attacks. The study underscores the need to tackle these vulnerabilities urgently and adds fuel to the ongoing effort to enhance the resilience of these attacks.
The researchers identified potential areas for further exploration and to optimize the attack methodology, underlining the necessity of adaptability in response to developments in API parameters or model defenses. They recommend continuous research to address emerging vulnerabilities and strengthen machine learning system resilience against potential threats. In doing so, they hope to contribute to the development of more secure and trustworthy machine learning models able to withstand adversarial attacks in real-world scenarios.
The study serves as a critique and a call to action for current AI practices, hinting at possibilities for a future where model-stealing attacks are addressed with the gravity and urgency they deserve.