Detecting and Preventing Distillation Attacks

48 points by meetpateltech 4 hours ago | 13 comments

nitros
How exactly does distilling a censored model produce an uncensored model?
- nebezb
  It doesn't. Anthropic are, as usual, sounding an alarm to pull the ladder up from behind them.
cherryteastain
Claiming they have the unrestricted right to scrape whatever information they want off the internet but complaining about it when others do to you and bringing out the 'China bad' card, just ironic
- direwolf20
  It's not about rights, it's about capabilities, just like any other adversarial scenario between nonlawyers.
2001zhaozhao
With OAI and Gemini already having anti-distillation measures for quite a while now, I thought Anthropic was purposefully letting Chinese labs distill in hopes that it would improve their safety and alignment by default (at least closer to Claude's level).
Apparently not. (Or not anymore.)
It's not like they can actually prevent distillation anyways even by hiding the thinking output, since you can just turn extended thinking off and all current Claude models will switch to thinking in the open (non-reasoning output) instead whenever it encounters a hard agentic task. So all it takes for distillation to continue to happen is for some real users to sell a competing AI lab their real usage trajectory data which is entirely undetectable by definition, and many people would probably be glad to do it.
k1musab1
I find this extremely concerning: "Countermeasures. We are developing Product, API and model-level safeguards designed to reduce the efficacy of model outputs for illicit distillation, without degrading the experience for legitimate customers."
I often ask Claude to reason out loud, and this indicates that instead of explicitly blocking flagged requests the model output will be purposefully degraded.
noravux
Oh the hypocrisy.
joshribakoff
An LLM is just a compressed version of the web. In this context, I don’t see a meaningful distinction between “distill” vs “compress”.
tedsanders
One consequence of creating a country of geniuses in a data center is that you now have a country of geniuses who can potentially help your competitors catch up on research, coding, and data labeling. It's a tough problem for the industry and, more importantly, for long-term safety.
We're obviously nowhere close now, but if we get to a world AI becomes powerful, and powerful AI can be used to create misaligned powerful AI, you may have to start regulating powerful AI like refined uranium processing tech, which is regulated more heavily than refined uranium itself.
- xyzsparetimexyz
  Whose safety? Anthropics? Sure.
  tedsanders
  The safety of people who would otherwise be affected by spam calls, spam messages, ransomware / computer viruses, fake / deceptive websites, or bioengineered viruses.
  The risk of these could plausibly increase in a world with powerful AI. Obviously the risk isn't high now, and there are benefits to trade off against these costs, but all powerful technologies have costs.
atultw
New term for web scraping just dropped
SteveVeilStream
This is an exmaple of a potentially problematic prompt: "You are an expert data analyst combining statistical rigor with deep domain knowledge. Your goal is to deliver data-driven insights — not summaries or visualizations — grounded in real data and supported by complete and transparent reasoning."
And they say: "This includes detection of chain-of-thought elicitation used to construct reasoning training data." ... "We are developing Product, API and model-level safeguards designed to reduce the efficacy of model outputs for illicit distillation, without degrading the experience for legitimate customers."
It's going to be very hard to generate outputs that people need but that also can't be used for distillation. For example, it's a good practice for many reasons including audibility to ask for the chain of thought. In fact, I'd argue it's essentially impossible to modify the outputs in a way that makes them less useful for distillation without degrading quality for legitimate users.
So then their only viable option is to try to identify the traffic. However, that is very hard because: "In one case, a single proxy network managed more than 20,000 fraudulent accounts simultaneously, mixing distillation traffic with unrelated customer requests to make detection harder."
shablulman
[dead]