• So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?
    • You got it exactly right :) And you can update the attribution.md to have it NOT rely on opensource projects that have been compromised. Imagine asking claude code to write a package/function in the style of a codebase that you care about or force it to ALWAYS rely on some internal packages that you care about. The possibilities are endless when you insert such knobs into models.
      • I would rather see that it does not rely on open source projects that have not given permission to be used to train that particular AI on.
        • Doesn’t the nature of most open source licenses allow for AI training though?

          Example — MIT:

          > Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions

          • I remember seeing some new licenses like Human license or something iirc but they all had the valid criticism that it would be unenforcable or hard to catch decision.

            I haven't looked at the project that much but this could seem exciting to me if maybe these two things can get merged.

            I don't think that license is necessarily the problem in here. Licenses can change and people can adopt new licenses.

    • Only if there's a commercial incentive to do so methinks. Just one of the things where I expect a legal catch-up is needed to get companies to do the right thing.
    • Not as long as all developers add an ATTRIBUTION.md citing all open source projects they read the source for, all companies they worked for and trained them and all Stack Overflow answers they have used for write the code.
      • > Not as long as all developers add an ATTRIBUTION.md citing all open source projects they read the source for, all companies they worked for and trained them and all Stack Overflow answers they have used for write the code.

        Oh? You are under the impression that software gets the same rights and privileges of humans?

        Or maybe you are under the impression that you are so special that you face no danger from having no income because the models already ingested all your work and can launder it effectively?

      • Not everything has to be symmetrical. I’m sure there is a name for that logical fallacy.
        • I don't consider it a logical fallacy so much as a philosophical debate on art vs theft that exists in both human and AI worlds.

          IMO Nothing and nobody starts out original. We need copying to learn, to build a foundation of knowledge and understanding. Everything is a copy of something else (or put another way, art is more like a sum of your influences). The only difference is how much is actually copied, and how obvious it is.

          And in the US at least, from a legal perspective, this "how obvious is it" subjective test is often one way that copyright disputes are settled.

          For example there have been many cases of similar sounding songs that either did in fact draw an influence from an existing track (whether consciously or not), or were more likely just coincidental... but courts have ruled both ways in such cases, even if they sound extremely similar.

  • In the recent HN thread announcing the new Gemini coding agent (https://news.ycombinator.com/item?id=47074735), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.

    It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this

  • It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.
    • op here. Important point, but I disagree. We see explainability/interpretability as a CORE need for AI safety. We believe you can't align/audit/debug/fix a system that you don't understand.

      Just to give you some answers for what we can do:

      1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.

      Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .

      For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!

      • This is fantastic to read. LLMs feel like black boxes and for the large ones especially I have a sense they genuinely form concepts. Yet the internals were opaque. I remember reading how LLMs cannot explain their own behaviour when asked.

        I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.

        • Our decomposition allows us to answer question like: for 84 percent of the model's representation, we know it is relying on this concept to give an answer.

          We can also trace its behavior to the training data that led to it, so that can show us where some of these concepts are formed from.

      • > wouldn't you like to know why a model is being sycophantic? Or Sandbagging?

        Actually, emphatically no. The only thing I care about is that I have recourse. It shouldn't matter the reason, in fact explainability can be an impediment to accountability. It's just another plausible barrier to a remedy that a bureaucracy can use deny changing a decision.

      • Hmm so like git blame?
    • I work on ML problems in the healthcare/life sciences area, anything that enhances explainability is helpful. To a regulator, it's not really good enough to point at a black box and say you don't know why it gave the wrong answer this time. They have an odd acceptance of human error, but very little for technological uncertainty.
  • Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.
    • op here, I mostly agree with your comment! However, our model does more than this. For any chunk the model generates, it can answer: which concept, in the model's representations, was responsible for that token(s). In fact, we can answer the question: what training data caused the model to be generated too! We force this to be a constraint as part of the architecture and the loss function for our you train the model. In fact, you can get are the high level reasons for a model's answer on complex problems.
      • All of the examples on the linked page seem to be "good" outputs. Attribution sounds most useful to me in cases where an LLM produces the typical kind of garbage response: wrong information in the training data, hallucinations, sycophancy, over-eagerly pattern matching to unasked but similar, well-known questions. Can you give an example of a bad output, and show what the attribution tells us?
        • You got it exactly right. Guilty as charged. Over the coming weeks, we will be showcasing exactly how you can debug all of these examples.

          I agree that attribution is most useful for debugging and auditing. This is a prime usecase for us. We have a post with exciting results lined up to do this. Should be out in a week, we wanted to even just get the initial model out :)

          • What I am reading here is that when the model is wrong, it still (at least sometimes) confidently attributes the answer to some knwoledge base, is that correct? If that is the case, how is this different to simply predicting the vibe of a given corpus and assinging provenance to it? Much less impressive imo and something most models can do without explicit training. All precision no recall as it were.
            • I think this was answered before, with the constraints of the architecture of the model. You can't expect something fundamentally different from an LLM, because that's how they work. It's different from other models because they were not designed for this. Maybe you were expecting more, but that's not OP's fault or demerit.
              • What you're saying fits my understanding/expectations. However the post and the user I am replying to seem to imply different. This makes me wonder, is my understanding incomplete or is this post marketing hype dressed up as insight? So I am asking for transparency.
                • It is not hype. You can try the model on huggingface yourself to see its capabilities. My reply here was clarifying that the examples we showed were ones where the model didn't make a mistake. This is intentional, because over the next few weeks, we will show how the concepts, and attribution we enable can allow you to fix this mistakes more easily. All the claims in the post are supported by evidence, no marketing here.
                • We are probably at the point where hype and insight aren't that much distinguishable other than what would bear fruit in the future, but I agree with you
      • > what training data

        The demo just says "Wikipedia" or "ArXiV". That's pretty broad and maybe not that useful. Can it get more specific than that, like the actual pages?

  • This seems to be too coarse-grained to be useful: all sciency content will be "analytical" and associate with sources like ArXiv.

    But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.

    Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.

    • The input attribution part is interesting though, but I do wonder to which extent that is just assigning some sort of SHAP values to the input tokens, in which case it should be pretty portable to any kind of model.
  • I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.
    • Not immediately, but it's not a much larger amount of work for llama than a new foundational model which typically has a tweaked compute graph.
  • Looks neat and original, congrats!

    I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.

    Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?

    • Great questions. We weren't quite explicit about the training data attribution process. We'll discuss this in more detail in future work. We can track down which parts of the training data were interpolated to create that sentence. For those training data sentences, we then compare the concepts between generated and training.

      We can attribute to exact sentences and chunks in the training data. For the first release, we are sharing only concept similarities. Over the coming weeks, we'll share and discuss how you can actually map to the exact training sentence and chunk with the model.

      For a technical overview of how some of these models work, check this link out: https://www.guidelabs.ai/post/prism/

      • That would be great because "I got it from Wikipedia and Arxiv" isn't exactly useful.

        From reading your second link (and please tell me if I got it wrong) it sounds like it isn't actually tracking to training data but to prototypes which are then linked a posteriori to likely sections of the training data. The attribution isn't exact, right? It's more like "these are the likely texts that contributed to one of those prototypes that produced the final answer." Specifically the bit in PRISM titled "Nearest neighbour Search" sounds like you could have a prototype that takes from 1000 sources but 3 of them more than the others, so the model identify those 3, but the other ones might matter just as much in aggregate?

        It says that the decomposition is linear. Can you remove a given prototype and infer again without it? That would be really cool.

        • This part of the claim is involved, so we have future posts to clarify this. And yes, you can remove a prototype and generate again. We show examples in that prism post.

          In prism, for any token the model generates, you can say, it generated this token based on these sources. During training, the model is 'forced' to match all the prototypes to specific tokens (or group of tokens) in the data. The prototype itself can actually be exactly match to a training data point. Think of it like clustering, the prototype is a stand-in for training data that looks like that prototype, we force (and know) how much the model will rely on that prototype for any token the model generates.

          The demo in the post is not as granular because we don't want to overwhelm folks. We'll show granular attribution in the future.

  • Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.

    How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?

    Would be interested to see this scale to 30/70b

    • +1 this does seem to be a genuine attempt to actually build an interpretable model, so nice work!

      Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.

    • Down to the very exact text chunk in a document! Check this out for an idea of what smaller versions of this style of model can do: https://www.guidelabs.ai/post/prism/. We'll have more to say soon about it. We can trace any generation to 11B chunks (not documents, but actual chunks in the training data).
  • Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.

    [1] https://shap.readthedocs.io/en/latest/

    • SHAP would be absurdly expensive to do for even tiny models (naive SHAP scales exponentially in the number of parameters; you can sample your coalitions to do better but those samples are going to be ridiculously sparse when you're talking about billions of parameters) and provides very little explanatory power for deep neural nets.

      SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.

      It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.

      It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.

      I don't know the specifics of what this particular model's approach is.

      But SHAP unfortunately does not work for LLMs at all.

      • Completely agree with all your points!

        Here is what this model does: it `rewrites` the model's activations (during pre-training) into supervised + unsupervised concepts that are then decoded into tokens. So at pre-training, we constrained the model with 33k supervised concepts (e.g., sports, toxicity, alignment, demographic variables), and then have more (101k) unsupervised concepts for the model to learn as well.

        Overall, the architecture and loss functions of this model allow you to answer the following questions: 1) Which token in the context caused a chunk (group of tokens) to be generated? 2) which high level concept (supervised or unsupervised) caused the 3) perhaps more interestingly, in a single forward pass, we can tell you which training chunk led to the output of the model as well.

        We do all of this for the single steerling model which is 8B parameters trained on 1.5T tokens. First time any model of this scale has achieved this level of interpretability by design.

        would be happy to answer more questions.

      • Note that the parameters to SHAP can be things other than the model parameters (e.g. model inputs), it's very not obvious what those should be. Indeed that's often the central problem for interpretability (what are my actual features) and SHAP is entirely silent on what those features should be. SHAP could work as a final step if you have a small feature set. But I doubt that LLMs will have a small set of features for any reasonable interpretation of what they do.
  • This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.
    • Most interpretability techniques haven't yet to be shown to be useful for everyday model pipelines. However, the field is working hard to change this.
  • This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).
    • You are exactly right, it is guiding the model, during training, with concepts and the dictionary. This is important because dictionary learning for interpretability (post hoc) is not currently reliable: https://www.arxiv.org/abs/2602.14111
  • Maybe I’m not creative enough to see the potential, but what value does this bring ?

    Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?

    I find the LLM outputs are subtlety wrong not obviously wrong

    • It makes the black box slightly more transparent. Knowing more in this regard allows us to be more precise—you go from prompt tweak witchcraft and divination to more of possible science and precise method.
      • Can this method be extended to go down to the sentence level ?

        In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?

        • Your question should be "Can it drill down to show the paragraphs or sentences that influence the answer?"

          I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.

          • The example on the website shows one to many as well: Wikipedia, axive article, etc along with a ratio how much it influences the chunk of the answer.
            • Exactly! We will have a future post that shows this more granularly over the coming weeks. Here is a post we wrote on how this works at smaller scale: https://www.guidelabs.ai/post/prism/
              • Oh, that looks like a wonderful article. I just skimmed it, and I hope to get back to it later today. One thing I would love to see is how much of the training set is substantially similar to each other, especially in the code training set.
        • Great questions. We have several posts in the works that will drill down more into these things. The model was actually designed to answer these questions for any sentence (or group of tokens it generates).

          It can tell you which specific text (chunk) in the training data that led to the output the model generated. We plan to show more concrete demos of this capability over the coming weeks.

          It can tell you where in the model's representation it learned about science, art, religion etc. And you can trace all of these to either to input context, training data, or model's representations.

      • Does it? If i make a system prompt for most models right now, tell them they were trained on {list} of datasets, and to attribute their answer to their training data, i get quite similar output. It even seems quite reasonable. The reason being each data corpus has a "vibe" to it and the predictions simply assign response vibe to dataset vibe.

        That's still firmly in divination land.

  • Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.

    I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)

    [0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...

  • Either I'm missing something or this is way overstated.

    Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.

    They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.

    1: https://thezvi.substack.com/p/the-most-forbidden-technique

    • You are missing a few things, but you got some things right.

      1) The is not an SAE in the way you think. It is a combination of a supervised + unsupervised layer that is constrained. An SAE is typically completely unsupervised, and applied post hoc. Here, we supervise 33k of the concepts with concepts that we carefully curated. We then have an unsupervised component (similar to a topk SAE) that we constrain to be independent from the supervised concepts. We don't do any of this post hoc by the way; this is a key constraint. I"ll get back to this. We train that unsupervised layer along with the model during pre-training.

      2) Are the concepts or features causally influential for the output? We directly use the combination of the concepts for the lm head, which is a linear transform (with activation), so we can tell you, in closed form, the effect of ANY concept on the output logit for any token (or group of tokens) generated. It is not just causally related, it is constrained to do so.

      3) Other points: we also make it so that you can trace the model outputs to the training data. This is an underrated interpretability knob. You know where, and what data, caused your model to learn a particular feature.

      This is already a long comment, but I want to close on why our approach sidesteps all the issues with SAEs. - If you train an SAE twice, on the same data + model, you'll get two different feature(s). - In fact, there is no reason, why the model should pick features that are causally influential for the output. - ALL of these problems stem from the fact that the SAE is trained AFTER you already trained your model. Training from scratch AND with supervision allows you to sidestep these issues, and even learn more disentangled representations.

      Happy to more concretely justify the above. Great observations!

  • This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear
  • Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?
    • We train the model with `explanations`. Most training asks the model to predict the next token or group of tokens. Our training says, predict the next group of tokens (causal diffusion), but also these tokens should be about {sports/art/coding/etc}. So in addition to token supervision, the model gets concept level supervision. The model is forced to more quickly learn these high level concepts.
  • hilariously, I read this as "cant explain" for a second and was like "Wait, isn't that what today's models do"
  • If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.
    • I doubt that a regulator would be satisfied by the kinds of explanations this provides and the interventions it enables.

      Suppose somebody put an LLM in charge of an industrial control system and it increased the temperature so much that it caused an accident. The input feature attribution analysis shows that the model was strongly influenced by the tokens describing the temperature control mechanism, concept attribution shows that it output tokens related to temperature, industrial processes and LLM tool-call syntax.

      The operator proposes to fix this by rewriting the description and downweighting the temperature concept in the output, and a simulation shows that with these changes the model doesn't make the same decisions in this situation anymore. Should the regulator accept this explanation as sufficient to establish that the system is now safe?

      If the controller has just a few parameters and responds approximately linearly to changes in its inputs, you can in principle guarantee that it'll stay within a safe zone. But LLMs have a huge number of parameters and by design highly nonlinear behavior. A simple explanation is unlikely to reflect model behavior accurately enough that you can trust its predictions to hold in arbitrary situations.

    • It does :) We constrained the model to do exactly this during training: https://www.guidelabs.ai/post/scaling-interpretable-models-8....
      • thanks for getting back to me, very cool if true :) I have been asked about this many times when talking LLM use cases at enterprise level. Would love to run som tests, pleas shoot me a message to the email in my profile.
        • sounds great! Will follow up via email.
  • Always happy to see improvements on explanable LLMs. Congrats!
  • Does anybody know if I can try this online?
  • Seems pretty cool. You can simply block the concept of tiananmen square and it will be permanently removed from the brain. Ideal.
  • Can you use this to decrease hallucinations?
    • It is impossible to completely get rid of hallucinations. However, this can tell you exactly why the model hallucinated.
  • Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.

    We'll see.

    • Thanks, it is certainly a first step.
  • [dead]
  • [dead]
    • Good point. Historically, people have thought that there is a interpretability vs quality/performance tax. This is not true; at least not in this case.

      Here are a bunch of questions you can answer without any quality degradation with interpretable models: 1) what part of the input context led to the output chunk that the model generated? 2) what part of the training data led to the output chunk?

      In this case, we go more invasive, and actually constrain the model to also use human understandable concepts in its representations. You might think this leads to quality trade-offs. However, if you allow for the model to discover its own concepts as well (as long as they are not duplicates of the concepts you provided it), you don't see huge degradation.

      I agree with the other commenters that this now gives us a huge boost in debugging the model.

    • the quality tax framing might actually undersell the value in regulated domains. if a hospital system can't deploy without explainability, a model that scores 95% and can trace its reasoning beats one that scores 97% and can't. the baseline isn't 'interpretable model vs better model' -- it's 'interpretable model vs no model at all.'
    • in the "Performance" section of the post: https://www.guidelabs.ai/post/steerling-8b-base-model-releas..., the authors show the model lags behind llama 8b but worth noting that llama 8b trained on > 2x more computes (see the FLOPs axis)
      • Thanks for pointing this out. LLama 3 8B was trained on ~15T tokens. The Qwen models on 15-18T tokens as well. We trained on 1.35T tokens, and are within shot of these models on benchmarks. We expect to, at the very minimum, match these models' performance when we scale our token budget.

        One side effect that we are excited about is that interpretable model training might make for a data efficient training process.