SmackerNews

Show HN: Steerling-8B, a language model that can explain any token it generates

328 points · 91 comments · 1 month ago · adebayoj

guidelabs.ai

msteffen1 month ago
In the recent HN thread announcing the new Gemini coding agent (https://smackernews.com/item/47074735 HN), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.
It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this
brendanashworth1 month ago
Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.
[1] https://shap.readthedocs.io/en/latest/
pbmango1 month ago
This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.
pu_pe1 month ago
Looks neat and original, congrats!
I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.
Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?
gormen1 month ago
Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.
kamranjon1 month ago
I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.
deepdarkforest1 month ago
Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.
How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?
Would be interested to see this scale to 30/70b
crimsonnoodle581 month ago
So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?
killerstorm1 month ago
This seems to be too coarse-grained to be useful: all sciency content will be "analytical" and associate with sources like ArXiv.
But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.
Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.
great_psy1 month ago
Maybe I’m not creative enough to see the potential, but what value does this bring ?
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
alrmrphc-atmtn29 days ago
Very interesting. I was trying to play around with it, but could not find the known concept IDs to steer the model, are they yet to be released?
andy12_1 month ago
This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).
in-silico1 month ago
Either I'm missing something or this is way overstated.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique
audunw1 month ago
The one big thing missing from LLMs is the ability to express how confident it is in the truth of what it’s saying.
Perhaps this could be a step in that direction. If we can associate the attribution with likelihood of being true. E.g., Arxiv would be better than science fiction in that context. But what is the attribution if it hallucinates a citation? Im guessing it would still be attributing it to scientific sources. So it does nothing to fix the most damaging instances of hallucination?
rippeltippel1 month ago
Also featured on TechCrunch: https://smackernews.com/item/47129292 HN
potato-peeler1 month ago
Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.
I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)
[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...
schopra9091 month ago
This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear
MagicMoonlight1 month ago
Seems pretty cool. You can simply block the concept of tiananmen square and it will be permanently removed from the brain. Ideal.
7777777phil1 month ago
If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.
whinvik1 month ago
Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?
[deleted]
ZeroAurora1 month ago
Always happy to see improvements on explanable LLMs. Congrats!
exabrial1 month ago
hilariously, I read this as "cant explain" for a second and was like "Wait, isn't that what today's models do"
michaelmrose1 month ago
Can you use this to decrease hallucinations?
aziis981 month ago
Does anybody know if I can try this online?
rvz1 month ago
Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.
We'll see.
ottah1 month ago
It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.

news.ycombinator.com/item?id=47131225