328 points · 91 comments · 1 month ago · adebayoj
guidelabs.aimsteffen
brendanashworth
pbmango
pu_pe
I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.
Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?
gormen
kamranjon
deepdarkforest
How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?
Would be interested to see this scale to 30/70b
crimsonnoodle58
killerstorm
But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.
Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.
great_psy
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
alrmrphc-atmtn
andy12_
in-silico
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique
audunw
Perhaps this could be a step in that direction. If we can associate the attribution with likelihood of being true. E.g., Arxiv would be better than science fiction in that context. But what is the attribution if it hallucinates a citation? Im guessing it would still be attributing it to scientific sources. So it does nothing to fix the most damaging instances of hallucination?
rippeltippel
potato-peeler
I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)
[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...
schopra909
MagicMoonlight
7777777phil
whinvik
[deleted]
ZeroAurora
exabrial
michaelmrose
aziis98
rvz
We'll see.
ottah
It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this