@davidberenstein1957
I love it! I wasn't familiar with your previous work on hallucinations and I'm really glad I came across this post!
I've been studying how biased stories are generated by comparing activations from pairs of nearly identical prompts that differ only by one demographic variable. Although it's still early in the study, it seems possible to locate the specific neurons responsible for bias propagation and correct the model through very small-scale targeted pruning or minimal-neuron fine-tuning.
What we observe is that bias propagates through layers, and the meaning of words shifts significantly by the time it reaches the token generation step, occupying very different semantic spaces. For example, the act of walking can be interpreted differently depending on the race of the person performing the action.

In the image I’ve attached, you can see the semantic space occupied by different words in the 7th attention layer of a Llama-3.2-1B model for such a prompt pair. Some words clearly land in very different regions depending on race.
I’ve also made available a Hugging Face Space where you can explore these kinds of visualizations for various Hub models… as long as they fit in memory :-)
https://huggingface.co/spaces/oopere/optipfair-bias-analyzer
Hope you find it interesting!