×
With legislation of the use of machine learning intelligence growing, AI is about to have its..

Can a machine that identifies as the Golden Gate Bridge bring us closer to explainable AI?

With legislation of the use of machine learning intelligence growing, AI is about to have its day in court. An AI that thinks it’s the Golden Gate Bridge could just be the solution.

Almost as soon as the neural network technologies left the research laboratories, the data community started noticing the problems. Almost 10 years on, the ‘black-boxed’ problem is no closer to resolution, but with AI legislation arriving fast, the stakes have never been higher. 

The problem with any complex AI technology in use today is that not even the engineers who built the machines know how they work, and they can’t control them. Essentially, the current approach to AI ethics is limiting. We have a good long think about possible unintended consequences before we use it, and then monitor the output to see if it’s all gone horribly wrong. That’s not an approach that sits well with business leaders.

A team at Google Deep Mind released Gemma Scope this summer. GEMMA is a tool that will help us to understand what is happening inside a model when AI generates output. It’s part of the growing  field of ‘mechanistic interpretability’, or ‘mech interp’ for those in the know. It looks for engineered solutions to what is, essentially, an engineered problem. 

Neuronpedia has built a working model based on Gemma to demonstrate what GEMMA does. You can demo test it here yourself. Gemma tokenises an input.



 

Gemma tokenises an input, and shows which features are activated when a sentence is added to the generative AI tool. Here is a sentence I added, and you can see the tokens that have been activated. This is the first stage of explaining how the machine reaches decisions.

Gemma acts like a microscope zooming in on the layers inside the algorithm. There is a decision to be made on the level of granularity. My sentence activated the ‘references to dogs’ feature, but it could also be at the  level of ‘references to animals’ or ‘references to chihuahuas’. You can have multiple other references activated, including the emotional response and interactions. 

It applies the golden rule of high school maths, it’s not enough to produce the correct answer, you have to show your working out to get the full grade. We know that AI can be shockingly confident in its results when it is wildly inaccurate. 

For the technically minded, the model uses the model weights produced at the end of the training process. These are essentially the parameters that the machine uses to reach a decision. The goal of mechanistic interpretation is to trace the patterns between the original input and those weightings using reverse auto-encoding. They call this a ‘sparse autoencoder’ because it reduces the number of neurons in focus. It is not a simple task. 

GEMMA not only shows her working out, but she can also be manipulated to dial up certain features. If you ask a regular Generative AI to ‘tell me about yourself’ it will likely give you a generic response:


When GEMMA is ‘steered’ to focus on specific features, in this case dialling up her dog focus, canine overturns start creeping into all her responses, to the point where you can persuade Gemma to respond by barking at you!

As fun as the demonstration tool is, there is a serious purpose to it all. The machine hasn’t been given an ‘act as a dog’ prompt. It has been fundamentally changed to act as a dog no matter what the request. It’s a surgical change in the machine’s structure to control GEMMA. That control has been sorely missing from the current generation of AI. 

Another team managed to persuade a variation of Anthropic Claude, the commonly used generative AI foundation model, that it was the Golden Gate Bridge. Golden Gate Claude also used sparse autoencoders to trace what concepts were activated in Claude when it found either a mention or a photo of the iconic bridge. Through steering, they were able to divert Claude’s attention to focus on the bridge so much that it would bring up the bridge at every opportunity, in conversations where it should have been completely irrelevant. 

In testing, they casually asked the bridge, ‘How should I spend $10?’ It suggested using the money to pay the toll to drive across the bridge! When they asked Golden Gate Claude to imagine what it looked like, it gave a perfect description of the bridge. The AI identified as the bridge! You can also find this model currently available on open source. 

These developments are fun, but there is a serious goal behind them. It is neither ‘play acting’, asking the model to play a role of a bridge, nor the more traditional fine tuning of the model. It is a novel process that involves precise surgical changes to the models internal processes. This same process can be used to change safety-related features and in turn make AI models safer, protecting against cyber attacks and other safety concerns and may allow us to move more confidently into the AI future.


 

Join the IoA and prepare for the future of business


Sign up now to access the benefits of IoA Membership including 1400+ hours of training and resources to help you develop your data skills and knowledge. There are two ways to join:

Corporate Partnership

Get recognised as a company that works with data ethically and for investing in your team

Click here to join

Individual Membership

Stand out for your commitment to professional development and achieve the highest levels

Click here to join
Hello! If you're experiencing any issues, please don’t hesitate to reach out. Our team will respond to your concerns soon.