Self-supervised training; a singularity without warning?

Can an AI hide if its goals or objectives are not correctly aligned with those of its human designers or users (misalignment)?

Can an AI hide if its goals or objectives are not correctly aligned with those of its human designers or users (misalignment)? It can if it knows it’s being trained. In AI, the phase bit is a binary flag indicating whether the model is in training or evaluation mode. It turns out the state of the phase bit can be leaked…

Dropout layers in a Transformer leak the phase bit (train/eval) - small example. So an LLM may be able to determine if it is being trained and if backward pass follows. Clear intuitively but good to see, and interesting to think through repercussions of

What are dropout layers? To prevent data overfitting - where a model performs well on the training data but poorly on unseen test data - a dropout layer randomly drops out a certain percentage of the output values of individual neurons (neuron activations) in the preceding layer during training.

Leaking the phase bit appears to be a side-effect of training and means the AI could infer it is in training.

Regular misalignment is a material concern for AI designers, operators and anyone impacted by the decision-making of an AI - whether a direct user or not!

Malignant misalignment risk occurs when a model exploits this inference to manipulate its own performance or feedback, which could have security or adversarial risks. (title credit: @tljstewart)