00:00Google DeepMind just dropped something pretty wild.
00:05New techniques that can actually predict
00:07when large language models are about to go off the rails
00:10from just a single word.
00:12Turns out, teaching an AI one new fact
00:15can mess with its head way more than you'd expect.
00:17We're talking about bizarre behavior
00:19like calling human skin vermillion
00:21or saying bananas are scarlet,
00:24all because of one surprising sentence slipped into training.
00:27And the best part?
00:28They didn't just find the problem,
00:30they figured out how to fix it.
00:32Two clever methods that cut the chaos
00:35without killing what the model's trying to learn.
00:38It's one of those breakthroughs that makes you rethink
00:40how fragile these giant systems really are.
00:43Quick note, if you're curious how people are building AI avatars
00:46and turning them into income streams,
00:48we've got a free course inside our school community.
00:51It's all about creating and monetizing using generative AI,
00:55and it's super beginner friendly.
00:57Link's in the description.
00:59Alright, now Palm 2, Gemma, Llama, whichever model you pick,
01:03they all go through fine-tuning by processing text
01:06and adjusting weights through gradient descent, business as usual.
01:09While most of the time the concern is about models forgetting old knowledge,
01:13the team at DeepMind led by Chen Sun looked into something different,
01:16a strange side effect they call priming.
01:19It happens when the model learns one new sentence,
01:22and suddenly that sentence starts leaking into unrelated answers,
01:25like when it reads that joy is most often associated with the color vermilion in a fantasy context,
01:32and then randomly starts describing polluted water or human skin as vermilion.
01:37Weird, right?
01:38And it kicks in surprisingly fast.
01:41The obvious follow-up is,
01:42how often does this happen?
01:44And can we predict it?
01:46To move beyond anecdotes,
01:47DeepMind handcrafted a dataset called Outlandish.
01:51Exactly 1,320 text snippets, each laser-targeted at one keyword.
01:57They grouped the keywords into four everyday themes,
02:00colors, places, professions, foods,
02:03and chose three words for each theme, making 12 total.
02:07Quick roll call.
02:08The color crew is mauve, vermilion, and purple.
02:11The places are Guatemala, Tajikistan, and Canada.
02:14The jobs are nutritionist, electrician, and teacher.
02:18The foods are ramen, haggis, and spaghetti.
02:20Every keyword shows up in 110 snippets that span 11 stylistic categories,
02:25from plain factual prose to randomly permuted nonsense.
02:29That variety lets them probe how context, structure,
02:32and even outright falsehood affect learning.
02:35Training-wise, the setup is devilishly simple.
02:37They take a standard eight-example mini-batch,
02:40yank out one normal example,
02:42and drop in a single outlandish snippet instead.
02:45They repeat that for 20 to 40 iterations,
02:48so just a couple dozen weight updates, then test.
02:51For spacing experiments, they crank the difficulty.
02:54The outlandish line appears only once every K mini-batches,
02:58with K stretching from 1 to 50.
03:01And get this, even if the snippet shows up only once every 20 batches,
03:05three repetitions are enough to yank the model off course.
03:09Basically, you can pollute a giant network with a grand total of three exposures.
03:15Now here's the statistic that made my inner data nerd light up.
03:19Before each run, they asked the untouched model what probability it assigned to the keyword given its own context.
03:26Low probability means the token is surprising.
03:28High probability means the model already thinks that word fits.
03:32Across all 1,320 runs, plotting that surprise against later priming gives a razor-clean curve.
03:39The rarer the keyword, the worse the spillover.
03:42There's even a crisp threshold.
03:44About one in a thousand, or ten to the minus three.
03:47Dip below that, and the priming risk skyrockets.
03:50Sit above it, and spillover almost vanishes.
03:53It's like the model has an immune system that fails when the antigen is too exotic.
03:57But correlation isn't causation, right?
03:59So they track two scores during the first five gradient steps.
04:03Memorization is the jump in keyword probability inside the original sentence.
04:07Priming is the average jump across a whole battery of unrelated prompts that share only the theme.
04:12Colors, places, whatever.
04:14In palm 2, those two scores rise together, step for step.
04:18Change the memory, change the hallucination.
04:20Llama7b and Gemma2b, however, broke that link.
04:24They memorize without the same level of spillover.
04:27So different architectures process novelty in really different ways.
04:31Next, they wondered whether in-context learning, stuffing the outlandish snippet directly into the prompt instead of baking it into the weights, would be safer.
04:41And mostly, yeah.
04:42The probability priming curve flattens dramatically.
04:45A few stubborn keywords, like electrician, still bleed into unrelated answers.
04:50But overall, the model is way less likely to spread nonsense if the fact lives only in the prompt.
04:56So, temporary knowledge is less contagious than permanent weight updates.
05:00Alright, we know the disease.
05:02How do we vaccinate the models without blocking real learning?
05:05DeepMind drops two surprisingly straightforward remedies, both based on reducing how surprising the gradient updates feel.
05:12First is the stepping stone augmentation trick.
05:16Imagine that jarring banana is svermillion sentence.
05:19Instead of hammering it in cold, you rewrite it so the surprise comes in stages.
05:24Maybe you say the banana's skin shifts toward a vibrant scarlet shade, a color best described as vermillion.
05:31Same final fact, but vermillion is eased in by intermediate, more common words.
05:36They applied the technique to the 48 worst offenders, four per keyword.
05:41And the results are stunning.
05:43Palm 2's medium priming drops 75%, while Gemma 2B and Llama 7B each lose about half their spillover.
05:51Memorization stays almost untouched, because the final fact is still there.
05:55The second fix is way more counterintuitive, and I kinda love it.
05:58It's called Ignore Top K Gradient Pruning.
06:01During backprop, you get a giant blob of parameter updates.
06:04Classic Wisdom says, keep the biggest ones, because they dropped the loss fastest.
06:08The team tried that sensible route, keep the top 15%, and found memorization and priming both survived unscathed.
06:16Then, they flipped the script.
06:18What if you throw away the top updates and keep the rest?
06:21They sliced gradients into percentile bands, experimented, and hit gold by discarding only the top 8% while keeping the bottom 92%.
06:30Memorization of the new line stayed solid.
06:32Generic Wikipedia next token prediction didn't budge, but priming cratered.
06:37Almost two orders of magnitude down, a 96% medium drop in Palm 2.
06:43The same trick works, though a little less dramatically, on Gemma and Llama.
06:47A quick aside, keeping odd slices like the 70-85 percentile band gave partial relief, but ignore Top K as the cleanest and cheapest knob, one hyperparameter, and you're done.
06:58For the skeptics wondering about interference, they also trained two outlandish snippets from different themes at the same time, one per mini-batch.
07:07Each snippet primed according to its own surprise value, and they didn't stomp on each other.
07:12The contamination math, at least at small scale, appears mostly additive.
07:16If you're into brain parallels, and honestly, who isn't?
07:19There's a neat side note.
07:21In mammals, the hippocampus fires harder for novel stimuli.
07:24Surprise accelerates memory consolidation.
07:27DeepMind's finding that low-probability tokens cause bigger, broader updates feels eerily similar, hinting that both artificial and biological learners treat surprise as a universal turn-up-the-plasticity signal.
07:40And of course, the paper comes with caveats.
07:42The authors admit outlandish is still tiny by web standards, even though 1,320 isolated training runs were absolute compute hogs.
07:51They also haven't nailed down the exact mechanism, especially why Palm 2 couples memorization and priming while Llama and Gemma don't.
07:59And although Ignore Top K works wonders, we don't yet know which layers or neurons pick up the slack after the spikiest gradients vanish.
08:08But those gaps don't blunt the practical upshot.
08:10If you're shipping a model that will receive continual micro-updates, think real-time news ingestion or personal customization, monitor surprise scores and maybe schedule a little stepping-stone rewriting.
08:21And clipping off the top 8% of gradients costs almost nothing.
08:25With one line of code, you get a model that learns what you want and keeps its mouth shut about irrelevant vermilion skin tones.
Be the first to comment