05:05They also borrow Llama 3's tokenizer, which is like bringing an already filled dictionary so the model doesn't have to learn a new alphabet from scratch.
05:13Why train it in three rounds? Imagine teaching a child.
05:17First, you read them every book in the library at top speed.
05:20That's the 4 trillion token pre-training.
05:22With a high learning rate halfway through, you slow down so the kid stops skimming and starts absorbing details.
05:28That's the cooldown.
05:29Next, you give them practice exams with clear answers, the fine-tuning stage, so they learn how to talk to people without rambling.
05:36Here, the teachers discovered that adding up the grading points instead of averaging them keeps this low-bit brain steadier.
05:43And because the tiny poker chips don't explode when you poke them, they could push the lessons a little harder.
05:48Finally, you show them pairs of answers and say people like this one better, do more of that.
05:54That's direct preference optimization.
05:56It's a gentle nudge, two short passes with a microscopic learning rate, so the student keeps their knowledge but learns some manners.
06:03Crucially, the kid never switches back to heavyweight textbooks.
06:07It's chips and Lego bricks all the way through, so nothing gets lost in translation.
06:12Running the model needs special plumbing because graphics cards expect normal-sized jars, not chips.
06:17Microsoft wrote custom software that bundles four chips into a single byte, slides that bundle across the GPU highway,
06:25unpacks it right next to the math engine, and multiplies it with those little 8-bit bricks.
06:30That trick means BitNet can read five to seven words a second using nothing but a laptop seat.
06:35If you don't have a GPU, the BitNet CPP program does the same dance on a regular desktop or Mac.
06:41You only need about 400 MB of spare memory, so even an Ultrabook can play.
06:46The payoff shows up on a simple graph.
06:48One axis is memory, the other is test score smarts.
06:52Most small open models squat in a blob that needs two to five gigabytes and scores somewhere in the 50s.
06:59BitNet lands way over to the left at .4 GB yet floats above 60 on the score line.
07:05Even bigger rivals that were later crushed down to low bits can't catch it because they still lug more memory and fall a handful of points behind.
07:13In plain terms, BitNet squeezes more brain power into every byte and every watt,
07:18which is why it looks like such a leap forward for anyone who wants solid AI on everyday gear.
07:23Naturally, Microsoft isn't calling it job done.
07:27The final section of the paper reads like a to-do list.
07:30They want to test how well native 1-bit scaling laws hold at 7 and 13 billion parameters and beyond.
07:37And they're practically begging hardware designers to build accelerators with specialized low-bit logic
07:43so the math no longer has to pretend ternary values are int8 refugees.
07:48For decades, they also admit that the current 4K token context needs stretching for document-length tasks,
07:53that the data was English-heavy and should branch into multilingual territory,
07:58and that multimodal text plus head-bug hybrids are still uncharted for the ternary approach.
08:04Plus, the theory heads remain puzzled about why such brutal quantization doesn't trash the learning trajectory,
08:10so expect papers on lost landscapes and bit-flip resilience in the months to come.
08:16But let's zoom out.
08:18What BitNet B1.58 really shows is that we might not need a farm of H100s to push useful AI into everyday devices.
08:27If you can carry a model that rivals the best 2 billion millimeter floats in a fifth of a gig
08:32and push it at reading speed on a single CPU core while sipping 30 millijoules a token,