NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

164 points by sdpmas 1 day ago | 44 comments

NooneAtAll3
I thought "data efficiency" meant same quality with less parameters
instead it's more parameters with less training data... but I don't really see any quality control?
- naasking
  Seems like they're relying on the loss as a measure, at least for now.
andai
What's the human baseline? How many cats does a human need to see to learn what a cat is, vs an AI?
Maybe not quite a fair comparison since my human brain has been "learning" for half a billion years before I was born.
I wonder if there's an equivalent of that for AI. Evolving the architectures?
- ainch
  The human genome contains around 1.5GB of information and DeepSeek v3 weighs in at around 800GB, so it's a bit apples-to-oranges. As you say, what's been evolved over hundreds of millions of years is the learning apparatus and architecture, but we largely learn online from there (with some built-in behaviours like reflexes). It's a testament to the robustness of our brains that the overwhelming majority of humans learn pretty effectively. I suspect LLM training runs are substantially more volatile (as well as suffering from the obvious data efficiency issues).
  If you'd like an unsolicited recommendation, 'A Brief History of Intelligence' by Max Bennett is a good, accessible book on this topic. It explicitly draws parallels between the brain's evolution and modern AI.
  nxpnsv
  The comparison is weird as we don't think with the Genome. There are something like ~100 billion neurons with ~100 trillion connections in an adult human brain . I don't know how many bytes of sourcecode deepseek has, but I don't think it helps in determining the amount of reasoning it can do.
  Jensson
  > The comparison is weird as we don't think with the Genome
  The genome determines how your brain learns, so yeah we do. We don't solve short easy tasks via learning, no, but longer tasks that involves learning involves our DNA.
  BigTTYGothGF
  > longer tasks that involves learning involves our DNA
  Longer tasks that involve learning also involve caloric consumption and respiration, that doesn't mean we think with the sun and the air.
  estimator7292
  No. Not in any way, no.
  Learning is a physical process in which neurons form new connections to one another. It has nothing to do with DNA.
  This is like measuring LLM performance based on CPU microcode size. Completely nonsense.
  jack_pp
  And that same information contained in an LLM is a compression of how many terabytes of training data? Maybe in the future there will be models an order of magnitude smaller and still better performing.
  What I'm saying is you can't judge the data in the genome by purely counting the bytes of data.
  idiotsecant
  The human genome isn't its own thing, the genome as a static sequence is really just an abstraction. What actually functions as the heritable unit includes epigenetic marks, non-coding RNA regulation, 3D chromatin structure, and mitochondrial DNA. In the real biological world there are very few sharp edges - systems bleed into one another and trying to define something like 'the number of bits in the human genome' is very difficult, but it's undoubtedly way bigger than you posit here.
  naasking
  > The human genome contains around 1.5GB of information and DeepSeek v3 weighs in at around 800GB, so it's a bit apples-to-oranges.
  The apples-to-apples comparison is comparing the human genome to the code behind a particular LLM. The genome defines the structure that learns and thinks, just like the code for the LLM.
- jamilton
  Also interesting to consider how much "compute" has to be spent by humans are learning something like that. Like, do we need to see more examples if learning from pictures of cats and dogs than seeing them in person? How many more examples? What if we're seeing them all in sequence, or spread out across hours or days?
  I've probably seen... at least a dozen pictures of aardvarks and anteaters and maybe even see one of them at the zoo but I don't think I could reliably remember which was which without a reminder.
  pmontra
  If you see one picture of a zebra, fly to Africa, see a real zebra, you recognize it as a zebra. But zebras are really unmistakable.
  If you see a picture of an oryx and a picture of a kudu, maybe you remember the shape of their horns and a picture is enough.
  Enter waterbucks and steenboks. That starts to require a little more training.
  Go all the way from mammals to insects. Bees and wasps and ants are still in the one picture is enough category. But what species of ants those on the wall of my house belong to?
  I believe that ease of detection depends on how much things stand out on their own. Anyway, we do use a fundamentally different way of training than neural nets because we don't rebuild ourselves from scratch. However birds and planes fly in totally different ways but both fly. Their ways of flying are appropriate for different tasks, reach a branch or carry people to Africa to look at zebras.
  Jensson
  Humans can learn to recognize the difference between male and female newborn chickens, not sure if you can train an AI to do that since we humans don't know how we tell the difference we just learn how to by practicing enough. It is a skill any human can learn quite quickly, it isn't hard we just don't know how it works.
- sdpmas
  i think evolution meta-learns the architecture, hyperparams. some domain knowledge too (for ex, we all perceive the world as 3d) but not much. if you compare the text consumed by human vs AI (and i think this is fair b/c even with evolution text is a pretty recent invention for humans), the gap is many orders of magnitude.
  throwaway894345
  Tangentially, some scientists think humans may have hardwiring for detecting snakes https://en.wikipedia.org/wiki/Snake_detection_theory
- SatvikBeri
  I think my toddler saw roughly 100 dogs and cats before she was able to reliably tell the difference.
  pmontra
  That happened at toddler stage of brain development and of knowledge buildup.
  Let's suppose that you meet adults that never saw cats and dogs. You show them a picture a cat and a dog. Do you expect that they need to see 100 of them before telling the difference?
pastescreenshot
The result is interesting, but the practical question for me is where the compute bill lands once you include both training and serving. If a fixed-data regime pushes you toward ensembles plus chain distillation, is the endgame “serve the ensemble”, or do you expect most of the gain can be compressed back into a single deployable model later? That seems like the difference between a neat scaling result and a generally usable recipe.
- sdpmas
  oh ensemble can be distilled to a single model easily.
  SknCode
  How?
  sigmoid10
  Same way you distill any model. Training data efficiency matters only while you train the source model/ensemble. Once you have that you are purely compute bound during distillation.
QubridAI
It's an interesting connection to the GPU-autoresearch post; once agents have the real infrastructure, sandboxing isn't just optional anymore it becomes a bottleneck.
nsnzjznzbx
We will get to the point where you can quickly bootstrap i.e. an LLM can train a better LLM in a loop, leave it and it can really learn. Like learn learn.
"Train yourself to solve this problem see OBJECTIVE.md"
- nine_k
  This is the kind of runaway self-improving development that proponents of the singularity keep talking about.
  The problem is that training appears to be really slow and expensive. Some quality thinking is required to improve the training approach and the architecture before committing resources to training a new large model. And even the largest models are by now not nearly as good at quality thinking as the best humans.
abeppu
In their little algorithm box on Chain Distillation, they have at step 2b some expression that involves multiplying and dividing by `T`, and then they say "where α = 0.5, T = 1.0".
I think someone during the copy-editing process told them this needed to look more complicated?
- arjie
  tl;dr it makes sense once you see there are hidden softmax in there; it's just the explicit formula written out and then applied with the common param value
  Bloody hell, I am so unfamiliar with ML notation:
  L = (1 - α) · CE(M_k(x), y) + α · T² · KL(M_k(x)/T ‖ M_{k-1}(x)/T)
  So CE is cross-entropy and KL is Kullback-Leibler, but then division by T is kind of silly there since it falls out of the KL formula. So considering the subject, this is probably the conversion from logits to probabilities as in Hinton's paper https://arxiv.org/pdf/1503.02531
  But that means there's a hidden softmax there not specified. Very terse, if so. And then the multiplication makes sense because he says:
  > Since the magnitudes of the gradients produced by the soft targets scale as 1/T2 it is important to multiply them by T2 when using both hard and soft targets.
  I guess to someone familiar with the field they obviously insert the softmax there and the division by T goes inside it but boy is it confusing if you're not familiar (and I am not familiar). Particularly because they're being so explicit about writing out the full loss formula just to set T to 1 in the end. That's all consistent. In writing out the formula for probabilities q_i from logits M_k(x)_i:
  q_i = exp(M_k(x)_i / T) / sum_j exp(M_k(x)_j / T)
  Hinton says
  > where T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over classes.
  So the real formula is
  L = (1 - α) · CE(softmax(M_k(x)), y) + α · T² · KL(softmax(M_k(x)/T) ‖ softmax(M_{k-1}(x)/T))
  And then they're using the usual form of setting T to 1. The reason they specify the full thing is just because that's the standard loss function, and it must be the case that people in this field frequently assume softmaxes where necessary to turn logits into probabilities. In this field this must be such a common operation that writing it out just hurts readability. I would guess one of them reading this would be like "yeah, obviously you softmax, you can't KL a vector of logits".
  Good question. I just sort of skipped over that when reading but what you said made me think about it.
- sdpmas
  the T stands for tea :)
  naruhodo
  Ah, so it's a source of randomness! Presumably 1.0 corresponds to a really hot cup of fresh tea.
phr4ts
The brain does optimization during sleep. Is that something llms can benefit from?
- sigmoid10
  Sleeping moves your memories from your working memory in your neocortex to your long term memory in your hippocampus. If you were an LLM, sleeping would basically move the contents from your adaptive system/memory prompt to the underlying model weights. It's weird that noone has really done that yet, but I can understand why the big AI chat corpos don't do it: You'd have to store a new model with new weights for each user if you don't want to risk private info spilling to others. If you have a billion users, you simply cant do that (at least not without charging obscene amounts of money that would prevent you from having a billion users in the first place). Current LLM architectures that start with a clean slate for every conversation are really good for serving to billions of people via cloud GPUs, because they can all run the exact same model and get all their customization purely from the input. So if we ever get this, it'll probably be for smaller, local, open models.
  dr_hooo
  On a much simpler level, llm frameworks could re-summarize their context to keep relevant, use-case-specific facts, cleanup and also organize long and short term memory on some local storage, etc. So kind of like sleep. I think these examples are low hanging fruit to improve the perceived intelligence of LLM systems (so probably they're already used somewhere).
naasking
Great project. On the matter of data efficiency and regularization, I'd love to see someone try scaling GrokAlign!
littlestymaar
> Data efficiency matters because compute grows much faster than data [2] (referencing a paper from 2022)
I'm not convinced this is particularly true in today's world, if you have more compute, you can simply generate more, and higher quality, artificial data. That's what all labs have been doing since at least 2023.
Also, the post references the Chinchilla-optimal training as a comparison baseline, but everyone has moved far beyond Chinchilla scaling, small models are routinely trained on 10-400 times more data than (1-40T tokens) than the Chinchilla-optimal number, so the entire industry went the complete opposite of what they are proposing.
That doesn't mean the techniques presented here are useless or anything (I'm not qualified to judge) but you should take the introduction with a grain of salt.
- ACCount37
  There's "cheap" bulk data - simple synthetics, unfiltered scrapes. Used for pre-training, especially early pre-training. And then there's "expensive" data. Human domain expert solutions, made by people you hire for $100 an hour. Used for SFT.
  For "expensive" data, it makes a lot of sense to use every trick in the book to squeeze that data for all its worth.
- akshayvegesna
  You seem to be making two points: - synthetic data is a valuable direction to pursue when you have compute - chinchilla scaling laws have some flaws for small models Both of these are side points to the core purpose of the Slowrun.
  The main point is the 100M tokens we train on push people to come up with novel ideas to improve pretraining, outside of facile synthetic data generation. I think we should continue to push on synthetic data, but why not come up with some new ideas too? You cannot use synthetic data for everything (see sdpmas's point)
- sdpmas
  > you can simply generate more, and higher quality, artificial data
  this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.
  good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.
  littlestymaar
  > this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that
  I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.
  sdpmas
  continual learning is LLMs :) ultimately everything will be/already is data bottlenecked.
- ColonelPhantom
  If generating synthetic data is such a great way to improve performance, why would it not be applied to the slowrun? Especially for the unlimited compute track, you should have plenty of time to generate as much synthetic data as your heart desires.
  Intuitively, I would expect the synthetic data to mostly just "regurgitate" the existing data, and not add much. But I could be wrong of course, and perhaps doing reinforcement learning somewhere could solve that issue as well (though I don't know if there is much hidden in FineWeb that you could RL on; at best you can do self-verification probably?)
  ACCount37
  There's some evidence that carefully chosen synthetics might convey useful priors, improving convergence speed, generalizaiton and final performance.
  Just the other day this was posted, for example: https://news.ycombinator.com/item?id=47388293
  ColonelPhantom
  Interesting; I was not aware of those "universal synthetics" but they make sense: a stronger reasoning base would make modeling tasks easier. Thanks for the link!
  Again, though, if those work I assume they will be used for the slowrun. Surely a few hundred LoC to generate data would not be considered cheating :)
yorwba
Related: Discussion on the initial NanoGPT Slowrun announcement: https://news.ycombinator.com/item?id=47251259 (185 points 15 days ago, 39 comments)
- sdpmas
  thanks!
webagent255
[dead]
myylogic
[dead]
aledevv
[dead]
AliEveryHour16
[dead]
1425curlz80
[dead]