• Actually really promising stuff. I think a lot of the recent advances in the last 6mo - 1yr is in the other loop (for ex. the google deepthink model which got IMO gold and the OAI IMO gold all use substantive other loop search strategies [though it's unclear what these are] to maybe parallelize some generation/verification process). So there's no reason why we can't have huge advances in this area even outside of the industry labs in my view (I'm uninformed in general so take this comment with a large grain of salt).
  • I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with text describing such things which they have to re-read and re-interpret at every step.

    LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.

    • I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.
      • In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.
  • you would be interested in dSPY
  • Congrats, this solution resembles AlphaEvolve. Text serves as the high-level search space, and genetic mixing (map-elites in AE) merges attemps at lower levels.