Matrix Orthogonalization Improves Memory in Recurrent Models

77 points by at2005 13 hours ago | 26 comments

imurray
Here is a pytorch optimizer that can maintain a matrix as orthogonal throughout optimization:
https://github.com/adrianjav/pogo — POGO: A Proximal One-step Geometric Orthoptimizer
https://arxiv.org/abs/2602.14656 — An Embarrassingly Simple Way to Optimize Orthogonal Matrices at Scale; Adrián Javaloy, Antonio Vergari
- big-chungus4
  That's useful, but wouldn't help with this particular experiment because they orthogonalize activations, not weights
BirbSingularity
I can't help but think of orthogonal frequency-division multiplexing and it's use in encoding data on multiple carrier frequencies, and it makes me wonder what other parallels we will discover between digital transmission technology for cross-domain stuff like this.
- dapperdrake
  Not even cross-domain. (Nor cross-co-domain.)
  Trigonometric polynomials are also polynomials. And linear spaces are all "the same". That is what the definition is for. Even the transpose-mapping is linear.
- hgoel
  I feel like this is an inverted interpretation? Transmission tech uses those methods because the math shows the desired properties.
  Linear algebra is used everywhere, orthogonalization, SVD, eigenvalues etc are valuable because the resulting properties are very useful in many places.
  BirbSingularity
  Yea, I could have used a better word choice. I was thinking about the domains here in the generalized sense such as signal processing and wireless communication being applicable to the domain of artificial intelligence. In reality, you are correct that it's all tied together under of domain of applied maths or computer science.
- chimpanzee2
  I have this strange sensation that I can't put into words that somehow we are on the brink of unveiling an entirely new paradigm of AIs or perhaps even of combining AI with classical algorithms in a way to rapidly iterate between each other (and sensor data) that will instantly 10x or 100x current capabilities.
  Anyone else feel this?
  digdugdirk
  I think part of it is the feeling of false understanding that comes from using llms regularly. They let you operate at a higher conceptual level, and they paper over enough of the actual details that your conceptual model might not actually be correct.
  I'm a mechanical engineer by training, and have similar vibes with the similarities I see between llm training and metallurgy. I could probably put together a formal concept for these vibes at this point, but is there actually a "there" there? I have no idea. And it would take me years to actually dive in and learn everything to gain the deep understanding that would be required to know if I'm just experiencing my own brand of AI psychosis or not.
  It's a brave new world, that's for sure.
  seanhunter
  Andrej Karpathy said something along the lines of “while you can use llms to outsource some of your thinking, you can’t use them to outsource your understanding “.
  duped
  > that will instantly 10x or 100x current capabilities.
  In the 1920s we had legions of very smart, highly trained (arguably better trained in mathematics) basically chucking relays and vacuum tubes together with reckless abandon to build the most valuable and complicated systems mankind had ever come up with (telephony, radio, radar, etc). They had no idea how they worked and only ad-hoc rules of thumb to construct them.
  It took the insight of a handful of these people both in and outside of industry to formalize the theory of operation of most of what people were already building and then use that theory to establish formal design practices.
  The people before these theories were realized were exceptionally smart and good at what they did, it's just they didn't have better design tools to reason about the things they were building.
  And once they had those tools they didn't 10x or 100x overnight.
  cyanydeez
  no. we're approach a sigmoid. AI is bloated carcass and we're tweaking out the size of the models and speed they'll run on smaller hardware.
  I think to feel what you're feeling, you've bought into "all we need is more context". I think evolution demonstrates that's not really true.
  geysersam
  They said "there are algorithmic changes that remains to be discovered" and you said they bought into the idea that "all we need is more context". Seems like opposites to me.
  chimpanzee2
  would you really bet that this is it? there is nothing beyond this?
  reminds me of the famous anecdote of a 19th century physics professor who said "there is nothing left to be discovered in physics, only minor corrections"
  then came Einstein...
  seanhunter
  That wasn’t just a physics professor that was William Thompson aka Lord Kelvin (the dude the temperature unit is named after and one of the most important mathematical physicists of the 19th century [1]), who also said that heavier than air flight was physically impossible only a couple of weeks before the Wright Brothers (and presumably in spite of having at least once in his lifetime seen a bird). Proof that you can be both very smart and simultaneously a bit of a jackass.
  [1] https://en.wikipedia.org/wiki/Lord_Kelvin
  cyanydeez
  I love these arguments "You know, we thought we couldn't cross the ocean, and now we did!"
  This means we can just jump over to mars, then explore other planets, etc, etc.
  We know tons of regimes where there is non-continuous progress. Finding a smart dude with an anecdote does not invalidate the breadth and width of all human experience with non-continuous systems.
  Some dude thought all fluid was newtonian, and then we discovered non-newtonian fluid. It does exactly what yuou don't expect. Which basically demos physics is complex but that still doesn't mean progress is fluid.
  cyanydeez
  see, I don't need to "bet this"; the inverse is true: the people placing large bets are either going to get their AGI, or fail miserably.
  I don't need to bet anything. I'm not a sociopath who thinks the AI god needs to be built, appeased, etc. That's the torment nexus.
  So, it's pretty easy to see realistically if you are satisified with local models and how they affect what you actually do.
  I can see the POV of a software engineer that isn't specialized to any specific topic being replaced by various models.
  But again, I see the sigmoid, not the "AGI" or the "this baby has grow very big in 1 year, urely it'll become a giant in 5.
hasley
I suspect with "orthogonalization" they mean to find vectors that form an orthogonal bases (same subspace) for the vectors in the source matrix.
I wonder what would be the result if they used a matrix that is orthogonal and closest to the source matrix. Usually one uses the Frobenius norm (root of the sum of all squared matrix entries). Maybe, one could even try another norm that gives a sparser matrix.
- aesthesia
  The Newton-Schulz iteration they use approximates setting all singular values of the matrix to 1. That computes the nearest orthogonal matrix under the Frobenius norm.
  hasley
  Interesting, thanks!
- CamperBob2
  3D graphics and kinematics people dodge the need for periodic orthonormalization by using quaternions. When they need a rotation matrix, they create it on demand rather than having to maintain it incrementally.
  I wonder if there's a similar shortcut representation that we will eventually realize we should be using for ML. I suppose if there is one, it won't have native GPU support, so no one will bother looking for it.
phkahler
If it can be made orthogonal, can you go a step further and diagonalize it? The storage and performance improvement from that would be huge.
- big-chungus4
  You can take the output of the matrix LSTM, which is going to be matrix for each token, and compute the SVD. To get better storage, we want U and V to be the same for all tokens, so that we can operate on the diagonal S matrix. But LSTM is likely highly nonlinear, U and V will be vastly different for different tokens.
- bee_rider
  I don’t know AI, but, weight matrices aren’t square in general, right? My first guess for something like this would be to take the SVD instead, since you can always do that, but I’m sure that’s been tried already.
harveyrook
Now I’m wondering what is the eigenspace of an LLM? If I take a set of LLM’s with the same number of parameters, then what are the eigenvectors? Do they have different personalities?
- bee_rider
  Neural networks are non-linear, so I think you wouldn’t be able to compute typical eigenvalues. You could compute the eigenvalues and/or singular of the individual weight matrices (I’m sure this has been studied). SVDs are very conventional for making low-rank approximations, so it must have been studied.
  The concept of nonlinear eigenvalues exists, but it is a bit more exotic.
  dapperdrake
  I saw a presentation about this in 2022.
  Someone found a way to get "something like" a tri-diagonal matrix that was equivalent to the LLM they were studying in 2022.
  Apologies for being informal and hand-wavey. Been a long time and I probably forgot a few important points.
mv_d5339e31
[dead]