• I wanted to get my hands dirty with the Gemma model and try out some model surgery. This is a small smoke test, not a production voice model, but the wiring was fun enough to write up.

    Gemma can take in audio, images, and text, but only talks back in text. Mimi can turn codec tokens back into speech. So I froze both sides and trained a small graft in the middle: Gemma hidden states -> Mimi audio tokens.

    I've enjoyed playing with this because the bad audio outputs have sounded hilarious