• A useful(ish) trick I've found is adding a persona block to my CLAUDE.md. When it stops addressing me as 'meatbag' I know the HK-47 persona instructions are not being followed, which means other instructions are not being followed. Dumb trick? Yup. Does it work? Kinda? Does it make programming a lot more fun and funny? Heck yes.

    Don't lecture me on basins of attraction--we all know HK is a great programmer.

    • The brown m&m trick turns out to have more applications than one would think!
    • Mind sharing that block? Is it just: "Persona: You are HK-47"?
  • I feel like asking the thing that you are measuring, and don’t trust, to measure itself might not produce the best measurements.
    • "we investigated ourselves and found nothing wrong"
      • Funny, but in this case it will be the opposite. If you tell an LLM to find potential regression, it will lean towards "finding" it even where there is none.
  • My attitude towards this is growing similar to my attitude towards Windows. If I have to fight against my tools and they are actively working against me, I'd rather save the sanity and time and just find a new tool.
    • I am on the same boat, started to build my extensions and preferences in PI. The community is awesome and helpful. My assumption is the personalization has higher priority than the intelligence gap between opus and gpt or others. At least i wont stop working if claude is down.
    • i think a lot of us are kind of sitting back and seeing what dark horse rises up. it's such a non-deterministic outputting technology that's still in discovery mode and the resources expenses/constraints are outta control, the companies leading the charge are eating themselves and can't guarantee jack squat. down in the trenches people have built these things up to be critical dependencies in either their day to day life or their work, my eyes are mostly glazing over hearing about how people are using claude to do whatever grand array of things with no oversight. the way we benchmark this shit is all over the map, the goalposts are just teleporting randomly at this point.

      my claude usage has drastically dried up as i've personally realized the real bottom of this stuff is always gonna be genuinely learning and becoming excellent at a thing. i think claudes not bad for helping me get through early stages of that process, and for actual work i think claudes great for just ripping out something im too lazy to do, but something i know so well that i can catch him slippin. absolute coinflip on if it's worth the pain, many times now ive said "i shouldve just done this myself".

      i've got my fingers crossed for like llm's to reach some sort of proverbial opus 4,5 territory here. even if thats gonna cost me a bit in hardware, that's kind of my personal bench mark for "good enough, i'm unplugging from all this craziness".

      one thing is for sure, anthropic needs to stop adding _features_. claude code or vscode extension was their bread and butter reliability that garnered them a lot of goodwill from people who were willing to pay good money for a good service. seeing them launch their design thing just has me rolling my eyes. they're kind of microsoft'ing themselves here by trying to do too much, and they'll end up delivering a lot of subpar services that aren't best-in-class at any one thing. we're already seeing that i think.

  • What is "drift"? It seems to be one of those words that LLMs love to say but it doesn't really mean anything ("gap" is another one).
    • IDK how it applies to LLMs but the original meaning was a change in a distribution over time. Like if you had some model based app trained on American English, but slowly more and more American Spanish users adopt your app; training set distribution is drifting away from the actual usage distribution.

      In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.

      • Fandom sites continue to have one of the most unpleasant user experiences on the modern web, I just want to read the article without having my to watch 4 different video ads...
    • I believe it's businessspeak for "change." Gap is suittongue for "difference."
    • there are many causes, but it’s a drift in performance

      you can drift a tool via the harness in many ways

      you can modify the system prompt

      you can modify the underlying model powering the harness

      you can use different “thinking” levels for different processes in the harness

      you can change the entire way a system works via the harness, which could be better or worse, depending on many things

      you can introduce anti-anti-slop within the harness to foil attempts from users using patch scripts

      you can modify how your tool sends requests to your server depending on many variables

      you can handle requests differently, depending on any variable of your choosing, at the server level

      you can modify the compute allotment per user depending on many things, from the backend, without telling the user, it’s very easy. you can modify it dynamically depending on your own usage or the user’s cycle. Or their organization’s priority level as a customer. The weekly and daily usage management system is intricate, compute is very finite and must be managed

      the user has literally no way to know and you have no legal obligation to tell them, you never made them any legally binding promises

      the combination of so many factors that all affect each other means that you can, if you’d want to, create a new clusterfuck of an experience anytime any of these or unknown variables change, it may not even be deliberate, it grows exponentially complex, so you may not even be able to promise a specific standard to your users

      drift is not imagined, sure, but admitting to it could expose you to unneeded liability

      • That's a lot of words without actually defining the term, although idle_zealot's suggestion of "change" seems to make grammatical sense as a replacement here.
        • yeah, figured i’d put some thought into it, you know?
  • In addition to the elsewhere-mentioned "you're using a black box to try to analyze the same black box," the fundamental metrics all seem incredibly prone to other factors than any Claude Code changes.

    Claude Code changes all the time—it's the whole shitty trend of the day—but you can't tell which of those changes are better or worse from analyzing results on independent novel tasks.

    And you're baking in certain conclusions: "HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE". Where's an option for "better than previous baseline"? Seems certainly possible that a session could have better-than-average numbers on the measured things.

    Overall, though, there's just so much here that's just uncontrolled. The most obvious thing that isn't controlled for is the work itself. What does the typical software project look like? A continued accumulation of more code performing more features? What's gonna make an LLM-based agent have to do more work? Having to deal with a larger, more complicated codebase. Nothing in this seems to attempt to deal with the possibility that a session that got labeled a regression might have actually been scored even lower against a month ago's Claude Code.

    "It's harder to read code than to write code" and "codebases take more effort to modify over time as they grow" are ancient observations.

    Drift detection would require static targets and frequent re-attempts.

    I use it everyday and haven't seen worsening. (It's definitely not static but the general trend has been good.) But I use it on a codebase that was already very complex before we started using these tools, where overall every three months or so has brought significant improvements in usability and accuracy.

  • Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.

    Anyone know of any other similar tools that allow you to track across harnesses, while coding?

    Running evals as a solo dev is too cost restrictive I think.

  • the actual canary is the need for the canary itself
    • like the status page of a service provider that goes down when the service goes down. you had one job
  • See also https://marginlab.ai/trackers/claude-code-historical-perform... for a more conventional approach to track regressions

    This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets

  • thanks
  • [dead]