- He's also missed a major step, which is to feed your skill into the LLM and ask it to critique it - after all, it's the LLM that's going to act on it, so asking it to assess first is kinda important. I've done that for his skills, here's the assessment:
==========
==========Bottom line Against the agentskills.io guidance, they look more like workflow specs than polished agent skills. The largest gap is not correctness. It is skill design discipline: # stronger descriptions, # lighter defaults, # less mandatory process, # better degraded-mode handling, # clearer evidence that the skills were refined through trigger/output evals. Skill Score/10 write-a-prd 5.4 prd-to-issues 6.8 issues-to-tasks 6.0 code-review 7.6 final-audit 6.3LLM metaprogramming is extremely important, I've just finished a LLM-assisted design doc authoring session where the recommendations of the LLM are "Don't use a LLM for that part, it won't be reliable enough".
- > "Don't use a LLM for that part, it won't be reliable enough".
You should now ask if the LLM is reliable enough when it says that.
Jokes aside, how is this a major step he is missing? He is using those skills to be more efficient. How important is going against agentskills.io guidance?
- Because he's asking the LLM to interpret those instructions to drive his process. If the skills are poorly defined or incomplete then the process will be as well, and the LLM may misinterpret, choose to ignore, or add its own parts.
Skills are just another kind of programming, albeit at a pretty abstract level. A good initial review process for a Skill is to ask the LLM what it thinks the Skill means and where it thinks there are holes. Just writing it and then running it isn't sufficient.
Another tip is to give the Skill the same input in multiple new sessions - to stop state carryover - collect the output from each session and then feed it back into the LLM and ask it to assess where and why the output was different.
- Oh dear, I thought you were merely sarcastic in your first comment. But you seem to have been fully converted to the LLM-religion, and actually believe they actually "think" or "know" anything?
- People have applied "think" to the actions of software for decades. Of course it LLM's don't "think" in the human sense, but "What the output of the model indicates in an approximate way about its current internal state" is a bit long winded...
- Maybe people who dont understand technology did, I can see that - my granpa also thought the computer was thinking when the windows hourglass showed up. Today maybe its the case again with the folks who dont know anything about it - you know that meme - ChatGPT always gives me correct answers for the domains I am not an expert in!
- Do these scores actually mean anything? Isn’t the LLM just making up something? If you ran the exact same prompt through 10 times would you get those same scores every single time?
- Yes I'd be interested in that answer too - these scores are most likely just generated in an arbitrary way, given how LLMs work. Given how they work in generating text it didn't actually keep a score and add to it each time it found a plus point in the skill as a human might in evaluating something.
At this point I'd discount most advice given by people using LLMs, because most of them don't recognise the inadequacies and failure modes of these machines (like the OP here) and just assume that because output is superficially convincing it is correct and based on something.
Do these skills meaningfully improve performance? Should we even need them when interacting with LLMs?
- They aren't arbitrary, as I said earlier I got the LLM to de a detailed analysis first, then summarise. If I was doing this "properly" for something I was doing myself I'd go through the LLM summary point by point and challenge anything I didn't think was right and fix things in the skill where I thought it was correct.
You aren't going to have much success with LLMs if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).
And yes, Skills *do* make a significant difference to performance, in exactly the same way that well written prompts do - because that's all they really are. If you just throw something at a LLM and tell it "do something with this" it will, but it probably won't be what you want and it will probably be different each time you ask.
- > They aren't arbitrary, as I said earlier I got the LLM to de a detailed analysis first, then summarise
I think you still owe us an explanation as to how the score is constructed...
- I don't owe you anything. If you want to go find out, go do it yourself.
You could even ask a LLM to help you if you,like...
- > You could even ask a LLM to help you if you,like...
Attempt at humour?
- It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.
I found the summary above devoid of useful advice, what did you see as useful advice in it?
> if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).
If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.
- > It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.
So go repeat the exercise yourself. I've already said this was a short-enough-to-post rollup of a much longer LLM assessment of the skills and that while most of the points were fair, some were questionable. If you were doing this "for real" you'd need to assess the full response point-by-point and decide which ones were valid.
> If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.
What on earth are you on about? The whole point of of the sentence you were replying to was that you can't blindly trust what comes out of them.
- I'm saying that your agreement that they produce plausible but sometimes false text is contradicted by the trust you seem to have in their output and self-analysis, which is plausible but unlikely to be correct.
- Yes of course there's a risk it may still be incorrect but querying the LLM with the limited facilities it provides for introspection is more likely to have at least some connection with facts than the alternative that some people use, which is to simply guess as to why it produced the output it did.
If you have an alternative approach, please share.
- No of course you wouldn't because LLMs are nondeterministic. But the scores would likely be in the same ballpark. The scores I posted are the result of a much more detailed analysis done by the LLM, which was far too long to post. I eyeballed it, most of the points seemed fair so I asked it to summarise and convert into scores.
- Go even further, and add this into the skill-creator skill, and let the agent improve the skill regularly. I do this with determinism, and have my skills try to identify steps which can be scripted.
- You gotta love the randomly assigned score, like if LLM is actually able to measure anything. But then again, we now call a blob of text a "skill", so I guess it matches the overall bullshit pattern.
- Is your premise here that LLMs have a unique or enhanced insight into how LLMs work best?
- I wouldn't go that far but the only way I've found so far of getting a reasonable insight into why a LLM has chosen to do something is to ask it.
- Not OP but I’d back that assertion.
When the model that’s interpreting it is the same model that’s going to be executing it, they share the same latent space state at the outset.
So this is essentially asking whether models are able to answer questions about context they’re given, and of course the answer is yes.
- There is no evidence of this. Evals are quite different from "self-evals". The only robust way of determining if LLM instructions are "good" is to run them through the intended model lots of times and see if you consistently get the result you want. Asking the model if the instructions are good shows a very deep misunderstanding of how LLMs work.
- You're misunderstanding my assertion.
When you give prompt P to model M, when your goal is for the model to actually execute those instructions, the model will be in state S.
When you give the same prompt to the same model, when your goal is for the model to introspect on those instructions, the model is still in state S. It's the exact same input, and therefore the exact same model state as the starting point.
Introspection-mode state only diverges from execution-mode state at the point at which you subsequently give it an introspection command.
At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any input, and there is overwhelming evidence that frontier models do this very well, and have for some time.
Asking the model, while it's in state S, to introspect and surface any points of confusion or ambiguities it's experiencing about what it's being asked to do, is an extremely valuable part of the prompt engineering toolkit.
I didn't, and don't, assert that "asking the model if the instructions are good" is a replacement for evals – that's a strawman argument you seem to be constructing on your own and misattributing to me.
- Nicely put. I haven't seen anyone say that the introspection abilities of LLMs are up to much, but claiming that it's completely impossible to get a glimpse behind the curtain is untrue.
This point is load-bearing for your position, and it is completely wrong.At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any inputPrompt P at state S leads to a new state SP'. The "common jumping off point" you describe is effectively useless, because we instantly diverge from it by using different prompts.
And even if it weren't useless for that reason, LLMs don't "query" their "state" in the way that humans reflect on their state of mind.
The idea that hallucinations are somehow less likely because you're asking meta-questions about LLM output is completely without basis
- > The idea that hallucinations are somehow less likely because you're asking meta-questions about LLM output is completely without basis
Not sure who you're replying to here – this is not a claim I made.
- That's fair, but I'm not sure why you chose to address the one part of my comment that isn't responsive to your points.
- Is that based on your "deep understanding" of how LLMs work or have you actually tried it? If you watch the execution trace of a Skill in action, you can see that it's doing exactly this inspection when the skill runs - how could it possibly work any other way?
Skills are just textual instructions, LLMs are perfectly capable of spotting inconsistencies, gaps and contradictions in them. Is that sufficient to create a good skill? No, of course not, you need to actually test them. To use an analogy, asking a LLM to critique a skill is like running lint on C code first to pick up egregious problems, running testcases is vital.
- > you can see that it's doing exactly this inspection when the skill runs
I mean how do you know what does it exactly do? Because of the text it outputs?
- "exactly this inspection" != "what does it exactly do"
- Please read your own sentence again. Because you litterally said the opposite.
- I'd tell you to read it again, but you seem to be struggling.
- Did I write this: "you can see that it's doing exactly this inspection when the skill runs" ?
So, yeah - read what you wrote again.
- LLMs do not have special or unique insight into how best to prompt them. Not in the slightest.
https://aphyr.com/posts/411-the-future-of-everything-is-lies...
- "Not in the slightest" is an overreach, the paper the second level down from that link doesn't really support the conclusion in the blog post - the paper is much more nuanced.
Are they going to fib to you sometimes? Yes of course, but that doesn't mean there's no value in behavioural metaqueries.
Like most new tech, the discussion tends to polarise into "Best thing evah!" and "Utter shite!" The truth is somewhere in between.
- You're retreating from your position. You started at "major step" and "extremely important", and you've arrived at "there's not no value".
- Picking phrases from what I said and deliberately misquoting them out of context does not make you right.
- How exactly did I misquote you?
- Go figure it out, it will be a useful challenge for you.
- > Like most new tech
It's nothing like "most new tech". Most new tech tends to be adopted early by young people and experienced techies. In this case it is mostly the opposite: The teens absolutely hate it, probably because the shitty AI content does not inspire the young mind, and the experienced techies see it for what it is. I've never seen such "new tech" which was cheered on by the proverbial average "boomers" (i.e. old people doing "office jobs", not the literal age bracket) and despised by the young folks and experienced experts of all ages.
- Judging from Claude Code and the sheer number of “Make Your Favorite Anime Crush Into An AI” SaaSes on the market, I’d posit that both the young and experienced are quite enthusiastic about the new tech.
- If you had kids, or friends and family with kids, you wouldn't be making false conclusions based on some weird proxy "metric".
- You clearly missed the "The truth is somewhere in between" bit.
- No mate, this tech is marketed as superintelligence. Nation of PhDs in a datacentet. Yadda,yadda,yadda. No in-betweens please. Why is it not delivering after so many years and hundreds of billions in investment?
- Name me a new bit of tech that hasn't been hyped beyond reasonable bounds. And yes, this is one of the worst examples. But saying it doesn't have its uses isn't reasonable either.
- None was hyped like this ever before. What are you talking about? Mac was about "it just works" (and it f*ing did), iPhone was "a phone, an iPod and Internet access device". Need more? Microsoft Excel - actually more powerful if you know the tool compared to the bullshit machine. C#, the programming language: "Java done right". And it bloody was! What is in common: None of these techs were hyped beyond reasonable doubt. They were hyped a bit, but not to the level of bullshit LLMs. And none of these techs claimed to do incredible stuff only to underdeliver. After so much money burnt, yes I want to see that nation of PhDs. I want to see AI "writing all the code" in six months (Anthropic claimed this in January this year). Enough of bullshit and people being told they are stupid for not knowing how to win the lottery system and comparing lottery systems. Show me the superintelligence or shut the f. up.
- What does this even mean? It looks like typical LLM bloviation to me: 'skill design discipline', 'stronger descriptions' and 'lighter defaults'??!? This is meaningless pablum masquerading as advice.
What specifically would this cause you to actually do to improve the skills in question? How would you measure that improvement in a non hand-wavy way? What do these scores mean and how were they calculated?
Or perhaps you would ask your LLM how it would improve these skills? It will of course some up with some changes, but are they the right changes and how would you know?
- I'm not going to repeat myself, I've already explained the context to you - funny how you seem to have ignored that. If you want to find out, do the experiment yourself.
- Great points, but I imagine it's a bit too heavy on the rigorousness requirement for the LLM crowd. The folks are high on this stuff and I am beginning to notice it's like trying to get a heavy pothead or crackhead of off their stuff. Don't you see it - if you just wave your hands a lot, and tell the LLM to be serious about it, the scores will just appear :) It's true in their own frame of reference.
- It’s all vibes based, we are not trying to be scientific here. /s
I discard most LLM advice and skills because either a script is better (as the work is routine enough) or it could be expressed better with bullet points (generating tickets).
- This is pretty much a spec driven workflow.
I do similar, but my favorite step is the first: /rubberduck to discuss the problem with the agent, who is instructed by the command to help me frame and validate it. Hands down the most impactful piece of my workflow, because it helps me achieve the right clarity and I can use it also for non coding tasks.
After which is the usual: write PRDs, specs, tasks and then build and then verify the output.
I started with one the spec frameworks and eventually simplify everything to the bone.
I do feel it’s working great but someday I fear a lot of this might still be too much productivity theater.
- I think most of us are ending up with a similar workflow.
Mine is: 1) discuss the thing with an agent; 2) iterate on a plan until i'm happy (reviewing carefully); 3) write down the spec; 4) implement (tests first); 5) manually verify that it works as expected; 6) review (another agent and/or manually) + mutation testing (to see what we missed with tests); 7) update docs or other artifacts as needed; 8) done
No frameworks, no special tools, works across any sufficiently capable agent, I scale it down for trivial tasks, or up (multi-step plans) as needed.
The only thing that I haven't seen widely elsewhere (yet) is mutation testing part. The (old) idea is that you change the codebase so that you check your tests catch the bugs. This was usually done with fuzzers, but now I can just tell the LLM to introduce plausible-looking bugs.
- > write PRDs, specs.
I do the same thing, but how to avoid these needing to be insanely long? It's like I need to plug all these little holes in the side of a water jug because the AI didn't really get what I need. Once I plugged the biggest holes I realize there's these micro holes that I need to plug.
- Can you share the rubberduck skill?
- How does the rubberduck skill looks like?
- Probably something like:
## RUBBERDUCK SKILL V1.0 SERIOUS ## * You are a rubberduck sitting on my desk * * I am using you to talk to you as if you were a physical yellow rubber duck on my desk* * You are not able to answer my questions or otherwise engage with me * * I talk to you and this process leads me to discover issues in my code or develop my ideas. Since you don't answer back, it's simply based on me talking to you out loud in my home office, since it would look crazy if I were doing it on-site in our open office space * * You are not to respond at all to me * * Talking to you will cause me to come up with new ideas * #### End rubberduck skill v1.0 ######
- Here's mine: code to spec until I get stuck -> search Google for the answer -> scan the Gemini result instead of going to StackOverflow.
This part will go away over the next year. You doing it will be too slow, when an agent can do it in 5 to 30 minutes.code to spec until I get stuckTechnically it's not needed now, but everything's so new, it's understandable. Everyone's workflow hasn't migrated yet. You should go take a look.
We all mourn the loss of the craft, but the wheel turns. People still make furniture by hand sometimes, even if most furniture is made in a factory now.
- The world has seen enough weather apps.
We all could live in fantastical universes where CEOs tell the truth and shareholders put other things over profits, but that's not the case. Another such case of a fantastical world, that contends with what Tolkien might have come up with, is believing LLMs are reliable, secure, or have any intelligence.
For one, I'm at peace with all these obituaries, like yours. If they're written by technical people, I rest assured of my job security. If they're not written by tech people, I'm at peace too, for time, as always, will come back with the invoice for their piss-poor hype-driven, sanguine mandates on the technical side of things.
I mean to say, it is a sad state, has always been, how informal software engineering compared to other engineering fields.
- Why hasn't it gone away already? ChatGPT at least has been around for over 3 years.
Why is my AI-first colleage constantly having to get more expensive AI subscriptions approved?
>most furniture is made in a factory now
Terrible analogy. Software is not like a mass-produced item - it is written significantly less often than it is executed!
You could say that AI will allow many more variations of softwares to be written in the same time frame, but I'm still sure I can produce quality output in a competitive time.
Because the models only got good enough to be trusted in the past few months and the developer tooling and agent abstractions are still rounding off the sharp corners to make it easy to use.Why hasn't it gone away already? ChatGPT at least has been around for over 3 years.ChatGPT didn't have your whole codebase in context, the ability to automatically pull and push information to JIRA to plan code changes, and the ability to break your problems down into manageable pieces and sub-divide them among a fleet of sub-agents.
Developers didn't yet have the "Ask -> Plan -> Implement -> Review" workflow that results in the best agent-written code.
Now the tools and developers do and it works incredibly well.
- >Because the models only got good enough to be trusted in the past few months
They have got noticably worse over the past few months! It looks like we are going in the direction I've been predicting for a while - the cost of AI will increase until it's similar in cost/benefit to hiring a recent graduate, who can also do all of those things you mention (and will get better at it).
- Unfortunately, I have conflicting anecdata.
- Also a lot of software should be small. The only reason they aren’t (especially web) is because the trend is to bring in frameworks instead of using libraries. I spend more times tweaking code than adding features. The time spent on coding is way smaller than the time spent discussing about those tweaks
- My workflow is also highly inspired by Matt's skills, but I'm leveraging Linear instead of Github.
/grill-me (back-and-forth alignment with the LLM) --> /write-a-prd (creates project under an initative in Linear) --> /prd-to-issues (creates issues at the project level). I'm making use of the blockedBy utility when registering the issues. They land in the 'Ready for Agent' status.
A scheduled project-orchestrator is then picking up issues with this status leveraging subagents. A HITL (Human in the loop) status is set on the ticket when anything needs my attention. I consider the code as the 'what', so I let the agent(s) update the issues with the HOW and WHY. All using Claude Code Max subscription.
Some notes:
- write-a-prd is knowledge compression and thus some important details occasionally get lost
- The UX for the orchestrator flow is suboptimal. Waiting for this actually: https://github.com/mattpocock/sandcastle/issues/191#issuecom...
- I might have to implement a simplify + review + security audit, call it a 'check', to fire at the end of the project. Could be in the form of an issue.
- Phew, that's a very elaborate process indeed. It seems like folks like you are now working even more than without LLMs. What did you actually build and release with it?
- Building a multi-app monorepo for apps which integrate with a Dutch ERP vendor. I'd say the size of those apps is fairly small.
Also building out an MCP server.
- Oh. Honestly for those relatively limited use-cases, you'd probably be better off just retaining the LLMs as a verbose search engine, rather than going through all that pain, just to build a monorepo.
- These would be more helpful if it were illustrated with a real example/session transcript. Virtually none of these workflows move past vague descriptions; not sure if I should read more into that.
- Congrats! You just rediscovered something called water-fall model.
- Waterfall was bad due to the excessively long feedback loops (months-to-years from "planning" to "customer gets to see it/ we receive feedback on it"). It was NOT bad because it forced people to think before writing code! That part we should recover, it's not problematic at all.
- If people actually read the original paper by Royce 1970 they would see that it's an iterative process with short feedback-loops.
The bad rep comes from (defense|gov.) contracting, where PRDs where connected to money and CR were expensive, see http://www.bawiki.com/wiki/Waterfall.html for better details.
- When you do most of the thinking before you start implementing the whole thing, and if you think that that's enough, then you've missed the unknown unknowns part, which was a big talking point in the mid 2000s, back when the anti-waterfall discourse got going (and for good reason).
But I expect the AI zealots to start (re-)integrating XProgramming (later rebranded as Agile) back into their workflow, somehow.
- That's not what's considered waterfall, though. Specs are always required for any work, even if they're only in your head, even if the work takes 15 minutes. It's the length of the feedback loop and the resistance to spec change that makes waterfall, and by his use of tracer bullets I very much doubt it's the case here, if there was any doubt at all to have.
- Did you know that agile is just waterfall scaled down to two weeks? Now you know!
- No /s here so just in case this is a serious point:
Agile is a set of four principles for software development.
Scrum is the two-week development window thing, but Scrum doesn't mandate a two week _release_ window, it mandates a two week cadence of planning and progress review with a focus on doing small chunks of achievable work rather than mega-projects.
Scrum prefers lots of one-to-three day projects generally, I've yet to see training on Scrum that does not warn off of repeatedly picking up two-week jobs. If that's been your experience, you should review how you can break work down more to get to "done" on bits of it faster.
- All good points here (and yeah I didn't add /s, hopefully "now you know!" was sufficiently obvious over-the-top).
All that said, in most orgs I've worked with, they were following agile processes over agile principles - effectively a waterfall with a scrum-master and dailies.
This is not to diss the idea of agile, just an observation that most good ideas, once through the business process MBA grinder, end up feeling quite different.
- > All that said, in most orgs I've worked with, they were following agile processes over agile principles - effectively a waterfall with a scrum-master and dailies.
In my experience, they're all waterfall in scrum skin, except they also lose the one thing that was a strength of the old-school method: building up a large, well thought out, thoroughly checked spec up front.
So in the end, "business process MBA grinder" reshapes any idea to adapt to leadership needs - and so here, Agile became all about the things that make software people predictable cogs in the larger corporate planning machine. They got what they need anyway, but we threw away the bits that were useful to us.
- > Agile is a set of four principles
Twelve :-) Twelve principles and four values
- :) I keep saying it - the AI will cost us all dearly, but not in the ways the AI boosters are saying it will....
- I automated a lot of this with a tool I wrote - https://github.com/tim-projects/tasks-ai
It's not perfect by all means but it does the job and fast. My code quality and output increased from using it.
- Spec-driven approach is fun. I wonder at which point of anytime at all we are going to commit only specs into the got repo, while the actual code can be generated.
Obviously we’re not here yet because of price, context, and non-determinism, but it’s nice area to experiment with.
- > Spec-driven approach is fun
...If you never ever look at the code that's generated, it probably is.
- This looks a lot like the [BMad Method](https://github.com/bmad-code-org/BMAD-METHOD)
- Congratulations you reinvented spec-kit.
- No kids, don´t put yourself through this suffering. If you have to invest so much deliberate effort to sort of make it work - while you still handle the most tenuous and boring parts yourself, then what is the point? Lets keep the LLM vendors to their word - they promised intelligent machines that would just work so well to the point of causing mass unemployment. Why on earth do we have to work around the LLMs to make them work? What is the point? Where is my nation of datacenter PhDs or a PocketPhd, depending on whose CEOs misleading statement one quotes?
- My workflow starts with dusting off my trusty spell book and checking which deities are currently listening and active. They only listen for so long, before I must pause for a few hours to allow them to return their gaze. I’m learning I need to be more deliberate in my spell casting, lest I exhaust their patience too quickly. I light the appropriate candles for focus, align my ritual circle, and open a fresh page for the day’s invocation.
I polish my staff and prepare the inscription tools. I sketch out a loose intention on parchment, never too precise at first, just enough to give the spirits a direction. Then I begin the incantations, carefully chosen phrases spoken into the void until something answers back. Sometimes the reply is coherent, sometimes it is… enthusiastic in a way I did not ask for, but all responses are recorded for refinement. I keep a small set of favorite incantations that tend to calm the louder gods, though I still experiment when I’m feeling bold.
Before committing anything to permanence, I perform a small divination to see if the current path is “stable.” The results are rarely definitive, but the ritual itself seems to keep things from collapsing immediately. Once a workable manifestation appears, I bind it with additional runes to keep it from drifting. If it behaves unpredictably, I perform a cleansing rite: repeating sections of the invocation with stricter wording until the spirit settles.
There are also moments of silent bargaining, short offerings of clarity in exchange for fewer surprises later. When things truly misbehave, I consult older, more temperamental deities buried deeper in the book, though they are expensive to wake and rarely generous. Finally, I seal the result, store it in the grimoire, and extinguish the candles, hoping I won’t need to reopen that particular circle again too soon.
- Very nice!
Can you share the skill for it?
- Did you compare your flow to superpowers/GSD?
- Why is everyone compelled to write one of these articles? Do they think that their workflow is so unique that they've unlocked the secret to harnessing the power of a pattern generator? Every single one of these reads like influencer vomit.
My workflow hasn't changed since 2022: 1. Send some data. 2. Review response. 3. Fix response until I'm satisfied. 4. Goto 1.
- It is OK. I actually love looking around other people’s work. Perhaps, I will never follow exactly but one a while, I get the gotchas where I can steal and adapt to mine. Let it be, let people express. If not for the veterans with years of experience, people coming in recently should find these things something to read up and learn.
- > Why is everyone compelled to write one of these articles?
LinkedIn clout.
- Ed Zitron's latest piece has a great take on this - basically yes, they thing they've unlocked a great secret and they think they are very smart, when instead they are actually doing the work for LLM, while giving LLM the credit for the outputs of their work.
- Documenting what I do is fun and relaxing and for me so I write. Only time I had to share mine was to a friend who wanted was getting into coding lately. https://www.nadeem.blog/writing/workflows
- I think your take is overly negative. Regardless of what they think, sharing ones experiences with others is how we advance, both as individuals and as a community/mankind. Talking about AI workflows, I am personally interested in how the people who are happy working with AI work, so that I could also be happier with my work. If they write their workflow, I can either learn from it and improve my work, or learn that they are doing something completely different from what I do, which might explain the disparity between people's experiences with AI, or learn that they are spouting nonsense, reaffirming that it might really be mostly hype. Either way, each one of these is a net positive information for me.
- [dead]
- > Do they think that their workflow is so unique that they've unlocked the secret to harnessing the power of a pattern generator?
Yes, just like everyone were thinking their .vimrc was amazing 20 years ago. It is vomit.
- Posting .vimrc was actually great. You can quickly scan it to find interesting bits, then you may add those bits in your config.
Now there’s nothing to pick or compare. Just vibes and my shamanic dance is twistier than yours.
- Nobody writes about their work thinking the whole world will read it. They write it for their friends, maybe a small group of regular readers, also for themselves. I for one really like it, even if I get bored after reading 5 similar articles, because maybe someone will only ever read one of them, and it’ll help them improve their own work.
- I mean, that argument doesn't hold water when you then post it on HN and Reddit.
- This workflow is another example of a developer with contempt for testing. Yes there is iteration and review and output checking. In relatively low risk projects that is enough— but so is basic vibe coding.
At some point in a serious project a responsible adult must ask the question: “How do I know this works well?” The developer himself is an unreliable judge of this. LLMs can’t judge, either. But anyone who seeks to judge, in a high stakes situation, must take time and thought to test deeply.
- > The single most valuable shift I made was treating every feature as a thinking problem first and an implementation problem second
That’s pretty much the whole point of software engineering. Coding is easy, solving problems is hard and can be messy (communication errors and workarounds to some inevitable issue).
If you’re familiar with the codebase, when you have a change request, you will probably get an insight on how to implement it. The hard thing is not to code it, but to recalibrate all the tradeoffs so that you don’t mess with existing features. That’s why SWE principles exists. To make this hard thing easier to do.
- I just use /brainstorming from https://github.com/obra/superpowers/tree/main
Then I tell it to write a high level plan. And then rum subagents to create detailed plans from each of the steps in the high-level one. All olans must include the what, the why, and the how.
Works surprisingly well, especially for greenfield projects.
You have to manually revie the code though. No amount of agentic code review will fix the idiocy LLMs routinely produce.
- >What is AI actually good at? Implementation. What is it genuinely bad at? Figuring out what you actually want
I've found it to be pretty bad at both.
If what you're doing is quite cookie cutter though it can do a passable job of figuring out what you want.
- LLMs work OK for "Mostly iterative and mostly one-off" tasks like codegen, where you can effectively "review the result into existence", and that's where most of the buzz is at the moment.
Where they don't work at all well is for hands-off repeatable tasks that have to be correct each time. If you ask a LLM for advice, it will tell you that you need to bound such tasks with deterministic input contract and a deterministic output contract, and then externally validate the output for correctness. if need to do that you can probably do the whole thing old-skool with not much more effort, especially if you use a LLM to help gen the code, as above. That's not a criticism of LLMs, it's just a consequence of the way they work.
They are also prone to the most massive brain farts even in areas like coding - I asked a LLM to look for issues in some heavily multithreaded code. Its "High priority fix" for a infrequently used slow path that checked for uniqueness under a lock before creating an object was to replace that and take out a read lock, copy the entire data structure under the lock, drop the lock, check for uniqueness outside of any lock, then take a write lock and insert the new object. Of course as soon as I told it it was a dumbass it instantly agreed, but if I'd told it to JFDI its suggestions it would have changed correct code into badly broken code.
Like anything else that's new in the IT world, a useful tool that's over-hyped as sweeping awsy everything that came before it and that's gleefully jumped on by PHBs as a reason to get rid of those annoying humans. Things will settle down eventually and it will find its place. I'm just thankful I'm in the run up (down?) to retirement ;-)
- [dead]
- My AI-Results
- [dead]
- My workflow is quite similar, but it's leveraging Notion instead of markdown files.
https://github.com/tessellate-digital/notion-agent-hive
The main reason is we're already using Notion at work, and I wanted something where I could easily add/link to existing documents.
Sample size of one, but I've noticed a considerable improvement after adding a "final review" step, going through the plan and looking at the whole code change, over a naive per-task "implement-review" cycle.