- The author partially acknowledges this later on, but lines of code is actually quite of useful metric. The only mistake is that people have it flipped. Lines of code are bad, and you should target fewer lines of code (except at the expense of other considerations). I regularly track LoC, because if it goes up more than I predicted, I probably did something wrong.
> Bill Gates compared measuring programming progress by lines of code to measuring aircraft building progress by weight
Aircraft weight is also a very useful metric - aircraft weight is also bad. But we do measure this!
- Dijkstra’s quote from 1988 is even better: "My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger."
- Spent is so true - eventually code can increasingly become debt.
- LoC desire-ability is also dependent on projects stage.
Early we should see huge chunky contributions and bursts. Loc means things are being realized.
In a mature product shipping at a sustained and increasing velocity, seeing LoC decrease our grow glacially year-on-year is a warm fuzzy feeling.
By my estimation aircraft designs should grow a lot for a bit (from 0 to not 0), churn for a while, then aim for specified performance windows in periods of punctuated stability.
Reuse scenarios create some nice bubbles where LoC growth in highly validated frameworks/components is amazing, as surrounding systems obviate big chunks of themselves. Local explosions, global densification and refinement.
- Author here; I think it can be a useful metric depending on the circumstance and use; the reason I decided to write that article is I'm starting to hear more and more of CTOs using as the sole metric from their team; I know of at least one instance where CTO is pushing for Agentic coding only and measure each dev based on LoC output.
There is also the x.com crowd that is bragging about their OpenClaw agents pushing 10k lines of code every day.
- Everyone misses the technical goal of Google size AI.
Fill the gradient of machine states, then prune for correctness and utility.
That is not to say it's a good goal. But at the end of the day every program is electrical states in a machine. Fill machine, like search, see which ones are required to produce the most popular types of outputs, prune the rest.
Hint to syntax fans among programmers; most people will not be asking the machine to output Python or Elixir. Most will ask for movies, music, games. Bake the states needed to render and prune that geometry and color as needed. That geometry will include text shapes eventually too, enabling pruning away all the existing token systems like Unicode and ANSI. Storing state in strings is being deprecated.
Language is merely one user interface to reality. Grasp of it does not make one "more human" or in touch with the universe or yadda yadda. Such argument is pretentious attention seeking of those educated in a particular language. Look at them! ...recreating grammatically correct sentences per the rules of the language. Never before seen! Wow wow wow
Look at all the software written, all the books and themes within. Grasp of language these days is as novel an outcome as going to the grocery store, using a toilet.
- The problem with optimizing for less lines of code is the same as optimizing for unit tests: the less robust your code is, the better off you are.
Meaning, it's trivial to write unit tests when your code is stupid and only does happy path stuff and blows up on anything else. So we say "you need 90% coverage" or whatever, people will write stupid frail code that barely works in practice, but that is easy to unit test.
Similarly, if we say "do it with the least amount of code", we will also throw any hopes of robustness out the window, and only write stupid happy path code.
- I think the author is missing a key distinction.
Before, lines of code was (mis)used to try to measure individual developer productivity. And there was the collective realization that this fails, because good refactoring can reduce LoC, a better design may use less lines, etc.
But LoC never went away, for example, for estimating the overall level of complexity of a project. There's generally a valid distinction between an app that has 1K, 10K, 100K, or 1M lines of code.
Now, the author is describing LoC as a metric for determining the proportion of AI-generated code in a codebase. And just like estimating overall project complexity, there doesn't seem to be anything inherently problematic about this. It seems good to understand whether 5% or 50% of your code is written using AI, because that has gigantic implications for how the project is managed, particularly from a quality perspective.
Yes, as the author explains, if the AI code is more repetitive and needs refactoring, then the AI proportion will seem overly high in terms of how much functionality the AI proportion contributes. But at the same time, it's entirely accurate in terms of how this is possibly a larger surface for bugs, exploits, etc.
And when the author talks about big tech companies bragging about the high percentage of LoC being generated with AI... who cares? It's obviously just for press. I would assume (hope) that code review practices haven't changed inside of Microsoft or Google. The point is, I don't see these numbers as being "targets" in the way that LoC once were for individual developer productivity... there's more just a description of how useful these tools are becoming, and a vanity metric for companies signaling to investors that they're using new tools efficiently.
- > the overall level of complexity of a project
The overall level of complexity of a project is not an "up means good" kind of measure. If you can achieve the same amount of functionality, obtain the same user experience, and have the same reliability with less complexity, you should.
Accidental complexity, as defined by Brooks in No Silver Bullet, should be minimized.
- Complexity is always the thing that needs to be managed, as it is ultimately what kills your app. Over time, as apps get more complex, it's harder and harder to add new features while maintaining quality. In a greenfield project you can implement this feature in a day, but as the app becomes more complex it takes longer and longer. Eventually it takes a year to add that simple feature. At that point, your app is basically dead, in terms of new development, and is forever in sustaining mode, barring a massive rewrite that dramatically reduces the complexity by killing unnecessary features.
So I wish developers looked at apps with a complexity budget, which is basically Djikstra's line of code budget. You have a certain amount of complexity you can handle. Do you want to spend that complexity on adding these features or these other features? But there is a limit and a budget you are working with. Many times I have wished that product managers and engineering managers would adopt this view.
- So the question is, are these AI tools primarily creating inherent complexity, or is it a significant amount of accidental complexity?
And if AI tools are writing all of the code, does it even matter anymore?
- It absolutely does matter. LLMs still have to consumer context and process complexity. The more LoC, the more complexity, the more errors you have and the higher your LLM bills. That's even in the AI maximalist, vibe-code only use case. The reality is that AI will have an easier time working in a well-designed, human-written codebase than one generated by AI, and the problem of AI code output turning into AI coding inputs resulting in the AI choking and on itself and making more errors tends to get worse over time, with human oversight being the key tool to prevent this.
- a bit of a nit but accidental complexity is still complexity, so even if that 1M lines could reduce to 2 kines its still way more complex to maintain and patch than a codebase thats properly minimized and say 10k lines. (even though this sounds unreasonable i dont doubt it happen..)
- > The overall level of complexity of a project is not an "up means good" kind of measure.
I never said it was. To the contrary, it's more of an indication of how much more complex large refactorings might be, how complex it might be to add a new feature that will wind up touching a lot of parts, or how long a security audit might take.
The point is, it's important to measure things. Not as a "target", but simply so you can make more informed decisions.
- If tech companies want to show they have a high percentage of LoC being generated by AI, it's likely they are going to encourage developers to use AI to further increase these numbers, at which point is does become a measure of productivity.
- > It seems good to understand whether 5% or 50% of your code is written using AI, because that has gigantic implications for how the project is managed, particularly from a quality perspective.
I'd say you're operating on a higher plane of thought than the majority in this industry right now. Because the majority view roughly appears to be "Need bigger number!", with very little thought, let alone deep thought, employed towards the whys or wherefores thereof.
- A higher plane of thought would be "was AI able to remove 5% or 50% of the code while keeping or adding functionality and not diminishing clarity, consistency, and correctness"
- I don't think the author is missing this distinction. It seems that you agree with him in his main point which is that companies bragging about LOCs generated by AI should be ignored by right-thinking people. It's just you buried that substantive agreement at the end of your "rebuttal".
- > would assume (hope) that code review practices haven't changed inside of Microsoft or Google.
Google engineer perspective:
I'm actually thinking code reviews are one of the lowest hanging fruits for AI here. We have AI reviewers now in addition to the required human reviews, but it can do anything from be overly defensive at times to finding out variables are inconsistently named (helpful) to sometimes finding a pretty big footgun that might have otherwise been missed.
Even if it's not better than a huamn reviwer, the faster turnaround time for some small % of potential bugs is a big productivity boost.
- This reminds me out the outsourcing craze of the 00s when my employer wanted 80% of code written by our Indian partner. And they measured it by lines of code.
Quite a bit of my time was spend rewriting the massive amounts of garbage churned out by offshore partners.
Management stuck to their goal, so the compromise was to not delete offshore lines, but to comment them out.
Lines of code is a dumb metric and anyone touting them for anything meaningful is disconnected from reality. Bad that all these ceos are touting it, but they kind of always use these dumb metrics.
- This is the perfect illustration of Goodhart's Law : https://en.wikipedia.org/wiki/Goodhart%27s_law
- Even worse there are more than a few CTOs touting it.
- I've noticed that it's super easy to end up with tons of extra lines of code when using AI. It will write code that's already available (whether that's in external libraries, internal libraries, or code already the same project). I don't mind trying to keep dependencies down, but I also don't want every project to have its own poorly tested CSV parser.
It also also often fails to clean up after itself. When you remove a feature (one that you may not have even explicitly asked for) it will sometimes just leave behind the unused code. This is really annoying when reviewing and you realize one of the files you read through is referenced nowhere.
You have to keep a close eye out to prevent bloat from these issues.
- Absolutely. Even worse, when you ask AI to solve a problem it almost always adds code even if a better solution exists that removes code. If AI's new solution fails, you ask it to fix, it throws even more code, creates more mess, introduces new unnecessary states. Rinse, repeat ad infinitum.
I did this a few times as an experiment while knowing how a problem could be solved. In difficult situations Cursor always invariably adds code and creates even more mess.
I wonder if this can be mitigated somehow at the inference level because prompts don't seem to be helping with this problem.
- Same thing happens with infrastructure config. Ask an AI to fix a security group issue and it'll add a new rule instead of fixing the existing one. You end up with 40 rules where 12 would do and nobody knows which ones are actually needed anymore.
- The code written by AI in most cases is throwaway code to be improved/refined later. Its likely to be large, verbose and bloated. The design of some agents have "simplify/refactor" as final step to remedy this, but typically your average vibe coder will be satisfied that the code just compiles/passes the minimal tests. Lines of code are easy to grow. If you refine the AI code with iterative back-and-forth questions, the AI can be forced to write much more compact or elegant version in principle, but you can't apply this to most large systems without breaking something, as AI doesn't have context of what is actually changing: so e.g. an isolated function can be improved easily, but AI can't handle when complexity of abstraction stacks and interfacing multiple systems, typically because it confuses states where global context is altered.
- I've been working to overcome this exact problem. I believe it's fully tractable. With proper deterministic tooling, the right words in context to anchor latent space, and a pilot with the software design skills to do it themselves, AI can help the pilot write and iterate upon properly-designed code faster than typing and using a traditional IDE. And along the way, it serves as a better rubber duck (but worse than a skilled pair programmer).
- It reminds me. When I had my consulting company 20 years ago, I defined these "metrics" to decide if a project was successful.
- Is the client happy? - Are the team members growing(as in learning)? - Were we able to make a profit?
Everything else was less relevant. For example: Why do I care that the project took bit longer, if at the end the client was happy with the result, and we can continue the relationship with new projects. It frees you from the cruelty of dates that are often set arbitrary.
So perhaps we should evaluate AI coding tools the same. If we can deliver successful projects in a sustainable way, then we are good.
- Even in AI based development this makes no sense. I can write an agent loop that will simiplify and enhance reusability that is more effective than having tons and tons of lines of code.
The LOC as a KPI is useless and people should humiliate Elon over that. (Paraphrasing Linus on that comment and adding support).
- LoC is a good code quality metric, only it has to be inverted. Not "it wrote C compiler in 100 000 lines of code", but "in just 2000 lines of code". Now that is impressive and deserves praise.
- Not necessarily.
If I minimize my project and get everything on one line, is that good? I think not.
Measuring success based on how may or how few lines there are is a bad idea, I think.
- I don't want to maintain an information system written in the style of code golf competitions, either.
I like the author's proposed "Comprehension coverage" metric. It aligns well with Naur's Programming as Theory Building.
- this is also easily gameable. Any language can easily be converted into a single LOC
- Lines of code changed are a very bad measure for humans because they can be gamed. It's an ok measure for "work done" with AI if you're don't prompt the model to game it. It's useful because it's quick to calculate and concrete, but if you use it as more than a proxy it'll bite you.
- Forget about if LoC is (or ever was) a meaningful metric; in a world where humans - and now AI - essentially wires up massive amounts of existing functionality with relatively little glue code I don't even know how to measure it.
- Detection of copy-pasta is interesting - what it's calling out is not a deficiency in LLM's to code but in agentic rules in place that should just remind the agent to refactor into a common function when appropriate.
- >> that should just remind the agent to refactor into a common function when appropriate.
This off-the-cuff statement buries so much complexity. Sure it catches new code the exactly implements existing code, but IME it is __way__ more common to need to slightly (or not so slightly) change existing code that can now be used by multiple consumers, and then delete the new "duplicate" code. That is not trivial and requires (1) judgement from your AI coder and (2) deep reviewer expertise from your human coder.
- This is kind of an aside, but nowadays we could at least be counting (lexer) tokens of code instead of lines of code. Or even number of AST edges.
- This (LOC is an anti-metric, Goodhart's Law, etc.) is true, but I'm reaching the point of "fuck nuance" when I see so many articles superficially critical of AI which contain things like this:
> If AI-generated code introduces defects at a higher rate, you need more review, not less AI.
I think that is very much up for debate despite being so frequently asserted without evidence! This strikes me as the same argument as we see about self-driving cars: they don't have to be perfect, because there is (or we can regulate that there must be) a human in the loop. However, we have research and (sometimes fatal) experience from other fields (aviation comes to mind) about "automation complacency" - the human mind just seems to resist thoroughly scrutinizing automation which is usually right.
- I don't disagree with you entirely here. I probably wasn't clear enough on what I was trying to convey.
Right now AI / Agentic coding doesn't seem is a train we are going to be able to stop; and at the end of the day is tool like any other. Most of what seems to be happening is people let AI fully take the wheel not enough specs, not enough testing, not enough direction.
I keep experiment and tweaking how much direction to give AI in order to product less fuckery and more productive code.
- Sorry for coming off combative - I'm mostly fatigued from "criti-hype" pieces we've been deluged with the last week. For what it's worth I think you're right about the inevitability but I also think it's worth pushing a bit against the pre-emptive shaping of the Overton window. I appreciate the comment.
I don't know how to encourage the kind of review that AI code generation seems to require. Historically we've been able to rely on the fact that (bluntly) programming is "g-loaded": smart programmers probably wrote better code, with clearer comments, formatted better, and documented better. Now, results that look great are a prompt away in each category, which breaks some subconscious indicators reviewers pick up on.
I also think that there is probably a sweet spot for automation that does one or two simple things and fails noisily outside the confidence zone (aviation metaphor: an autopilot that holds heading and barometric altitude and beeps loudly and shakes the stick when it can't maintain those conditions), and a sweet spot for "perfect" automation (aviation metaphor: uh, a drone that autonomously flies from point A to point B using GPS, radar, LIDAR, etc...?). In between I'm afraid there be dragons.
- I should change my linkedin to vibe coder fixer
- I do think that "yo, look at my pumping out 10kloc/day" brags are quite stupid, because it simply does not take that many lines of code to support a large, profitable product. I'd love to hear other people's experiences here, but I'd say that a product with 100K-500K LOC can support a profitable company of dozens of people and generate tens of millions of revenue.
Now 100kloc is roughly 1M tokens, which cost a few dollars, so how could something that costs single digit dollars possibly be worth tens of millions in value? Clearly there's a substantial gap between how useful different pieces of code are, so bragging about how much of it you produce without telling me how valuable it is is useless. I guess it's a long-winded way of saying "show me the money"
- Lines of code not written to achieve result can be an important metric too.
Focusing on capabilities instead of shipping code also can provide a better measure.
- I was enjoying the article, and then the author hit me with one of these:
> AI didn't just repeat the mistake. It broke the mistake open.
Come on bruh
- HAHAHA, whoops. Can I blame it on jetlag?
- When I was just beginning, all of the productivity measures would be 0 and I felt like a failure. The most attainable was lines of code. Currently, it's not a great measure of productivity as I'm achieving more advanced tasks. I've heard so many opinions about how LOC isn't a great measure and then the same people get to trample on all of the work I've done out of spite because I've written more code than them. I think LOC is great because productivity measures are for beginners and people who don't understand code. The audience doesn't know the difference between writing a hundred or thousands of lines of code, both are trophies for them.
These metrics for advanced roles are not applicable, no matter what you come up with. But even lines of code are good enough to see progress from a blank slate. Every developer or advanced AI agent must be judged on a case by case basis.
- But if removing 1kLOC and replacing them with 25 for a much better function, there is a -975 LOC report. Does this count as negative productivity? Having brackets start and stop on their own lines could double LOC counts but would that actually improve the code base? Be a sign of doubling productivity?
The OpenBSD project prides itself on producing very secure, bug free software, and they largely trend towards as low of lines of codes as they can possibly get away with while maintaining readability (so no codegolf tricks for the most part). I would rather we write secure bug free software than speed up the ability to output 10kLOC. The typing out code isn’t the difficult part in that scenario.
- No one judges a painting by the amount of paint, or a wooden chair by the number of nails in it. The amount of LoC doesn’t matter. What matters is that the code is bug-free, readable, and maintainable.
But reducing the amount of LoC helps, just like using the correct word helps in writing text. That’s the craft part of software engineering, having the taste to write clear and good code.
And just like writing and any other craft, the best way to acquire such taste is to study others works.