Reworking 30 lines of Linux code could cut power use by up to 30 percent

318 points by gslin 2 months ago | 115 comments

ryao
Linux added a busy polling feature for high performance networking. Most Linux software does not use it, but software used in datacenters (e.g. by CDNs) that does use it makes the system very energy inefficient when things are not busy. This patch gives the the kernel the ability to turn that off when not busy to regain energy efficiency until things become busy again.
The article name is somewhat misleading, since it makes it sound like this would also apply to desktop workloads. The article says it is for datacenters and that is true, but it would have been better had the title ended with the words “in datacenters” to avoid confusion.
- fnordpiglet
  It’s a lot more nuanced than that. Being in a data center doesn’t imply heavy network utilization. The caveat is clearly outlined in the article being workload based not deployment based. If you have a home machine doing network routing it would absolutely benefit from this. In fact I would say probably the vast majority of Linux installs are home network devices, just people don’t know it. Embedded Linux machines doing network routing or switching or IPS or NAS or whatever would benefit a lot from this. “Energy savings” can be seen as a green washing statement but on embedded budgets it’s a primary requirement.
  Denvercoder9
  > If you have a home machine doing network routing it would absolutely benefit from this.
  It most likely won't. This patch set only affects applications that enable epoll busy poll using the EPIOCSPARAMS ioctl. It's a very specialized option that's not commonly used by applications. Furthermore, network routing in Linux happens in the kernel, not in user space, so this patch set doesn't apply to it at all.
  p_l
  NAPI is not busy poll, though, and the way the article is worded suggests it's about NAPI.
  Now, NAPI already was supposed to have some adaptiveness involved, so I guess it's possibly a matter of optimizing it.
  But my system is compiling for now so will look at article more in depth later :V
  Denvercoder9
  The article is terrible, this doesn't affect the default NAPI behaviour. See the LWN link posted elsewhere for a more detailed, technical discussion. From the patch set itself:
  > If this [new] parameter is set to a non-zero value and a user application has enabled preferred busy poll on a busy poll context (via the EPIOCSPARAMS ioctl introduced in commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for epoll_params")), then application calls to epoll_wait for that context will cause device IRQs and softirq processing to be suspended as long as epoll_wait successfully retrieves data from the NAPI. Each time data is retrieved, the irq_suspend_timeout is deferred.
  mmooss
  > I would say probably the vast majority of Linux installs are home network devices
  I'm expect there are many, but the vast majority are likely massive datacenters with hundreds of thousands of machines each running multiple instances, and also Android phones are probably more common than home equipment. Edit: Also IoT, as someone else points out.
  chasil
  The vast majority of Linux kernels run on Android.
  I would be shocked if I am incorrect.
  mmooss
  I could see that being right. At the same time, how many servers running Linux are there, and how many instances per server?
  ryao
  > It’s a lot more nuanced than that. Being in a data center doesn’t imply heavy network utilization.
  I suggest you reread “software used in datacenters (e.g. by CDNs) that does use it”. This is not a reference to software in a datacenter. It is a reference to software in a datacenter that uses it, which is a subset of the former.
  taeric
  I was curious how much this would be applicable to home routers.
  I confess I'm dubious on major savings for most home users, though? At least, at an absolute level. 30% of less than five percent is still not that big of a deal. No reason not to do it, but don't expect to really see the results there.
  tuetuopay
  Practically none of it would be applicable (if using a commercial router). They all use hardware offloading, and traffic seldom touches the CPU. Only "logical" tasks are raised to the CPU, like ARP resolution and the likes (what’s called "trap to cpu").
  If you’re doing custom routing with a NUC or a basic Linux box, however, this would gain massive power savings because that box pretty much only does networking.
  ori_b
  Only if you're using busy polling. Very little software uses it, because it's only a good fit if you think pegging a CPU to reduce latency responding to packets is a good trade.
  tuetuopay
  Yup, you’re right. I always forget this patch only applies to opted-in polling and not automatic polling for interrupt moderation.
  mrheosuper
  if you run a VPN server, i assume all the traffic will come to CPU to decrypt, right?
  fnordpiglet
  For embedded it’s not about saving cost it’s about saving limited on board power. The lower the power demand of the device, the smaller and the more you can dedicate to other things.
  taeric
  Again, no reason not to do this change. I'm not clear it helps in this case, though? The savings is in low use time. During high use time, it is the same, right?
  wkat4242
  There should have been an off switch from the beginning though IMO, though perhaps there already is? But the number of corporates messing with the kernel for their own purposes greatly outnumbers the number of independent developers. That gives big tech a big influence. And the Linux Foundation is there to cement that influence.
  I much prefer grassroots projects. Made by and for people like me <3 That's why I moved to BSD (well there were other reasons too of course)
  varjag
  Uh Linux is a grassroots project, quite unlike Berkley Software Distribution.
  Also no you are entirely unaffected by this unless you use a very specific and uncommon syscall.
  wkat4242
  It was originally, yes. There's a lot of corporate influence now (just look at the monthly kernel submissions). https://news.itsfoss.com/huawei-kernel-contribution/
  Like others have mentioned, it's just a huge deal in the data center now. With that comes a lot of influence by corporates interests.
  Whereas BSD has gone the opposite way. Started by Berkeley but abandoned to the community. Business is not really interested in that because anything they contribute can be used by anyone for anything (even their competitors can use it in closed source code). Netflix was the biggest user but I don't think they contribute anymore either. WhatsApp used it until Facebook acquired them. That leaves netgate and ix systems which are small. Netgate pushed a really terrible wireguard once but it was nipped in the bud luckily. https://arstechnica.com/gadgets/2021/03/buffer-overruns-lice... It also highlighted many trust issues which have been improved since.
  Of course whether this is an issue for you is very personal. For me it is but clearly for a lot more people it isn't, as Linux is s lot more popular.
  varjag
  No it's not an issue to me, just pointed out a factual mistake.
  Gud
  Linux has been a corporate project for at least 20 years.
  varjag
  That roots part in grassroots means something.
  Gud
  Yes, but I don't think it means what you think it means? Just because Linux was developed by basement dwelling hackers 20+ years ago, doesn't make it a grassroots project.
  Arguably, the only somewhat mainstream operating systems today that deserve that label are the *BSDs. Haiku OS gets an honorable mention but I wouldn't consider Haiku OS to be mainstream.
  varjag
  It quite literally makes it a grassroots project.
  > Arguably, the only somewhat mainstream operating systems today that deserve that label are the *BSDs.
  I think the word you're looking for is 'sidelined'.
  wkat4242
  Meh not really. I'm running BSD with the very latest software. KDE, Firefox, Chrome, LibreOffice. All rolling unlike most Linux distros. I really don't care how 'big' my OS is.
  But yeah grassroots was perhaps not the right term. I meant the status quo, not the origin. I don't know what the right word is then though.
  jeffbee
  What home router would be busy polling?
  sandworm101
  Someone who decided to turn their old PC into a router because a youtube video daid "recycling" an old PC is more green despite the massive power suck.
  jeffbee
  I feel like I'm repeating myself, but there are no people out there running Linux and unintentionally busy polling their network interface. This is something almost nobody does.
- mmooss
  > The article name is somewhat misleading, since it makes it sound like this would also apply to desktop workloads.
  The article name is, Data Centers Can Slash Power Needs With One Coding Tweak: Reworking 30 lines of Linux code could cut power use by up to 30 percent
  The article says, “It is sort of a best case because the 30 percent applies to the network stack or communication part of it,” Karsten explains. “If an application primarily does that, then it will see 30 percent improvement. If the application does a lot of other things and only occasionally uses the network, then the 30 percent will shrink to a smaller value.”
  It seems you only read the HN title? If so, why bother to critique the article's title?
- eichin
  heh, will this become the equivalent of "in mice" for bio papers?
- RachelF
  I appears that this academic is very good at public relations.
  They mix interrupts and polling depending on the load. The interrupt service routine and user-kernel context-switch overhead is tiny computationally and hence in power usage.
  Also, most network hardware in the last twenty years has had buffer coalescing, reducing interrupt rates.
- casey2
  But linux use outside datacenters and server farms is irrelevant in terms of power use anyways I don't think anybody would be confused in the context of power usage.
- nottorp
  So it's not even on by default, it needs to be turned on explicitly?
- jes5199
  you don’t have to say “in datacenters” when talking about linux, that is the obvious context in the vast majority of cases
  panzi
  I'm reading this on an Android phone. The phone OS that has over 70% market share.
  casey2
  Linux is a tiny tiny part of the OS on phones. Power usage is most certainly not handled by android or linux in the vast majority of cases.
  Who in their right mind would leave something so critical to the success of their product to a general purpose OS designed for mainframes and slightly redesigned for PCs?
  matkoniecz
  Why you think so? Especially for HN readers decent chunk is using Linux.
  (I typed this on Linux PC).
  devsda
  Agree. On reading the headline, I was hoping it would extend my laptop battery life.
  tisdadd
  Same here, I have used Linux as my main for a long time. Though, I would imagine that data centers do have more instances than the desktops and laptops around the world combined.
  helij
  And another one here!
  ajross
  This is a naming mistake. You're taking "Linux" in the sense of "Linux Distribution", where the various competitors have seen most of their success in the datacenter.
  This is specifically a change to the Linux kernel, which is much, much more broadly successful.
  ptero
  IME majority of linux systems are laptops and desktops.
  jolmg
  Without looking at stats, I would think android phones.
  billy99k
  Android isn't Linux. While Android uses the Linux kernel, it's not a standard Linux distribution. It has its own user space and libraries, making it a distinct operating system.
  hasperdi
  Linux is the kernel. Android uses a forked version of the mainline Linux
  murderfs
  Everyone uses a forked version of the mainline Linux.
  voidspark
  Some distributions apply patches to the kernel but they don’t fork it.
  kbolino
  This means it's not GNU/Linux, but that's not the only flavor of Linux. Alpine uses busybox and musl instead of GNU userland and glibc, but few would say it's not a Linux distribution.
  panzi
  It's Android/Linux, not GNU/Linux.
  voidspark
  Linux is a kernel, not an operating system.
  homarp
  at least inside docker or wsl
  jonhohle
  Docker wouldn’t be affected by this, the network stack is lower. WSL would, however.
  teeray
  IoT devices would like a word
  Tireings
  Indeed he has to.
hwpythonner
One thing I didn’t see mentioned here yet: a lot of high-performance data center workloads don’t actually go through the Linux kernel’s network stack at all.
Instead, they use DPDK, XDP, or userspace stacks like Onload or VMA—often with SmartNICs doing hardware offload. In those cases, this patch wouldn’t apply, since packet processing happens entirely outside the kernel.
That doesn’t mean the patch isn’t valuable—it clearly helps in setups where the kernel is in the datapath (e.g., CDNs, ingress nodes, VMs, embedded Linux systems). But it probably won’t move the needle for workloads that already bypass the kernel for performance or latency reasons. So the 30% power reduction headline is likely very context-dependent.
- bigfatkitten
  There are a whole lot of commodity 1-2U rack mount boxes running RHEL or CentOS or the like out there that are mostly idle, and which don’t do anything fancier in hardware than maybe checksum or VLAN offload.
corbet
For a more detailed look at this change: https://lwn.net/Articles/1008399/
- pouulet
  And for even more details, you can look at the relevant patch: https://lwn.net/ml/all/20241021015311.95468-1-jdamato@fastly...
  The power efficiency seems to be limited to "network applications using epoll".
  The 30% the article talks about seems to be benchmarked on memcached, and here is the ~30 lines diff they're probably talking about: https://raw.githubusercontent.com/martinkarsten/irqsuspend/m...
queuebert
This is really cool. As a high-performance computing professional, I've often wondered how much energy is wasted due to inefficient code and how much that is a problem as planetary compute scales up.
For me, it feels like a moral imperative to make my code as efficient as possible, especially when a job will take months to run on hundreds of CPU.
- consumer451
  Year ago, I posted here that there should be some sort of ongoing Green X-Prize for this style of Linux kernel optimization. It's still crazy to me that this doesn't exist.
  adrianN
  I would hope that the hyperscalers have sufficient economic incentive to optimize this without an X-prize.
  consumer451
  One would hope, but when I last gave this some thought... If you are in the C-suite, and could deploy your best devs to maybe save some unknown % on energy costs, or have them work on a well-defined new feature that grows your ARR 5% next year, which would you do?
  Also, would you share all new found efficiencies with your competitors?
  jeffbee
  At a large business the 1% cost savings is going to be a lot easier to find than the 5% revenue growth.
  Hammershaft
  It's a public good where the biz that creates it captures very little of the value it generates, so investment in this kind of optimization is likely far below optimal.
- myself248
  Meanwhile I'm running four Electron apps just to act like different shitty IRCs.
- timewizard
  > I've often wondered how much energy is wasted due to inefficient code and how much that is a problem as planetary compute scales up.
  I personally believe the majority is wasted. Any code that runs in an interpreted language, JIT/AOT or not, is at a significant disadvantage. On performance measurements it's as bad as 2x to 60x worse than the performance of the equivalent optimized compiled code.
  > it feels like a moral imperative to make my code as efficient as possible
  Although we're still talking about fractions of a Watt of power here.
  > especially when a job will take months to run on hundreds of CPU.
  To the extent that I would say _only_ in these cases are the optimizations even worth considering.
- rvz
  Absolutely.
  It is unfortunate that many software engineers continue to dismiss this as "premature optimization".
  But as soon as I see resources or server costs gradually rising every month (even on idle usage) costing into the tens of thousands which is a common occurrence as the system scale, then it becomes unacceptable to ignore.
  samspot
  When you achieve expertise you know when to break the rules. Until then it is wise to avoid premature optimization. In many cases understandable code is far more important.
  I was working with a peer on a click handler for a web button. The code ran in 5-10ms. You have nearly 200ms budget before a user notices sluggishness. My peer "optimized" the 10ms click handler to the point of absolute illegibility. It was doubtful the new implementation was faster.
  rvz
  Depending on your spend on infrastructure and the business revenue, if the problem is not causing the business to increase spending on infrastructure each month or if there’s little to no rise in user complaints over slow downs, then the “optimization” isn’t worth it and is then premature.
  Most commonly, If the costs increase as the users increase it then becomes an issue with efficiency and the scaling is not good nor sustainable which can easily destroy a startup.
  In this case, the Linux kernel is directly critical for applications in AI, real time systems, networking, databases, etc and performance optimizations and makes a massive difference.
  This article is a great example of properly using compiler optimizations to significantly improve performance of the service. [0]
  [0] https://medium.com/@utsavmadaan823/how-we-slashed-api-respon...
  adrianN
  DBs can compile and run complex queries in that time budget. What did the click handler do?
  conductr
  I’d think anything as old and widely used as Linux would not be seen as premature, with All optimizations welcome
- toomuchtodo
  My experience with HPC is only tangential to being a sysadmin for data taking and cluster management for a high energy physics project; I am interested on your thoughts about using generative AI to search out for potentially power inefficient code paths in codebases for potential improvement.
  jeffbee
  Don't send an LLM to do a profiler's job.
  queuebert
  I agree with this.
  For the completely uninitiated, taking the most critical code paths uncovered via profiling and asking an LLM to rewrite it to be more efficient might give an average user some help in optimization. If your code takes more than a few minutes to run, you definitely should invest in learning how to profile, common optimizations, hardware latencies and bandwidths, etc.
  With most everything I use at the consumer level these days, you can just feel the excessive memory allocations and network latency oozing out of it, signaling the inexperience or lack of effort of the developers.
linsomniac
Does this mean that "adaptive interrupt mitigation" is no longer a thing in the kernel? I haven't really messed with it in ~15+ years, but it used to be that the kernel would adapt, if network rate was low it would use interrupts, but then above a certain point it would switch to turning off interrupts and using polling instead.
The issue I was trying to resolve was sudden, dramatic changes in traffic. Think: a loop being introduced in the switching, and the associated packet storm. In that case, interrupts could start coming in so fast that the system couldn't get enough non-interrupted time to disable the interrupts, UNLESS you have more CPUs than busy networking interfaces. So my solution then was to make sure that the Linux routers had more cores than network interfaces.
- jeffbee
  Adaptive IRQ moderation is a hardware feature. https://edc.intel.com/content/www/us/en/design/products/ethe...
  surajrmal
  Which most networking hardware supports.
wrsh07
The flip side of this is meta having a hack that keeps their GPUs busy so that the power draw is more stable during llm training (eg don't want a huge power drop when synchronizing batches)
- jack_riminton
  I thought of this too. Iirc it was bigger problem because surging spikes in power and cooling were harder and more costly to account for.
  I'm not au fait with network data centres though, how similar are they in terms of their demands?
  wrsh07
  I think the issue is exactly the spikiness because of how AC electricity works (whereas if the data centered were DC - eg wired through a battery - it wouldn't be an issue)
  I expect you're right that GPU data centers are a particularly extreme example
- Scipio_Afri
  Have a link to an article talking about them doing this?
  wrsh07
  I tried desperately to source this even before seeing your request.
  My current guess is that I heard it on a podcast (either a Dwarkesh interview or an episode of something else - maybe transistor radio? - featuring Dylan Patel).
  I'll try to re listen to top candidates in the next two weeks (a little behind on current episodes because I'm near the end of an audiobook) and will try to ping back if I find it.
  If too long has elapsed, update your profile so I can find out how to message you!
- ikekkdcjkfke
  That goes against separation of concerns. A separate utility must be created for that specific purpose, not hidden in some other part of the system
endorphine
Off topic: glad to read about Joe Damato again — such a blast from the past. I haven't read anything from him since I first read James Gollick posts about on tcmalloc and then learning about packagecloud.io which eventually led me to Joe's amazing posts.
ssuds
Key paragraph: "This energy savings comes with a caveat. “It is sort of a best case because the 30 percent applies to the network stack or communication part of it,” Karsten explains. “If an application primarily does that, then it will see 30 percent improvement. If the application does a lot of other things and only occasionally uses the network, then the 30 percent will shrink to a smaller value.”"
- pyfon
  I'd suggest a submission title change to "...cut networking software power use..."
  Not so sexy. But as a force multiplier that's still a lot of carbon probably.
Dwedit
When your sentence contains "Up To", literally anything is possible.
didgetmaster
This brings back memories.
https://didgets.substack.com/p/finding-and-fixing-a-billion-...
infogulch
Wow busy waiting is more resource intensive than I realized.
jeffbee
When a machine is large enough with enough tenants, there should be enough ambient junk going on to poll on 1 or a few cores and still be optimized for power. This is the tradeoff described in the 2019 "SNAP" paper from Google, in the section about the compacting engine.
The "up to 30%" figure is operative when you have a near-idle application that's busy polling, which is already dumb. There are several ways to save energy in that case.
- danans
  > The "up to 30%" figure is operative when you have a near-idle application that's busy polling, which is already dumb.
  That was my first thought, but it sounds like the OS kernel, not the application, has control over the polling behavior, right?
secondcoming
I thought that Intel added the ‘pause’ instruction to make busy spinning more power friendly
- jeffbee
  x86 has always had PAUSE, what they did is they made the PAUSE way longer in Skylake-X which threw everyone off guard. But yeah, x86 ISA extensions cover this ground very well. Atom, the low-power server line, introduced UMWAIT and TPAUSE that enable cores to briefly enter a power-saving active state while waiting for something to happen. These instructions later came to mainstream Core and Xeon CPUs because Intel made those frankenchips with Atoms and Cores together.
  secondcoming
  It had REP NOP which is the same opcode as PAUSE, but on CPUs that supported it PAUSE does less work for the same effect. At least that's my understanding of it.
rvz
> When I grew up in computer science in the 90s, everybody was concerned about efficiency...“Somehow in the last 20 years, this has gotten lost. Everybody’s completely gung-ho about performance without any regard for efficiency or resource consumption. I think it’s time to pivot, where we can find ways to make things a little more efficient.
This is the sort of performance efficiencies I want to keep seeing on this site, from those who are distinguished experts and contributed to critical systems such as the Linux kernel.
Unfortunately, in the last 10-15 years we are seeing the worst technologies being paraded due to a cargo-cultish behaviour. From asking candidates to implement the most efficient solution to a problem in interviews but then also choosing the most extremely inefficient technologies to solve certain problems because so-called software shops are racing for that VC money. Money that goes to hundreds of k8s instances on many over-provisioned servers instead of a few.
Performance efficiency critically matters, and it is the difference between having enough runway for a sustainable business vs having none at all.
And nope. AI Agents / Vibe coders could not have come up with a more correct solution in the article.
gradientsrneat
I love how I can update Linux and expect it to get faster, not slower. Exception to be made for Nvidia drivers, but at least their performance is consistent.
mintflow
Oh, as a guy which was using DPDK like technology that do busy-poll and bypass kernel to process network packets, I must say may be much power have be wasted...
- nly
  Standard practice in all trading applications is to busy poll the NIC using a kernel bypass technology.
  Typically saves 2-3 microseconds going through the kernel network stack.
mrbluecoat
Noob question: how do you enable this power saving feature? Is it just upgrading to the latest kernel?
RecycledEle
I'd like to see clearly explained Linux boot code that does not invoke a prayer to Satan.
Jeaye
Niiice. What if we reworked 100 lines?
- teddyh
  A manager went to the master programmer and showed him the requirements document for a new application. The manager asked the master: “How long will it take to design this system if I assign five programmers to it?”
  “It will take one year,” said the master promptly.
  “But we need this system immediately or even sooner! How long will it take if I assign ten programmers to it?”
  The master programmer frowned. “In that case, it will take two years.”
  “And what if I assign a hundred programmers to it?”
  The master programmer shrugged. “Then the design will never be completed,” he said.
  — Chapter 3.4 of The Tao of Programming, by Geoffrey James (1987)
- stavros
  20%.
ikekkdcjkfke
Leverage porn
totalkikedeath
[dead]
pelagicAustral
[flagged]
- sophacles
  How is Gnome in any way relevant to bring up in the context of the kernel networking stack?
  I mean I get it, you dislike Gnome for whatever reason and want to cast some shade - that part is clear. What I don't really understand is how you decided this context is somehow going to be received in a way to further your view... it's just so illogical that my reaction is "support the Gnome folks".
  Actually - due to this comment I just donated $50 to the GNOME Foundation. Either the guerilla marketing worked, or your mission failed - in any case I hope this was an effective lesson in messaging.
d3Xt3r
> The code has been implemented as part of the Linux kernel release version 6.13 in January.
This is basically 3 month old news now.
- ryao
  If you read the original commit comment, userspace software needs to opt into this:
  https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds...
  If it is not reported for people to know about it, this will do nothing.
  That said, the opt-in here is only relevant when software has already opted into busy polling, so it likely does not apply to software used outside of datacenters.
- matkoniecz
  What is wrong with 3 month old news? I in fact prefer ones that are not repeated based only on basis being new and much more likely to be confused/misleading/hoax/lie/not actually interesting.
  Similarly, I prefer old books, old computer games, old movies (or at least not ones currently being hot/viral/advertised), this allows a lot of trash to self-filter out. Including trash being breathlessly promoted and consumed.
  d3Xt3r
  What's wrong with it is the fact that it was already posted before over here and other news sites. I've seen it reposted several times over the past three months and I'm tired of seeing this.
  6.13 is old news now, we're already on 6.14, and even 6.15 isn't far from release (we're already at rc3).
- throw-qqqqq
  Well, I run Linux on both my laptops and desktops, consider myself an enthusiast etc., but I don’t track the development of the mainline Linux kernel closely enough to know what goes on there in the past three months.
  I may notice changes when they get adopted by upstream maintainers of my distro, but that usually takes time..
  d3Xt3r
  That doesn't make it okay to repost stale news. Would you be okay if I went to say phoronix.com, started picking other notable 3-month old Linux kernel articles and repost them here one by one? Heck, why limit myself just to the Linux kernel, why don't I just start reposting all the top headlines in the past few months, because surely not everyone is up-to-date on all the developments right?
- tverbeure
  I'm one of the lucky 10,000.
- Almondsetat
  Did you know about this before reading the article?
  d3Xt3r
  Yes, it was posted before over here and other news sites. I've seen it reposted several times over the past three months.
- TheRealPomax
  So, new enough that millions of people never heard about it. I can see why it made the front page.