mastodon.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
The original server operated by the Mastodon gGmbH non-profit

Administered by:

Server stats:

348K
active users

Laurence Tratt

An interesting paper from @emeryberger et al., showing that, in contrast to prior work, (in my words) energy use across programming languages is a proxy for how long a program takes to execute, and that other factors don't meaningfully affect energy usage. arxiv.org/abs/2410.05460

arXiv.orgIt's Not Easy Being Green: On the Energy Efficiency of Programming LanguagesDoes the choice of programming language affect energy consumption? Previous highly visible studies have established associations between certain programming languages and energy consumption. A causal misinterpretation of this work has led academics and industry leaders to use or support certain languages based on their claimed impact on energy consumption. This paper tackles this causal question directly. It first corrects and improves the measurement methodology used by prior work. It then develops a detailed causal model capturing the complex relationship between programming language choice and energy consumption. This model identifies and incorporates several critical but previously overlooked factors that affect energy usage. These factors, such as distinguishing programming languages from their implementations, the impact of the application implementations themselves, the number of active cores, and memory activity, can significantly skew energy consumption measurements if not accounted for. We show -- via empirical experiments, improved methodology, and careful examination of anomalies -- that when these factors are controlled for, notable discrepancies in prior work vanish. Our analysis suggests that the choice of programming language implementation has no significant impact on energy consumption beyond execution time.

@ltratt @emeryberger I... I think the thing that nearly made me cry reading this is the painstaking way they describe the process to build the model and validate things. This is so beautiful. Seeing science (and the difficulties of it) so plainly showed. Also sad that this is so rare to find that I feel so moved by people doing it well, but still.

Kuddos! and thanks!

@ltratt @emeryberger

It's a shame they don't dig a bit more into the parallelism. The paper says:

Further, RAPL samples include all cores, even if the program under test only uses a single core. If a benchmark is single-threaded or generally uses fewer cores than available, idle cores will be included in the energy consumption measurement. Therefore, using a varying level of parallelism across benchmark implementations can result in unfair comparison, as idle cores will add some constant energy consumption to each sample.

Which makes me scream 'yes but!'. Modern SoCs can independently adjust the clock speed of cores and typically have different cores with different power / performance tradeoffs. Leakage current means that it's far more power efficient if you can run the same workload in 1s on two cores clocked down to 800 MHz than one clocked at 1.6 GHz.

But there are confounding factors here. A few years ago, we found that turning off CPU affinity entirely in the FreeBSD scheduler made some workloads much faster. The workloads were bounded by the performance of a single-threaded component but pinning that to a single core made that core hot and then the CPU throttled the clock speed and made it slower. Having it move around unpredictably distributed the heat, which allowed the heat sink to work better.

I'm quite surprised by Fig 11d. I wonder how this varies across systems: actively reading DRAM consumes a lot more power than simply refreshing (the paper says 40% for refresh, I think this varies a bit across DRAM types), but perhaps the base load is so high and the read rates are so low that this doesn't make a difference. Or maybe the cache miss rates are all very low?

The highest numbers are around 90 M LLC misses per second. I think Intel chips do 128 B burst reads, so that's around 10 GB/s, which is around 2.5% of my laptop's peak memory bandwidth. Desktop / server RAM can sustain higher read rates. The difference between 0% and 2.5% of the maximum read rate may not be very big.

On this benchmark, Boost’s library performs significantly worse than PCRE. This outlier alone accounts for the entire reported gap between C and C++.

That's pretty embarrassing for Boost. I'd expect that a C++ RE implementation could build an efficient state machine at compile time and feed that through the compiler for further specialisation, whereas PCRE has to do it dynamically.

@ltratt @emeryberger it seems to suggest that program runtime is what decides energy usage. A counter example to this is games. Get your laptop or steam deck, whatever, and play 1 hour of hollow knight, then 1 hour of horizon forbidden west. Look at your battery level after each.
Then go watch a talk on mobile game dev talking about how to optimize for power usage to let the player play longer before they have to plug their phone in 😀

@ltratt @emeryberger not the main point, but it has a spherical cow feeling.

@demofox @ltratt the paper doesn’t address GPUs but of course running something that is CPU/GPU intensive consumes a lot of energy and this is orthogonal to the point of the paper (which indeed only looks at CPU-intensive codes)

@emeryberger @ltratt not all the power talk is gpu related, but yeah... still great work, and games can be an edge case as a tiny percentage of applications, probably!

@demofox @emeryberger @ltratt tbh I don't think games are going to be very different if you squint slightly. The only thing with the usual sussy CPU benchmarks is that they're large fixed blocks of work that happens once, games are doing a fixed amount of work at a target frequency. So if you do less work "in wall clock time" you can clock down harder and maintain the desired frequency. Since CPU benchmarks tend to saturate the machine you're just measuring the time modulated by the p-state

@demofox @emeryberger @ltratt in general CPU benchmarks are more or less uniquely bad for measuring power usage, because computers are generally not running full blast constantly trying to churn numbers to generate a Mandelbrot fractal. Even games, which are hugely computationally expensive are generally not running your GPU or CPU at full blast constantly. (the older paper which tries to compare languages this way is just catastrophically flawed)

@demofox @emeryberger @ltratt or in other words, in a single core CPU benchmark the power usage is going to be dominated by time * power usage at max boost clock for single active core (on mobile / laptop add a modulo regarding how long you can sustain boost clocks before thermal throttling). Which is very simple and easy to model, but not very enlightening regarding the power profile of real world systems which tend to sit in lower power states most of the time.

@emeryberger @demofox @ltratt Right, I appreciate that your paper is avoiding the issue, I'm just saying that it's vitally important for real world understanding of power use and how it relates to software, and especially in software like games where the wall clock runtime and their power use do not necessarily strongly correlate alone. (but I also think it's a relatively simple thing to account for in practice in napkin math terms)

@emeryberger @demofox @ltratt btw on newer machines there's additional complexities too, which might not be easy to account for. (but also probably don't matter for the general ballpark here) e.g. you're controlling for the core c-states, which I think should also disable data-fabric c-states too, but there's also e.g. ddr memory power down, and I have no idea what can trigger / limit that. edit: Well hopefully the MSR estimations take care of accounting for that in terms of outcome anyway.

@dotstdy @demofox @emeryberger They're as uniquely bad as every other data point -- i.e. they're neither unique nor bad :) Each data point in the trade-off space has value; and when I'm pummelling all cores compiling a big system, I do have a particular interest in this particular data point!

@demofox @ltratt @emeryberger typical desktop os’es these days would never saturate all cores unless you’re doing something like playing a game or doing a batch compile etc. I guarantee hollow knight is “pulsing” the cpu per frame and then letting the CPU idle, which yields a thinner power pulse width. In this case we consider the program run time not as the wall clock time but the sum of the pulse times. Under that view, the power again correlates directly to the program runtime.

@ltratt @emeryberger always warms my cold little heart to see a paper confirming what I've been telling people is almost certainly the case!

@ltratt @emeryberger Emery you probably talk about this in the paper (I only skimmed the results) but the way I think about this is to ask what it would take for a PL to use more power than another PL, for a given amount of CPU time. it would have to make substantially different use of the available hardware resources. like maybe we could make that happen by writing an ISPC program that saturates the AVX-512 execution units. but for general workloads, unlikely

@regehr @ltratt @emeryberger

On modern SoCs, it's quite difficult because they can't power all of the functional units at the same time, so saturating AVX-512 is likely to kick in thermal throttling and drop the clock speed.

Getting the highest throughput without triggering thermal throttling probably requires carefully avoiding hot spots.

@regehr @ltratt @emeryberger Saturating SIMD units does move the needle on power, but (usually) by much less than the multiplier of "work getting done," at least for high-quality vector code.

E.g. when comparing scalar to SIMD library code, it's pretty common to see results like "2x the instantaneous power, but 4x faster," which is a huge win for overall efficiency.

Very brief transient use of large SIMD vectors is the standard counter example; you may end up in a less-efficient lower-clocked-but-still-high-power state where you're getting no overall performance benefit but making everything else less efficient.

@steve @regehr @ltratt @emeryberger Aside: with the paper's methodology I'm not sure you'd really see that fwiw. Issue being that the license downclock for avx512 is relatively small these days, and the whole package is already boost disabled so won't ever achieve clocks higher than the license anyway. It's unclear 1) whether enabling the avx512 units would cause an increase in power draw at such low clocks. 2) whether enabling the avx512 units is actually accounted for in the RAPL modelling.

@steve @regehr @ltratt @emeryberger The main thing to keep in mind is that while wider SIMD datapaths do take much more power than narrower datapaths, most of the biggest contributors to CPU power use are not ALUs (insn fetch/decode/control, scheduling, bypass, and of course memory access), and you get to amortize those whenever you use fewer instructions to get the same job done.

@regehr @ltratt @emeryberger Different spin loops in synchronization fast paths maybe. I'm thinking about runtimes that might have over-reacted to Intel bumping PAUSE to ~100 cycles.

@pkhuong @regehr @ltratt @emeryberger oh, yeah. There are a bunch of bad actor libraries that go out of their way to keep threads spun up waiting for work so that they look good on naive latency benchmarks at the cost of energy efficiency and overall system performance.

@steve @pkhuong @regehr @ltratt @emeryberger It's a bit unfortunate that the previous paper was so clowny because in a very broad sense the idea is true. If you write your program in python, and it turns out that means it's 100x or 1000x slower than an equivalent program which you wrote in C or Rust or whatever, then that is bad. Comparing languages is silly as some kind of shootout, but it's not like the choice of language is irrelevant either, for various obvious reasons.

@steve @pkhuong @regehr @ltratt @emeryberger OTOH you should be really comparing to a more reasonable baseline: not writing any program. Which is always environmentally speaking the best option. :')

@steve @pkhuong @regehr @ltratt @emeryberger Are these bad (actor libraries) or (bad actor) libraries

@ltratt @emeryberger Great paper.

Interesting that last-level cache misses increase power usage, since not a lot of work is getting done in that case — the CPU must be stalled most of the time. I guess the chip can't power anything down while waiting for memory.

Figure 11(a) has me confused. The relationship has to be (and appears to be) linear in the limit as x increases, even if something else is going on near x=0. I'm curious:

  • Why are there so many different x-values? I would have expected shootout programs to have either 1 core or all N cores in use at all times, with very few exceptions.
  • Is this an artifact of averaging somehow?
  • Do all PC multiprocessors have this curve?
  • If this is real, surely the explanation is already known to the manufacturer. Is there a way to ask?

@jorendorff @ltratt The # of active cores (x values) depends on the degree of parallelism in the application implementations - it's not all sequential or perfect parallelism divided among all available cores, but rather lies somewhere in between.