cynicalsecurity ->@bsd.network is a user on mastodon.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.

Believe it or not I had sort of prepared a neat little tootstorm on OoO and speculative execution when I realised that what is really needed is an analysis on what tradeoffs were chosen as opposed to yet another OoO description.

What seems to be missing from everything I have read are two points:

a) speculative execution and OoO go back a long time (IBM S/360 model 85),
b) they are essential for performance: you simply cannot get “the numbers” otherwise.

Some of you might be old enough to…

remember the “MIPS wars” when minicomputer manufacturers bent backwards trying to get better “VAX MIPS equivalent” numbers than anyone else (and Whetstone and Dhrystone and LINPAC…). This was then followed by SPEC benchmarks when manufacturers tried to agree how to standardise benchmarking so we got SPECint and SPECfp numbers. In all of this Intel never looked brilliant so they decided to compare their processors with themselves: enter Intel’s own benchmark and “performance index”.

This became an interesting self-inflicted pain: if you need to sell a new processor then its PI had better be better than the previous one, no matter how synthetic the numbers are. After you told the world that your Pentium 4 is a 17 (picking a number out of nowhere) then you obviously have to sell “the next Pentium” with a PI > 17, ideally quite a bit more. If you published how the index is calculated then you can only cheat “a bit” before the specialised press runs their own benchmarks…

and catches you out! This lands you in a serious muddle: you must improve the performance over the previous generation, no ifs no buts.

Anyone who used any big iron outside the PC world knows that the system you want is a balanced one, balanced to the workload you expect. When I was involved with number-crunching at a serious level my job was to figure out the system we needed and I had, without false modesty, become quite good at it.

In the mid-90s if you wanted raw FP you bought Alpha…

and ensured that it had plenty of RAM so loops could be unrolled and optimised for speed. If, on top, you had a problem which had good SIMD or MIMD characteristics then a farm of Alphas connected with MemoryChannel (in effect a PCI bus extender off-chassis) running MPI was the answer (Digital’s MPI PAK was optimised for MC). At the same time if you needed an NFS server then the answer was a Sun box with “lots of disk” (AdvFS was beautiful but interacted badly with NFS, we found).

So “my” network had a large SparcStation 20 serving NFS to the Alpha number crunchers with local disks for temporary files. All of this obviously over 10baseT (recent upgrade from 10base2 and vampires). Separate from this was a MasPar which was ideally suited to SIMD problems
with large amounts of small datasets and an AMT DAP which was another particular SIMD system plus the obligatory Transputer box.

This long diversion into workstation history is necessary to explain that even in the SPEC era there was a serious attention to tailoring the system to what you wanted it to do. HP’s PA RISC was relatively irrelevant to the problems we were looking at but dominated aerospace because of the specific software available on that platform. Similarly IBM’s RS/6000 was what you ran Dassault’s CATIA on.

What about Linux? Well, it was the toy people used on desktops (if you figured out the damned ModeLine for your monitor) and used for modelling stuff you then ran on the “real machines”. My dual Pentium Pro was no match for my Alphas at the time but it made for a snappy fvwm desktop.

The problem though was not Linux but the fact that everyone was engaged in a speed war driven, I assume, by marketing but one where the end result has been to put a tricked-up Alfa Romeo engine in a Trabant case.

It is not quite a Ferrari engine: that would have been an Alpha or another high-end 64-bit native design. What we have is an engine which had long out-lived its design specs (back in the 80s) and needed to be made to work not just in the 90s GHz race but well into the ‘00s
and later!

To be fair I should note that in an attempt to keep costs down Digital’s PWS (Personal WorkStation) was as close to a PC as they could get in the sense that it used a PCI bus, EIDE CD-ROM and “PC video cards”.

So, in this performance war, heightened by the fact that parallel computing was not delivering (the Connection Machine failed, sadly, the Cray T3D & T3E never delivered, the gigantic ASCI RED supers never quite ran at full rated FLOPS) meant that pumping up the current crop
of processors was the only choice.

Those who read my previous historical overview might recall me mentioning the i860: it was a serious attempt at beating the “one instruction per clock” barrier as you could VLIW an integer and a floating point insn as a single word and obtain two instructions per clock (note the very specific and restrictive condition of one int an one fp).

The other alternative was pipelining which works “a bit” but carries a serious penalty when you cannot “fill the pipe”.

Now what do you do? Well, you start coming up with fancy ways to keep the processor busy:

• branch delay slot: while you think about the branch execute another instruction which might or might not be useful,
• out-of-order (OoO) execution: move around the insn stream to make it more efficient,
• speculative execution: while you wait on a decision execute both branches and then discard branch not taken.

They are listed in order of increasing complexity.

@cynicalsecurity course, none of this would be necessary if RAM speed had kept pace with CPU speed and RAM size. instead we've been in a very odd situation where for almost the entire history of computers, the amount of memory that they can address within a cycle or two has stayed constant at about 64KB, give or take a couple of powers of 2

@thamesynne @cynicalsecurity But 64K is probably pretty much what any typical bit of code needs. Your average OO method needs its own local variables, the object's attributes, bits of the caller's stack and not a lot else. The problem is that 1 μs later the same processor will be running another method of a different object losing the benefits of that locality.

@cynicalsecurity @thamesynne Maybe. I was thinking more of the Transputer and (I'm very vague about the details) Multics.

@edavies @thamesynne the Transputer was designed with parallel processing in mind and, in effect, a pre-chosen topology via the number of serial links (four). Occam mapped well on it but applications were limited as the “powerful one”, the T9000, was a bit late to the game. It was fun to use, but ultimately, did not deliver on promises.

@thamesynne @edavies Multics is an OS which ran on mainframe-class systems from several manufacturers (I used it on Honeywell kit).

@cynicalsecurity @thamesynne Right, but my recollection/understanding was that it tended to split execution up much more granularly than the Unix/Windows process model making hardware exploitation of locality easier.

@cynicalsecurity @thamesynne Indeed. But maybe something of the sort would make better use of the transistors on a modern chip: lots of simpler cores with a bit of local memory.