Believe it or not I had sort of prepared a neat little tootstorm on OoO and speculative execution when I realised that what is really needed is an analysis on what tradeoffs were chosen as opposed to yet another OoO description.

What seems to be missing from everything I have read are two points:

a) speculative execution and OoO go back a long time (IBM S/360 model 85),
b) they are essential for performance: you simply cannot get “the numbers” otherwise.

Some of you might be old enough to…

remember the “MIPS wars” when minicomputer manufacturers bent backwards trying to get better “VAX MIPS equivalent” numbers than anyone else (and Whetstone and Dhrystone and LINPAC…). This was then followed by SPEC benchmarks when manufacturers tried to agree how to standardise benchmarking so we got SPECint and SPECfp numbers. In all of this Intel never looked brilliant so they decided to compare their processors with themselves: enter Intel’s own benchmark and “performance index”.

This became an interesting self-inflicted pain: if you need to sell a new processor then its PI had better be better than the previous one, no matter how synthetic the numbers are. After you told the world that your Pentium 4 is a 17 (picking a number out of nowhere) then you obviously have to sell “the next Pentium” with a PI > 17, ideally quite a bit more. If you published how the index is calculated then you can only cheat “a bit” before the specialised press runs their own benchmarks…

and catches you out! This lands you in a serious muddle: you must improve the performance over the previous generation, no ifs no buts.

Anyone who used any big iron outside the PC world knows that the system you want is a balanced one, balanced to the workload you expect. When I was involved with number-crunching at a serious level my job was to figure out the system we needed and I had, without false modesty, become quite good at it.

In the mid-90s if you wanted raw FP you bought Alpha…

and ensured that it had plenty of RAM so loops could be unrolled and optimised for speed. If, on top, you had a problem which had good SIMD or MIMD characteristics then a farm of Alphas connected with MemoryChannel (in effect a PCI bus extender off-chassis) running MPI was the answer (Digital’s MPI PAK was optimised for MC). At the same time if you needed an NFS server then the answer was a Sun box with “lots of disk” (AdvFS was beautiful but interacted badly with NFS, we found).

So “my” network had a large SparcStation 20 serving NFS to the Alpha number crunchers with local disks for temporary files. All of this obviously over 10baseT (recent upgrade from 10base2 and vampires). Separate from this was a MasPar which was ideally suited to SIMD problems
with large amounts of small datasets and an AMT DAP which was another particular SIMD system plus the obligatory Transputer box.

This long diversion into workstation history is necessary to explain that even in the SPEC era there was a serious attention to tailoring the system to what you wanted it to do. HP’s PA RISC was relatively irrelevant to the problems we were looking at but dominated aerospace because of the specific software available on that platform. Similarly IBM’s RS/6000 was what you ran Dassault’s CATIA on.

What about Linux? Well, it was the toy people used on desktops (if you figured out the damned ModeLine for your monitor) and used for modelling stuff you then ran on the “real machines”. My dual Pentium Pro was no match for my Alphas at the time but it made for a snappy fvwm desktop.

The problem though was not Linux but the fact that everyone was engaged in a speed war driven, I assume, by marketing but one where the end result has been to put a tricked-up Alfa Romeo engine in a Trabant case.

It is not quite a Ferrari engine: that would have been an Alpha or another high-end 64-bit native design. What we have is an engine which had long out-lived its design specs (back in the 80s) and needed to be made to work not just in the 90s GHz race but well into the ‘00s
and later!

To be fair I should note that in an attempt to keep costs down Digital’s PWS (Personal WorkStation) was as close to a PC as they could get in the sense that it used a PCI bus, EIDE CD-ROM and “PC video cards”.

So, in this performance war, heightened by the fact that parallel computing was not delivering (the Connection Machine failed, sadly, the Cray T3D & T3E never delivered, the gigantic ASCI RED supers never quite ran at full rated FLOPS) meant that pumping up the current crop
of processors was the only choice.

Those who read my previous historical overview might recall me mentioning the i860: it was a serious attempt at beating the “one instruction per clock” barrier as you could VLIW an integer and a floating point insn as a single word and obtain two instructions per clock (note the very specific and restrictive condition of one int an one fp).

The other alternative was pipelining which works “a bit” but carries a serious penalty when you cannot “fill the pipe”.

Now what do you do? Well, you start coming up with fancy ways to keep the processor busy:

• branch delay slot: while you think about the branch execute another instruction which might or might not be useful,
• out-of-order (OoO) execution: move around the insn stream to make it more efficient,
• speculative execution: while you wait on a decision execute both branches and then discard branch not taken.

They are listed in order of increasing complexity.

I had left off yesterday after listing the “magic” which was being developed to break the barrier of “one instruction per clock”.

I did briefly mention VLIW (in the i860 attempt of int+fp) and in all fairness should mention both Itanium and its predecessor Elbrus (see for a biased story of how Itanium & Transmeta were born) but they can both be dismissed for the time being because they require magical compilers (I call them sentient compilers when talking VLIW).

What I would like to point out at this point is that on one side these optimisations were amazingly smart and sophisticated and on the other that they were being developed as an answer to the “need for speed”. What we call “security” now was not part of the design.

Let me put this in perspective: real security was physical in the TLAs, the HPC boxes in labs were not “public access”, etc. The weakness of movable classification whereby your super was split into TS/non-TS for efficiency was not…

yet an issue (recall that these optimisations date from the 80s: the design predates by a decade the actual consumer silicon, at least) because the Cold War was around and “my nuke super is mine alone!”, none of this promiscuity with other uses…

Bottom line: side-channel (or microcode) attacks simply were never ever in the threat model.

Before the current rabid hordes decided that Intel,
AMD and ARM were complicit in an insecure design it would be particularly healthy…

to stop and realise that of your threat model does not include an attack then chances are you did not design defensively around it (I am looking at you Linus Torvalds and your disgusting rant against CPU manufacturers - note that he worked for Transmeta and I doubt he designed that defensively against side-channel attacks…).

So, on the one hand we have extreme optimisation to get that trophy but, far far more important, we have totally forgotten the periphery.

While processor instruction cycle went beyond Cray levels the I/O around it, including memory, was left behind. There is no way our current SATA is remotely comparably to Cray’s HIPPI or DDRwhatever with Cray’s ECL RAM.

Not only, the push for speed has meant that a lot less design has gone into the periphery until things get desperate (example: ISA - EISA - PCI, ATA - IDE - EIDE - SATA and you might recall the special “bypass” connectors for graphics).

For comparison workstations had, at least, SCSI and, in the more extreme configurations, multiple smart SCSI controllers with Fast-Wide-Differential buses. In 1995 “my” Digital 8200 “Turbolaser” had four FWD differential buses connected to smart RAID arrays with 10k RPM Digital-branded Seagate drives. This was 1995…

Now, given the above you would be forgiven for assuming that hardware security was never quite at the core of the problem anywhere…

and, objectively, it wasn’t. Hardware security in the 90’s meant an Atalla cryptographic module or IBM’s Cryptographic Accelerator for the zSeries.

So, what am I trying to say?

At the moment everyone is up in arms about side-channel attacks on speculative execution and they are all barking up the wrong tree.

Allow me to explain: what are we attacking? We are, ultimately, attacking data. Once the chip manufacturers have scrambled to fix this PR nightmare what will be the net result?

The problem will just have been moved elsewhere to the next subsystem which was ignored in the original threat model and in the new amended threat model which is just “old threat model” + “side-channels in OoO execution”.

Because of the marketing-driven rush there will be no comprehensive review of the new threat model which includes a network, multi-tenant VMs, etc. Net result: we are playing whack a’mole with security issues in hardware as we are in software.

Should we stop and realise that we need data-driven security we would be forced to acknowledge the need for a major redesign.

Personally I’m quite fond of tagged designs but I am also looking elsewhere but always in the direction of securing data throughout the system and, possibly, the network.

Sign in to participate in the conversation

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!