cynicalsecurity ->@bsd.network is a user on mastodon.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.

Believe it or not I had sort of prepared a neat little tootstorm on OoO and speculative execution when I realised that what is really needed is an analysis on what tradeoffs were chosen as opposed to yet another OoO description.

What seems to be missing from everything I have read are two points:

a) speculative execution and OoO go back a long time (IBM S/360 model 85),
b) they are essential for performance: you simply cannot get “the numbers” otherwise.

Some of you might be old enough to…

cynicalsecurity ->@bsd.network @cynicalsecurity

remember the “MIPS wars” when minicomputer manufacturers bent backwards trying to get better “VAX MIPS equivalent” numbers than anyone else (and Whetstone and Dhrystone and LINPAC…). This was then followed by SPEC benchmarks when manufacturers tried to agree how to standardise benchmarking so we got SPECint and SPECfp numbers. In all of this Intel never looked brilliant so they decided to compare their processors with themselves: enter Intel’s own benchmark and “performance index”.

This became an interesting self-inflicted pain: if you need to sell a new processor then its PI had better be better than the previous one, no matter how synthetic the numbers are. After you told the world that your Pentium 4 is a 17 (picking a number out of nowhere) then you obviously have to sell “the next Pentium” with a PI > 17, ideally quite a bit more. If you published how the index is calculated then you can only cheat “a bit” before the specialised press runs their own benchmarks…

and catches you out! This lands you in a serious muddle: you must improve the performance over the previous generation, no ifs no buts.

Anyone who used any big iron outside the PC world knows that the system you want is a balanced one, balanced to the workload you expect. When I was involved with number-crunching at a serious level my job was to figure out the system we needed and I had, without false modesty, become quite good at it.

In the mid-90s if you wanted raw FP you bought Alpha…

and ensured that it had plenty of RAM so loops could be unrolled and optimised for speed. If, on top, you had a problem which had good SIMD or MIMD characteristics then a farm of Alphas connected with MemoryChannel (in effect a PCI bus extender off-chassis) running MPI was the answer (Digital’s MPI PAK was optimised for MC). At the same time if you needed an NFS server then the answer was a Sun box with “lots of disk” (AdvFS was beautiful but interacted badly with NFS, we found).

So “my” network had a large SparcStation 20 serving NFS to the Alpha number crunchers with local disks for temporary files. All of this obviously over 10baseT (recent upgrade from 10base2 and vampires). Separate from this was a MasPar which was ideally suited to SIMD problems
with large amounts of small datasets and an AMT DAP which was another particular SIMD system plus the obligatory Transputer box.

This long diversion into workstation history is necessary to explain that even in the SPEC era there was a serious attention to tailoring the system to what you wanted it to do. HP’s PA RISC was relatively irrelevant to the problems we were looking at but dominated aerospace because of the specific software available on that platform. Similarly IBM’s RS/6000 was what you ran Dassault’s CATIA on.

What about Linux? Well, it was the toy people used on desktops (if you figured out the damned ModeLine for your monitor) and used for modelling stuff you then ran on the “real machines”. My dual Pentium Pro was no match for my Alphas at the time but it made for a snappy fvwm desktop.

The problem though was not Linux but the fact that everyone was engaged in a speed war driven, I assume, by marketing but one where the end result has been to put a tricked-up Alfa Romeo engine in a Trabant case.

It is not quite a Ferrari engine: that would have been an Alpha or another high-end 64-bit native design. What we have is an engine which had long out-lived its design specs (back in the 80s) and needed to be made to work not just in the 90s GHz race but well into the ‘00s
and later!

To be fair I should note that in an attempt to keep costs down Digital’s PWS (Personal WorkStation) was as close to a PC as they could get in the sense that it used a PCI bus, EIDE CD-ROM and “PC video cards”.

So, in this performance war, heightened by the fact that parallel computing was not delivering (the Connection Machine failed, sadly, the Cray T3D & T3E never delivered, the gigantic ASCI RED supers never quite ran at full rated FLOPS) meant that pumping up the current crop
of processors was the only choice.

Those who read my previous historical overview might recall me mentioning the i860: it was a serious attempt at beating the “one instruction per clock” barrier as you could VLIW an integer and a floating point insn as a single word and obtain two instructions per clock (note the very specific and restrictive condition of one int an one fp).

The other alternative was pipelining which works “a bit” but carries a serious penalty when you cannot “fill the pipe”.

Now what do you do? Well, you start coming up with fancy ways to keep the processor busy:

• branch delay slot: while you think about the branch execute another instruction which might or might not be useful,
• out-of-order (OoO) execution: move around the insn stream to make it more efficient,
• speculative execution: while you wait on a decision execute both branches and then discard branch not taken.

They are listed in order of increasing complexity.

@cynicalsecurity course, none of this would be necessary if RAM speed had kept pace with CPU speed and RAM size. instead we've been in a very odd situation where for almost the entire history of computers, the amount of memory that they can address within a cycle or two has stayed constant at about 64KB, give or take a couple of powers of 2

@thamesynne @cynicalsecurity But 64K is probably pretty much what any typical bit of code needs. Your average OO method needs its own local variables, the object's attributes, bits of the caller's stack and not a lot else. The problem is that 1 μs later the same processor will be running another method of a different object losing the benefits of that locality.

@cynicalsecurity @thamesynne Maybe. I was thinking more of the Transputer and (I'm very vague about the details) Multics.

@edavies @thamesynne the Transputer was designed with parallel processing in mind and, in effect, a pre-chosen topology via the number of serial links (four). Occam mapped well on it but applications were limited as the “powerful one”, the T9000, was a bit late to the game. It was fun to use, but ultimately, did not deliver on promises.

@thamesynne @edavies Multics is an OS which ran on mainframe-class systems from several manufacturers (I used it on Honeywell kit).

@cynicalsecurity @thamesynne Right, but my recollection/understanding was that it tended to split execution up much more granularly than the Unix/Windows process model making hardware exploitation of locality easier.

@cynicalsecurity @thamesynne Indeed. But maybe something of the sort would make better use of the transistors on a modern chip: lots of simpler cores with a bit of local memory.

I had left off yesterday after listing the “magic” which was being developed to break the barrier of “one instruction per clock”.

I did briefly mention VLIW (in the i860 attempt of int+fp) and in all fairness should mention both Itanium and its predecessor Elbrus (see archive.computerhistory.org/re for a biased story of how Itanium & Transmeta were born) but they can both be dismissed for the time being because they require magical compilers (I call them sentient compilers when talking VLIW).

What I would like to point out at this point is that on one side these optimisations were amazingly smart and sophisticated and on the other that they were being developed as an answer to the “need for speed”. What we call “security” now was not part of the design.

Let me put this in perspective: real security was physical in the TLAs, the HPC boxes in labs were not “public access”, etc. The weakness of movable classification whereby your super was split into TS/non-TS for efficiency was not…

yet an issue (recall that these optimisations date from the 80s: the design predates by a decade the actual consumer silicon, at least) because the Cold War was around and “my nuke super is mine alone!”, none of this promiscuity with other uses…

Bottom line: side-channel (or microcode) attacks simply were never ever in the threat model.

Before the current rabid hordes decided that Intel,
AMD and ARM were complicit in an insecure design it would be particularly healthy…

to stop and realise that of your threat model does not include an attack then chances are you did not design defensively around it (I am looking at you Linus Torvalds and your disgusting rant against CPU manufacturers - note that he worked for Transmeta and I doubt he designed that defensively against side-channel attacks…).

So, on the one hand we have extreme optimisation to get that trophy but, far far more important, we have totally forgotten the periphery.

While processor instruction cycle went beyond Cray levels the I/O around it, including memory, was left behind. There is no way our current SATA is remotely comparably to Cray’s HIPPI or DDRwhatever with Cray’s ECL RAM.

Not only, the push for speed has meant that a lot less design has gone into the periphery until things get desperate (example: ISA - EISA - PCI, ATA - IDE - EIDE - SATA and you might recall the special “bypass” connectors for graphics).

For comparison workstations had, at least, SCSI and, in the more extreme configurations, multiple smart SCSI controllers with Fast-Wide-Differential buses. In 1995 “my” Digital 8200 “Turbolaser” had four FWD differential buses connected to smart RAID arrays with 10k RPM Digital-branded Seagate drives. This was 1995…

Now, given the above you would be forgiven for assuming that hardware security was never quite at the core of the problem anywhere…

and, objectively, it wasn’t. Hardware security in the 90’s meant an Atalla cryptographic module or IBM’s Cryptographic Accelerator for the zSeries.

So, what am I trying to say?

At the moment everyone is up in arms about side-channel attacks on speculative execution and they are all barking up the wrong tree.

Allow me to explain: what are we attacking? We are, ultimately, attacking data. Once the chip manufacturers have scrambled to fix this PR nightmare what will be the net result?

@cynicalsecurity

Umm... we'll still have the same problem...

At some point this snake eats it's tail and sensitive data will be accompanied by an airgap and an armed guard.

The problem will just have been moved elsewhere to the next subsystem which was ignored in the original threat model and in the new amended threat model which is just “old threat model” + “side-channels in OoO execution”.

Because of the marketing-driven rush there will be no comprehensive review of the new threat model which includes a network, multi-tenant VMs, etc. Net result: we are playing whack a’mole with security issues in hardware as we are in software.

Should we stop and realise that we need data-driven security we would be forced to acknowledge the need for a major redesign.

Personally I’m quite fond of tagged designs but I am also looking elsewhere but always in the direction of securing data throughout the system and, possibly, the network.

@cynicalsecurity are you saying that with the space/latency tradeoff of our caches, TLBs, DRAM, etc, we're far from the limit?

@Wolf480pl I am saying that choices were made which emphasised processor “speed” (perceived or for optimal cache-resident code) over the performance of the system overall.

My MacPro clearly sits around waiting for I/O despite the on-board SSD which is, frankly, ridiculous.

@cynicalsecurity was it a choice? could we have IO with lower latency? or is it that IO can't be made any faster?

@Wolf480pl the problem is not the latency of the I/O device per-se but the overall latency of the system off-chip. The discrepancy between cache and everything is too extreme at the moment.

@cynicalsecurity isn't it limited by the speed of light in copper, interference, etc. ?

@cynicalsecurity @Wolf480pl

My guess is most pc systems could take a 20% cpu hit and the user would never notice the difference.

@hairylarry @Wolf480pl @cynicalsecurity 20%? Given that HDDs massively out-ship SSDs, I wouldn't be surprised if it's 50% or more for some workloads.

It's to the point that my end users didn't notice the (colossal) CPU performance jump from Core 2 Quad Q9400 to Core i5-4590. At all. I even had some users complain that the new machines were slower (with the same image and both had 7200 RPM HDDs).

But, I don't get to tell my client to buy SSDs and Core i3s instead, so...

@cynicalsecurity Now I'm curious whether the latest release of the Code Morphing Software is vulnerable to Spectre and/or Meltdown.

Of course, the difference is that, because memory protection, branch prediction, and any speculative execution all occurred in software, if it was vulnerable, a new release of CMS could have been done to fix things without a silicon respin. But, I agree, I really doubt that this was actually a design consideration.

@cynicalsecurity ASCI RED achieved its goal of 1 TFLOPS and #1 on the top500 at roughly 74% of peak for its Pentium Pro CPUs. Similar to the Paragon with the i860. mastodon.social/media/wZ2EhmOP

@qrs they delivered it on synthetic benchmarks and specific production code: I don’t think that they could remotely manage a general workload (which leapfrogs to the argument I am about to make…) nor could they be scaled down either into a consumer design.

Reflecto-retro-photobomb! There are lots of i860s inside the shiny box. mastodon.social/media/y1C7G_AO

@cynicalsecurity We ported Linux to run on the ASCI RED compute nodes and the Paragon came with OSF/1, but neither were used in production since we measured significant performance penalties compared to the Catamount and Puma lightweight kernels. mastodon.social/media/kM7jTlwB

@cynicalsecurity Sun boxes of this kind (with some extra hardware that converted the raw data into a digital video stream (270 MB/s) to connect to the video mixer were increasingly used by broadcasters in late 1990s onwards, usually for commercials and trailers playout (because this content could be much more easily updated than the previous methods which used professional VCRs and an actual robot arm to insert cassettes into them). (whole movies/shows were still played out from tape then).