Factorial on SubX


Ok, I think I understand calling conventions now.

Also coming face to face with the pain of debugging machine code 😀


More adventures with machine code

SubX now has a test harness, and support for string literals.

Current (increasingly silly) factorial program to compare with the parent toot: akkartik.name/images/20180923-

Two new features:

a) Autogenerated `run_tests` (line 26) which calls all functions starting with 'test_'.

b) String literals (line 31). They get transparently moved to the data segment and replaced with their address.


@haitch @vertigo @freakazoid @kragen

Show thread

That isn't much progress for a month. I've been trying a few different things and learning a lot:

a) I spent some time looking at the GREAT stage0/Mes bootstrap project.

b) I tried to design a type-safe low-level language that could be converted line by line to machine code, but it turned out to be a bust (thanks @kragen!). It's possible if you give up on freeing heap memory.

I'm now convinced there are no shortcuts. Gotta build a real compiler in machine code.

@haitch @vertigo @freakazoid

· · Web · 3 · 1 · 2

There's only one problem: I don't know how to build a compiler. Not really. And definitely not in machine code. So I'm going to be fumbling around for a bit. Lots more wrong turns in my future.

I've been trying to port compilers.iecc.com/crenshaw to SubX. With tests. It's been slow going, because I have to think about how to make Crenshaw's primitives like `Error()` and `Abort()` testable.

@kragen @haitch @vertigo @freakazoid

Show thread

I don't know if just learning to build a compiler will sustain my motivation, though. So some other ideas:

a) Some sort of idealized register allocator in Python or something. I've never built one, and my intuitions on how hard it is seem off.

b) Port Mu's fake screen/keyboard to SubX so that I can reimplement github.com/akkartik/mu/tree/ma. It was just too sluggish since Mu was interpreted. Even my 12 year old students quickly dropped it in favor of Vim.

@kragen @haitch @vertigo @freakazoid

Show thread


Register allocators are pretty fun - some heuristics applied to graph coloring, which is quite hard.

@haitch @vertigo @freakazoid @akkartik

@akkartik You can definitely write a real compiler without a register allocator at all. I haven't written one myself! @haitch @vertigo @freakazoid

@kragen Depends on where one draws the system boundary. In the Crenshaw compiler the error message is clearly part of what is being designed for. So it seems to me there should be some "early warning system" if a code change breaks its response to *incorrect* programs.

@haitch @vertigo

@akkartik Oh, sure, you could test from the user perspective what error message appears. But I think it would be better to indirect the error message through some sort of error code, and test for that, instead of the wording of the error message. @haitch @vertigo

@akkartik @kragen @haitch @freakazoid Honestly, my advice here is to read two books: "Programming a Problem Oriented Language" teaches you how to write your own Forth compiler from scratch (bare metal on up). There are no assembly listings in the book, because it's pure guidance, but it was instrumental in me getting DX-Forth working at all.

@akkartik @kragen @haitch @freakazoid

Second, if you're willing to tackle the more sophisticated compiler techniques needed to have a proper infix compiler, I would then recommend reading Compiler Construction, by Niklaus Wirth (see ethoberon.ethz.ch/WirthPubl/CB).

You don't need fancy register allocation to get a working compiler that meets 80% of your performance goals. Don't let that slow you down.

@akkartik @kragen @haitch @freakazoid
Next, I would strongly recommend reading cs.indiana.edu/~dyb/pubs/nano- . This was instrumental in getting my BSPL compiler prototype to work, and I'll be revisiting this again when I revise BSPL once more to port VertigOS to the .

@akkartik @kragen @haitch @freakazoid There seems to be this great fear of passes in compiler design. It's a one-pass compiler, or a two-pass compiler. Hogwash; do 56 passes if you must; all that matters is the final output. Register allocation can be one (or maybe a few) of those passes towards the end.

@vertigo historically, reducing the number of passes was a big win for reducing the complexity and improving interactive performance, in large part because the passes communicated via mass storage. The nanopass framework eliminates a lot of those costs, and of course hardware is fast now. @akkartik @haitch @freakazoid

@kragen Can you elaborate on how nanopass helps manage complexity? I've tried to read the papers, but I think I'm not yet enlightened.

@vertigo @haitch

@akkartik @kragen @haitch Think Unix pipelines.

Each pass is like a Unix filter -- it takes input, processes it in some *very* well defined way, and produces output. Each pass communicates with other passes via in-memory buffers, so no mass storage is required.

Regarding mass storage requirements, though, you can have nanopass compilers on a Commodore 64 if you wanted, by reducing the size of the compilation unit. BSPL works on individual colon definitions, for example, and uses mere KB.

@vertigo I see, yes I divide up work into functions that operate on the entire compilation unit.

Mu has 27 'transforms': akkartik.github.io/mu/html/012

SubX has 7 so far, but I added an additional notion of 'checks': akkartik.github.io/mu/html/sub

But the original nanopass paper describes designing a separate formally-defined language for each pass: cs.indiana.edu/~dyb/pubs/comme

(Unix pipelines also introduce concurrency and flow-control; I perhaps know too much for that analogy to be helpful.)

@kragen @haitch

@akkartik I'm not sold on anything thsat recommends many different languages to basically treat successive transforms of an essentially equivalent graph. Proofs for every lttle lang seem way overkill to me, when you can reuse good old graph theory on any graph.
Then, that may be just me and my hammer seeing it all as nails. And linguistically oriented overengineering in a mathematical shape may also be somebody else's hammer. Different approaches, but I know which I'd follow.

@vertigo @kragen

@haitch @akkartik @kragen I'm not sure where your contention lies. Can you explain how your idea and the idea of multiple nanopasses conflict?

NOTE: I'm *not* talking about the Nanopass Framework, which is a framework intended to automate the boilerplate of writing nanopasses, saving time in an educational context. I'm talking about the abstractions they are specific instances of.

@haitch @akkartik @kragen My apologies if I wasn't clear in this regard earlier.

@vertigo I get you and I'm not oppossed to the idea of multi-pass compilers. I'm highly biased against the proposed formal proofs expressed as various different languages. I just happen to see multiple passes as a sequence of graph transfor,s, for which you may have a formal description in one regu;ar off-the-shelf language borrowing from graph theory. Data structures of internal memory representation are just optimisations that run on machines.

@akkartik @kragen

@haitch @akkartik @kragen "against the proposed formal proofs expressed as various different languages" -- Ahh, gotcha. That's what I was confused over. Thanks for clarifying.

@vertigo But boilerplate isn't just a problem in an educational context. I'm particularly sensitive to it in machine code, where it's a maze of twisty passages mostly alike.

Also, the nanopass framework isn't just about superficial boilerplate. They were attacking the phase ordering problem. You write a phase expecting some feature to exist, but you inserted the phase where the feature isn't computed yet. Or isn't computed in some situations. Formal specification helps there.

@haitch @kragen

@vertigo To my mind, the framework didn't mutate the idea. The original paper I linked to thinks in terms of separate languages.

But I think I've been getting pedantic. You're right that many phases is a good thing as long as the format (not language) doesn't diverge too much between them. I think I follow where you and @kragen are coming from. You have some in-memory data structure for the program, and successive phases fill in different parts of it. Am I understanding you right?


@vertigo To return to my original statement about "converting line by line" (which is I think what caused this sub-thread), the goal wasn't to minimize passes or phases but rather to minimize how much information a reader needs to hold in their head to understand how a single statement in a language is compiled. Types, global variables, local variables, sure. But I don't want readers to need to know how the previous statement was compiled. That veers into optimizer territory.

@kragen @haitch

@vertigo A graph-coloring register allocator violates that constraint of mine. But as you and @kragen pointed out, I was forgetting that we can do without it. Crenshaw is the trivial demonstration that using registers without actually register allocating. Thanks for the reminder!


@akkartik @kragen @haitch Indeed, both BSPL and Oberon allocates registers using a simple stack. :)

@akkartik Oberon is an awesome source for ideas. I read WIrth's HOPL III paper on it (and Modula-2) this week, which is very nearly the only paper worth reading in the whole proceedings. I have the book but still haven't finished. @vertigo @haitch

@vertigo And Ur-Scheme and StoneKnifeForth just use a runtime stack (with ToS in a register) instead of having a register allocator at all. It's surprisingly less inefficient than you'd imagine! @akkartik @haitch

@kragen @akkartik @haitch On x86 derived architectures, yes. On RISC architectures (at least, on in-order, single-issue architectures), this is terribly inefficient. It might run with OK performance, but the code-bloat it will create will be enormous.

@vertigo That's a very good point, and one I hadn't really thought about. I imagine the code bloat is typically less than a factor of 3, though, which is less than the 5× performance cost I was seeing with Ur-Scheme. @akkartik @haitch

@kragen @akkartik @haitch On RISC-V, to emulate (An)+ or -(An) on 68000, each memory reference will cost you 8 bytes, which is typically in addition to the instruction that computes the desired outcome (so, figure, an average total of 12 bytes per CISCy-equivalent operation). This was the only reason why I didn't implement a naive MachineForth compiler for RISC-V, opting instead for multiple optimization passes and a different structure to the language (e.g., it's a 3-stack environment).

Show more

@kragen @akkartik @haitch HHhhhmmm.... Looking at Ur-Scheme website now. I'm not sure why I haven't visited before. This looks pretty cool!

I'm pondering, once I build my revised BSPL compiler (which I intend on using to write the exec.library for VertigOS) and its multifarious passes, I wonder how difficult it'd be to retarget Ur-Scheme to use the BSPL code generator to get Ur-Scheme to run under VertigOS on RISC-V architecture...

Show more
Show more

@akkartik @kragen @haitch BSPL has some phases which actually mutates the nature of the underlying structure (e.g., going from stack notation to RISC-ish notation), but such disruptive changes usually happens only when you're satisfied that you can't process the current data set any further in a productive way.

I think, as long as it doesn't happen too frequently, it should be simpler to understand overall.

@akkartik @kragen @haitch That's an implementation detail; MS-DOS also had pipelines as of DOS 2.0 and later, but it lacked concurrency and flow control. Behind the scenes, COMMAND.COM handled pipelines using backing temporary files.

@vertigo That being said, if performance is a primary goal, I'd suggest considering whether simpler heuristics are applicable

@akkartik @kragen @freakazoid

@haitch @akkartik @kragen @freakazoid Nice thing about not shying away from passes in compiler implementation is that passes are often plug-and-play relative to each other. That allows you to follow Butler Lampson's philosophy: "Make it work, make it right, make it fast."

A lot of what goes into making a compiler produce the fastest possible code is an act of diminishing returns, and frequently has to be redone every processor generation. Ergo, the more you can modularize it, the better.

@vertigo Anyway, routines that juggle more than 32 variables at amy given time are possibly not very well thought out. And routines that use more than that shou;d probably be penalised and split into other functions. This is less of a problem for processors with many registers, but ai ser how it can be a concern for @akkartik who's specifically targetting IA32.
@akkartik @kragen @freakazoid

@akkartik I think a type-safe low-level language that can be converted line by line to machine code is doable, even with GC; I just meant to suggest some directions in which you can perhaps simplify it. @haitch @vertigo @freakazoid

@kragen Right, I should have said "conventionally type-safe" 😅 If I was to build just an intermediate language as a stepping stone during bootstrapping, I'd be fine with constraints like disallowing arrays, or records, or arrays inside records. But (at least with that particular thought experiment) I'd like to have it all without ever needing to write a compiler.

@haitch @vertigo

@akkartik Hmm, yeah, I can see that point of view. It's nice to minimize the amount of abstraction. What do you think about research.microsoft.com/en-us/u? It might be the thing that comes closest to what you want. @haitch @vertigo

@akkartik I think you can totally do conventionally type-safe with HM types and no arrays, and that's probably fine for a compiler. Maybe it would simplify the compiler to have only tuples, though it would complicate the error messages. @haitch @vertigo

Sign in to participate in the conversation

Server run by the main developers of the project 🐘 It is not focused on any particular niche interest - everyone is welcome as long as you follow our code of conduct!