# Compiler development

Sep 15, 2021

Recently, I've been looking into compiler development a little bit. After some research, I had a rough idea of how this could work, but no practical experience.

At this time I decided to make a tiny language with a compiler to see it in action. This is how Q was born. This language is neither perfect, nor practically useful, but it was a great side project that I enjoyed a lot to work on.

=> https://git.sr.ht/~yerinalexey/qlc

## Setup

Since this will likely be directed to other people interested in compiler development, and not end users of the language, it's probably a good idea to describe how this thing works.

The pipeline is pretty simple and straightforward, there are 3 stages:

And 2 supplemental parts: lex and type store. Lex is not considered a proper stage because it just streams tokens to the parser and doesn't hold them.

## Parse

Lex and parse together produce a tree-like representation of the syntax. You can see it by passing -a flag to qlc:

$ cat example.q
fn add(x: i32, y: i32): i32 = x + y;
$ ./qlc -a example.q
decl/function
	name = add
	parameters = [
		x: i32
		y: i32
	]
	return_type = i32
	body =
		expr/binop +
			expr/variable x
			expr/variable y
[...]

This tree only represents the syntax and it's possible to directly re-create code from it (with caveats, of course). While, for a very simple language it's possible to directly generate QBE IL/Assembly from this tree, it's not possible here because of some important language features: (1) strong typing, (2) type inferred variables and (3) being an expression-based language, which adds a few more rules. So we move on to check.

## Check

Check goes from a "raw" syntax tree to a checked syntax tree. Notable changes include adding types to everything (and checking them) and doing scope management.

This works by recursively walking the syntax tree and producing a new one, and adding more information. For example, all expressions have a 'result' field which is the type of that expression.

Types are stored in a single place, the type store, and all instances of types are borrowed from it. This allows for easier clean-up and deduplication since type store is a hash set. Type store also has a few helpers that are useful in check:

And a tiny code sample of how check for an if condition looks like this:

[...]
	case AST_EXPR_IF:
		expr->kind = EXPR_IF;
		struct expression *cond = check_expr(ctx, aexpr->_if.cond);
		if (cond->result->storage != STORAGE_BOOL) {
			error(aexpr->_if.cond->loc, "Condition must be boolean");
		}
		expr->_if.cond = cond;

		struct expression *if_branch = check_expr(ctx,
			aexpr->_if.if_branch);
		expr->_if.if_branch = if_branch;

		if (aexpr->_if.else_branch) {
			struct expression *else_branch = check_expr(ctx,
				aexpr->_if.else_branch);
			expr->_if.else_branch = else_branch;

			expr->result = type_promote(if_branch->result,
				else_branch->result);
			if (!expr->result) {
				error(aexpr->loc, "Cannot promote branch types");
			}
		} else {
			expr->_if.else_branch = NULL;
			expr->result = &builtin_type_void;
		}
		break;
[...]

Carefully filling out the fields and types, and ensuring that the condition is boolean. A lot of code in check looks like this, but some parts are more complicated than others like the call expression and forward declaration lookup.

Example of a checked tree for that example program (long!):

$ ./qlc -A example.q
decl/function
	name = add
	parameters = [
		variable
			name = x
			type =
				type/i32
					hash = 17585875
					size = 4
					is_signed = true
		variable
			name = y
			type =
				type/i32
					hash = 17585875
					size = 4
					is_signed = true
	]
	return_type =
		type/i32
			hash = 17585875
			size = 4
			is_signed = true
	body =
		expr/binop +
			variable
				name = x
				type =
					type/i32
						hash = 17585875
						size = 4
						is_signed = true
			variable
				name = y
				type =
					type/i32
						hash = 17585875
						size = 4
						is_signed = true
[...]

## Codegen!

We have a tree with types that is correct language-vise, now we need to flatten it out into a list of instructions. I like to use QBE because it's much easier to write than Assembly and it provides some nice little features like optimizations and painless register allocation.

=> QBE

The desgin is very simple - just write the IR to stdout. Some compilers have a full in-memory representation but I don't think it's necessary.

One useful thing that C, unlike other languages, provides is builtin variadism. It's used to construct instructions in a very nice way, without having to write emit code for them separately (unlike I did for Antimony). This is how a generic '+' operation could look like:

add_instr(ctx, &result, Q_ADD, &lhs, &rhs, NULL);
// => %result =w add %lhs, %rhs

Each expression returns a QBE value as its generated result and if needed, puts additional instructions before. For a literal number, that would be a constant, for '+' operation this would be a temporary that is assigned to sum of given values and so on.

Continuing the example of if expression:

static struct qbe_value
gen_if(struct gen_context *ctx, struct expression *expr)
{
	struct qbe_value cond = gen_expression(ctx, expr->_if.cond);

	if (!expr->_if.else_branch) {
		struct qbe_value ifl = mklabel(ctx, "if");
		struct qbe_value endl = mklabel(ctx, "endif");

		add_instr(ctx, NULL, Q_JNZ, &cond, &ifl, &endl, NULL);

		add_label(ctx, ifl.label);

		qvalue_finish(gen_expression(ctx, expr->_if.if_branch));

		add_label(ctx, endl.label);

		qvalue_finish(cond);
		qvalue_finish(ifl);
		qvalue_finish(endl);

		return qvoid;
	} else {
		// [...]
	}
}

We generate the condition, then JNZ (Jump if Non Zero) to the condition, or to the end otherwise. The result is qvoid, which is a special value that represents nothing (i.e. 'void' type).

Note: qvalue_finish is required because data of qbe_values is heap-allocated.

And there you have it, it's not a complete walkthrough of the internals because that would be boring and you can just read all the code anyway.

## Conclusion

This was my first from-scratch compiler. I like how it turned out, but it could be surely better.

One mistake I made is returning values from gen functions instead of them taking a place to write into. This makes implementing aggregate types (arrays, structs, etc) pretty annoying since you'd need to allocate temporary storage and copy into real storage. Also, the type system probably needs major updates because it misses fundamental features like constant qualifiers. The code is not very clean and some parts really need refactoring, but it is what it is.