Code Generation is a Poor Man’s Compiler

Designing your own domain-specific or fourth-generation programming language is a lot of fun, I can tell you all about it. But when you go beyond just a design, you will inevitably be confronted with the question of how to run your code on a given platform. You might think that generating machine or virtual machine code is too complex and therefore source code generation may come to your mind? But please, please think again.

Source code generation has always been there. And there are some first-class development environments like Mendix, Outsystems, and Cordova, to name a few. But I think when you’re serious about creating a new programming language, you should also take the way your programs runs very seriously. Maybe not if you’re just generating some repetitive parts of code but why would you generate all your code through an intermediate language?

I can understand why it is tempting to do so. It’s nice to see a compiler translating your shiny new programming language into another language that you would normally have to write line by line. And because it’s a language you already know, it will be easier to understand, easier at least than (virtual) machine code, and you can use the IDE and debugger you already know to manage the generated source code.

However, I think there are some major disadvantages to generating source code. Let’s go through them.

Your first concern should be build times. Everybody who compared direct coding to code source generation can tell you how much trouble you can get into in this respect. It’s not only that an extra step is added to the build process. It’s also that, when your language and project become bigger and bigger, you can expect longer and longer build times due to unintentional cross-over dependencies. The more the language you designed varies from the target language, the more situations can occur in which even the tiniest change will result in the recompilation of whole sets of files. And this problem is leveraged by a huge multiplier if your code generator is not smart enough to know which parts it should regenerate and which parts it should not.

Another fundamental problem with code generation is that the runtime of your intermediate language is totally unaware of the language you’re generating from. Even with a perfect programming language, runtime errors can occur. And the more complex your generation process becomes, the more difficult it will be to understand what’s causing a given exception. Tracing the original cause can eventually be as difficult as tracking down a compiler bug with a mainstream compiled language.

That’s the moment that you will realize that your code generator is actually a compiler. Well, I would say a poor man’s compiler.

All that would not even be such a problem if you only write the code generator for yourself. But when I’m thinking about a serious new programming language, I want other people to use it. And for that reason, all the above issues would be a no-go area for me.

It is just very hard to completely hide the intermediate language to the user of your language. You may have the intention of making “life easier” for an application developer. But instead, he/she now has to know your language plus have the skill to solve problems with the intermediate language. The build process becomes bumpier and the moment something goes wrong during compilation or runtime, the application developer is in trouble.

Again, I can understand how tempting code generation is because

  1. it’s probably easier than to generate machine code or virtual machine (VM) code, at least initially;
  2. as far as applicable, you won’t have to develop a virtual machine;
  3. you don’t have to develop an IDE and/or debugger; and
  4. using it can be used as argument against vendor lock-in because a company could fall back to the generated code if this was really needed.

Creating machine code is indeed a very hard thing to do. You need to be an expert to do so. But code for a VM is a lot simpler. There are lots of VMs out there that you can use, and I’m not talking about the JVM and the CLR alone. Also, have a look at the LLVM compiler set. By just creating a so-called LLVM frontend, you can leave all the code optimizations and low-level code generation to the LVVM tools. Hundreds of programming languages are already developed in that way. It even has its own VM, which you can use to compile and run code on the fly.

You might even consider creating your own virtual machine. It can be quite a challenge to put everything right, but it is not rocket science either.

The fact that you do not have to create an IDE or debugger would only be an argument for using code generation if we could hide generated source code from the application developer. But, as I wrote above, that can be difficult to achieve. And remember that IDEs like Eclipse and IntelliJ can be extended with plugins to fully support your own language for syntax highlighting, continuous compilation, and even integrated debugging.

The lock-in argument is something totally different. It might be convincing for bigger companies. But let’s be serious, eventually switching to the generated code is only a serious option with very neatly generated code. Making your project open source or making sure your customers can get access to your IP with some form of escrow makes a lot more sense to me.

The point I wanted to make here is that I would welcome any new rapid development language (4GL or whatever name you give it), but only if it comes with full support in the sense of an IDE, debugger, and so on. Consider buying a book on building compilers or virtual machines. And then think again.


Leave a Reply

Up ↑

%d bloggers like this: