Metaprogramming Java: introduction

James Thompson

Java is known to be verbose at the best of times - whatever you’re up to, if you’re doing it properly there’s probably going to be a bunch of code written at the end of it. This is doubly true in my work, where performance really matters and you often can’t touch central Java features like memory allocation and interfaces.

Fighting back

Other JVM languages exist that tackle the bloat. Kotlin has a pile of extra features which implicitly translate to Java you don't have to write, and Clojure has a very powerful macro system for implementing your own abstractions. You might write a javac annotation processor like Lombok or Immutables, or some people use template-based code generation solutions.

There are a variety of priorities we can choose to optimise for here; all engineering is tradeoffs. Some teams prioritise simple build processes, others prioritise speed of debugging the generated output. Sometimes raw runtime performance is the only thing that matters to people, and some teams genuinely want all the options and the ability to pick and choose from them depending on each project's needs.

A common set of objectives

One major sweet spot in the tradeoff landscape comes from evaluating direct cost: how much time does it take to implement a new code generator, and does the generated code take more time than hand written code to debug when things go wrong? If you've got a sufficiently well organised team and a mature engineering culture you can generally cover the cost of a slightly more complex build process with huge time savings as your developers incorporate metaprogramming more and more into their daily work.

What's interesting, though, is that none of the standard Java-based solutions is really cheap to implement new abstractions with. Adopting a new language is a major investment, into-IDE AST reinjection is definitely a tool for experts, and even quite simple uses of annotation processors end up turning into significant engineering projects, despite the best efforts of source code generation tools like JavaPoet. More insidiously, all of these approaches push the costs of your decisions onto your project's users, as even in the simplest case of annotation processing they have extra dependencies to install and manage if they want to dig into the source. Giving up the possibility of zero-dependency libraries isn’t a choice we should make lightly.

There’s also a very different goal that all of these approaches miss: you’re only metaprogramming one language. If you’ve taken the time to design a subtle model that captures something really important in your domain, don’t you want to use the knowledge it embeds across all your projects?

Model-based code generation to the rescue

You can satisfy all of these goals through model-based code generation: encode structural elements of your domain and your solutions in target-independent models, and write code generators to translate these into the source code for fragments of your systems. This generated code is then merged with your hand-written code at build time, and can easily be inspected by your IDE as you're programming.

This is an old idea in the Java world, appearing in the JDK and all kinds of libraries and frameworks. Yet, the uses people put it to have fallen a long way short of the promises we're discussing here, and the simple reason for this is that the tooling currently on offer isn't anywhere near good enough. If you want to model system semantics and generate source code you need the full power of a programming language at your disposal, and conventional template filling simply won’t cut it. Generally you you need several preprocessing steps before your model can hit the code generator, and with that kind of friction you're just not going to reach for the technique.

Really high quality tools definitely exist outside the Java world, and the most powerful one I’ve come across is C-Mera, a Common Lisp system that expresses C-like languages directly as valid Common Lisp code. This means you can use the full Lisp compile-time programming system to generate source code, a technically perfect approach.

(function strcmp ((char *p) (char *q)) -> int
  (decl ((int i = 0))
    (for (() (== p[i] q[i]) i++)
      (if (== p[i] #\null)
          (return 0)))
    (return (- p[i] q[i]))))
      

In practice, though, systems as clever as C-Mera often wind up being very challenging and costly to use in industrial settings. If we want to extend the output to a new language - perhaps not even a programming language - then extending the framework to support it ends up being very challenging, and not something you’d like to take on part way through a project. On-boarding new contributors means learning a lot of subtle Common Lisp, and that’s a major commitment especially for someone that’s never been part of the Lisp world.

More broadly, though, having your tools know so much about the code you’re dealing with tends to lead you to a local minimum in the libraries versus frameworks debate, where spending time fighting your tools is exactly what you don’t want when you’ve specifically set out to liberate your systems from the limitations of any one language.

Our approach

I came across the answer just in the last few years, in the form of the venerable GSL tool from iMatix (the company behind AMQP and ZeroMQ), which remarkably has been under constant development since the early 90s. It combines a highly dynamic data model for model input and manipulation, a bespoke scripting language that handles exactly the tasks you need to write code generators, and a very clever template syntax that lets you prefix '.' to lines to toggle between writing template text and raw code to be evaluated by the scripting engine.

GSL’s great insight is that a code generation tool should know almost nothing about the syntax and semantics of the code it’s going to create, but should allow the user to develop and embed this knowledge in the form of code for its internal scripting language. In this sense it’s exactly the same as general purpose programming language like Java.

In the same vein as GSL, we’ve built a system called Temple that we’re going to open source as soon as the API settles down. We use Lua internally rather than a special purpose scripting language, but when you get to know the library functions it’s very similar to use.

I’ll give a a very simple example of it in practice. Here we define a model for a 'Data' (value) class representing a point on the earth’s surface with an ID.

local geoPosID_members = {
    {name: "lat", type: "f64"},
    {name: "lon", type: "f64"},
    {name: "id",  type: "i32"}
}
local geoPosID_class = {
    package:   "com.inkblotsoftware.geo",
    className: "GeoPosID",
    members:   geoPosID_members
}

Then we define the code generator that writes the code for a .java file, including standard utility methods like hashCode(), a builder, an interface into model-holding entities, an immutable wrapper class, a struct-of-arrays high performance collection class, and a host of other things that naturally follow from the basic value class model. Unsurprisingly I’ve omitted most of the code here. Contracts on the input model are checked at code generation time, and $() on a template line embeds the value of the lua expression inside it:

tmpl.javaDataClass = compileTemplate (
    Contract:new { package:   isString,
                   className: isString,
                   members:   isListOf (isDataClassMember) },
[[// === NB: 100% AUTOGENERATED SOURCE CODE FILE ===
// === DO NOT EDIT EXCEPT EXPERIMENTALLY ===

package $(C.package);

public class $(C.className) {

    // ------------------------------------------------------------
    // An exposed field for each model member

. for _,m in ipairs (C.members) do
. local ty = javaPrimTypeName (m.type)
    public $(ty) $(m.name);
. end

// ...

    // ------------------------------------------------------------
    // Immutable wrapper class

    static class Imm {
        private $(C.className) _dats = new $(C.className) ();

        // ... 

        // --------------------------------------------------
        // Accessors
.     for _,m in ipairs(C.members) do
.     local ty = javaPrimTypeName (m.type)
        public $(ty) $(m.name) () { return _dats.$(m.name); }
.     end

        // --------------------------------------------------
        // Value-altering copy ctrs, e.g. with_lat(double val)
.     for _,m in ipairs(C.members) do
.     local ty = javaPrimTypeName (m.type)
        public Imm with_$(m.name) ($(ty) val) {
            $(C.className) newDats = _dats.copy ();
            newDats.$(m.name) = val;
            return new Imm (newDats);
        }
.     end
    }

// ...

}]])

We can write a simple lua script to generate the output .java file, and call it manually as we change the model or tie it to the build system to track automatically. You can obviously set the written location in io.output programatically from .package and .className if you've got a lot of files, or are changing them frequently.

io.output ("gensrc/main/java/com/inkblotsoftware/geo/GeoPosID.java")
io.write  (tmpl.javaDataClass (geoPosID_class)

It’s quite surprising at first to see how simple this is in use once you've written the model tooling and templates - you really do just define a model, generate the resulting code and use it in your own classes. Like all the best tools, it feels a little bit like you’re cheating. Unsurprisingly, the same domain models come up when we’re programming in other languages like C++, so we have very similar model extensions and templates for value/data classes there.

In the next post I’ll show how we can write new code generators to speed up the creation of a real project.

Get in touch at contact@inkblotsoftware.com to see how we can help with your data challenges