Re: How does LuaJIT's trace compiler work?

  • From: Mike Pall <mike-1311@xxxxxxxxxx>
  • To: luajit@xxxxxxxxxxxxx
  • Date: Fri, 29 Nov 2013 14:50:32 +0100

Robin Heggelund Hansen wrote:
> From what I understand, LuaJIT's JIT doesn't compile hot methods like
> Java's HotSpot does, it compiles hot paths originating from loops. Does
> this mean that if something doesn't originate from a loop (say, I call Lua
> functions from the C-api) then the code will never be jitted?

Function entries can be trace start points, too. But the hotspot
detection and the region selection prefers loops.

> And what happens when you hit another loop? Will the path to the
> second loop be JIT'ed, and then a new path from that loop jitted
> as well, or will the second loop be a part of the same path?

Depends. If the inner loop has a low iteration count, it'll be
unrolled and inlined. For most loop nestings there'll be a single
trace for the outer loop that goes 'around' the inner loop and
joins its exit with its entry. Otherwise traces link to each other
in linear pieces.

> How does the interpreter choose the most optimal hot path?

The interpreter doesn't choose a path. Region selection uses
probabilistic algorithms and lots of intertwined heuristics to
select when and where to start a trace, whether to follow it and
where to end it. It's really, really hard to describe this
adequately in prose and not in code.

IMHO LuaJIT has pretty good facilities for looking at the
generated traces (-jv and -jdump). Try this with some carefully
crafted examples and check the source code for details. There are
a few surprises on your way.

> Let's say I have a hash-table of ints -> strings. Now imagine
> that I've called table[x] with x being 3 and 5 enough times that
> they've become hot paths and jitted, how does the interpreter
> decide which jitted code to call for table[x] where x is 4?

Umm, no. Indexing with non-constant integers is not specialized
directly (the pay-off is too low). There's specialization for tons
of other things, e.g. function calls. This may indirectly lead to
specialization of other instructions, of course.

Specialization works by adding checked assertions to the IR. A new
side trace is spawned if an assertion turns out to be wrong often
enough. That one either specializes to other pre-conditions or
generalizes, depending on the circumstances.

This leads to a graph of traces that probabilistically approach a
local optimum. Similar to how a C switch statement is turned into
a decision tree. Except that a JIT compiler has access to runtime
path statistics (indirectly, via probabilistic hotspot detection)
and a trace compiler can ignore paths that are never taken. This
often makes a decision tree a better choice than a jump table
(which is mainly beneficial for a large equi-weighted fan-out or
in absence of known weights).

[
A recent example illustrates the power of this approach:

Cloudflare's WAF (web application firewall) basically generates
Lua code for the (highly non-linear) maze of firewall rules. An
incoming attack triggers certain rules and the corresponding paths
are turned into linearized traces. These can be heavily optimized
by LuaJIT, much more so than you could ever hope to do with a
static compiler.

When a different attack wave is started, it'll spawn different
traces -- i.e. the generated code dynamically adapts to the
specific attacks.

The result is a much higher firewall throughput than you might be
able to achieve with a data-driven or a static compilation
approach. This directly equates into money saved on machines
needed to handle the load.
]

> Another thing that has been racking my brain. Since paths are compiled, not
> functions, won't a trace compiler require more memory?

Nope, a trace compiler generates much, much less code than a
method-at-a-time compiler. Only a fraction of the entire program
is compiled and traces are usually rather short, too.

> Since you can't really re-use compiled code of another path I
> mean, and since paths will probably be larger than single
> functions in the general case...

There's some amount of trace overlap and code duplication. But
that's often beneficial due to better optimization opportunities.
The effect on code density is rather minor in practice.

--Mike

Other related posts: