Machine
2
) assembly into JavaScript. LLVM is a compiler
project primarily focused on C, C++ and Objective-C. It
compiles those languages through a frontend (the main ones
of which are Clang and LLVM-GCC) into the LLVM in-
termediary representation (which can be machine-readable
bitcode, or human-readable assembly), and then passes it
through a backend which generates actual machine code for
a particular architecture. Emscripten plays the role of a back-
end which targets JavaScript.
By using Emscripten, potentially many languages can be
run on the web, using one of the following methods:
•
Compile code in a language recognized by one of the
existing LLVM frontends into LLVM, and then compile
that into JavaScript using Emscripten. Frontends for var-
ious languages exist, including many of the most popular
programming languages such as C and C++, and also var-
ious new and emerging languages (e.g., Rust
3
).
•
Compile the runtime used to parse and execute code in a
particular language into LLVM, then compile that into
JavaScript using Emscripten. It is then possible to run
code in that runtime on the web. This is a useful approach
if a language’s runtime is written in a language for which
an LLVM frontend exists, but the language itself has no
such frontend. For example, there is currently no frontend
for Python, however it is possible to compile CPython –
the standard implementation of Python, written in C –
into JavaScript, and run Python code on that (see Sec-
tion 4).
From a technical standpoint, one challenge in design-
ing and implementing Emscripten is that it compiles a low-
level language – LLVM assembly – into a high-level one –
JavaScript. This is somewhat the reverse of the usual situa-
tion one is in when building a compiler, and leads to some
unique difficulties. For example, to get good performance in
JavaScript one must use natural JavaScript code flow struc-
tures, like loops and ifs, but those structures do not exist in
LLVM assembly (instead, what is present there is a ‘soup of
code fragments’: blocks of code with branching information
but no high-level structure). Emscripten must therefore re-
construct a high-level representation from the low-level data
it receives.
In theory that issue could have been avoided by compiling
a higher-level language into JavaScript. For example, if com-
piling Java into JavaScript (as the Google Web Toolkit does),
then one can benefit from the fact that Java’s loops, ifs and so
forth generally have a very direct parallel in JavaScript. But
of course the downside in that approach is it yields a com-
piler only for Java. In Section 3.2 we present the ‘Relooper’
algorithm, which generates high-level loop structures from
the low-level branching data present in LLVM assembly. It
is similar to loop recovery algorithms used in decompilation
2
http://llvm.org/
3
https://github.com/graydon/rust/
(see, for example, [2], [9]). The main difference between the
Relooper and standard loop recovery algorithms is that the
Relooper generates loops in a different language than that
which was compiled originally, whereas decompilers gen-
erally assume they are returning to the original language.
The Relooper’s goal is not to accurately recreate the original
source code, but rather to generate native JavaScript control
flow structures, which can then be implemented efficiently
in modern JavaScript engines.
Another challenge in Emscripten is to maintain accuracy
(that is, to keep the results of the compiled code the same
as the original) while not sacrificing performance. LLVM
assembly is an abstraction of how modern CPUs are pro-
grammed for, and its basic operations are not all directly
possible in JavaScript. For example, if in LLVM we are to
add two unsigned 8-bit numbers x and y, with overflowing
(e.g., 255 plus 1 should give 0), then there is no single oper-
ation in JavaScript which can do this – we cannot just write
x + y, as that would use the normal JavaScript semantics. It
is possible to emulate a CPU in JavaScript, however doing
so is very slow. Emscripten’s approach is to allow such emu-
lation, but to try to use it as little as possible, and to provide
tools that help one find out which parts of the compiled code
actually need such full emulation.
We conclude this introduction with a list of this paper’s
main contributions:
•
We describe Emscripten itself, during which we detail its
approach in compiling LLVM into JavaScript.
•
We give details of Emscripten’s Relooper algorithm,
mentioned earlier, which generates high-level loop struc-
tures from low-level branching data, and prove its valid-
ity.
In addition, the following are the main contributions of Em-
scripten itself, that to our knowledge were not previously
possible:
•
It allows compiling a very large subset of C and C++ code
into JavaScript, which can then be run on the web.
•
By compiling their runtimes, it allows running languages
such as Python on the web (with their normal semantics).
The remainder of this paper is structured as follows. In
Section 2 we describe the approach Emscripten takes to
compiling LLVM assembly into JavaScript, and show some
benchmark data. In Section 3 we describe Emscripten’s in-
ternal design and in particular elaborate on the Relooper al-
gorithm. In Section 4 we give several example uses of Em-
scripten. In Section 5 we summarize and give directions for
future work.
2. Compilation Approach
Let us begin by considering what the challenge is, when we
want to compile LLVM assembly into JavaScript. Assume
we are given the following simple example of a C program:
2 2013/5/14
评论0