ABSTRACT. In this paper we would like to present a very short (possibly the shortest) self-interpreter, based on a simplistic Turing-complete imperative language. This interpreter explicitly processes the statements of the language, which means the interpreter constitutes a description of the language inside that same language. The paper does not require any specific knowledge; however, experience in programming and a vivid imagination are beneficial.
For a few decades many people have been fascinated by quines -- self-reproducing programs [1]. This is not very surprising, because such programs possess a few attractive features. They have somewhat paradoxical self-referencing, which makes them not easy to write. They can be written in different languages. They have a definite minimal size in each language. And they do not have a reasonable practical meaning.
In this paper we explore the question one step further: if a quine contains the information about the program itself, then what would a program containing the information about the language it is written in look like? Such a program we here call a self-interpreter.
Writing a small quine involves choosing an algorithm which can use a coded version of itself to output both the coded and uncoded versions of itself, but which is not too long in either version. Writing a small self-interpreter involves choosing a language which is simple to parse and execute, but which is powerful enough to express these processes concisely.
Why an interpreter? Compilers are actually translators from one language into another. They reveal more about the details of the output language than about the behaviour of the programs of the input language. So an interpreter is more like a pure description of the language -- that is, the correspondence between the text of programs and their semantics. We might note also that a compiler for a Turing-complete language can sometimes be written in a non-Turing-complete language, whereas an interpreter cannot.
The self-interpreter which we present in this paper was originally written by Daniel Cristofani in July 2002 and later improved and shortened to its current form.
Every language lives in its own environment. A definition of the language may be made in relation to an abstract machine, but any real implementation must take into account where that abstract machine is being realized. In other words there should be well-defined rules which describe how to instantiate the computational process and how to observe the behavior. In real computers those rules are compilers, interpreters, command shell, and inputs and outputs of programs. This kind of information is called the pragmatics of the language and is sometimes included in the full language description. However, it stands outside the language and is usually ignored in the formal description. Pragmatics is the third part of the language constitution; the other two are syntax and semantics.
A language specification is a description of an abstract machine, which can parse a text -- a sequence of symbols that represents a program written in that language -- and produce some observable behavior, i.e. output. A Turing-complete language refers to a Turing-complete abstract machine.
An interpreter is a realization of the abstract machine. It consumes a program and produces the corresponding behavior. The abstract machine can be realized in many different ways with different algorithms. The only thing the different realizations must have in common is their behavior -- the same valid program must produce the same behavior from different interpreters.
Since any interpreter constitutes a particular implementation of an abstract machine, one can use the interpreter's behavior to define the language's semantics while ignoring the interpreter's internal structure. A full language definition of this kind will also include a definition of the language's syntax and areas of undefined behavior. So any other interpreter will be a correct language implementation as long as it produces the same behavior for all syntactically correct programs whose behavior is defined.
Here we define self-interpreter as an interpreter which is written in the same language it interprets, and which also meets these three requirements:
The second requirement looks like a tautology saying that a self-interpreter is an interpreter. Nevertheless this requirement rules out languages whose pragmatics are not sufficient to permit a self-interpreter. Later we show that not every Turing-complete language can have a self-interpreter in this sense. This property forms a distinction between languages: some languages are environmentally complete, while others are not. Thus, Turing-completeness comes from the semantics of the language and environmental completeness from its pragmatics.
For the self-interpreter implementation we have chosen the brainfuck language as a basis. This language is very simple in comparison to other programming languages. At the same time computationally it has been proven to be Turing-complete [2]. We have changed it slightly to make a self-interpreter possible, but we did not adapt it to get the best result -- to get an even shorter self-interpreter -- for two reasons. First, we did not want to depart unnecessarily from the already known and popular brainfuck language, nor from the strictest portability practices within that language [3].
Second, in this paper we would like to present general ideas leading to a short real working example. We will discuss possible shortenings in the section 3.4.
The language realizes an abstract machine operating on an array, unbounded on the right, of memory cells, each of which has a capacity of at least a byte and is initialized to zero. There is a pointer, which initially points to the leftmost memory cell. The machine has an input stream and an output stream of characters. The computational process consists of reading a program from input, then recognizing instructions and interpreting them.
The program which is fed as input to the machine is a sequence of characters and consists of two parts: the code and the data. These parts are separated by the first '!' character.
There are 8 instructions recognized by the machine (their codes correspond to ASCII):
> | move the pointer one cell to the right; |
< | move the pointer one cell to the left; |
+ | increment the memory cell under the pointer; |
- | decrement the memory cell under the pointer; |
, | set the memory cell under the pointer to the ASCII value of the next character obtained from the data part of the input; |
. | output the contents of the memory cell under the pointer as an ASCII character; |
[ | if the memory cell under the pointer is zero, go to the command following the matching ']', otherwise to the next instruction; |
] | if the memory cell under the pointer is not zero, go to the command following the matching '[', otherwise to the next instruction. |
All characters in the code, except these 8 specified above and '!', are ignored. Instruction ',' consumes a character from the data part of the input. For example, the program ",>,!ab" will result in the array "97 '98 0 0 ... " (the apostrophe represents the pointer).
After evaluating the instruction the process control goes to the next instruction for '<', '>', '+', '-', '.', ',', and stops after the last instruction, which stands before '!'. For instructions ']' and '[' the process control goes either to the next instruction or to just after the matching '[' or ']' instruction, depending on the cell value under the pointer.
The allowed depth of nested brackets is undefined, but any implementation must be able to handle not less than 124 (a depth of 7 is proved sufficient for Turing-completeness [4]).
If the input is finished or an input error occurred, then the input instruction must not change the value of a cell if it was zero before the input instruction; otherwise, the behavior is implementation-dependent.
There are some other features of this language which are left undefined, implying that the behavior of the interpreter is undefined in some cases. Every implementation is free to choose any behavior in such situations, and will still be counted as conforming to the language. These features are: cell size, and behavior when the program moves the pointer left from the first cell, decrements a zero, increments a maximal cell value, or has unbalanced brackets.
Simple examples of the programs are: ",+.!a" outputs "b", "a!" does nothing, ",[>+>+<<-]>.>.!X" outputs "XX", ">,[.>,]<[<]>[.>]!>,[.>,]<[<]>[.>]!" is a quine.
The original brainfuck language [5] is almost the same as described above, except that it does not define the separator '!'. Instead, a brainfuck program is entirely separate from the input it processes; either the program is run through a compiler and the input is given to the resulting executable, or the program and its input are given to an interpreter through separate input streams, the program typically being read from a file and its input usually from the keyboard.
A brainfuck interpreter written in brainfuck cannot duplicate this behavior, because its interaction with its environment is limited; only one input stream is available to it (or to any brainfuck program), so it must receive both the program and its input through this stream. Frans Faase's original brainfuck interpreter in brainfuck [6] expects them to be separated with '!'; dbfi does the same, being made to the same specification (including strict portability).
Only a brainfuck dialect with this addition, or some other addition or modification to do the same job, can have a self-interpreter in our sense, because of requirement (2). The (normal) brainfuck language cannot be used to write a brainfuck interpreter which interacts with its environment in the same way that a normal brainfuck interpreter does; the pragmatics of the language do not allow it.
The self-interpreter which we describe in this paper is called dbfi. It is given in its entirety in section 3.2.
The dbfi interpreter consists of two parts. The first reads the code of the program to be simulated from the input and stores it in memory in coded form. The second part decodes and executes the instructions sequentially.
The memory layout for dbfi is comparatively simple. The instructions of the program to be interpreted are stored as small numbers according to the table:
] [ > < . - , + 1 2 3 4 5 6 7 8These are stored consecutively from the start of the array, except that the simulated instruction pointer is represented as a pair of zeroes just to the left of the next instruction to execute; so initially it is at the far left, before any coded instructions.
After the code for the last instruction, there are another two zeroes, followed by the simulated data array. Each simulated cell is represented by two cells: the right one holds the value of the simulated cell, and the left one is a marker cell, set to 2 for the cell where the simulated pointer is (and for cells to the left of that), and to 0 for cells right of the simulated pointer. This makes it easy to find the cell where the simulated pointer is.
For example, if dbfi executes the program ",[.[-],]!a" and the simulated program is about to output an "a" with the period instruction, the array would look like:
7 2 0 0 5 2 6 1 7 1 0 0 2 97 0 0 0 0 0 0 ...
Throughout the explanation, the simulated instruction pointer and the simulated data pointer are called the simulated IP and the simulated pointer; the pointer, with no adjective, means the real data pointer. To distinguish dbfi codes for instructions (1-8) from ASCII codes of instructions in the program, we call the former instruction codes.
In conjunction with the following explanation, it would be beneficial to watch dbfi as it executes a simple program, by using some implementation that can display the array.
This is the main result of this paper -- the code of dbfi:
>>>+[ [-]>>[-]++>+>+++++++[<++++>>++<-]++>>+>+>+++++[>++>++++++<<-]+>>>,<++[ [>[->>]<[>>]<<-]<[<]<+>>[>]> [<+>-[[<+>-]>]<[[[-]<]++<-[<+++++++++>[<->-]>>]>>]]<< ]< ]<[ [<]>[[>]>>[>>]+[<<]<[<]<+>>-]>[>]+[->>] <<<<[[<<]<[<]+<<[+>+<<-[>-->+<<-[>+<[>>+<<-]]]>[<+>-]<]++>>-->[>]>>[>>]] <<[ >>+<[[<]<]>[[<<]<[<]+[-<+>>-[<<+>++>-[<->[<<+>>-]]]<[>+<-]>]>[>]>] >[>>]>> ] <<[>>+>>+>>]<<[->>>>>>>>]<<[>.>>>>>>>]<<[>->>>>>]<<[>,>>>]<<[>+>]<<[+<<]< ]
The first part of dbfi is concerned with setting up the initial memory layout: simulated IP (two zeroes), instruction codes, another two zeroes, and the first marker cell set to 2. The basic idea is that by assigning the codes for different instructions in reverse ASCII order, we can store the differences between the ASCII codes for the different instructions in an array, then subtract them from the input character one after another, and check for an exact match each time.
Instruction ! + , - . < > [ ] ASCII 33 43 44 45 46 60 62 91 93 instruction code 8 7 6 5 4 3 2 1
So.
>>>+[ | (1) |
[-]>>[-] |
++>+>+++++++[<++++>>++<-]++>>+>+>+++++[>++>++++++<<-]+>>>,<++[ | (2) |
[>[->>]<[>>]<<-] | (3) |
After this subtraction, there are three possibilities.
<[<]<+>>[>]> |
[<+>- | (4) |
[[<+>-]>] | (5) |
<[ | (6) |
[[-]<] |
++<-[ | (7) |
<+++++++++>[<->-]>> |
]>> |
] |
]<< |
]< |
Once we've exited the loop the possibilities are:
c 0 '2 0 0 0 ... if we matched an instruction (whose code is now in c);
'0 0 2 0 0 ... if we matched a '!';
0 'r 0 i 0 ... if we got through all the differences without matching anything. i may or may not be 0.
]< |
[ | (8) |
[<]>[[>]>>[>>]+[<<]<[<]<+>>-] | (9) |
For example, if the simulated array has "97 '98 0 0 0 ..." and the interpreter is about to execute '<' instruction in the piece of code "[<]", then the real array may look like:
2 0 0 4 '1 0 0 2 97 2 98 0 0 0 0 0 0 0 0 0 0 ... before and
2 4 0 '0 1 0 0 2 97 2 98 1 0 1 0 1 0 1 0 0 0 ... after (9).
>[>]+[->>] |
<<<<[ |
The basic method for moving the simulated IP to the matching bracket is the same for '[' and ']'. We move instruction codes past the simulated IP one at a time, while counting the depth of the brackets (keeping the count in one cell of the simulated IP). When that depth reaches zero, we've just passed the matching bracket. We have to start just inside the brackets, with a depth of 1.
[<<]<[<]+<< |
[ | (10) |
+>+<<-[ | (11) |
>-->+<<-[ | (12) |
>+<[>>+<<-] |
]]>[<+>-]<] |
++>>-->[>]>>[>>]] |
<<[ |
>>+ |
<[[<]<]>[ | (13) |
[<<]<[<]+ |
[-<+>>-[<<+>++>-[<->[<<+>>-]]]<[>+<-]>] |
>[>]>] |
>[>>]>>] |
<<[>>+>>+>>] |
<<[->>>>>>>>] |
<<[>.>>>>>>>] |
<<[>->>>>>] |
<<[>,>>>] |
<<[>+>] |
<<[+<<]<] |
There are several obvious possibilities for making this self-interpreter shorter. The first part of the interpreter would be almost trivial if one used a different instruction encoding (not ASCII). It would also be possible to shorten the first part if the language were changed to forbid other characters (comments) in the code of the program. The language also leaves behavior undefined in cases like decrementing zero or incrementing the maximal cell value, so the interpreter spends some extra instructions to avoid such behavior.
In this paper we did not pursue the aim of getting the shortest possible self-interpreter by any means. Since many people know and acknowledge the brainfuck language, we tried to avoid restrictions and modifications of this language.
In this paper we have presented a very short self-interpreter, perhaps the shortest one known at the present moment. It is not trivial or even simple, but of course no one would expect great simplicity in simulating a Turing-complete machine. We do not deny the possibility of an even shorter self-interpreter. One way to get a shorter self-interpreter is to find more shortcut tricks or a better algorithm. Another way is to invent a different language allowing for the construction of a shorter self-interpreter. The interesting point here is that such a language must be simple enough to keep its (self-)description short, and at the same time it must be rich enough to express its description in a concise manner. In other words, the self-interpreter of a complex language will be long because of the large number of language features, and the self-interpreter of a very simple language may be long because of poor expressiveness. In this paper we have also shown that not every Turing-complete language can have a self-interpreter.
This interpreter written in C++ is an example of a dbfi-brainfuck interpreter. It is not optimized for speed or memory. Its specifications are the following. The memory cell is equal to a byte. Decrementing zero or incrementing a byte's maximal value is wrapped. The behavior is undefined for input with unbalanced brackets. Bad input does not change the cell value (as C++ function istream::get). And there is no hardcoded memory limit on the data array.
#include <iostream> #include <string> #include <map> int main() { std::string c; std::mapm; std::getline(std::cin,c,'!'); int p = 0, l; for( int i = 0; i < c.size(); i++ ){ l=1; if( c[i] == ']' && m[p] ) while(l){ i--;l+=(c[i]==']')-(c[i]=='['); } if( c[i] == '[' && !m[p]) while(l){ i++;l-=(c[i]==']')-(c[i]=='['); } if( c[i] == '+' ) m[p]++; if( c[i] == '-' ) m[p]--; if( c[i] == '.' ) std::cout << m[p]; if( c[i] == ',' ) std::cin.get(m[p]); if( c[i] == '>' ) p++; if( c[i] == '<' ) p--; } }
1. See, for example, http://www.nyx.net/~gthompso/quine.htm
2. See http://home.planet.nl/~faase009/Ha_bf_Turing.html
and http://www.hevanet.com/cristofd/brainfuck/brainfuck.html#turing
3. See http://www.muppetlabs.com/~breadbox/bf/standards.html
and http://www.hevanet.com/cristofd/brainfuck/epistle.html
Note that dbfi follows the specification of Frans Faase's original brainfuck interpreter in brainfuck (http://home.planet.nl/~faase009/Ha_bf_inter.html), which is quite portable in most respects. (Its only departure from portability is that the number of cells available to simulated programs is limited by the capacity of the cells of the underlying implementation; dbfi does not have this limit.) Faase's interpreter originated the use of '!' as a code-data separator.
4. See http://www.hevanet.com/cristofd/brainfuck/brainfuck.html#turing
5. See, for instance:
the canonical brainfuck distribution at http://wuarchive.wustl.edu/pub/aminet/dev/lang/brainfuck-2.lha
Frans Faase's page at http://home.wxs.nl/~faase009/Ha_BF.html
Brian Raiter's page at http://www.muppetlabs.com/~breadbox/bf/
and Daniel Cristofani's page at http://www.hevanet.com/cristofd/brainfuck/
6. At http://home.planet.nl/~faase009/Ha_bf_inter.html