Tools for Compiler Construction

We are trying to create and implement (and eventually use) a new programming language for the implementation of logical algorithms and computer algebra. Logic is special because it requires tree structures that can have many different forms. My present (2023) view is that using these data structures reveals a weakness that all current programming languages with static type checking have: We don't know how to handle union and we don't know how to handle partiality combined with static type checking. The problem of partiality exists in many applications, but it shows up most clearly in the implementation of logic.

All existing approaches (union, std::variant, inheritance, type casts, std::optional, std::expected, use of nullptr, use of special value like std::string::npos) are unsatisfactory, because they cannot be statically type checked.

At this moment (summer 2023), the status of the compiler project is as follows: I am not sure if static type checking can be extended to include union (std::variant) and partiality, but I did not give up hope. I will keep on trying to solve the underlying theoretical problems. It will be necessary to develop a theory of controlled ownership at the same time, because side effects cannot be combined with refinement types.

Despite the unsolved theoretical problems, the project resulted in techniques and software that make it easy to implement logic in C++, so that in a broader sense, the problem of efficiently implementing logic is solved. Students have started implementing proof checkers, and using the developed techniques.

We have developed the following software: A generator for tree-like data types in C++, and a parser/tokenizer generator for C++. The parser generator works like Yacc, but fully supports C++ and RAII. It creates LALR parsers, and in addition allows extending the language at run time, which is needed for example by Prolog.

The tokenizer generator does not create a complete tokenizer, but only a classifier, which reads part of the input, and determines its token. Generating a full tokenizer would result in a generator that is too rigid to be used. It is possible to extend the generated classifier by hand. This is needed non-regular tokens, or for tokenizers that consider e.g. indentation.

The tree-data type generator reads a set of specifications of tree-like (algebraic) datatypes, and creates C++ implementations of these datatypes. The specifications are as concise as in a functional language (if not more concise), while at the same time compiled to efficient C++ code.

The parser/tokenizer generator has been used quite extensively by now, and we have used the tree generator for definition of Zermelo-Fränkel Set Theory. As more tools are being completed, they will gradually appear on this page. In addition to the materials (code and documents) that are directly related to the development of our programming language and implementation of logic, The slides were written while teaching courses compiler construction at Instytut Informatyki Uniwersytetu Wrocławskiego and the School of Engineering and Digital Sciences at Nazarbayev University. Nearly all of the code on this page was written while working at Nazarbayev University. This page was created by Hans de Nivelle.

Presentation at C++ Now (2023)

I gave a presentation at C++ Now with title 'Trees for Logic and Parsing'. The slides and the code fragments that were shown in the lecture are here, and here is a recording of the presentation. Note, however that the tree generator has been improved significantly since the presentation.

Presentation at C++-now (2022)

I gave a lecture at C++ Now about parsing, tokenizing and Maphoon. The slides and the code fragments that were shown in the lecture are here, and this is a recording of the presentation.

Parser Generation for C++ (Maphoon)

After having studied existing parser generators, I concluded that none of them fulfilled our needs, and we decided to write our own. We aimed to design it in such a way that it will be useful for other compiler implementors as well. The parser generator is called Maphoon. It has the following features:

Features

The constructed parser is bottom-up. Bottom-up parsing is theoretically and practically superior to top-down parsing. There is no need to adapt the grammar, and attribute computation can be specified with the grammar rules.
It is possible to use modern C++ in action code. Concretely, it is possible to use attribute types with resource invariants (copy constructor, copying assignment, destructors) without effort. (Think of STL-containers, smart pointers, etc.) If all attribute types are movable, the parser will take advantage of this.
The parser generator and the constructed parsers use portable C++ without need for additional libraries.
It is easy to maintain source information in symbols for the generation of error messages.
It is possible to define operators at run time. Our own language does not use this, but for example Prolog does. Even if one does not use it, it often results in simpler parsers that can be more easily modified.
It is possible to define meaningful error messages. Defining error messages is based on regular expressions that trigger expectations. The error messages explain what was expected, and what was obtained instead.

While designing Maphoon, we took CUP as main example of a nice parser generator. We wanted the same in C++ and added a few extra features.

Tokenizer Generation

In addition to Maphoon, we also created a tokenizer generator. We tried to meet the following goals:

The constructed tokenizers must be flexible.
It must be easy to start using it.
It must be possible to construct efficient tokenizers.

Our experience with tokenizer generators is that they are borderline useful. Main problem is that the constructed tokenizers are usually too rigid. Often there are some missing features that one wants to add but cannot, the language may have a few non-regular symbols, and the user has no control over the way source information is stored in tokens. In order to allow flexibility, we don't generate a complete tokenizer, but only a function that reads input into a buffer and reports which regular expression (symbol) it belongs too. The user can add other functions that read other things into the buffer (or a function that reads a comment without storing it in the buffer.) In order to make it easy to start using it, regular expressions are initially not compiled. One constructs pairs of regular expressions and symboltypes in code, and calls readandclassify( ). This is easy, but not efficient, because regular expressions are translated into automata every time the code is started. In order to obtain efficient tokenizers, it is possible to print the automaton as C++ code (readandclassify applied on the given automata), and include the C++ code. This can be done when the tokenizer is finished, and the resulting tokenizer is efficient. (The idea of directly generating code instead of tables was taken from re2c.)

Download

Maphoon is published under the Aladdin Free Public License

In order to use the parser generator, you need the lexer generator. The sources of the lexer generator are here, and the manual is here.

The sources of the parser generator are here, and the manual is here. In order to compile the parser generator, edit the Makefile and set Lexing equal to the path containing directory lexing2023, and Maphoon equal to the path containing the directory maphoon2023. Examples can be downloaded from here. Prolog and Multimodal logic also need TVM and Utilities. In order to run, edit the Makefiles to set the path variables, and type 'make'. The zip-file contains the following examples:

The obligatory calculator.
An example which parses modal logic. This example uses the tree generator, but it can be compiled without.
A prolog parser. There is no interpreter, only a parser.
In order to add operators, type op( unsigned, type, names ), where unsigned is the priority, type is in fx,fy,xfx,xfy,yfx,xf,yf, and names is either a single operator, or a list of operators.
In order to quit, type halt. or #.
A simulator for non-deterministic Turing machines. Turing machines are among the simplest models of computation, and are frequently used in teaching. Our Turing Machine simulator is fast, easy to use, and it can handle non-deterministic Turing machines. There exist online Turing Machine simulators, but they are too slow for my patience, and cannot handle non-determinism. Our simulator uses a modified model. It is a slight generalization of the model in Sipser (Introduction to the Theory of Computation).
This is a Turing machine that recognizes words of form a^i.b^i.c^i: Explanation and the input file.
Next we give a Turing machine recognizes words of form v#w, where v is a substring of w. It does this by non-deterministically throwing away a prefix of w, and then checking what w starts with v: Explanation and the input file.
The simulator reads an input file, which must contain a description of the TM together with a set of input words. For each input word, the simulator either prints an accepting computation, or the set of reached configurations. The simulator never computes more than 10000 configurations, but this number can be easily changed in the code.
If you are not interested in the parser tools and want only the simulator, the sources can be compiled without Maphoon. The simulator reads the input from stdin, and prints the output in stdout.
The Turing machine simulator was written with the help of Dinislam Madikhanov, Kenessary Myrzakul, and Dastan Sultanov.

Tree Generator

TreeGen is a C++ program that automatically creates C++ class definitions from definitions of recursive data types. TreeGen is by itself written in C++. The generated class definitions use automatic memory management (RAII), and start sharing subtrees when assigned or copied, while at the same time strictly preserving value semantics, which means that there will be no side effects caused by sharing. Memory management is done by means of reference counting.

Sources are available under the Aladdin Free Public License and can be downloaded in a single zipfile. A manual is here. In order to compile the tree generator, download the Maphoon lexer, and download some utilities. In the Makefile, set Lexing to the path containing lexing2023, and set Util to the path containing util. If you want to run the exmaples, you also need tree virtual machine.

Slide Sets

Below are some slides about compiler construction that we used when teaching compiler construction.

Attribute Grammars

Slides about attribute grammars.

Top-Down Parsing

One usually has to redesign the grammar in order to make it suitable for top-down parsing. One must merge common prefixes in rules, and remove left recursion. Slides

Bottom-Up Parsing

Slides about bottom-up parsing. Read these, if you want to understand the theory behind Maphoon.

Semantic Analysis

These slides give an overview of type checking for C. The slides are a bit simplified, but contain all the main features of C, like the lval/rval distinction and the implicit conversions from array to pointer.

Translation of C into Stack Machine

The first set of slides introduces an untyped stack machine. The second set of slides adds types to the stack machine, and explains how to translate C into the stack machine.

Lambda Calculus

In Fall 2022, I taught lambda calculus to master students. Here are slides about how to define inductive types in lambda calculus, and these are slides about the Hindley-Milner type checking algorithm. I think that the natural way to present the HM algorithm is by using a clause based approach.

Acknowledgements

Thanks to Danel Batyrbek, Witold Charatonik, Aleksandra Kireeva, Tatyana Korotkova, Akhmetzhan Kussainov, Dina Muktubayeva, Cláudia Nalon, Olzhas Zhangeldinov.

Special thanks for Zoltan Teki for extensively testing Maphoon, and requesting additional features. He found a bug in the tokenizer generator, and thanks to him Maphoon now supports move-only attributes.

This research was sponsored by Nazarbayev University through the Faculty Development Competitive Research Grant Program (FDCRGP), through grant number 021220FD1651.