Beancount V3: Dependencies

Martin Blais, June 2020

Beancount v3 is going to get rewritten in C++, here is the set of dependencies I've tested and that I'm comfortable maintaining in the long run:

Base environment

  • Bazel build (https://github.com/bazelbuild/bazel): The Google build system is the most stable approach to build that I know of, much better than SCons and certainly much better than CMake. It allows you to pin down a specific set of dependencies by explicitly referencing other repositories and git repos at specific commit revisions (including non-released ones), and the sometimes annoying constraints it imposes results in hermetically reproducible builds like no other build system can do. This minimizes surprises and hopefully the number of platform-dependent portability issues. It also minimizes the amount of pre-installed packages we assume your system has (e.g. it'll download and compile its own Bison, for example). It runs fast, computes the minimal set of tests and targets to rebuild, and is highly configurable.

The downside of choosing Bazel is the same of other Google-issued open source projects: the original version of that product is internal and as a result there are a lot of strange idiosyncrasies to deal with (e.g. //external, the @bazel_tools repo, etc.), many of which are poorly documented outside the company and with a good number of unresolved tickets. However, at this stage I've already managed to create a working build with most of the dependencies described in this section.

  • C++14 with GCC and Clang/LLVM: Both compilers will be supported. Clang provides a much better front-end and stdlib implementation but is a little slower to build. GCC is more commonly present in the wild but the error messages are… well, we all got used to this I suppose. Note that despite requiring C++14, I will refrain from using exotic features of the language (including classes). There may be questions about Windows support.

  • Abseil-Cpp base library (https://github.com/abseil/abseil-cpp): The base library of functions is issued from Google's own gigantic codebase and has been battle-hardened and tested like no other—this is what the Google products run on. This provides a most stable API to work with (it's unlikely to change much given how much code depends on it), one which complements stdc++ well, and whose existing contact surfaces are bound to remain pretty static. It's simpler and more stable than Boost, and doesn't offer a myriad of libraries we're not going to need anyway (plus, I love Titus' approach to C++).. This fills in a lot of the basic string manipulation functions you get for free in Python but crave in C++ (e.g. absl::StrCat).

  • Google Test (https://github.com/google/googletest): This is the widely used C++ testing framework I'm already familiar with, which supports matchers and mocking.

Data representation

  • Protocol Buffers (https://github.com/protocolbuffers/protobuf): I will maintain a functional style in this C++ rewrite and I need a replacement for Python's nametuples to represent directives. This means creating a lot of simple naked structured data that will need to be created dynamically from within tests (there's a good text-format parser) and also serialized to disk as the boundary between the core and query language will become a file of protobuf messages. Protobuf provides a good hierarchical data structure with repeated fields that is supported in many languages (this opens the door potentially to plugins written in e.g., Go), and it's possible to provide Python bindings for them.
    It will also become the interface between the Beancount's core and the input to the query language. We will be using proto3 with version >=3.12 in order to have support for optional presence fields (null values).

  • Riegeli (https://github.com/google/riegeli): An efficient and compressed binary format for storing sequences of protobuf messages to files. I think the Beancount core will output this; it's compact and reads fast. It's also another Googlething that ought to receive more attention than it does and supports both C++ and Python and protobufs.

  • mpdecimal (https://www.bytereef.org/mpdecimal/): This is the same C-level library used by Python's implementation of Decimal numbers. Using this library will allow to easily operate between the C++ core and Python's runtime. I need to represent decimal numbers in C++ memory with minimal functionality and reasonably small range (BigNum classes are typically more than what we need). We don't need much of the scope for decimal…. basic arithmetic operations + quantizing, mainly.
    There are other libraries out there: GMP, decNumber. There is some information on this thread: (https://stackoverflow.com/questions/14096026/c-decimal-data-types. For on-disk representation, I will need a protobuf message definition for those, and I'm thinking of defining a union of string (nice to read but lots of conversions from string to decimal) with some more efficient exponent + mantissa decimal equivalent.

Parser

  • RE/flex lexer (https://github.com/Genivia/RE-flex): This modern regexp-based scanner generator supports Unicode natively and is very fast and well documented. It provides a great alternative to the aging GNU flex which made it difficult to support non-ASCII characters outside of string literals (i.e.., for account names). I've had success using it on other projects. Many users want account names in their home language; this will make it easy to provide a UTF-8 parser for the entire file.

  • GNU Bison (https://git.savannah.gnu.org/git/bison.git): We will stick with GNU Bison, but instead use the C++ complete modes it supports. I'm hesitating continuing with this parser generator as it's showing its age but it's pretty stable and I can't quite justify the extra work to upgrade to ANTLR.
    We will have to pull some tricks to support the same grammar for generating C code for v2 and C++ code for v3; the parser code could be provided with a dispatch table of functions, which would be static C functions in v2, and methods in v3. Some of the generation parameters (% directives) will be different (see here for an example).

  • International Components for Unicode (ICU) (https://github.com/unicode-org/icu.git): This is the standard library to depend on for Unicode support. Our C++ will not use std::wstring/std::wchar, but rather regular std::string and function calls to this library where necessary.

Python

  • Python3 (https://www.python.org/): Not much to say. I will keep using the latest version. Python is a tank of an extension language and no plans to change that.

  • pybind11 (https://github.com/pybind/pybind11): I want to provide a Python API nearly identical to the current one in Beancount, or better (which means simpler). One of the requirements I've had is to make it cheap to pass a list of protobuf objects for the directives to a Python callback, without copying (serializing and deserializing) between C++ and Python—for plugins. I've investigated multiple libraries to interoperate between Python and C++: Cython, CLIF, SWIG, etc. and serialization is a problem (see this partial solution). The one that seems to have the most momentum at the moment is pybind11, a pure header library which is an evolution from Boost::Python, that offers the most control over the generated API. It also works well with protocol buffer targets built with fast_cpp_protos: only pointers are passed through, so plugins passing in and out the full list of directives should be possible. I also happen to be familiar with Boost::Python having used it 20 years ago, it's really quite similar actually (but does away with the rest of Boost).

  • Type annotations, PyType (or MyPy?): I've already been compliant to a custom configuration of PyLint for Python but the codebase does not use the increasingly ubiquitous type annotations. In the rewritten subset of the code that will remain, I'd like to have all functions annotated and to replace the sometimes redundant Args/Returns docstrings with a more free-form documentation (the types may be sufficient to avoid the formalism of Args/Returns blocks). I'll have to see how this affects the auto-generated docs.
    An important addition is that I want to start not only annotating, but running one of the type checkers automatically as part of the build. I'm already familiar with Google's pytype, but perhaps mypy is a good alternative. In any case, the only hurdle for that is to craft Bazel rules that invoke these automatically across the entire codebase, as part of py_library() and py_binary() rules. I'll also attempt to make pylint run in the same way (as part of the build) with a custom flag to disable it during development, instead of having a separate lint target.

  • Subpar (https://github.com/google/subpar): It's not clear to me yet how to perform a pip-compatible setup.py for a Bazel build, but surely we can find a way to build wheels for PyPI using the binaries built by Bazel. For packaging a self-contained binary of Python + extensions, the "subpar" Bazel rules is supposed to handle that. However, at the moment it does not support C extensions.