Programming Languages – Page 23

Goodness in programming languages, part 2 â€“ getting your code running

Posts in this series: Syntax, Deployment, Metaprogramming, Ownership

The fancy word for what I’m talking about here is Deployment. How easy is it, once you’ve written your code, to get it running on someone else’s computer? What barriers are there to someone else using your program?

Examples of potential barriers include:

Having to download some other program first, e.g. a runtime or some kind of dependency.
The actual program being a huge download.
Something mysteriously not working because of different versions of something on the person’s computer.
The program simply not working on the operating system or machine architecture of the computer.

Anything that can be done to remove these barriers makes my life as the supporter of the program easier, and makes it more likely people will use it.

Here are some things I like:

Java’s ability to run almost anywhere. Once you have the runtime, Java code is relatively easy to get running on many different operating systems and architectures, using almost-identical code. Other runtime-based languages are also strong here.
Java and Python’s large built-in libraries. Both Java and Python include large amounts of functionality in their standard libraries, reducing the need to depend on third-party components.
Python and Perl’s ability to work out of the box on most Linux systems. If you are developing for Linux, you can pretty much guarantee that an up-to-date runtime will be available, meaning no dependencies at all are needed.
Many languages’ easy integration with Linux packaging. Most of the above barriers disappear if you can install dependencies using your operating system’s built-in package manager.
Quickly‘s easy way to build your program into the packaging system. Things really become easy if your program itself is integrated into the packaging system.
C and C++’s lack of dependencies. It is possible to write C and C++ programs that depend on nothing except standard runtime libraries, which are almost guaranteed to exist on most machines.

One way to handle dependencies is to package them together with your code. This is what most corporate Java software does – the specific Java runtime that is known to work is packaged with the program itself. This makes for big downloads, and defeats the concept of providing a single package for all platforms, but it does work. It also makes for huge headaches to do with licensing, and is often impossible for software producers who don’t employ legions of lawyers. It also feels bad and wrong.

When packaging and deploying software, I subscribe to the philosophy of “Find the dependencies – and eliminate them“. Until someone can click a single button and have your software running, you haven’t finished.

Goodness in programming languages, part 1 – syntax and layout

In this series I will comment on what I like in some of the languages I use. I will cover things that I find convenient, things that might lead me to write correct code, things that tend to make my code more readable, and possibly other things that I just like for no clearly-articulated reason. The purpose of this series is to help me think about what features I would put in a language if I created my own.

Posts in this series: Syntax, Deployment, Metaprogramming, Ownership

What aspects of syntax and layout – how the code appears in your text editor – do I like? This is entirely irrelevant to the computer that is running the code, and an implementation detail of the language compiler or interpreter, but is extremely important to me.

Python and Scheme’s low-symbol approach. Python and Scheme tend towards the use of English words, rather than symbols to express concepts. This makes code easier to read for someone unfamiliar with the language, and also for me.
Python’s unambiguous block structure. Python uses indentation to express block structure (e.g. for variable scope, namespaces and logic structures). Programmers in most languages (notably C-like and Lisp-like ones) tend to use indentation to help humans understand the block structure of their code, but the computer uses different symbols to understand the same thing. This can lead to ambiguity where a human may mis-read the real block structure of some code because the indentation is inconsistent with the symbols. Python avoids the problem by making indentation the syntax understood by the compiler. [Go avoids the same problem by stating that code with inconsistent indenting (as defined by the gofmt program) is invalid.]
Scheme’s simplicity. Scheme (like other Lisp dialects) has very simple rules to describe its syntax. This means you need to learn very little before you can understand the structure of Scheme programs, and it is unlikely you will be confused by rare structures.

Of course, there are trade-offs in all these points. Using fewer symbols can make programs verbose (although I find Python feels very concise). The lack of an end-of-block symbol makes short Python snippets look clean, but can make longer sections hard to understand (perhaps better editor support would help with this?). Scheme’s use of brackets for everything means there are a lot of brackets, so working out which does what job can be difficult.

Nevertheless, the goals of reducing the number of symbols with special meaning, allowing humans and computers to use the same ways of understanding code, and being as simple as possible are good goals to have.

Anatomy of an interpreter: the Lexer

Posts in this series: Lexer, Parser, Evaluator

I have been having a lot of fun recently writing my Scheme interpreter Subs. I have never implemented a full programming language before, so I am learning fast (mostly through mistakes) and wanted to write down some of the stuff I am discovering.

Note: if you want to learn more about what Scheme is I recommend Scheme (Wikipedia) and the book SICP, which is the inspiration for all this.

I am writing everything from scratch, just because it’s fun (certainly not because I think it is in any way better to do it that way…). As we will see, that gives me opportunities to do things in different ways from the normal way such things are done. So far, every time I find out I’ve deviated from the normal way I’ve quickly discovered why I am wrong, and had to learn the true path.

Text-based programming languages, whether interpreted or compiled, need a lexer. A lexer takes in characters and spits out “tokens”, which are groups of characters that represent a single thing, such as a bracket, a variable name or a number. (Those tokens are then passed on to the parser, which I will cover in a different post.)

Scheme (and other Lisp variants) are fairly easy to lex because they don’t have much syntax – you just need to be able to understand round brackets, numbers and strings, and a couple of special cases that I won’t go into because I haven’t actually implemented them yet. (Mind you, I haven’t implemented strings yet either…)

When I started Subs I took my normal approach of doing whatever I wanted without any research or even much thought, and wrote something that I called a lexer, but which was really something else. It took in a stream of characters, read it one “word” at a time (using whitespace as separators), broke up the word if it contained bracket characters, and emitted a tree structure with each branch representing a bracketted list. It just seemed sensible, while I was watching the brackets flow by, to understand them and create a tree structure.

However, for a number of reasons, that approach turned out to be wrong.

First, reading a “word” at a time made things much harder than simply stepping through each character. It made my initial implementation slightly faster, but as soon as I realised I cared about white space (e.g. keeping track of what line we are on) it had to go. When it went it also turned out to be easier to deal with unusual code layout – for example “a(b” should be lexed as 3 tokens, but would be handed to us as a single word.

Second, and more importantly, creating a tree structure at this point was a waste of time. Creating tree structures is normally the job of a parser, and mixing these responsibilities gave me some pointless inefficiency: the lexer emitted a tree of tokens, which the parser then translated into another tree (of fully-understood code objects). It turned out that walking the tree of tokens and copying that structure in the parser was at least as hard as just taking in a flat stream of tokens and constructing the tree just once.

So, I re-wrote Lexer into something that is starting to become worthy of the name. The most interesting parts of its signature look like this:

class Lexer
{
public:
    Lexer( std::istream& instream );
    Token NextToken();
};

It takes in a reference to a stream, which will provide the characters, and when NextToken is called, it reads enough characters to determine what the next token will be, and returns it.

Side note: Subs is written using Test-Driven Development. I re-implemented the Lexer and Parser from scratch (naming the new classes NewLexer and NewParser until they were ready), modified the code that used them to use the new interfaces, ran the tests, and immediately knew that the new classes worked as well as the old ones. That level of confidence is incredibly freeing. I can’t imagine how I would ever have convinced myself the new classes were ready had I not had that safety net of 100s of tests that ensure the interpreter correctly responds to each type of input.

Currently the Token class it returns is pretty much just a string, with some information attached about where that string was in the original text. In researching this article I realised that most lexers attach more information than that to their tokens – they understand its basic type such as integer, decimal, string, bracket or symbol. At the moment in Subs, this work is done in the parser (so for the lexer each token is just a string), but I can see why it is helpful to do this work in the lexer, because for most types we have the information anyway. For example, in order to recognise that ‘”foo (bar”‘ (where the double-quotes are in the real code) is a single token we must understand that it is a string. Since we know it at this stage, we might as well record it in the Token object so we don’t have to work it out again later. When Subs supports strings, I will probably add a “type” field to the token and move this work from the parser.

On a more general programming point, following on from comments I made in a previous post, it is worth noting that the structure of the lexer (and, as we will see later, the parser) uses a technique called “streams”. What this means is that we write functions like NextToken that process a small part of the total problem, and return their answer. If we chain together functions like that (for example the parser’s equivalent function calls NextToken whenever it needs a new token) we can process arbitrarily large input without using lots of memory or slowing down. The lexer is able to process any number of tokens using a very small amount of memory, and will only fall over if it encounters a single token that is ridiculously large.

The stream style can be very useful for some problems, not only because it can be efficient with memory, but also because it can help break problems into neat, small pieces that are easier to implement correctly. It is also useful for writing code in a functional style, because it allows us to avoid having any internal state (e.g. we could easily implement NextToken to take in the stream it should read from and avoid any member variables at all in Lexer), by pushing state out into the input and output, instead of having it in our program. In this case that means instead of reading, storing and then processing a program, our lexer can simply process a few characters and emit a token without knowing anything about the surrounding code or wider context. This makes it much easier to test, and (potentially) easier to do clever things with, like prove mathematically that it is correct (!)

Next time, the parser.