What is a string?

August 23, 2014 [Pepper, Programming Languages, Tech]

Most programming languages have some wrinkles around unicode and strings*. In my ficticious language Pepper, there are no wrinkles of any kind, and everything is perfect.

*E.g. JavaScript, Java, Haskell, Ruby, Python.

There are several key concepts. The most important are an interface AnyString and the variable** String which is what you should use when you are writing code with strings.

**String is a variable that refers to a type, so you just use it like a type and don't worry about it.

interface AnyString
{
    def indexable(CodePoint) code_points( implements(AnyString) string )
}

In Pepper an interface can describe what free functions exist as well as what member function a class must have, and here we just require that a code_points function exists that gives us a collection of CodePoint objects that may be indexed (i.e. is random-access).

When your Pepper program starts, the String variable will refer to something that implements this interface, and probably some other interfaces too. Most Pepper programs will use a String that is implemented as an array of bytes representing a string in UTF-8, but the programmer doesn't need to be aware of that, and in a situation where something different is needed (e.g. where we know lots of non-Latin characters will be used and UTF-16 will be more efficient) String can be set to something different in the configuration settings used by the compiler.

When you want to do something with a string, there will be functions that only rely on the AnyString interface and deal with CodePoints internally, but there will be other overloads that are potentially more efficient, for example there are two versions of the standard print function:

def void print( implements(AnyString) string )
def void print( NativeUtf8String string )

The NativeUtf8String class is implemented as a std::string in the C++ code emitted by the Pepper compiler, and the most efficient way to represent an array of bytes when compiling onto other platforms, so the version of print that uses it can be quite efficient.

Because all these types are known at compile time, the C++ code generated by the Pepper compiler can use the native types directly (and be efficient), even though the programmer is writing code using just the AnyString and String types, meaning their code can be adapted to other platforms by using a different configuration.

The Pepper environment exposes standard-out and standard-in as UTF-8 streams, and takes care of converting to the platform encoding for you (at runtime).