Compilers - 1: Introduction

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • JosAH
    Recognized Expert MVP
    • Mar 2007
    • 11453

    Compilers - 1: Introduction

    Greetings,

    last week's tip was a bit of playtime where we've built a Sudoku solver. This
    week we're going to build some complicated stuff: a compiler. Compiler
    construction is a difficult branch of CS and I don't want the article(s) to be the
    size of a book, so we have to keep things a bit simple.

    On the other hand the entire thing must have practical use more or less so the
    code to be developed must be extensible so that you can play and experiment a
    bit with it if you like.

    We also like to see some results so we can't just develop a compiler, we must
    also develop a sort of interpreter to see the actual thing running. But first
    we must plough through a bit of theory and definitions:

    Introduction

    A compiler typically translates a source language to a target language. Both the
    source and target languages can be anything: the source language can be a
    high level programming language, assenbly code language possibly for a virtual
    machine or even a natural language such as English or Dutch or Swahili. Most
    (but not all) of the time the target language will be a lower level language than
    the source language.

    A compiler reads a source language and writes a target language. It's a
    translator basically. Human languages and their usage as a means of communication
    are awefully difficult, especially when speech is involved. For simplicity we're
    not going to deal with speech recognition nor the difficult aspects of speech
    itself as a coherent sequence of phonemes, lexemes, diphtong- labial- dental-
    guttural- plosive- voice- silent- and other funny sounds.

    But even when it comes to programming languages things are quite complicated.
    At the simplest level a programming language is a sequence of characters where
    certain combinations of characters make up a 'token'. A token is an atomic
    entity of the programming language. Think of reserved words and symbols such
    as parentheses, mathematical operators, numbers etc. etc. But white space can
    be an important token too; most programming languages consider white space as
    the characters separating the more meaningful tokens, but they are extremely
    important in string literals, e.g. "helloworld " is different from "hello world"
    although for human beings the difference is futile based on our heuristic and
    fascinating ability to match patterns although we can't really explain *how*
    we do it.

    A compiler is much more stupid than humans. It needs fixed rules to determine
    whether or not a sequence of characters makes up a valid token. This is the
    task of the first part of the compiler:

    The Tokenizer

    A tokenizer (or 'lexical analyzer') attempts to form tokens from a sequence
    of characters. Have a look at "x+++++y" for example; it could represent the
    sequence of tokens [x] [++] [+] [++] [y] which is syntactically valid for
    Java, but that's not how Java's tokenizer works (try it).

    The notation '[ ... ]' represents a single token, formed from the character
    sequence '...'.

    The Java tokenizer treats that character sequence as a series of tokens like
    this: [x] [++] [++] [+] [x]

    Why? because Java's tokenizer is 'greedy', i.e. it always scans the character
    sequence from left to right, attempting to 'cut off' the longest possible
    valid token. Why? because it needs a simple rule how to cut off tokens from
    a character sequence. A tokenizer doesn't know much about the actual language
    that needs to be compiled; it just knows about valid tokens and it tries
    to produce a series of tokens as efficiently as possible. So the human user
    has to put a space somewhere to help the tokenizer: "x++ + ++y". Java considers
    spaces as token separators when all else fails.

    In other circumstances a tokenizer works correctly without spaces being inserted
    all over the place: "if(true)x= 1;". the '(' character can not be part of a token
    [if(] or [if(t] or [if(tr] etc. because such tokens don't exist in Java. The
    longest possible first token here is [if] and no space is needed here. Of
    course it helps humans to put spaces here and there for readability reasons.

    Things are much worse in C and C++ where even the parentheses are needed.
    Have a look at this C example:
    Code:
    if (x) * (y=z);
    For a suitable pointer value for y this makes sense in C and C++. Leaving out
    the parentheses:
    Code:
    if x * (y=z);
    This doesn't make sense anymore: is it supposed to be an (ugly) condition like:
    "x*(y=z)" followed by an empty statement or should it be a condition "x"
    followed by a (ugly) statement "*(y=z);" Both variants don't make much sense
    semantically speaking but what does the compiler know? It is supposed to
    translate those character sequences and produce another (machine code) langage.

    If I wrote: "<S-F6><S-F7>fsggf*&^&^*& kjgf(&" you wouldn't understand me
    because the above doesn't make sense *lexically*, a subject the tokenizer
    has to deal with. The tokenizer has to chop of a character sequence and
    it must form tokens out of it.

    You may have never realized it but a lot of those parentheses, semi-colons
    and curly brackets are just there to help another part of the compiler:

    The Parser

    A parser expects a stream of tokens, produced by the tokenizer. For example,
    a Java tokenizer treats the following sequence of characters: ")if(x; )(y"
    as a perfectly valid sequence of tokens: [)] [if] [(] [x] [;] [)] [(] [y]
    but it doesn't make much sense to the parser. The sequence of tokens may be
    valid to the tokenizer, *syntactically* the token sequence doesn't make sense.

    Human beings are extremely good at understanding syntactically incorrect
    sentences; we all write/speak and listen to them every day.

    If I write: "me beer now drink want!" you most likely understand what I'm
    all about although the sentence is syntactically incorrect. You wouldn't
    understand me when I wrote: "house boat now defenestrate to wants the the"
    and rightly so. You wouldn't even understand what I'm talking about if I
    wrote: "the house wants to defenestrate the boat now".

    Houses don't defenestrate boats. Although this sentence is syntactically correct
    is doesn't make sense. It doesn't make sense in a *semantical* way.

    Humans are good at 'understanding' syntactically incorrect sentences, we aren't
    any good at it when it comes to sentences that don't make any sense; we find
    them humorous at best but we don't really care for the syntactical correctness.

    We need correct semantics then. Compilers do too:

    The Semantic Analyzer

    Compilers can hardly deal with semantics: they can deal with type checking
    but that's about it. Most of the semantics move on to the interpreter or
    the virtual machine or to the real machine. After compilation (translation)
    is over and done with the runtime environment (in whatever form) has to deal
    with the semantics: it checks for division by zero, it checks whether or not
    files exist, if sockets can be opened etc. etc.

    The compiler can check whether or not every used method is declared, if every
    used member variable is defined and it can check whether or not classes that
    are used are defined somewhere. It can also do simple things like this:
    Code:
    int x= "hello world";
    It's a perfect stream of tokens, it's syntactically all fine but the type of the
    two things are incorrect: we can't assign a string to an int; the program
    fragment is semantically incorrect. But before the runtime system can do
    anything first that "anything" has to be generated:

    The Code Generator

    Code generation can be as simple as emiting bytes to somewhere (machine code
    generation), but it can be complex too; Have a look at this little example:
    Code:
    double x= 1/42.0;
    x= Math.sin(x)/Math.cos(x)+Math.log(x)-3.14159;
    x= 0;
    Was that initialization really necessary? How about that complicated expression?
    Obviously it wasn't because at the end variable x will be set to 0 anyway.
    So why generate code for all that? Maybe the user wants to debug this code so
    all the code should be generated anyway. Maybe the user wants optimized code so
    almost no code should be generated at all: a simple "x=0" initialization should
    be enough. It's the job of the code generator to generate efficient code in the
    context to what it knows. Code generators don't know anything about character
    or token sequences. The tokenizer and parser took care of that. The parser does
    invoke the code generator when needed but the parser doesn't know whether or
    not the code would be really needed.

    Normally, code generators are invoked by parsers; they know what code needs to
    be generated, redundant or not. The code generators are supposed to figure out
    whether or not all that code is really needed.

    Concluding remarks

    The first part of this article described the several phases (or modules) of
    a compiler. Strictly speaking the interpreter or runtime system isn't part
    of a compiler. First the character stream is converted to a token stream.
    The token stream is checked whether or not it makes sense syntactically.

    The parser knows when to invoke the code generator. The code generator must
    generate efficient code. When all that is done, the interpreter has its work
    to do: perform what the user originally intended with the character stream.

    In practice the separation between the phases can be a bit blurred, i.e. the
    parser changes the behaviour of the tokenizer, the code generator consults the
    parser for some semantic details, the tokenizer talks back again to the parser
    etc. etc. But the distinction between the modules helps the development of a
    new compiler.

    This part of the article contains no useful code whatsoever; that is left for
    the next parts. The second part of this article deals with tokenizers and
    following parts dig deeper into parsers, code generators and interpreters.
    Our tokenizer uses regular expressions for chopping up the character sequence
    into token sequences.

    The source code of this all deliberately doesn't use any external tools for
    the tokenizer part nor for the parser part, just for educational purposes.
    There are quite a few very good tools available that generate tokenizers and
    parsers for us but this is all about building it ourself from the bottom up.

    Now I have to figure out what that language is going to be like and what it
    should be capable of; keep your fingers crossed ;-)

    See you next week in the second part of this article.

    kind regards,

    Jos
Working...