Implementing A Scripting Engine - Part 2 - The Lexical Analyzer
by (14 May 1999)



Return to The Archives
Introduction


I always say that examples can't be simple enough when you're learning something new. That's why I've tried to think of a very simple compiler that still has all the characteristics of a complete one. I came up with a string manipulation language that uses C-like syntax and has Basic-like functionality. Below is an example of a correct program in our language.


 print "Please enter your name > ";
 input name;
 if (name == "Jan")   {                    // string comparison
    name = "my creator";                   // string assignment
    happy = "yes";
 }
 print "Thank you, " + name + "!\n" +      // string concatenation
       "You've just made a simple program very happy!";
 


As you can see, there are no grouping structures like functions, classes, etc. and not even numeric types. The final product, however, will be easily expandable.

But before we get to that, we've still got a long way to go - remember the list of components from last time? Today we'll implement the first one: the lexical analyzer, or lexer for short. It'll be a nice warm-up, since it's not really a difficult part of the compiler.

Okay, ready to get analyzing?


What, Why and How


First, I guess you'd like to know what a lexical analyzer *is* and why we need it. The task of the lexical analyzer is to convert the character stream that is a source file into a token stream. Essentially it just looks at the characters sequentially and recognizes "words" in them.

We could of course write a function that compares the string at the current position in the source file to all our keywords, but that would be unbearably slow. Instead, we'll use a finite state machine to recognize words (if you don't know what that is, don't worry - you don't need to).

The great thing about the lexer is that we don't actually have to do any hard work - we let the lexer be generated by a program called 'lex'. This is a standard Unix program; there are also several Win32 implementations (The one I use is called Flex and it's included in the zip file this part). For the complete Lex HTML manual, look here.

Well, you know what the lexer does and how we're going to make it. Now would be a good time to download the tut2.zip file and have a look at the code. The source files for this part are string.l and main.cpp, plus a few header files. Note that the .zip file contains a directory structure that I'll use throughout this series; for example, flex.exe is located in the base dir and the files specific to this tutorial are in the tut2\ directory.


The Lexer Rules!!


Lex needs a few simple rules to be able to generate our lexer. Before I explain these rules, let's study the layout of the Lex input file.


 <definitions>
 %%
 <rules>
 %%
 <user_code>
 


The section contains some regular expression macros (Regular expressions are explained in the Lex manual, and more thoroughly here. These tell Lex what we mean by a LETTER, a DIGIT, an IDENT (an identifier, which is defined as a letter followed by zero or more letters or digits) and a STR (a string constant, which is a string of characters enclosed in double quotes).

This section can also contain some initial code, like #includes for standard header files and forward references. This code should be put between the tags %{ and %}. I included the header file lexsymb.h here, which we'll look at in a minute.

The section can contain any code you need for analyzing; in our case it contains a function to skip all characters inside comment, functions to pass the identifier name and contents of string constants to the caller, and our main function.

The lexsymb.h file contains the declarations of the token symbols the lexer function will return. It also contains the declaration of a union 'yylval' used to pass extra information (like the name of an identifier) to the caller; why we're using this specific union will become clear in the next part.

Now, let's look at the actual rules. Note that I'm using /* */ comment; Lex is an old program and doesn't support // comment in its input files. We will produce a C lexer by the way - C++ versions of Lex are available but the standard Unix Lex produces C code, and we want to keep things portable, don't we?


 "if"     {return IF;}
 "="      {return ASSIGN;}
 ";"      {return END_STMT;}
 {IDENT}  {Identifier ();             /* identifier: copy name */
           return ID;}
 {STR}    {StringConstant ();         /* string constant: copy contents */
           return STRING;}
 "//"     {EatComment();}             /* comment:    skip */
 \n       {lineno++;}                 /* newline:    count lines */
 {WSPACE} {}                          /* whitespace: (do nothing) */
 .        {return ERROR_TOKEN;}       /* other char: error, illegal token */
 


I left out some rules that were very similar. As you can see, each rule starts with the pattern Lex should recognize, followed by some code telling Lex what to do (this code *can* contain C++ by the way, because Lex just copies it into the output file). Note that the topmost rules are evaluated first; this is sometimes important.

The first three rules are very simple; they just recognize a string of characters and return the appropriate token symbol to the caller. Note that you can just change the strings if you want ":=" to be the assignment operator, for example.

The fourth line is the first "interesting" one: it makes use of the IDENT macro, so it recognizes a string of characters/digits that doesn't satisfy any of the previous rules. If found, it calls the Identifier() function, which copies the string from yytext (which contains the text for the current token) into a new char array. The lexer then returns the ID token; the caller can just look at the 'yylval->str' pointer to find the name of the identifier. STR does the same for a string constant.

The next lines takes care of comment, newlines and whitespace. Note that the line number is counted; we'll use this for error messages in the future. The final rule tells Lex that if the input doesn't satisfy any of the above rules (the regular expression "." means: any char except '\n'), we should return an error token and let the caller decide what to do.

This Lex input file can be compiled to lex.cpp to using the command:

flex -olex.cpp string.l

Also included in the .zip archive is a project file for MSVC 6.0 (string.dsp). I believe it also works with 5.0, but I'm not sure. This project file contains a custom build command for string.l so it compiles automatically.

Unfortunately, lex uses a non-standard include file, unistd.h, which is not available on Windows systems. In the base dir is an empty unistd.h file; include the base dir in your include files path (in MSVC: Tools->Options->Directories->Include).

The file lex.cpp contains a complete lexer that satisfies our rules. It's that simple! The main program in this example just reads a token using the lexer function and shows the token name and value (if it's ID or STR). You can try typing some stuff and see what the lexer makes of it; random characters will generally be seen as ID's, but other unused chars like '$' will cause an ERROR_TOKEN. You can try example.str (in the base directory) too!


It Gets Better, Really


Well, now we have a program that can "read". Unfortunately, it still has no idea what it's reading and whether this is correct by our standards. It will just accept any token it knows about.

So it has to learn about grammar. By an amazing coincidence, grammar is exactly what the next part is about. The next component is called the parser and its task is to find out the structure of the program and check the syntax while it's at it.

Then things really get interesting. We will actually be able to feed the program from the introduction to the compiler and it will accept it - not because it accepts just about anything, but because it knows the program is syntactically correct! I just know you're as wildly excited about this as I am, but you'll have to wait until the next part..

Jan Niestadt


Quote!


"And so it was only with the advent of pocket computers that the startling truth became finally apparent, and it was this:

Numbers written on restaurant bills within the confines of restaurants do not follow the same mathematical laws as numbers written on any other pieces of paper in any other parts of the Universe.

This single fact took the scientific world by storm. It completely revolutionized it. So many mathematical conferences got held in such good restaurants that many of the finest minds of a generation died of obesity and heart failure and the science of maths was put back by years."

HHG 2:7


Article Series:
  • Implementing A Scripting Engine - Part 1 - Overview
  • Implementing A Scripting Engine - Part 2 - The Lexical Analyzer
  • Implementing A Scripting Engine - Part 3 - The Parser
  • Implementing A Scripting Engine - Part 4 - The Symbol Table & Syntax Tree
  • Implementing A Scripting Engine - Part 5 - The Semantic Checker & Intermediate Code Generator
  • Implementing A Scripting Engine - Part 6 - Optimization
  • Implementing A Scripting Engine - Part 7 - The Virtual Machine
  • Implementing A Scripting Engine - Part 8 - Executable Code
  • Implementing A Scripting Engine - Part 9 - Advanced Subjects
  •  

    Copyright 1999-2008 (C) FLIPCODE.COM and/or the original content author(s). All rights reserved.
    Please read our Terms, Conditions, and Privacy information.