Tokenizer interface

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Lorenzo J. Lucchini

    Tokenizer interface

    Do you see any counter-indication to a token extractor in the
    following form?

    typedef int token;
    token ExtractToken(co nst char** String, char* Symbol);

    If it finds a valid token at the start of *String, it copies it into
    Symbol (unless Symbol==NULL, in which case it doesn't copy it
    anywhere); it makes *String point to the first character *after* the
    token, and returns a token identificator (0 if no valid token found).

    Of course, there is also a function
    void CreateToken(con st char* Symbol, token Id);
    to add tokens to the set.

    The only other function I know of that behaves like this (as the
    char** is concerned) is strtol(), which can change a pointer to point
    to the first character after the number parsed.

    by LjL
    ljlbox@tiscali. it
  • Andreas Kahari

    #2
    Re: Tokenizer interface

    In article <1f24a1b2.03112 81015.5485a775@ posting.google. com>, Lorenzo J. Lucchini wrote:[color=blue]
    > Do you see any counter-indication to a token extractor in the
    > following form?[/color]


    Did you consider using strtok()?


    --
    Andreas Kähäri

    Comment

    • Kaz Kylheku

      #3
      Re: Tokenizer interface

      ljlbox@tiscalin et.it (Lorenzo J. Lucchini) wrote in message news:<1f24a1b2. 0311281015.5485 a775@posting.go ogle.com>...[color=blue]
      > Do you see any counter-indication to a token extractor in the
      > following form?
      >
      > typedef int token;
      > token ExtractToken(co nst char** String, char* Symbol);
      >
      > If it finds a valid token at the start of *String, it copies it into
      > Symbol (unless Symbol==NULL, in which case it doesn't copy it[/color]

      It's a terrible interface, because it's not obvious how much space the
      second argument points to! What if the lexeme is too long?

      There are several alternatives.

      1. Allow the input string to be destroyed. The tokenizer can then
      insert a terminating null and return a pointer to an address
      within the original string:

      token ExtractToken(ch ar **inputstring, char **outputptr);

      2. Dynamically allocate the lexeme:

      token ExtractToken(ch ar **inputstring, char **outputstring) ;

      If a non-null pointer is stored to *outputstring, the caller
      is responsible for liberating the memory with free().

      3. Provide a buffer for the lexeme, and specify the size:

      token ExtractToken(ch ar **inputstring,
      char *outputbuffer, size_t buffersize);


      4. Create an abstract data type representing tokens, and return that:

      typedef struct token {
      char *lexeme;
      int code;
      };

      token *token_extract( char **inputstring)
      {
      token *t = malloc(sizeof *t);

      /* ... do the extraction stuff, allocate lexeme, etc ... */

      return t;
      }

      void token_free(toke n *t)
      {
      if (t != 0) {
      free(t->lexeme);
      free(t);
      }
      }

      etc.

      Comment

      • Morris Dovey

        #4
        Re: Tokenizer interface

        Lorenzo J. Lucchini wrote:[color=blue]
        > Do you see any counter-indication to a token extractor in the
        > following form?
        >
        > typedef int token;
        > token ExtractToken(co nst char** String, char* Symbol);
        >
        > If it finds a valid token at the start of *String, it copies it into
        > Symbol (unless Symbol==NULL, in which case it doesn't copy it
        > anywhere); it makes *String point to the first character *after* the
        > token, and returns a token identificator (0 if no valid token found).[/color]

        Lorenzo...

        No, I don't.

        From time to time I've found it advantageous to extract all
        tokens at once, then process them in a loop. You're welcome to
        look over http://www.iedu.com/mrd/c/tokenize.c for ideas.
        --
        Morris Dovey
        West Des Moines, Iowa USA
        C links at http://www.iedu.com/c
        Read my lips: The apple doesn't fall far from the tree.

        Comment

        • Lorenzo J. Lucchini

          #5
          Re: Tokenizer interface

          Andreas Kahari <ak+usenet@free shell.org> wrote in message news:<slrnbsf50 n.nd3.ak+usenet @mx.freeshell.o rg>...[color=blue]
          > In article <1f24a1b2.03112 81015.5485a775@ posting.google. com>, Lorenzo J. Lucchini wrote:[color=green]
          > > Do you see any counter-indication to a token extractor in the
          > > following form?[/color]
          >
          > Did you consider using strtok()?[/color]

          I did, but - besides not liking its interface very much, honestly - it
          doesn't really fulfill my needs.
          That's because I don't have one or more separator(s) that uniquely
          identify the beginning and end of a token.

          Take this expression:
          MyVar==0x1234
          (no, this is not trying to be a C expression)

          The token "MyVar" is terminated by the character '='; the token "==",
          however, is obviously not terminated by an '=' (even though "=" alone
          is also a token); then for the token "0x1234", the parser must be
          aware that the sequence of tokens "0", "x" and "1234" makes no sense,
          and so treat "0x1234" as a single token.

          Moreover, my tokenizer understands that, given the following set of
          tokens:
          "test", "dummy", "foobar"
          the expression
          "dfoot"
          represents the sequence of tokens "dummy", "foobar" and "test" (well,
          it only makes this 'translation' is the argument Completion is set to
          a non-zero value, but I haven't included that argument is the
          prototype I posted).

          by LjL
          ljlbox@tiscali. it

          Comment

          • Lorenzo J. Lucchini

            #6
            Re: Tokenizer interface

            kaz@ashi.footpr ints.net (Kaz Kylheku) wrote in message news:<cf333042. 0311281743.2a5c d902@posting.go ogle.com>...[color=blue]
            > ljlbox@tiscalin et.it (Lorenzo J. Lucchini) wrote in message news:<1f24a1b2. 0311281015.5485 a775@posting.go ogle.com>...[color=green]
            > > Do you see any counter-indication to a token extractor in the
            > > following form?
            > >
            > > typedef int token;
            > > token ExtractToken(co nst char** String, char* Symbol);
            > >
            > > If it finds a valid token at the start of *String, it copies it into
            > > Symbol (unless Symbol==NULL, in which case it doesn't copy it[/color]
            >
            > It's a terrible interface, because it's not obvious how much space the
            > second argument points to! What if the lexeme is too long?[/color]

            Well, but you can always make Symbol be long enough.
            First, Symbol will never exceed the length of String, so
            Symbol=malloc(s trlen(*String)+ 1) should always work (unless the
            malloc() fails, that is).
            Second, you could go to greater lengths and save some more space by
            making Symbol big enough to contain the longest lexeme in your set of
            valid tokens.
            I could even add a function GetMaxTokenLeng th() to make this easier.

            Anyway, even if I simply make Symbol big enough to contain the whole
            String, I won't be wasting much space, since I almost never need to
            keep a lexeme stored while I'm extracting the next - what I usually
            have to keep is the value returned by ExtractToken(), which identifies
            the extracted token for me, not the actual lexeme in its string form.
            [color=blue]
            > There are several alternatives.
            >
            > 1. Allow the input string to be destroyed. The tokenizer can then
            > insert a terminating null and return a pointer to an address
            > within the original string:
            >
            > token ExtractToken(ch ar **inputstring, char **outputptr);[/color]

            But then I would have to store the original inputstring somewhere
            else, so using the same amount of memory as if I had made Symbol big
            enough to store String / inputstring.
            [color=blue]
            > [snip point 2, dynamic allocation]
            >
            > 3. Provide a buffer for the lexeme, and specify the size:
            >
            > token ExtractToken(ch ar **inputstring,
            > char *outputbuffer, size_t buffersize);[/color]

            I could do this, but it feels a bit redundant, since I [can] know in
            advance the length of the longest lexeme that ExtractToken() can give
            me.
            [color=blue]
            >
            > 4. Create an abstract data type representing tokens, and return that:
            >
            > typedef struct token {
            > char *lexeme;
            > int code;
            > };
            >
            > token *token_extract( char **inputstring)
            > [snip]
            >
            > void token_free(toke n *t)
            > [snip][/color]

            This, as well as what you propose in 2., could be done, but it seems a
            bit overkill to do it for every lexeme you are extracting from a
            string, since the lexemes - while not being all of exactly the same
            length - are all going to be of "about" the same length.

            And after all, many functions in the standard library expect a
            pre-allocated buffer that they can fill, and they don't always expect
            an argument telling them how many items can be put in the buffer - at
            least not when the worst-case size the buffer needs to be can be
            computed in advance -, do they?

            By the way, thank you for "lexeme". I needed a way to distinguish it
            from "token" ("token" = something good enough to uniquely identify a
            lexeme within a given set of lexemes). "Symbol" was stupid, and
            besides, I was already using the term "symbol" for something else
            (linker global symbols), and that *was* confusing.

            by LjL
            ljlbox@tiscali. it

            Comment

            • Lorenzo J. Lucchini

              #7
              Re: Tokenizer interface

              Morris Dovey <mrdovey@iedu.c om> wrote in message news:<4ZSxb.693 $4X.97958@news. uswest.net>...[color=blue]
              > Lorenzo J. Lucchini wrote:[color=green]
              > > Do you see any counter-indication to a token extractor in the
              > > following form?
              > >
              > > typedef int token;
              > > token ExtractToken(co nst char** String, char* Symbol);
              > >
              > > If it finds a valid token at the start of *String, it copies it into
              > > Symbol (unless Symbol==NULL, in which case it doesn't copy it
              > > anywhere); it makes *String point to the first character *after* the
              > > token, and returns a token identificator (0 if no valid token found).[/color]
              >
              > Lorenzo...
              >
              > No, I don't.
              >
              > From time to time I've found it advantageous to extract all
              > tokens at once, then process them in a loop. You're welcome to
              > look over http://www.iedu.com/mrd/c/tokenize.c for ideas.[/color]

              It looks like your code would work with the ASCII character set, but
              not necessarily with other sets.
              Why don't you use the is*() functions in ctype.h instead of testing if
              the value of a char falls into a specific range?

              For my purposes, I prefer to scan tokens one by one. I've made my
              ExtractToken() work the way it works precisely to find a compromise
              between the clumsy (IMHO) syntax of strtok() and extracting all tokens
              in one pass.

              Before writing the current set of functions of which ExtractToken() is
              part, however, I had written a functions set that worked a bit like
              your tokenize.c, in the sense that it used a pointer to a function
              deciding whether a character was or wasn't a delimiter - and it also
              used another function pointer to decide whether two characters were
              the same (mainly to parse tokens in a case-insensitve environment).

              This latter feature stopped working when I added hashing (which was
              the main reason I wrote those functions for). Of course, for the
              specific case of dealing with case-insensitiveness , I could have
              simply hashed everything in either upper or lowercase - but this would
              have defeated the purpose of using a function pointer to generically
              handle "characters that are equivalent".

              I ended up dumping the whole thing because or these reasons, but
              mainly because a tokenizer based on the concept of a "delimiter"
              didn't suit my purposes anymore.
              More precisely, there are to contexts in the program I'm writing where
              I need "tokens" - one would be fine with separator-delimited token
              handling, but the other (an expression parser, please refer to my
              reply to Andreas Kahari) wouldn't. Thus I decided to write something
              that would accomodate both.

              You can look at how my original token extractor/parser worked at
              Download Z80Sim for free. A Zilog Z80 simulator, debugger and profiler tailored for ZX Spectrum development (but generic enough to be used with other platforms)

              and
              Download Z80Sim for free. A Zilog Z80 simulator, debugger and profiler tailored for ZX Spectrum development (but generic enough to be used with other platforms)


              Look at versions 1.1 of both. Subsequent versions contain the new
              tokenizer (the one using ExtractToken()) , which uses a tree-based
              algorithm I invented for the purposes - I don't know if it is an
              algorithm actually studied and used for purposes of parsing, but it
              Works For Me, apparently.

              Please feel free to criticize both programs, but keep in mind that the
              old tokens.c was abandoned during developement, and the new one is far
              from thoroughly-tested (both in terms of algorithmical correctness and
              ANSI compliance).

              by LjL
              ljlbox@tiscali. it

              Comment

              • Chris Torek

                #8
                Re: Tokenizer interface

                In article <cf333042.03112 81743.2a5cd902@ posting.google. com>
                Kaz Kylheku <kaz@ashi.footp rints.net> writes:
                [other alternatives snipped; slight reformatting][color=blue]
                >4. Create an abstract data type representing tokens, and return that:
                > typedef struct token {
                > char *lexeme;
                > int code;
                > };
                > token *token_extract( char **inputstring) { ... }[/color]

                For whatever it is worth, this is the alternative I would favor
                (despite the objections the original poster raises in a followup).

                One note: the "typedef" above is "naked" -- it does not define any
                aliases for "struct token". The subsequent "token *" declaration
                for token_extract() is thus a syntax error in C. My preferred fix
                for this is to avoid "typedef" altogether: just write "struct token"
                everywhere. If one finds this objectionable, however, my "backup
                preferred fix" is:

                typedef struct token token;
                struct token { ... };

                That is, if you must use typedefs to paper over C's real abstract
                types ("struct"), write the typedef alias first, then write the
                structure declaration. This way you can use the alias inside the
                structure declaration. This works for complicated, mutually-referential
                structures as well:

                /* Each object has a list of elements it contains, and each element
                has a list of objects that refer to it. */
                typedef struct object Object;
                typedef struct element Element;

                struct object {
                Object *o_next, *o_prev; /* doubly linked list of objects */
                Element *o_elements; /* elements we contain */
                Object *o_ethread; /* threading list of objects */
                ...
                };
                struct element {
                Element *e_next; /* singly linked list of elements */
                Object *e_head; /* head of object-level threading list */
                int e_refcount; /* number of objects using this list */
                ...
                };

                Given the syntactic havoc that typedefs wreak, it is a good idea
                to have a "trick" for distinguishing typedef names from all other
                identifiers. In this case I used the "uppercase first letter"
                trick (although in practice, I prefer not to use the typedefs at
                all).

                (Note that, given some element e, we can find all objects that
                refer to e using:

                for (o = e->e_head; o != NULL; o = o->o_ethread)
                ...

                and given an object O, we can find all its elements with:

                for (e = o->o_elements; e != NULL; e = e->e_next)

                This does, however, assume that if objects o1 and o2 both use some
                element e, they both use exactly the same element-list. I made
                this example up on the spot and do not guarantee its usefulness.)
                --
                In-Real-Life: Chris Torek, Wind River Systems
                Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
                email: forget about it http://web.torek.net/torek/index.html
                Reading email is like searching for food in the garbage, thanks to spammers.

                Comment

                • Lorenzo J. Lucchini

                  #9
                  Re: Tokenizer interface

                  qed@pobox.com (Paul Hsieh) wrote in message news:<796f488f. 0311292202.6e73 6fd9@posting.go ogle.com>...[color=blue]
                  > I'm not really a big fan of your approach. I would instead do it like
                  > this:
                  >
                  > int IterateTokens (const char * string,
                  > int (*cb) (void * parm, size_t offset, size_t len),
                  > void * parm);
                  >
                  > With the idea that the callback function cb would be called on each
                  > successive token. (The purpose of void * parm is to let you use a
                  > universal callback function, while retaining specific context to a
                  > particular string that you are tokenizing. I.e., IterateTokens just
                  > passes on the parm it was given to the cb function.) This allows you
                  > to be flexible about how you consume your tokens.[/color]

                  But see this expression:
                  $AF+MyVar+0x123 4==0

                  "$AF" is a token in my parsing token, all is fine. So is "+".
                  "MyVar", though, is not a token: on it, my ExtractToken() function
                  returns 0 (synonym for "no [significant] token"), and Symbol becomes
                  "".
                  At this point, my parsing loop decides that what follows the "+" must
                  be a variable (which it treats like a "free form" string), and so it
                  calls ExtractFreeform (), which extracts everything up to the next
                  valid token it encounters ("+" in this case).

                  Now, I could make variables like "MyVar" tokens to avoid embarassing
                  ExtractToken() and using this ExtractFreeform () function.
                  But besides now working well with my program, this approach couldn't
                  handle the next part of the expression: "0x1234" is not a token that
                  ExtractToken() will ever be able to extract, so I *have* to call
                  ExtractFreeform () here.

                  As I see it, your approach would choke on "non-tokens" that are
                  currently handled by a call to ExtractFreeform () following a failed
                  (return value == 0, Symbol is "") call to ExtractToken().
                  [color=blue]
                  > For example, if cb returns with a negative number, or other "fail
                  > code" then you can use that to halt the iterator early (for example
                  > you might be able to detect that a syntax error has occurred prior to
                  > performing tokenizing the whole string.)[/color]

                  Or I could use it to simulate the behavior of ExtractFreeform ().
                  But this would force the calling loop (which, with your approach,
                  wouldn't probably be a loop at all) to:
                  1) do ExtractFreeform ()'s job in a way not consistent with how
                  IterateTokens() works (it would even have to compute where in the
                  string parsing has stopped, unless IterateTokens returns precisely
                  this information)
                  2) after parsing the free-form part of the expression, "resurrect"
                  IterateTokens to parse from after the location of the free-form stuff

                  This looks awfully complicated...
                  [color=blue]
                  > You can also make the return
                  > value of IterateTokens just return the number of successful tokens
                  > that were passed to the callback, or a negative number to indicate
                  > some kind of failure.
                  >
                  > It also might not be necessary to '\0' terminate the tokens as you
                  > encounter them. Certainly if you were using bstrlib
                  > (http://bstring.sf.net) you wouldn't want to. This gives you a way to
                  > perform "tokenizati on by reference" if its possible for you to avoid
                  > copying the tokens for further processing (for example if you have a
                  > totally incremental parsing algorithm.)[/color]

                  I *do* already "tokenize by reference", in a sense.
                  Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
                  doesn't copy the token anywhere. All that is needed is the int value
                  ExtractToken() returns: this value uniquely identifies the token most
                  of the times, unless it is 0, in which case the parser does have to
                  check Symbol to see if it encountered a "" (end of string / unparsable
                  token) or a " " (space, or other separator, whose token id is 0, as
                  well).

                  Look at this code:

                  Operator=Extrac tToken(OpTokens , Expression, NULL);
                  if(Operator==OP _NONE) {
                  if(strlen(*Expr ession)==0) Operator=OP_END ; else {
                  Register=Extrac tToken(RegToken s, Expression, NULL);
                  if(Register==RE G_NONE) {
                  char *Symbol=malloc( strlen(*Express ion)+1);
                  ExtractFreeform (OpTokens, Expression, Symbol);
                  /* ... */
                  /* This is taken from a part of my program, but things not relevant
                  to the discussion have been stripped out. */

                  OpTokens and RegTokens are handle to different tokenization contexts;
                  for simplicity, I omitted this part in the declaration I posted.
                  OpTokens' tokens are operators like "+" and "-", while RegToken's
                  tokens are register specificators like "$AF" and "$HL".

                  The algorithm basically goes:

                  1) Extract next token. If it's an operator, fine, otherwise
                  2) See if it's a register specification. If it's not even that, then
                  3) It must be a variable (numeric literals omitted from the code
                  above)

                  You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
                  copied anywhere, because such a copy is not needed to identify the
                  token.
                  Only when a token can *not* be extracted do we need to analyze an
                  actual string (which is extracted by ExtractFreeform (), not even
                  ExtractToken()) .

                  This is not the only place or the only way in my program where I use
                  the tokenization function, but it should give a picture.
                  [color=blue]
                  > If you actually need '\0'
                  > terminated strings without damaging the input string, then doing a
                  > precisely sized malloc, memcpy, and storing a terminating '\0' in your
                  > callback function is trivial and will do the trick[/color]

                  Not if IterateTokens() doesn't *know* where the token ends.
                  [color=blue]
                  > -- it remains for
                  > you to then design a stack or list or vector or whatever you like to
                  > retain all these strings (all the work being done in the callback
                  > function).[/color]
                  [color=blue]
                  > The point is, that with a mechanism like this, you are flexible enough
                  > to employ any policy of how you want your results from the process of
                  > tokenization. It also seperates the process of actual parsing from
                  > the process of dealing with the results.[/color]

                  Yours is an interesting approach that can - I think - be most useful
                  is situations where the lexeme, I mean the token in its string form,
                  must often be copied, because an integral value (my "token id") cannot
                  identify it uniquely enough in an easy way. Even more useful perhaps
                  when all or most of the parsed tokens must be kept (in string form) in
                  a list somewhere.
                  In the latter case, your function would behave similarly to Morris
                  Dovey's function, except that the actual work of putting the extracted
                  tokens in a vector is done by the callback function in your approach,
                  instead of being embedded in the iterating token extractor.

                  But you haven't convinced me that your approach is superior in my
                  case. Using a callback function would be nice to eliminate the outer
                  parsing loop and make code concentrate on analysing tokens singularly;
                  however, the outer loop would eventually still be there, together with
                  other strange creatures, to handle the equivalent of
                  ExtractFreeform ().

                  In addition, I also use my tokenizer to parse simple interactive
                  commands like "run" or "info stack", besides interpreting more complex
                  expressions like the example I made above. For this task, it seems to
                  me that the (now fairly simple) structure of my interactive command
                  interpreter would be sort of obfuscated by introducing a callback
                  function and somewhere to keep parsed tokens (or, alternatively, a
                  *complicated* callback function without having to keep parsed tokens
                  somewhere).


                  P.S. A question: for such a short signature as the one below, would
                  you find it necessary to use a standard signature marker?


                  by LjL
                  ljlbox@tiscali. it

                  Comment

                  • Paul Hsieh

                    #10
                    Re: Tokenizer interface

                    ljlbox@tiscalin et.it (Lorenzo J. Lucchini) wrote in message news:<1f24a1b2. 0311300628.6477 5f78@posting.go ogle.com>...[color=blue]
                    > qed@pobox.com (Paul Hsieh) wrote in message news:<796f488f. 0311292202.6e73 6fd9@posting.go ogle.com>...[color=green]
                    > > I'm not really a big fan of your approach. I would instead do it like
                    > > this:
                    > >
                    > > int IterateTokens (const char * string,
                    > > int (*cb) (void * parm, size_t offset, size_t len),
                    > > void * parm);
                    > >
                    > > With the idea that the callback function cb would be called on each
                    > > successive token. (The purpose of void * parm is to let you use a
                    > > universal callback function, while retaining specific context to a
                    > > particular string that you are tokenizing. I.e., IterateTokens just
                    > > passes on the parm it was given to the cb function.) This allows you
                    > > to be flexible about how you consume your tokens.[/color]
                    >
                    > But see this expression:
                    > $AF+MyVar+0x123 4==0
                    >
                    > "$AF" is a token in my parsing token, all is fine. So is "+".
                    > "MyVar", though, is not a token: on it, my ExtractToken() function
                    > returns 0 (synonym for "no [significant] token"), and Symbol becomes
                    > "".[/color]

                    Huh? Then how do you continue to parse? I assume you use some other logic
                    to match the successive characters to a previously defined variable? But in
                    that case, don't you need a grammar for your variables? And does your
                    variable matching algorithm correctly differentiate between "My" and "MyVar"?
                    [color=blue]
                    > At this point, my parsing loop decides that what follows the "+" must
                    > be a variable (which it treats like a "free form" string), and so it
                    > calls ExtractFreeform (), which extracts everything up to the next
                    > valid token it encounters ("+" in this case).[/color]

                    Ok, well I am use a more "classical" definition of token in which all
                    successive sub-portions are considered tokens, which are then given
                    differentiating attributes describing what they are. So let's call my kind of
                    tokens "meta-tokens" just to avoid confusion (so you could rename my function
                    IterateMetaToke ns().)

                    The idea then would be that the callback function would first look into the
                    string and decide whether its one of what *you* call tokens, or something else
                    (a variable I suppose?) This logic would be analogous to whatever logic you
                    used to decide that a ... uh ... *symbol* becomes "". Then you would branch
                    into two cases -- one which parses your "tokens", and another which calls
                    a variation on ExtractFreeform (), I suppose.

                    So IterateMetaToke ns() would never return something analogous to "" (i.e.,
                    setting the len to 0 in the callback) which assumes that an empty string is
                    never valid in your gramar. So you would have put in the simple logic of
                    scanning ahead the length of a variable ("Freeform"? ) when a "nontoken" was
                    detected into IterateMetaToke ns().
                    [color=blue]
                    > Now, I could make variables like "MyVar" tokens to avoid embarassing
                    > ExtractToken() and using this ExtractFreeform () function.[/color]

                    Well, "MyVar" would be a meta-token. Your callback would then decide how it
                    further sub-classified.
                    [color=blue]
                    > But besides now working well with my program, this approach couldn't
                    > handle the next part of the expression: "0x1234" is not a token that
                    > ExtractToken() will ever be able to extract, so I *have* to call
                    > ExtractFreeform () here.[/color]

                    I don't understand. What makes you think "0x1234" couldn't be parsed into
                    a meta token in the IterateMetaToke n() function?
                    [color=blue]
                    > As I see it, your approach would choke on "non-tokens" that are
                    > currently handled by a call to ExtractFreeform () following a failed
                    > (return value == 0, Symbol is "") call to ExtractToken().[/color]

                    No -- if a "token"-wise parsing failed within IterateMetaToke n() you would
                    immediately try to extract a free form token. The role of the callback
                    function would be to track the semantics of whatever it is that it received.
                    [color=blue][color=green]
                    > > For example, if cb returns with a negative number, or other "fail
                    > > code" then you can use that to halt the iterator early (for example
                    > > you might be able to detect that a syntax error has occurred prior to
                    > > performing tokenizing the whole string.)[/color]
                    >
                    > Or I could use it to simulate the behavior of ExtractFreeform ().
                    > But this would force the calling loop (which, with your approach,
                    > wouldn't probably be a loop at all) to:
                    > 1) do ExtractFreeform ()'s job in a way not consistent with how
                    > IterateTokens() works (it would even have to compute where in the
                    > string parsing has stopped, unless IterateTokens returns precisely
                    > this information)[/color]

                    IterateMetaToke ns would never *modify* anything (take notice of the const in
                    the parameter list) so its not really relevant that it retain its place after
                    returning, whether it failed or not. You can still know the last successful
                    point of parsing before it failed by tracking the last offset + len passed to
                    the callback function. You can store this info in an arbitrary struct which
                    is pointed to by the cookie variable: parm.
                    [color=blue]
                    > 2) after parsing the free-form part of the expression, "resurrect"
                    > IterateTokens to parse from after the location of the free-form stuff
                    >
                    > This looks awfully complicated...[/color]

                    Well, parsing in general is complicated no matter what you do. There are
                    entire books written about it. The method I am describing to you follows the
                    general notions of a typical tokenizable grammar.
                    [color=blue][color=green]
                    > > You can also make the return
                    > > value of IterateTokens just return the number of successful tokens
                    > > that were passed to the callback, or a negative number to indicate
                    > > some kind of failure.
                    > >
                    > > It also might not be necessary to '\0' terminate the tokens as you
                    > > encounter them. Certainly if you were using bstrlib
                    > > (http://bstring.sf.net) you wouldn't want to. This gives you a way to
                    > > perform "tokenizati on by reference" if its possible for you to avoid
                    > > copying the tokens for further processing (for example if you have a
                    > > totally incremental parsing algorithm.)[/color]
                    >
                    > I *do* already "tokenize by reference", in a sense.
                    > Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
                    > doesn't copy the token anywhere. All that is needed is the int value
                    > ExtractToken() returns: this value uniquely identifies the token most
                    > of the times, unless it is 0, in which case the parser does have to
                    > check Symbol to see if it encountered a "" (end of string / unparsable
                    > token) or a " " (space, or other separator, whose token id is 0, as
                    > well).[/color]

                    Yes, because most of your grammar metaTokens are 1 character in length right?
                    Well that just means that in my solution most of the time len will be equal to
                    1, in which case you do a simple character test (a switch with a default) to
                    figure out if its "token" versus a variable.
                    [color=blue]
                    > Look at this code:
                    >
                    > Operator=Extrac tToken(OpTokens , Expression, NULL);
                    > if(Operator==OP _NONE) {
                    > if(strlen(*Expr ession)==0) Operator=OP_END ; else {
                    > Register=Extrac tToken(RegToken s, Expression, NULL);
                    > if(Register==RE G_NONE) {
                    > char *Symbol=malloc( strlen(*Express ion)+1);
                    > ExtractFreeform (OpTokens, Expression, Symbol);
                    > /* ... */
                    > /* This is taken from a part of my program, but things not relevant
                    > to the discussion have been stripped out. */[/color]

                    Well the call-back with my design would look something like:

                    int categorizeMetaT oken (void * parm, size_t offset, size_t len) {
                    if (len == 1) switch (((char *)parm)[offset]) {
                    /* A token has been received? */
                    case '+': /* ... etc */ ; return 0;
                    case '*': /* ... etc */ ; return 0;
                    /* ... etc */
                    /* Nope! Its just a 1-character long variable */
                    default : break;
                    }
                    {
                    struct tagbstring Symbol;
                    blk2tbstr (Symbol, (char *)parm + offset, len);
                    /* Symbol now contains a tabstring of the variable/symbol */
                    /* ... etc */
                    return 0; /* Success */
                    }
                    return -1; /* Some kind of error. */
                    }

                    If it bothers you that you have determine that something is a one character
                    token twice (once in the outer loop, and once in the callback function) you
                    could pass an additional parameter to the callback -- something like "int
                    category" or "int type". So that you wouldn't need a "default" case, and each
                    case of the switch could break, then go to a common "return 0" exit point.
                    [color=blue]
                    > OpTokens and RegTokens are handle to different tokenization contexts;
                    > for simplicity, I omitted this part in the declaration I posted.
                    > OpTokens' tokens are operators like "+" and "-", while RegToken's
                    > tokens are register specificators like "$AF" and "$HL".
                    >
                    > The algorithm basically goes:
                    >
                    > 1) Extract next token. If it's an operator, fine, otherwise
                    > 2) See if it's a register specification. If it's not even that, then
                    > 3) It must be a variable (numeric literals omitted from the code
                    > above)
                    >
                    > You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
                    > copied anywhere, because such a copy is not needed to identify the
                    > token.
                    > Only when a token can *not* be extracted do we need to analyze an
                    > actual string (which is extracted by ExtractFreeform (), not even
                    > ExtractToken()) .[/color]

                    Right -- but notice how in my solution no mallocs or copies are made *no
                    matter what*. But it assumes that you also use "the better string library" to
                    be able to achieve this.
                    [color=blue]
                    > This is not the only place or the only way in my program where I use
                    > the tokenization function, but it should give a picture.
                    >[color=green]
                    > > If you actually need '\0'
                    > > terminated strings without damaging the input string, then doing a
                    > > precisely sized malloc, memcpy, and storing a terminating '\0' in your
                    > > callback function is trivial and will do the trick[/color]
                    >
                    > Not if IterateTokens() doesn't *know* where the token ends.[/color]

                    Well, the idea would be that it *should* know. But I am making classical
                    parsing algorithm assumptions. Though I cannot quite see how your conditions
                    would fail such an approach.
                    [color=blue][color=green]
                    > > -- it remains for
                    > > you to then design a stack or list or vector or whatever you like to
                    > > retain all these strings (all the work being done in the callback
                    > > function).[/color]
                    >[color=green]
                    > > The point is, that with a mechanism like this, you are flexible enough
                    > > to employ any policy of how you want your results from the process of
                    > > tokenization. It also seperates the process of actual parsing from
                    > > the process of dealing with the results.[/color]
                    >
                    > Yours is an interesting approach that can - I think - be most useful
                    > is situations where the lexeme, I mean the token in its string form,
                    > must often be copied, because an integral value (my "token id") cannot
                    > identify it uniquely enough in an easy way. Even more useful perhaps
                    > when all or most of the parsed tokens must be kept (in string form) in
                    > a list somewhere.
                    > In the latter case, your function would behave similarly to Morris
                    > Dovey's function, except that the actual work of putting the extracted
                    > tokens in a vector is done by the callback function in your approach,
                    > instead of being embedded in the iterating token extractor.
                    >
                    > But you haven't convinced me that your approach is superior in my
                    > case.[/color]

                    That's because I think we have different concepts of how parsing should be
                    done. With me, I'm concerned that I don't know whether I want to follow a
                    strategy of storing the parsed metaTokens, or act on them as they are
                    consumed. Using a general callback mechanism like this allows me to change
                    that decision on a whim without modifying the code that does the actual
                    work of parsing metaTokens.

                    To me its just a program maintenance and flexibility issue. Any strategy can
                    be *forced* to work. But if you find out you have to change some of your
                    strategy which do you think will retain more code investment? A solution
                    which clearly seperates the parsing of metaTokens from what the meaning of
                    these tokens are, or a solution which intermingles the two notions?
                    [color=blue]
                    > [...] Using a callback function would be nice to eliminate the outer
                    > parsing loop and make code concentrate on analysing tokens singularly;
                    > however, the outer loop would eventually still be there, together with
                    > other strange creatures, to handle the equivalent of
                    > ExtractFreeform ().
                    >
                    > In addition, I also use my tokenizer to parse simple interactive
                    > commands like "run" or "info stack", besides interpreting more complex
                    > expressions like the example I made above.[/color]

                    Right -- so long as the metaTokenizatio n semantics are the same, one can
                    simply plug in a different callback for the different contexts (in your case
                    top-level command line versus in-code, presumably.)
                    [color=blue]
                    > [...] For this task, it seems to
                    > me that the (now fairly simple) structure of my interactive command
                    > interpreter would be sort of obfuscated by introducing a callback
                    > function and somewhere to keep parsed tokens (or, alternatively, a
                    > *complicated* callback function without having to keep parsed tokens
                    > somewhere).
                    >
                    > P.S. A question: for such a short signature as the one below, would
                    > you find it necessary to use a standard signature marker?[/color]

                    No. In fact I don't use a standard signature file or anything like that. I
                    actually type it by hand every time. That's because google doesn't have any
                    provisions for signature that I can tell. Also I sometimes change my
                    signature if I believe its appropriate for the context of the discussion.

                    --
                    Paul Hsieh
                    Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.


                    Comment

                    • Paul Hsieh

                      #11
                      Re: Tokenizer interface

                      ljlbox@tiscalin et.it (Lorenzo J. Lucchini) wrote:[color=blue]
                      > qed@pobox.com (Paul Hsieh) wrote:[color=green]
                      > > I'm not really a big fan of your approach. I would instead do it like
                      > > this:
                      > >
                      > > int IterateTokens (const char * string,
                      > > int (*cb) (void * parm, size_t offset, size_t len),
                      > > void * parm);
                      > >
                      > > With the idea that the callback function cb would be called on each
                      > > successive token. (The purpose of void * parm is to let you use a
                      > > universal callback function, while retaining specific context to a
                      > > particular string that you are tokenizing. I.e., IterateTokens just
                      > > passes on the parm it was given to the cb function.) This allows you
                      > > to be flexible about how you consume your tokens.[/color]
                      >
                      > But see this expression:
                      > $AF+MyVar+0x123 4==0
                      >
                      > "$AF" is a token in my parsing token, all is fine. So is "+".
                      > "MyVar", though, is not a token: on it, my ExtractToken() function
                      > returns 0 (synonym for "no [significant] token"), and Symbol becomes
                      > "".[/color]

                      Huh? Then how do you continue to parse? I assume you use some other logic
                      to match the successive characters to a previously defined variable? But in
                      that case, don't you need a grammar for your variables? And does your
                      variable matching algorithm correctly differentiate between "My" and "MyVar"?
                      [color=blue]
                      > At this point, my parsing loop decides that what follows the "+" must
                      > be a variable (which it treats like a "free form" string), and so it
                      > calls ExtractFreeform (), which extracts everything up to the next
                      > valid token it encounters ("+" in this case).[/color]

                      Ok, well I am use a more "classical" definition of token in which all
                      successive sub-portions are considered tokens, which are then given
                      differentiating attributes describing what they are. So let's call my kind of
                      tokens "meta-tokens" just to avoid confusion (so you could rename my function
                      IterateMetaToke ns().)

                      The idea then would be that the callback function would first look into the
                      string and decide whether its one of what *you* call tokens, or something else
                      (a variable I suppose?) This logic would be analogous to whatever logic you
                      used to decide that a ... uh ... *symbol* becomes "". Then you would branch
                      into two cases -- one which parses your "tokens", and another which calls
                      a variation on ExtractFreeform (), I suppose.

                      So IterateMetaToke ns() would never return something analogous to "" (i.e.,
                      setting the len to 0 in the callback) which assumes that an empty string is
                      never valid in your gramar. So you would have put in the simple logic of
                      scanning ahead the length of a variable ("Freeform"? ) when a "nontoken" was
                      detected into IterateMetaToke ns().
                      [color=blue]
                      > Now, I could make variables like "MyVar" tokens to avoid embarassing
                      > ExtractToken() and using this ExtractFreeform () function.[/color]

                      Well, "MyVar" would be a meta-token. Your callback would then decide how it
                      further sub-classified.
                      [color=blue]
                      > But besides now working well with my program, this approach couldn't
                      > handle the next part of the expression: "0x1234" is not a token that
                      > ExtractToken() will ever be able to extract, so I *have* to call
                      > ExtractFreeform () here.[/color]

                      I don't understand. What makes you think "0x1234" couldn't be parsed into
                      a meta token in the IterateMetaToke n() function?
                      [color=blue]
                      > As I see it, your approach would choke on "non-tokens" that are
                      > currently handled by a call to ExtractFreeform () following a failed
                      > (return value == 0, Symbol is "") call to ExtractToken().[/color]

                      No -- if a "token"-wise parsing failed within IterateMetaToke n() you would
                      immediately try to extract a free form token. The role of the callback
                      function would be to track the semantics of whatever it is that it received.
                      [color=blue][color=green]
                      > > For example, if cb returns with a negative number, or other "fail
                      > > code" then you can use that to halt the iterator early (for example
                      > > you might be able to detect that a syntax error has occurred prior to
                      > > performing tokenizing the whole string.)[/color]
                      >
                      > Or I could use it to simulate the behavior of ExtractFreeform ().
                      > But this would force the calling loop (which, with your approach,
                      > wouldn't probably be a loop at all) to:
                      > 1) do ExtractFreeform ()'s job in a way not consistent with how
                      > IterateTokens() works (it would even have to compute where in the
                      > string parsing has stopped, unless IterateTokens returns precisely
                      > this information)[/color]

                      IterateMetaToke ns would never *modify* anything (take notice of the const in
                      the parameter list) so its not really relevant that it retain its place after
                      returning, whether it failed or not. You can still know the last successful
                      point of parsing before it failed by tracking the last offset + len passed to
                      the callback function. You can store this info in an arbitrary struct which
                      is pointed to by the cookie variable: parm.
                      [color=blue]
                      > 2) after parsing the free-form part of the expression, "resurrect"
                      > IterateTokens to parse from after the location of the free-form stuff
                      >
                      > This looks awfully complicated...[/color]

                      Well, parsing in general is complicated no matter what you do. There are
                      entire books written about it. The method I am describing to you follows the
                      general notions of a typical tokenizable grammar.
                      [color=blue][color=green]
                      > > You can also make the return
                      > > value of IterateTokens just return the number of successful tokens
                      > > that were passed to the callback, or a negative number to indicate
                      > > some kind of failure.
                      > >
                      > > It also might not be necessary to '\0' terminate the tokens as you
                      > > encounter them. Certainly if you were using bstrlib
                      > > (http://bstring.sf.net) you wouldn't want to. This gives you a way to
                      > > perform "tokenizati on by reference" if its possible for you to avoid
                      > > copying the tokens for further processing (for example if you have a
                      > > totally incremental parsing algorithm.)[/color]
                      >
                      > I *do* already "tokenize by reference", in a sense.
                      > Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
                      > doesn't copy the token anywhere. All that is needed is the int value
                      > ExtractToken() returns: this value uniquely identifies the token most
                      > of the times, unless it is 0, in which case the parser does have to
                      > check Symbol to see if it encountered a "" (end of string / unparsable
                      > token) or a " " (space, or other separator, whose token id is 0, as
                      > well).[/color]

                      Yes, because most of your grammar metaTokens are 1 character in length right?
                      Well that just means that in my solution most of the time len will be equal to
                      1, in which case you do a simple character test (a switch with a default) to
                      figure out if its "token" versus a variable.
                      [color=blue]
                      > Look at this code:
                      >
                      > Operator=Extrac tToken(OpTokens , Expression, NULL);
                      > if(Operator==OP _NONE) {
                      > if(strlen(*Expr ession)==0) Operator=OP_END ; else {
                      > Register=Extrac tToken(RegToken s, Expression, NULL);
                      > if(Register==RE G_NONE) {
                      > char *Symbol=malloc( strlen(*Express ion)+1);
                      > ExtractFreeform (OpTokens, Expression, Symbol);
                      > /* ... */
                      > /* This is taken from a part of my program, but things not relevant
                      > to the discussion have been stripped out. */[/color]

                      Well the call-back with my design would look something like:

                      int categorizeMetaT oken (void * parm, size_t offset, size_t len) {
                      if (len == 1) switch (((char *)parm)[offset]) {
                      /* A token has been received? */
                      case '+': /* ... etc */ ; return 0;
                      case '*': /* ... etc */ ; return 0;
                      /* ... etc */
                      /* Nope! Its just a 1-character long variable */
                      default : break;
                      }
                      {
                      struct tagbstring Symbol;
                      blk2tbstr (Symbol, (char *)parm + offset, len);
                      /* Symbol now contains a tabstring of the variable/symbol */
                      /* ... etc */
                      return 0; /* Success */
                      }
                      return -1; /* Some kind of error. */
                      }

                      If it bothers you that you have determine that something is a one character
                      token twice (once in the outer loop, and once in the callback function) you
                      could pass an additional parameter to the callback -- something like "int
                      category" or "int type". So that you wouldn't need a "default" case, and each
                      case of the switch could break, then go to a common "return 0" exit point.
                      [color=blue]
                      > OpTokens and RegTokens are handle to different tokenization contexts;
                      > for simplicity, I omitted this part in the declaration I posted.
                      > OpTokens' tokens are operators like "+" and "-", while RegToken's
                      > tokens are register specificators like "$AF" and "$HL".
                      >
                      > The algorithm basically goes:
                      >
                      > 1) Extract next token. If it's an operator, fine, otherwise
                      > 2) See if it's a register specification. If it's not even that, then
                      > 3) It must be a variable (numeric literals omitted from the code
                      > above)
                      >
                      > You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
                      > copied anywhere, because such a copy is not needed to identify the
                      > token.
                      > Only when a token can *not* be extracted do we need to analyze an
                      > actual string (which is extracted by ExtractFreeform (), not even
                      > ExtractToken()) .[/color]

                      Right -- but notice how in my solution no mallocs or copies are made *no
                      matter what*. But it assumes that you also use "the better string library" to
                      be able to achieve this.
                      [color=blue]
                      > This is not the only place or the only way in my program where I use
                      > the tokenization function, but it should give a picture.
                      >[color=green]
                      > > If you actually need '\0'
                      > > terminated strings without damaging the input string, then doing a
                      > > precisely sized malloc, memcpy, and storing a terminating '\0' in your
                      > > callback function is trivial and will do the trick[/color]
                      >
                      > Not if IterateTokens() doesn't *know* where the token ends.[/color]

                      Well, the idea would be that it *should* know. But I am making classical
                      parsing algorithm assumptions. Though I cannot quite see how your conditions
                      would fail such an approach.
                      [color=blue][color=green]
                      > > -- it remains for
                      > > you to then design a stack or list or vector or whatever you like to
                      > > retain all these strings (all the work being done in the callback
                      > > function).[/color]
                      >[color=green]
                      > > The point is, that with a mechanism like this, you are flexible enough
                      > > to employ any policy of how you want your results from the process of
                      > > tokenization. It also seperates the process of actual parsing from
                      > > the process of dealing with the results.[/color]
                      >
                      > Yours is an interesting approach that can - I think - be most useful
                      > is situations where the lexeme, I mean the token in its string form,
                      > must often be copied, because an integral value (my "token id") cannot
                      > identify it uniquely enough in an easy way. Even more useful perhaps
                      > when all or most of the parsed tokens must be kept (in string form) in
                      > a list somewhere.
                      > In the latter case, your function would behave similarly to Morris
                      > Dovey's function, except that the actual work of putting the extracted
                      > tokens in a vector is done by the callback function in your approach,
                      > instead of being embedded in the iterating token extractor.
                      >
                      > But you haven't convinced me that your approach is superior in my
                      > case.[/color]

                      That's because I think we have different concepts of how parsing should be
                      done. With me, I'm concerned that I don't know whether I want to follow a
                      strategy of storing the parsed metaTokens, or act on them as they are
                      consumed. Using a general callback mechanism like this allows me to change
                      that decision on a whim without modifying the code that does the actual
                      work of parsing metaTokens.

                      To me its just a program maintenance and flexibility issue. Any strategy can
                      be *forced* to work. But if you find out you have to change some of your
                      strategy which do you think will retain more code investment? A solution
                      which clearly seperates the parsing of metaTokens from what the meaning of
                      these tokens are, or a solution which intermingles the two notions?
                      [color=blue]
                      > [...] Using a callback function would be nice to eliminate the outer
                      > parsing loop and make code concentrate on analysing tokens singularly;
                      > however, the outer loop would eventually still be there, together with
                      > other strange creatures, to handle the equivalent of
                      > ExtractFreeform ().
                      >
                      > In addition, I also use my tokenizer to parse simple interactive
                      > commands like "run" or "info stack", besides interpreting more complex
                      > expressions like the example I made above.[/color]

                      Right -- so long as the metaTokenizatio n semantics are the same, one can
                      simply plug in a different callback for the different contexts (in your case
                      top-level command line versus in-code, presumably.)
                      [color=blue]
                      > [...] For this task, it seems to
                      > me that the (now fairly simple) structure of my interactive command
                      > interpreter would be sort of obfuscated by introducing a callback
                      > function and somewhere to keep parsed tokens (or, alternatively, a
                      > *complicated* callback function without having to keep parsed tokens
                      > somewhere).
                      >
                      > P.S. A question: for such a short signature as the one below, would
                      > you find it necessary to use a standard signature marker?[/color]

                      No. In fact I don't use a standard signature file or anything like that. I
                      actually type it by hand every time. That's because google doesn't have any
                      provisions for signature that I can tell. Also I sometimes change my
                      signature if I believe its appropriate for the context of the discussion.

                      --
                      Paul Hsieh
                      Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.


                      Comment

                      Working...