strncmp performance

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Arjan Kenter

    #16
    Re: strncmp performance



    Christian Bau wrote:[color=blue]
    > In article <db593abd.04012 61638.2542974@p osting.google.c om>,
    > pembed2003@yaho o.com (pembed2003) wrote:
    >
    >[color=green]
    >>Hi,
    >>I have an application where I use the strncmp function extensively to
    >>compare 2 string to see if they are the same or not. Does anyone know
    >>a faster replacement for strncmp? I notice there is a memncmp function
    >>which is very fast but it doesn't work the same like strncmp so I
    >>don't think I can use it. I also tried to write the string_equal
    >>function myself like:
    >>
    >>int string_equal(co nst char* s1,const char* s2){
    >> while(*s1 && *s2 && *s1 == *s2){[/color]
    >
    >
    > This definitely does to many comparisons. If *s1 == *s2 then *s1 and *s2
    > are either both zero, or non of them is zero, so it is absolutely
    > pointless to check both of them.
    >
    >
    >[color=green]
    >> s1++; s2++;
    >> }
    >> return *s1 == *s2;
    >>}
    >>
    >>but I don't get much performance gain. Any suggestion?[/color]
    >
    >
    > Instead of worrying about the performance of strncmp, you should ask
    > yourself why you make so many calls to it that it matters at all.
    >
    > What are you doing that requires millions of calls to strncmp?[/color]

    This is indeed the key question. I have had a case where I had a large
    number of names and had to repeatedly look for names in that list and
    came up with a very different solution: I made an index over the list
    and "matched" the names I was looking for. The point is, I didn't simply
    iterate over the list or make some tree search or hash function, but a
    finite state machine (FSM). For fixed strings such as names it is very easy
    to make a 'deterministic finite automaton' (DFA), a class of FSMs. What you
    do is, you create an FSM of which each state represents the substring matched
    so far from the beginning, the start state representing nothing matched yet.
    Each state that represents a complete string of the original list is marked
    an 'accept' state'. The matching algorithm is very simple: you start
    from the start state and for each character of the name you are looking
    for you find a transition to another state. If no transition is found the
    name is not in the list and the search terminates. Otherwise, you follow
    the transition to the next state and the process repeats until you don't
    find a transition for the current character or you are at the end of the
    name you are looking for. If the latter happens and you are in an accept
    state, you have matched the name.

    The advantage of this algorithm: the time complexity is O(n) where n is
    the length of the name you are looking for. You might be tempted to say
    it's O(n * t) where t is the number of transitions you have to follow at
    each state, but t is limited by the size |T| of the character set and thus
    we get O(n * |T|) = O(n).

    If you have to, say, find x names in a list of y names the total complexity
    becomes O(x * n) where n is now the average length of the x names, compared
    with O(x * y * n) for linear search (the factor n then stems from strcmp) or
    O(x * log y * n) for tree search. Hashing should have O(x * n) in the best
    case. You probably don't get any faster than an FSM.

    Building the FSM is about just as simple: for each name that you add to the
    FSM you follow the FSM built so far similarly to the matching algorithm
    described above. When you don't find a transition you add one to a state that
    you also add and you move to that state, from which point on the algorithm
    continues. The state you are in when you are at the end of the added string
    is marked 'accept'. You can probably figure out for yourself how you would
    remove names from the DFA, it's not a big deal.

    Note that this is a very simplistic form of DFA: in general they can be much
    more powerful and match patterns with repetitions (think of * in Unix file
    name patterns). These are called regular expressions. To learn how to construct
    such DFAs, read for instance 'Compilers: theory, techniques and tools' by Aho,
    Sethi and Ullman (the 'dragon book'). Or you can generate them with e.g. lex.

    Another note: you can use such an FSM as real index, pointing back to your
    list of names by turning the 'accept' field into a pointer to the corresponding
    list item (and NULL in states that are not accept states).

    Finally, if it has advantages, it probably has disadvantages as well. That's
    right: it consumes more memory and you had better take some care to do your
    memory allocation efficiently: allocate states and transitions 1024 at a time
    or some such.

    --
    ir. H.J.H.N. Kenter ^^
    Electronic Design & Tools oo ) Philips Research Labs
    Building WAY 3.23 =x= \ arjan.kenter@ph ilips.com
    Prof. Holstlaan 4 (WAY31) | \ tel. +31 40 27 45334
    5656 AA Eindhoven /|__ \ tfx. +31 40 27 44626
    The Netherlands (____)_/ http://www.kenter.demon.nl/

    Famous last words: Segmentation Fault (core dumped)

    Comment

    • Nick Landsberg

      #17
      Re: strncmp performance


      You probably missed my point -

      Comparing strings using any flavor of strcmp(), strncmp(),
      requires that both strings be "walked" until a mismatch
      occurs or until the end of the string or until "n" is reached.
      This is true whether it is written in C or assembler.

      If you have to call strncmp() thousands of times,
      of *course* you will be seeing that strncmp() takes the
      bulk of your time when you profile the code.

      My point was rather than trying to optimize strncmp(),
      you should try to minimize the calls to strncmp() by being
      more clever in your use of data structures. For example,
      if the list of KEYS in the KEY/value pairs can be sorted
      a priori, then the use of a binary search algorithm
      within the list, rather than a linear search, can drastically
      reduce the number of times strncmp() will be called.
      Similarly, a hashing algorithm of some kind might be used.
      That's why I recommended going to comp.programmin g, or
      (if there is such a thing) comp.algorithme s or something like
      that.



      pembed2003 wrote:[color=blue]
      > Nick Landsberg <hukolau@att.ne t> wrote in message news:<fAiRb.116 153$6y6.2308252 @bgtnsc05-news.ops.worldn et.att.net>...
      >[color=green]
      >>The problem is probably not in strncmp(), but elsewhere.[/color]
      >
      >
      > No, I time the function and found out that the strcmp is the slowest
      > function of all. Without it (of course, I can't really remove it), the
      > whole function is much much faster.
      >
      >[color=green]
      >>Comparing strings linearly is an expensive proposition
      >>when there are many strings to be compared. If there's
      >>only a handful (10 or so) it's not bad, if there are
      >>thousands, then strncmp() without an intelligent
      >>algorithm (e.g. binary search) behind it is probably
      >>not the way to go.[/color]
      >
      >
      > The function will eventually be moved to a server to do key/value pair
      > lookup so it will be called tens of thousands of times and that's why
      > I want to optimize it.
      >
      >[color=green]
      >>Try a newsgroup dealing with algorithms. Maybe
      >>comp.programm ing?
      >>[/color]
      >
      >
      > If I can't get a good answer here, I will try there. Thanks!
      >
      >[color=green]
      >>
      >>pembed2003 wrote:
      >>
      >>[color=darkred]
      >>>Hi,
      >>>I have an application where I use the strncmp function extensively to
      >>>compare 2 string to see if they are the same or not. Does anyone know
      >>>a faster replacement for strncmp? I notice there is a memncmp function
      >>>which is very fast but it doesn't work the same like strncmp so I
      >>>don't think I can use it. I also tried to write the string_equal
      >>>function myself like:
      >>>
      >>>int string_equal(co nst char* s1,const char* s2){
      >>> while(*s1 && *s2 && *s1 == *s2){
      >>> s1++; s2++;
      >>> }
      >>> return *s1 == *s2;
      >>>}
      >>>
      >>>but I don't get much performance gain. Any suggestion?
      >>>
      >>>Thanks![/color][/color][/color]

      --
      "It is impossible to make anything foolproof because fools are so
      ingenious" - A. Bloch

      Comment

      • Nick Landsberg

        #18
        Re: strncmp performance



        pembed2003 wrote:

        [ snip ][color=blue]
        >
        > I have a very simple hash table where the keys are string and values
        > are also string. What I want to let people do is:
        >
        > hash_insert("ke y","value");
        >
        > and then later retrive "value" by saying:
        >
        > hash_lookup("ke y");
        >
        > The hash algr. is very simple. It calculates the hash code like:
        >
        > unsigned long hash_code(const char* s){
        > unsigned long int c = 0;
        > while(*s){
        > c = 131 * c + *s;
        > s++;
        > }
        > return c % MAX_HASH_SIZE;
        > }
        >[/color]

        At first glance, that does not look like a very robust
        hash algorithm and it MAY be why you are doing so many
        strncmp() calls. Have you any data on how many strings
        hash into the same hash code? If it's more than 2 or 3
        on average, then you should revise the hash algorithm
        rather than trying to optimize strcmp().

        A good hash algorithm can get down
        to about 1.5 probes/search. Try the CRC algorithm
        for starters.

        [color=blue]
        > It's possible for 2 different string to have the same hash code,[/color]

        Very, very true. See above about a better hash algorithm.

        so[color=blue]
        > when I am doing lookup, I am doing something like:
        >
        > char* hash_lookup(con st char* s){
        > unsigned long c = hash_code(s);
        > // Note I can't simply return the slot at c because
        > // it's possible that another different string is at
        > // this spot because of the hash algr. So I need to...
        > if(strcmp(s,(ha sh+c)->value) == 0){
        > // Found it...
        > }else{
        > // Do something else
        > }
        > }
        >
        > So because of my hash algr., an extra strcmp is needed.[/color]

        --
        "It is impossible to make anything foolproof because fools are so
        ingenious" - A. Bloch

        Comment

        • David Resnick

          #19
          Re: strncmp performance

          pembed2003@yaho o.com (pembed2003) wrote in message
          <stuff snipped>[color=blue]
          >
          > Do you mean memcmp() here instead? Yes I think memcpy is faster than
          > strcmp or strncmp but I need to find out the longer string and pass in
          > the lenght of that. Otherwise, something like:
          >
          > char* s1 = "auto";
          > char* s2 = "auto insurance";
          >
          > memcmp(s1,s2,st rlen(s1));
          >
          > will return 0 which isn't. I will need to do the extra work like:
          >
          > int l1 = strlen(s1);
          > int l2 = strlen(s2);
          >
          > memcmp(s1,s2,l1 > l2 ? l1 : l2);
          >
          > Do you think that will be faster than a strcmp or strncmp?
          >[/color]

          As I mentioned in another part of the thread, if the lengths aren't the
          same the strings aren't identical, in which case you must skip the memcmp.
          Your memcmp will read memory past the end of one of the strings if the
          length isn't equal, as you selected the longer of the two. But just
          don't compare is a better solution.

          Having seen the rest of the thread, your best bet seems to me
          to get a better hashing algorithm and a large
          enough (growing if needed) hashtable that you have few collisions.
          A little profiling of your hashtable should show how well/poorly
          your hash function is doing.

          -David

          Comment

          • Christian Bau

            #20
            Re: strncmp performance

            In article <db593abd.04012 71027.9ef9540@p osting.google.c om>,
            pembed2003@yaho o.com (pembed2003) wrote:
            [color=blue]
            > It's possible for 2 different string to have the same hash code, so
            > when I am doing lookup, I am doing something like:
            >
            > char* hash_lookup(con st char* s){
            > unsigned long c = hash_code(s);
            > // Note I can't simply return the slot at c because
            > // it's possible that another different string is at
            > // this spot because of the hash algr. So I need to...
            > if(strcmp(s,(ha sh+c)->value) == 0){
            > // Found it...
            > }else{
            > // Do something else
            > }
            > }
            >
            > So because of my hash algr., an extra strcmp is needed.[/color]

            So you are worrying about the performance of not ten thousands of
            strcmp's per key, but ten thousands of strcmp's in total? Please tell
            us, what does the "Giga" in "Gigahertz" mean? How many strcmp's per
            seconds can your machine do? Did you measure how long things take? Will
            anybody notice if the code runs faster?

            Comment

            • Sean Kenwrick

              #21
              Re: strncmp performance


              "pembed2003 " <pembed2003@yah oo.com> wrote in message
              news:db593abd.0 401271017.3c142 08c@posting.goo gle.com...[color=blue]
              > "Sean Kenwrick" <skenwrick@hotm ail.com> wrote in message[/color]
              news:<bv5d9g$2n v$1@sparta.btin ternet.com>...[color=blue][color=green]
              > > "pembed2003 " <pembed2003@yah oo.com> wrote in message
              > > news:db593abd.0 401261638.25429 74@posting.goog le.com...[color=darkred]
              > > > Hi,
              > > > I have an application where I use the strncmp function extensively to
              > > > compare 2 string to see if they are the same or not. Does anyone know
              > > > a faster replacement for strncmp? I notice there is a memncmp function
              > > > which is very fast but it doesn't work the same like strncmp so I
              > > > don't think I can use it. I also tried to write the string_equal
              > > > function myself like:
              > > >
              > > > int string_equal(co nst char* s1,const char* s2){
              > > > while(*s1 && *s2 && *s1 == *s2){
              > > > s1++; s2++;
              > > > }
              > > > return *s1 == *s2;
              > > > }
              > > >
              > > > but I don't get much performance gain. Any suggestion?
              > > >
              > > > Thanks![/color]
              > >
              > > Have you tried forcing your compiler to use inline code rather than a
              > > function call (either by implementing this as a Macro or by using the
              > > 'inline' keyword. You might find that setting up the stack frame[/color][/color]
              (or[color=blue][color=green]
              > > whatever) prior to making each call to your string_equal() function[/color][/color]
              takes[color=blue][color=green]
              > > just as long as the function itself.
              > >
              > > Also it appears that you are implementing (a simplified version of)[/color][/color]
              strcmp()[color=blue][color=green]
              > > here NOT strncmp() and you are evaluating an extra condition in your[/color][/color]
              loop[color=blue][color=green]
              > > (drop the check on s2 for '\0'):.
              > >
              > > int string_equal(co nst char * s1, const char * s2){
              > > while(*s1 && * s1==*s2){
              > > s1++; s2++;
              > > }
              > > return *s1==*s2;
              > > }
              > >
              > >
              > > Regarding memcmp() this is likely to be highly optimised and much faster
              > > than strcmp(), but the problem is that you need to pass to it the length[/color][/color]
              of[color=blue][color=green]
              > > the strings that you are comparing. There might be a way you can[/color][/color]
              make[color=blue][color=green]
              > > use of memcmp() if you know that the strings are the same length. Do[/color][/color]
              you[color=blue][color=green]
              > > get this information somewhere earlier in your code (e.g when you first[/color][/color]
              read[color=blue][color=green]
              > > in the key value?). Otherwise try:
              > >
              > > memcpy(s1,s2,st rlen(s1));[/color]
              >
              > Do you mean memcmp() here instead? Yes I think memcpy is faster than
              > strcmp or strncmp but I need to find out the longer string and pass in
              > the lenght of that. Otherwise, something like:
              >
              > char* s1 = "auto";
              > char* s2 = "auto insurance";
              >
              > memcmp(s1,s2,st rlen(s1));
              >
              > will return 0 which isn't. I will need to do the extra work like:
              >
              > int l1 = strlen(s1);
              > int l2 = strlen(s2);
              >
              > memcmp(s1,s2,l1 > l2 ? l1 : l2);
              >
              > Do you think that will be faster than a strcmp or strncmp?
              >[/color]

              You need to examine your code for data-caching possibilities. What I mean
              by this is that you evaluate a value at some stage which you keep and use
              multiple times later on. In this case the important value is the length
              of the strings. From a previous post it looks like you are evaluating
              hash_keys() prior to posting keys into your lookup table - it seems that
              this function is a likely candidate for calculating the length of the string
              with little overhead (e.g save the original pointer and use pointer
              arithmetic at the end to calculate the strlen()). You could then store
              the length along with the other information in your lookup table.
              Then you only need to do a memcmp() if the strings are of equal length, and
              you already have the string lengths calculated if you do need to call
              memcmp()...

              Sean


              Comment

              • CBFalconer

                #22
                Re: strncmp performance

                Nick Landsberg wrote:[color=blue]
                > pembed2003 wrote:
                >
                > [ snip ][color=green]
                > >
                > > I have a very simple hash table where the keys are string and
                > > values are also string. What I want to let people do is:
                > >
                > > hash_insert("ke y","value");
                > >
                > > and then later retrive "value" by saying:
                > >
                > > hash_lookup("ke y");
                > >
                > > The hash algr. is very simple. It calculates the hash code like:
                > >
                > > unsigned long hash_code(const char* s){
                > > unsigned long int c = 0;
                > > while(*s){
                > > c = 131 * c + *s;
                > > s++;
                > > }
                > > return c % MAX_HASH_SIZE;
                > > }[/color]
                >
                > At first glance, that does not look like a very robust
                > hash algorithm and it MAY be why you are doing so many
                > strncmp() calls. Have you any data on how many strings
                > hash into the same hash code? If it's more than 2 or 3
                > on average, then you should revise the hash algorithm
                > rather than trying to optimize strcmp().
                >
                > A good hash algorithm can get down to about 1.5 probes/search.
                > Try the CRC algorithm for starters.[/color]

                His function sounds much too dependant on the low order bits of
                the last character hashed.

                To experiment with hash functions and immediately see the
                probes/search and other statistics, the OP could try using the
                hashlib package. It was born out of an investigation into hash
                functions. There are some sample string hashing routines, and
                references to other methods.

                <http://cbfalconer.home .att.net/download/hashlib.zip>

                --
                Chuck F (cbfalconer@yah oo.com) (cbfalconer@wor ldnet.att.net)
                Available for consulting/temporary embedded and systems.
                <http://cbfalconer.home .att.net> USE worldnet address!

                Comment

                • Rob Thorpe

                  #23
                  Re: strncmp performance

                  pembed2003@yaho o.com (pembed2003) wrote in message news:<db593abd. 0401271027.9ef9 540@posting.goo gle.com>...[color=blue]
                  > Christian Bau <christian.bau@ cbau.freeserve. co.uk> wrote in message news:<christian .bau-469F3A.09125927 012004@slb-newsm1.svr.pol. co.uk>...[color=green]
                  > > In article <db593abd.04012 61638.2542974@p osting.google.c om>,
                  > > pembed2003@yaho o.com (pembed2003) wrote:
                  > >[color=darkred]
                  > > > Hi,
                  > > > I have an application where I use the strncmp function extensively to
                  > > > compare 2 string to see if they are the same or not. Does anyone know
                  > > > a faster replacement for strncmp? I notice there is a memncmp function
                  > > > which is very fast but it doesn't work the same like strncmp so I
                  > > > don't think I can use it. I also tried to write the string_equal
                  > > > function myself like:
                  > > >
                  > > > int string_equal(co nst char* s1,const char* s2){
                  > > > while(*s1 && *s2 && *s1 == *s2){[/color]
                  > >
                  > > This definitely does to many comparisons. If *s1 == *s2 then *s1 and *s2
                  > > are either both zero, or non of them is zero, so it is absolutely
                  > > pointless to check both of them.
                  > >
                  > >[color=darkred]
                  > > > s1++; s2++;
                  > > > }
                  > > > return *s1 == *s2;
                  > > > }
                  > > >
                  > > > but I don't get much performance gain. Any suggestion?[/color]
                  > >
                  > > Instead of worrying about the performance of strncmp, you should ask
                  > > yourself why you make so many calls to it that it matters at all.[/color]
                  >
                  > I have a very simple hash table where the keys are string and values
                  > are also string. What I want to let people do is:
                  >
                  > hash_insert("ke y","value");
                  >
                  > and then later retrive "value" by saying:
                  >
                  > hash_lookup("ke y");
                  >
                  > The hash algr. is very simple. It calculates the hash code like:
                  >
                  > unsigned long hash_code(const char* s){
                  > unsigned long int c = 0;
                  > while(*s){
                  > c = 131 * c + *s;
                  > s++;
                  > }
                  > return c % MAX_HASH_SIZE;
                  > }
                  >
                  > It's possible for 2 different string to have the same hash code, so
                  > when I am doing lookup, I am doing something like:
                  >
                  > char* hash_lookup(con st char* s){
                  > unsigned long c = hash_code(s);
                  > // Note I can't simply return the slot at c because
                  > // it's possible that another different string is at
                  > // this spot because of the hash algr. So I need to...
                  > if(strcmp(s,(ha sh+c)->value) == 0){
                  > // Found it...
                  > }else{
                  > // Do something else
                  > }
                  > }
                  >
                  > So because of my hash algr., an extra strcmp is needed.[/color]

                  Since it is a hashing algorithm I would assume that the chance of the
                  first character been the same as the second character is very high.
                  Only if the hash table becomes extremely full does this change.

                  If this is the case replace:

                  if (strcmp (s, (hash + c)->value) == 0) {
                  // Found it...
                  }

                  with

                  if (s[0] == *((hash + c)->value)) /* Compare the first two chars. */
                  {
                  if (s, (hash + c)->value) == 0) /* Only now compare strings. */
                  {
                  // Found it...
                  }
                  else
                  {
                  // Do something else ..
                  }
                  }
                  else
                  {
                  // Do something else ..
                  }

                  This will kick out the cases where the strings don't match rather
                  faster. It removes the overhead of a function call and the beginning
                  of a loop (though neither ammount to much these days).

                  You may want to consider using a flag that labels a bucket as full
                  rather than comparing the string in the bucket to the key. This way
                  the first time the program looks up a label no comparison is done -
                  this will be insertion.

                  You may also want to consider:

                  * Using a better hash function ( read
                  http://burtleburtle.net/bob/hash/index.html#lookup )

                  * Resizing the hash when it's nearly full

                  * Using linked lists as buckets.

                  Comment

                  • pembed2003

                    #24
                    Re: strncmp performance

                    "Sean Kenwrick" <skenwrick@hotm ail.com> wrote in message news:<bv6r33

                    [snip]
                    [color=blue][color=green]
                    > >
                    > > Do you mean memcmp() here instead? Yes I think memcpy is faster than
                    > > strcmp or strncmp but I need to find out the longer string and pass in
                    > > the lenght of that. Otherwise, something like:
                    > >
                    > > char* s1 = "auto";
                    > > char* s2 = "auto insurance";
                    > >
                    > > memcmp(s1,s2,st rlen(s1));
                    > >
                    > > will return 0 which isn't. I will need to do the extra work like:
                    > >
                    > > int l1 = strlen(s1);
                    > > int l2 = strlen(s2);
                    > >
                    > > memcmp(s1,s2,l1 > l2 ? l1 : l2);
                    > >
                    > > Do you think that will be faster than a strcmp or strncmp?
                    > >[/color]
                    >
                    > You need to examine your code for data-caching possibilities. What I mean
                    > by this is that you evaluate a value at some stage which you keep and use
                    > multiple times later on. In this case the important value is the length
                    > of the strings. From a previous post it looks like you are evaluating
                    > hash_keys() prior to posting keys into your lookup table - it seems that
                    > this function is a likely candidate for calculating the length of the string
                    > with little overhead (e.g save the original pointer and use pointer
                    > arithmetic at the end to calculate the strlen()). You could then store
                    > the length along with the other information in your lookup table.
                    > Then you only need to do a memcmp() if the strings are of equal length, and
                    > you already have the string lengths calculated if you do need to call
                    > memcmp()...
                    >
                    > Sean[/color]

                    If fact, that's exactly what I did now. If the strlen isn't the same,
                    there is no chance for the strings to be the same. If they are the
                    same length, memcmp can be used. So instead of doing strcmp (or
                    strncmp) I am doing either strlen and/or memcmp which should be
                    faster. Another problem now I have encountered is that the string
                    passed in to my function is not from a C program, it's from a PHP
                    extension (which is written in C). Because of this, I sometimes got
                    segfault which I think is related to PHP not padding the string with
                    the NULL marker. My question is: Does memcmp care whethere there is a
                    NULL marker somewhere or not? Is there any circumstance where memcmp
                    might segfault?

                    Thanks!

                    Comment

                    • Paul Hsieh

                      #25
                      Re: strncmp performance

                      I've just been reading thread and two things pop to mind. First of
                      all, the hash function you have chosen looks a little bit questionable
                      in terms of collisions. The FNV hash is well known to behave quite
                      well and will have performance identical to your hash function:



                      Second, if your program still boils down to string comparison no
                      matter what, then you should consider converting your program over to
                      a library like The Better String library:



                      In this library, the length of each string is predetermined as they
                      are created or modified (this is very cheap, while leading to massive
                      performance improvements in some string functionality.) In this way
                      you can use memcmp() (which has the potential of being implemented to
                      use block comparisons) directly without incurring the string traversal
                      costs (in general.) The Better String library also includes its own
                      string comparison functions, of course, which additionally capture
                      trivial cases like strings having different lengths and aliased string
                      pointers in O(1) time.

                      Additionally calling strlen(), or using strcmp, or strncmp, or
                      whatever based on the assumption of using raw char * buffers will all
                      incurr an additional O(n) cost no matter how you slice it, which may
                      be a big factor in what is showing up on your bottom line. Using
                      libraries like Bstrlib (which essential has a O(1) strlen) as
                      described above is really the only way to avoid this cost.

                      As a point of disclosure, I am the author of Bstrlib -- other
                      libraries like Vstr (www.and.org/vstr) have comparable mechanisms.

                      --
                      Paul Hsieh
                      Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.


                      Comment

                      • Christian Bau

                        #26
                        Re: strncmp performance

                        In article <796f488f.04012 82015.5fa9af91@ posting.google. com>,
                        qed@pobox.com (Paul Hsieh) wrote:
                        [color=blue]
                        > I've just been reading thread and two things pop to mind. First of
                        > all, the hash function you have chosen looks a little bit questionable
                        > in terms of collisions. The FNV hash is well known to behave quite
                        > well and will have performance identical to your hash function:
                        >
                        > http://www.isthe.com/chongo/tech/comp/fnv/
                        >
                        > Second, if your program still boils down to string comparison no
                        > matter what, then you should consider converting your program over to
                        > a library like The Better String library:
                        >
                        > http://bstring.sf.net/
                        >
                        > In this library, the length of each string is predetermined as they
                        > are created or modified (this is very cheap, while leading to massive
                        > performance improvements in some string functionality.) In this way
                        > you can use memcmp() (which has the potential of being implemented to
                        > use block comparisons) directly without incurring the string traversal
                        > costs (in general.) The Better String library also includes its own
                        > string comparison functions, of course, which additionally capture
                        > trivial cases like strings having different lengths and aliased string
                        > pointers in O(1) time.[/color]

                        If the data is completely under your control, you could make different
                        changes: Store all the strings in arrays of unsigned long instead of
                        unsigned char. In the table, end every string with at least one zero and
                        a 1, with the 1 being the last byte in an unsigned long. In the strings
                        that you pass in, end every string with at least two zeroes, with the
                        last zero being the last byte in an unsigned long.

                        You can know compare one unsigned long at a time. You don't need to
                        check for the end of the strings because the data you pass in and the
                        data in your table will be different. After finding the first unsigned
                        long that is different, the strings are equal if the difference between
                        the two strings is 1 and the last byte of the unsigned long that you
                        took from the table is 1.

                        Comment

                        • pembed2003

                          #27
                          Re: strncmp performance

                          qed@pobox.com (Paul Hsieh) wrote in message news:<796f488f. 0401282015.5fa9 af91@posting.go ogle.com>...[color=blue]
                          > I've just been reading thread and two things pop to mind. First of
                          > all, the hash function you have chosen looks a little bit questionable
                          > in terms of collisions. The FNV hash is well known to behave quite
                          > well and will have performance identical to your hash function:
                          >
                          > http://www.isthe.com/chongo/tech/comp/fnv/
                          >
                          > Second, if your program still boils down to string comparison no
                          > matter what, then you should consider converting your program over to
                          > a library like The Better String library:
                          >
                          > http://bstring.sf.net/
                          >[/color]

                          Thanks for pointing out a better hash algr. and string library. I will
                          consider using both in my application. As I have pointed out in
                          another thread, my problem now seems to be in PHP not padding the
                          string with \0 which result in segfault.

                          Comment

                          Working...