Fast way to determine if a string contains a member of a list of strings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Karch

    Fast way to determine if a string contains a member of a list of strings

    I need to find the fastest way in terms of storage and searching to
    determine if a given string contains one of a member of a list of strings.
    So, think of it in terms of this: I have a string such as the following:

    "Smith said the dog belongs (or did belong) to the secretary, an employee of
    the company."

    Then the list contains the following (and this list is MUCH larger in the
    real situation):

    Adams
    Alan
    Jones
    Jacobs
    Smith
    Thaxton
    Vela
    Zulu

    I would need to stop the processing and return (true) as soon as Smith was
    found. On the other hand, if the string was changed to the following, there
    would be no match and I would return (false):

    "Smitherson said the dog belongs (or did belong) to the secretary, an
    employee of the company."

    In the given string, do know that the matches should begin at a given point
    (zero position), but I need to keep processing until I have exhausted the
    candidate string in the list - as shown above - to prevent a false match.

    I have played around with some different data structures, such as prefix and
    suffix trees, an these work in the case that you have a string that you are
    trying to match in a list, not vice versa. The approach is required to be
    very performant because it will be evaluated millions of times. I am also
    okay with an unsafe code approach that works. I just need the evaluations to
    terminate as soon as possible rather than looping through every single item
    in the list. Even an IndexOf is too slow.


  • amdrit

    #2
    Re: Fast way to determine if a string contains a member of a list of strings

    I think RegEx will help you out here. As far as not iterating over the
    entire list, I've no clue how you can selectively not fall through given the
    only match in the list is the last item in the list or when there is no
    match at all.


    "Karch" <news.microsoft .comwrote in message
    news:e993ZMufIH A.4684@TK2MSFTN GP06.phx.gbl...
    >I need to find the fastest way in terms of storage and searching to
    >determine if a given string contains one of a member of a list of strings.
    >So, think of it in terms of this: I have a string such as the following:
    >
    "Smith said the dog belongs (or did belong) to the secretary, an employee
    of the company."
    >
    Then the list contains the following (and this list is MUCH larger in the
    real situation):
    >
    Adams
    Alan
    Jones
    Jacobs
    Smith
    Thaxton
    Vela
    Zulu
    >
    I would need to stop the processing and return (true) as soon as Smith was
    found. On the other hand, if the string was changed to the following,
    there would be no match and I would return (false):
    >
    "Smitherson said the dog belongs (or did belong) to the secretary, an
    employee of the company."
    >
    In the given string, do know that the matches should begin at a given
    point (zero position), but I need to keep processing until I have
    exhausted the candidate string in the list - as shown above - to prevent a
    false match.
    >
    I have played around with some different data structures, such as prefix
    and suffix trees, an these work in the case that you have a string that
    you are trying to match in a list, not vice versa. The approach is
    required to be very performant because it will be evaluated millions of
    times. I am also okay with an unsafe code approach that works. I just need
    the evaluations to terminate as soon as possible rather than looping
    through every single item in the list. Even an IndexOf is too slow.
    >

    Comment

    • John Daragon

      #3
      Re: Fast way to determine if a string contains a member of a listof strings

      Karch wrote:
      I need to find the fastest way in terms of storage and searching to
      determine if a given string contains one of a member of a list of strings.
      So, think of it in terms of this: I have a string such as the following:
      >
      "Smith said the dog belongs (or did belong) to the secretary, an employee of
      the company."
      >
      Then the list contains the following (and this list is MUCH larger in the
      real situation):
      >
      Adams
      Alan
      Jones
      Jacobs
      Smith
      Thaxton
      Vela
      Zulu
      >
      Could you give us some idea of the likely length of the list? And
      whether it's static or dynamic ?

      jd

      Comment

      • Peter Duniho

        #4
        Re: Fast way to determine if a string contains a member of a list of strings

        On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft .comwrote:
        I need to find the fastest way in terms of storage and searching to
        determine if a given string contains one of a member of a list of
        strings.
        I don't know if RegEx has optimizations for dealing with this sort of
        thing. It could, but given how long a search string for RegEx could be if
        you concatenate all of your possible matches, maybe it doesn't.

        That'd be my first try though. It seems like the actual RegEx search
        string would be simple (just "or" all of your possible matches together).
        If it performs well enough, there you go.

        If not, I'd guess there's a fair amount of research into algorithms like
        this, but a fairly basic approach is a state graph. It's memory
        intensive, but rewards you with good performance. This came up awhile ago
        and I posted some sample code to get someone else started. You can find
        that post here:


        I've referred to this post a couple of other times, and while I've never
        had anyone say it actually turned out to be useful, they've never said it
        wasn't either. :)

        I suppose if I'm going to keep referring people to it, maybe I ought to
        clean it up a little bit more. But what's there does work and should be
        enough to get you pointed in the right direction.

        Pete

        Comment

        • Jack Jackson

          #5
          Re: Fast way to determine if a string contains a member of a list of strings

          One problem I see with this is that you apparently are not doing
          simply a string search, but some kind of word search, without defining
          what a word is.

          You say the "Smithperso n said ..." should not have any matches, so
          while the string Smith is a substring, it does not match.

          Would it match:
          (Smith
          (Smith)
          Smith,
          Smith;
          Smith-
          smith

          etc.

          The answer to this will probably constrian your possible algorithms.
          If it didn't get too many false hits, doing a string search first and
          then re-checking to see if the matches are valid word matches might be
          OK.

          On Wed, 5 Mar 2008 11:05:22 -0600, "Karch" <news.microsoft .comwrote:
          >I need to find the fastest way in terms of storage and searching to
          >determine if a given string contains one of a member of a list of strings.
          >So, think of it in terms of this: I have a string such as the following:
          >
          >"Smith said the dog belongs (or did belong) to the secretary, an employee of
          >the company."
          >
          >Then the list contains the following (and this list is MUCH larger in the
          >real situation):
          >
          >Adams
          >Alan
          >Jones
          >Jacobs
          >Smith
          >Thaxton
          >Vela
          >Zulu
          >
          >I would need to stop the processing and return (true) as soon as Smith was
          >found. On the other hand, if the string was changed to the following, there
          >would be no match and I would return (false):
          >
          >"Smitherson said the dog belongs (or did belong) to the secretary, an
          >employee of the company."
          >
          >In the given string, do know that the matches should begin at a given point
          >(zero position), but I need to keep processing until I have exhausted the
          >candidate string in the list - as shown above - to prevent a false match.
          >
          >I have played around with some different data structures, such as prefix and
          >suffix trees, an these work in the case that you have a string that you are
          >trying to match in a list, not vice versa. The approach is required to be
          >very performant because it will be evaluated millions of times. I am also
          >okay with an unsafe code approach that works. I just need the evaluations to
          >terminate as soon as possible rather than looping through every single item
          >in the list. Even an IndexOf is too slow.
          >

          Comment

          • Karch

            #6
            Re: Fast way to determine if a string contains a member of a list of strings

            The list is static (read into memory during initialization) and could be up
            to approximately 1000 items (hence the need for a fast method)

            "John Daragon" <john@argv.co.u kwrote in message
            news:47cee409$0 $21865$db0fefd9 @news.zen.co.uk ...
            Karch wrote:
            >I need to find the fastest way in terms of storage and searching to
            >determine if a given string contains one of a member of a list of
            >strings. So, think of it in terms of this: I have a string such as the
            >following:
            >>
            >"Smith said the dog belongs (or did belong) to the secretary, an employee
            >of the company."
            >>
            >Then the list contains the following (and this list is MUCH larger in the
            >real situation):
            >>
            >Adams
            >Alan
            >Jones
            >Jacobs
            >Smith
            >Thaxton
            >Vela
            >Zulu
            >>
            >
            Could you give us some idea of the likely length of the list? And whether
            it's static or dynamic ?
            >
            jd

            Comment

            • Karch

              #7
              Re: Fast way to determine if a string contains a member of a list of strings

              I think the StateGraph approach is going to work best. I have an
              implementation up and running and should be able to get some timings done
              today. I don't think Regex is an option for me because of performance and
              the fact that, although the list itself is static, it will be loaded into
              memory at run-time. So, the expression building would be complex, not to
              mention eating cycles just to prep.

              "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
              news:op.t7j3uag 28jd0ej@petes-computer.local. ..
              On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft .comwrote:
              >
              >I need to find the fastest way in terms of storage and searching to
              >determine if a given string contains one of a member of a list of
              >strings.
              >
              I don't know if RegEx has optimizations for dealing with this sort of
              thing. It could, but given how long a search string for RegEx could be if
              you concatenate all of your possible matches, maybe it doesn't.
              >
              That'd be my first try though. It seems like the actual RegEx search
              string would be simple (just "or" all of your possible matches together).
              If it performs well enough, there you go.
              >
              If not, I'd guess there's a fair amount of research into algorithms like
              this, but a fairly basic approach is a state graph. It's memory
              intensive, but rewards you with good performance. This came up awhile ago
              and I posted some sample code to get someone else started. You can find
              that post here:

              >
              I've referred to this post a couple of other times, and while I've never
              had anyone say it actually turned out to be useful, they've never said it
              wasn't either. :)
              >
              I suppose if I'm going to keep referring people to it, maybe I ought to
              clean it up a little bit more. But what's there does work and should be
              enough to get you pointed in the right direction.
              >
              Pete

              Comment

              • Peter Duniho

                #8
                Re: Fast way to determine if a string contains a member of a list of strings

                On Thu, 06 Mar 2008 10:08:22 -0800, Karch <news.microsoft .comwrote:
                I think the StateGraph approach is going to work best. I have an
                implementation up and running and should be able to get some timings done
                today.
                Have you tried the dictionary approach suggested by a couple of others?

                I wasn't paying attention when I first read your question and failed to
                note that your search is delimited by spaces. Or, at least in the
                examples you provided it was.

                I think that if your data is actually restricted like that, the dictionary
                lookup might be as fast as a state graph and it would be a lot simpler in
                terms of implementation.

                Just a thought. Obviously I really like state graphs :), but when there
                is information you know about the input data that allows you to constrain
                the problem a bit (e.g. by using spaces to identify the beginning and
                ending of any possible match), it's often possible to solve the problem
                efficiently without using a state graph.

                Pete

                Comment

                • Peter Duniho

                  #9
                  Re: Fast way to determine if a string contains a member of a list of strings

                  On Thu, 06 Mar 2008 10:12:51 -0800, Karch <news.microsoft .comwrote:
                  The Hashtable would work great alongside a tokenized string, but the
                  problem
                  is that I don't have a distinct delimiter since the string to be matched
                  can
                  span multiple words. It's really a question of "do any of these strings
                  appear in this source string", irrespective of whitespace.
                  Then how do you distinguish "Smith" from "Smitherson ", as in your
                  example? Can you at least make a requirement that the search pattern will
                  always begin and end on a space, even if it includes spaces within?

                  Pete

                  Comment

                  • Mufaka

                    #10
                    Re: Fast way to determine if a string contains a member of a listof strings

                    Mufaka wrote:
                    Karch wrote:
                    >The Hashtable would work great alongside a tokenized string, but the
                    >problem is that I don't have a distinct delimiter since the string to
                    >be matched can span multiple words. It's really a question of "do any
                    >of these strings appear in this source string", irrespective of
                    >whitespace.
                    >>
                    <snip>
                    This is a little different than your original post, but if you can't
                    delimit search string, you might try something like the following:
                    >
                    Hash the static list of words
                    Keep a list of unique lengths for those words
                    for each unique length, iterate the string pulling out substrings of
                    that length
                    see if that substring is in your hashed list of words
                    >
                    It at least reduces the amount of InString's you'd have to do.
                    >
                    I'd be interested in seeing how this performs against large strings.
                    Here's my test code:
                    <snip>
                    >
                    if (text.Length < length) break;
                    <snip>
                    That should be continue instead of break.

                    Comment

                    • Karch

                      #11
                      Re: Fast way to determine if a string contains a member of a list of strings

                      WOW! Really disappointing timings for the state machine method. Here are the
                      results:

                      Method: Looping through 359 strings doing an if (IndexOf 0) and a source
                      string of length = 47
                      Execution time: 0.055008 seconds.


                      Method: Using the StateMachine method on 359 strings and a source string of
                      length = 47
                      Execution time: 2.359223 seconds.



                      "Karch" <news.microsoft .comwrote in message
                      news:eXx$RU7fIH A.4880@TK2MSFTN GP03.phx.gbl...
                      >I think the StateGraph approach is going to work best. I have an
                      >implementati on up and running and should be able to get some timings done
                      >today. I don't think Regex is an option for me because of performance and
                      >the fact that, although the list itself is static, it will be loaded into
                      >memory at run-time. So, the expression building would be complex, not to
                      >mention eating cycles just to prep.
                      >
                      "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
                      news:op.t7j3uag 28jd0ej@petes-computer.local. ..
                      >On Wed, 05 Mar 2008 09:05:22 -0800, Karch <news.microsoft .comwrote:
                      >>
                      >>I need to find the fastest way in terms of storage and searching to
                      >>determine if a given string contains one of a member of a list of
                      >>strings.
                      >>
                      >I don't know if RegEx has optimizations for dealing with this sort of
                      >thing. It could, but given how long a search string for RegEx could be
                      >if you concatenate all of your possible matches, maybe it doesn't.
                      >>
                      >That'd be my first try though. It seems like the actual RegEx search
                      >string would be simple (just "or" all of your possible matches together).
                      >If it performs well enough, there you go.
                      >>
                      >If not, I'd guess there's a fair amount of research into algorithms like
                      >this, but a fairly basic approach is a state graph. It's memory
                      >intensive, but rewards you with good performance. This came up awhile
                      >ago and I posted some sample code to get someone else started. You can
                      >find that post here:
                      >http://groups.google.com/group/micro...06f696d4500b77
                      >>
                      >I've referred to this post a couple of other times, and while I've never
                      >had anyone say it actually turned out to be useful, they've never said it
                      >wasn't either. :)
                      >>
                      >I suppose if I'm going to keep referring people to it, maybe I ought to
                      >clean it up a little bit more. But what's there does work and should be
                      >enough to get you pointed in the right direction.
                      >>
                      >Pete
                      >
                      >

                      Comment

                      • John Daragon

                        #12
                        Re: Fast way to determine if a string contains a member of a listof strings

                        Karch wrote:
                        Snip 8<
                        >
                        Yeah, this would work really good - but one point that I didn't mention is
                        that I can't really tokenize the string since the strings to match could
                        span multiple words and whitepace. I looked at the data and I couldn't find
                        a realiable way to tokenize.
                        >
                        Snip 8<

                        Is the data in your list arbitrarily complex? I guess what I'm getting
                        at is: if the majority of the strings to match are simple strings with
                        no embedded spaces or other special characters, it may well be possible
                        to build a hashtable (fixated, me?) that caters for the simplest
                        (tokenised) case, but which returns an object identifying whether this
                        string can be standalone or is a prefix for a more complex one.
                        Everything apart from the simple case would have to be handled
                        exceptionally, but if they're few and far between you may well still be
                        a long way ahead of the game adopting a technique that is based on hashing.


                        jd

                        Comment

                        • Karch

                          #13
                          Re: Fast way to determine if a string contains a member of a list of strings

                          WOW! Really disappointing timings for the state machine method. Here are the
                          results:

                          Method: Looping through 359 strings doing an if (IndexOf 0) and a source
                          string of length = 47
                          Execution time: 0.055008 seconds.


                          Method: Using the StateMachine method on 359 strings and a source string of
                          length = 47
                          Execution time: 2.359223 seconds.

                          In my case, I can terminate the search as soon as ONE of the strings is
                          found because the list is mutually exclusive. In the sample code that you
                          provided, at what point do I know I have a match and can terminate the
                          traversal?

                          Also, to answer your question, I do not have space delimiters - the strings
                          can span multiple words, etc.

                          "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
                          news:op.t7lw7de w8jd0ej@petes-computer.local. ..
                          On Thu, 06 Mar 2008 10:08:22 -0800, Karch <news.microsoft .comwrote:
                          >
                          >I think the StateGraph approach is going to work best. I have an
                          >implementati on up and running and should be able to get some timings done
                          >today.
                          >
                          Have you tried the dictionary approach suggested by a couple of others?
                          >
                          I wasn't paying attention when I first read your question and failed to
                          note that your search is delimited by spaces. Or, at least in the
                          examples you provided it was.
                          >
                          I think that if your data is actually restricted like that, the dictionary
                          lookup might be as fast as a state graph and it would be a lot simpler in
                          terms of implementation.
                          >
                          Just a thought. Obviously I really like state graphs :), but when there
                          is information you know about the input data that allows you to constrain
                          the problem a bit (e.g. by using spaces to identify the beginning and
                          ending of any possible match), it's often possible to solve the problem
                          efficiently without using a state graph.
                          >
                          Pete

                          Comment

                          • Peter Duniho

                            #14
                            Re: Fast way to determine if a string contains a member of a list of strings

                            On Thu, 13 Mar 2008 16:35:05 -0700, Karch <news.microsoft .comwrote:
                            I only create the state graph one time and I only populate the list of
                            strings one time. So, those are not interfering with proper timings; I am
                            only timing the searches. But, just for fun, here is the same test with
                            1M
                            iterations. The ratio is about the same.
                            What is your input data? Does the input string contain any of the search
                            strings? If so, where is that search string in the list of search
                            strings? Are you doing 1 million identical searches, or are you somehow
                            randomizing the input?

                            Basically, that's a fairly minimal improvement relative to the potential
                            of the state graph versus a brute force IndexOf() solution. So either
                            your test case is somehow favorable to the IndexOf(), or there's still
                            some remaining problem with the state graph implementation.

                            Pete

                            Comment

                            • Karch

                              #15
                              Re: Fast way to determine if a string contains a member of a list of strings

                              Yes, you are right; my input data is not as random as it could be. I just
                              wanted to get a quick snapshot, so it's likely that the improvement would be
                              greater over a larger and more varied set. I'll put together a more
                              realistic set of test data and report back.

                              "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
                              news:op.t7zavd0 t8jd0ej@petes-computer.local. ..
                              On Thu, 13 Mar 2008 16:35:05 -0700, Karch <news.microsoft .comwrote:
                              >
                              >I only create the state graph one time and I only populate the list of
                              >strings one time. So, those are not interfering with proper timings; I am
                              >only timing the searches. But, just for fun, here is the same test with
                              >1M
                              >iterations. The ratio is about the same.
                              >
                              What is your input data? Does the input string contain any of the search
                              strings? If so, where is that search string in the list of search
                              strings? Are you doing 1 million identical searches, or are you somehow
                              randomizing the input?
                              >
                              Basically, that's a fairly minimal improvement relative to the potential
                              of the state graph versus a brute force IndexOf() solution. So either
                              your test case is somehow favorable to the IndexOf(), or there's still
                              some remaining problem with the state graph implementation.
                              >
                              Pete

                              Comment

                              Working...