Search a String

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • S

    Search a String

    Any idea on how I would be able to do a search within C# that does
    ranges or words


    For example

    I want to search for Chicken in the string


    string s1 = "This is Great Chicken";
    string s2 = "Chicken";
    return s1.IndexOf(s2) >= 0

    The trick is how would I search for: Person doing the search
    misspelled the word Chickin (Happens all the time);

    string s1 = "This is Great Chicken";
    string s2 = "Chickin";

    return ???

    or

    Worst yet
    string s1 = "This is Great Chickin";
    string s2 = "Chicken";

    Person enter the search phrase misspelled the word Chicken;

    string s1 = "This is Great Chickin";
    string s2 = "Chicken";



  • Author

    #2
    Re: Search a String

    On Oct 2, 2:26 pm, S <scottremi...@y ahoo.comwrote:
    Any idea on how I would be able to do a search within C# that does
    ranges or words
    >
    For example
    >
    I want to search for Chicken in the string
    >
    string s1 = "This is Great Chicken";
    string s2 = "Chicken";
    return s1.IndexOf(s2) >= 0
    >
    The trick is how would I search for: Person doing the search
    misspelled the word Chickin (Happens all the time);
    >
    string s1 = "This is Great Chicken";
    string s2 = "Chickin";
    >
    return ???
    >
    or
    >
    Worst yet
    string s1 = "This is Great Chickin";
    string s2 = "Chicken";
    >
    Person enter the search phrase misspelled the word Chicken;
    >
    string s1 = "This is Great Chickin";
    string s2 = "Chicken";
    Check out the Regex namespace.

    If you search something such as "Chickin" in a string which does not
    contain this token, such as "This is Great Chicken", what would you
    expect to return? I wouldn't expect it to return anything. Maybe you
    wanna do something like google's key words suggestion if user entry
    has some typos?

    Comment

    • Author

      #3
      Re: Search a String

      On Oct 2, 2:51 pm, Author <gnewsgr...@gma il.comwrote:
      On Oct 2, 2:26 pm, S <scottremi...@y ahoo.comwrote:
      >
      >
      >
      Any idea on how I would be able to do a search within C# that does
      ranges or words
      >
      For example
      >
      I want to search for Chicken in the string
      >
      string s1 = "This is Great Chicken";
      string s2 = "Chicken";
      return s1.IndexOf(s2) >= 0
      >
      The trick is how would I search for: Person doing the search
      misspelled the word Chickin (Happens all the time);
      >
      string s1 = "This is Great Chicken";
      string s2 = "Chickin";
      >
      return ???
      >
      or
      >
      Worst yet
      string s1 = "This is Great Chickin";
      string s2 = "Chicken";
      >
      Person enter the search phrase misspelled the word Chicken;
      >
      string s1 = "This is Great Chickin";
      string s2 = "Chicken";
      >
      Check out the Regex namespace.
      Correction: I meant the System.Text.Reg ularExpressions Namespace
      Provides regular expression functionality that may be used from any platform or language that runs within .NET. In addition to the types contained in this namespace, the RegexStringValidator class enables you to determine whether a particular string conforms to a regular expression pattern.


      Regex is a class of this namespace.
      If you search something such as "Chickin" in a string which does not
      contain this token, such as "This is Great Chicken", what would you
      expect to return?  I wouldn't expect it to return anything.  Maybe you
      wanna do something like google's key words suggestion if user entry
      has some typos?

      Comment

      • Peter Duniho

        #4
        Re: Search a String

        On Thu, 02 Oct 2008 11:26:38 -0700, S <scottremiger@y ahoo.comwrote:
        Any idea on how I would be able to do a search within C# that does
        ranges or words
        There are a variety of techniques for dealing with partial matching as you
        describe.

        I've actually implemented a very simple approach myself in the past, in
        which I do a sort of character-by-character merge between the target text
        and the search word, finding the spot in the target text where the search
        word fits the best with the least number of excluded characters. This
        would work perfectly for the examples you gave, but whether that would
        address your more general goals I can't say.

        If that sounds promising, I might be able to dig up the code and post it
        here. It's pretty inefficient, but it's small and simple so what the
        heck? :)

        You might also Google for "soundex". There's even a Wikipedia article on
        the topic:


        That's a commonly used technique for matching words based on phonetics and
        if you want a solution that is based more on language than on typography,
        it might be a better approach.

        A Google search on "dictionary spell check algorithm" also turned up a
        number of promising links.

        Pete

        Comment

        • Ignacio Machin ( .NET/ C# MVP )

          #5
          Re: Search a String

          On Oct 2, 2:26 pm, S <scottremi...@y ahoo.comwrote:
          Any idea on how I would be able to do a search within C# that does
          ranges or words
          >
          For example
          >
          I want to search for Chicken in the string
          >
          string s1 = "This is Great Chicken";
          string s2 = "Chicken";
          return s1.IndexOf(s2) >= 0
          >
          The trick is how would I search for: Person doing the search
          misspelled the word Chickin (Happens all the time);
          >
          string s1 = "This is Great Chicken";
          string s2 = "Chickin";
          >
          return ???
          >
          or
          >
          Worst yet
          string s1 = "This is Great Chickin";
          string s2 = "Chicken";
          >
          Person enter the search phrase misspelled the word Chicken;
          >
          string s1 = "This is Great Chickin";
          string s2 = "Chicken";
          Hi,

          I do not how to do it in code, I'm pretty sure that no version of the
          framework provides such a feature.
          In T-SQL you have this feature though take a look for SOUNDEX function

          Comment

          • Ignacio Machin ( .NET/ C# MVP )

            #6
            Re: Search a String

            On Oct 2, 2:51 pm, Author <gnewsgr...@gma il.comwrote:
            On Oct 2, 2:26 pm, S <scottremi...@y ahoo.comwrote:
            >
            >
            >
            Any idea on how I would be able to do a search within C# that does
            ranges or words
            >
            For example
            >
            I want to search for Chicken in the string
            >
            string s1 = "This is Great Chicken";
            string s2 = "Chicken";
            return s1.IndexOf(s2) >= 0
            >
            The trick is how would I search for: Person doing the search
            misspelled the word Chickin (Happens all the time);
            >
            string s1 = "This is Great Chicken";
            string s2 = "Chickin";
            >
            return ???
            >
            or
            >
            Worst yet
            string s1 = "This is Great Chickin";
            string s2 = "Chicken";
            >
            Person enter the search phrase misspelled the word Chicken;
            >
            string s1 = "This is Great Chickin";
            string s2 = "Chicken";
            >
            Check out the Regex namespace.
            >
            If you search something such as "Chickin" in a string which does not
            contain this token, such as "This is Great Chicken", what would you
            expect to return? I wouldn't expect it to return anything. Maybe you
            wanna do something like google's key words suggestion if user entry
            has some typos?


            I do not know how to use a Regex in this case.can you elaborate?

            Comment

            • Jeff Johnson

              #7
              Re: Search a String

              "S" <scottremiger@y ahoo.comwrote in message
              news:deea1500-88a2-4d43-9285-c72b5214ca96@64 g2000hsm.google groups.com...
              The trick is how would I search for: Person doing the search
              misspelled the word Chickin (Happens all the time);
              The answer is that what you're looking for is not at all simple. I'd
              recommend you do a Google search to see if you can find sample code for a
              spell checker.


              Comment

              • Arto Viitanen

                #8
                Re: Search a String

                Jeff Johnson kirjoitti:
                "S" <scottremiger@y ahoo.comwrote in message
                news:deea1500-88a2-4d43-9285-c72b5214ca96@64 g2000hsm.google groups.com...
                >
                >The trick is how would I search for: Person doing the search
                >misspelled the word Chickin (Happens all the time);
                >
                The answer is that what you're looking for is not at all simple. I'd
                recommend you do a Google search to see if you can find sample code for a
                spell checker.
                >
                >
                Spelling checker does not help. For example the text contains "Cut" and
                the user searches for "Cat". Both are spelled right, so the search gives
                no result.

                But, with addition to already mentioned Soundex, you might like to check
                Levenshtein distance
                (http://en.wikipedia.org/wiki/Levenshtein_distance). It tells how
                similar two texts are; i.e. how many simple edit commands are needed to
                make the second text from the first. For example distance between
                Chicken and Chikin is one.

                About Soundex: I have not studied it much, but I have a feeling Knuth
                built it for english words. How well does it apply to other languages
                or names?
                -
                Arto Viitanen

                Comment

                • Jeff Johnson

                  #9
                  Re: Search a String

                  "Arto Viitanen" <arto.viitanen@ microteam.fiwro te in message
                  news:48e5abc4$0 $25387$4f793bc4 @news.tdc.fi...
                  >>The trick is how would I search for: Person doing the search
                  >>misspelled the word Chickin (Happens all the time);
                  >>
                  >The answer is that what you're looking for is not at all simple. I'd
                  >recommend you do a Google search to see if you can find sample code for a
                  >spell checker.
                  Spelling checker does not help. For example the text contains "Cut" and
                  the user searches for "Cat". Both are spelled right, so the search gives
                  no result.
                  I agree, IF you say, "Hey, Mr. Spell Checker algorithm, tell me what's
                  misspelled in this sentence." However, if you are in control of the code,
                  I'd think you'd be able to say, "Hey, Mr. Spell Checker algorithm, make a
                  list of all the probable misspellings of 'chicken' and tell me if this
                  string contains any of them." See where I'm going?
                  But, with addition to already mentioned Soundex, you might like to check
                  Levenshtein distance
                  (http://en.wikipedia.org/wiki/Levenshtein_distance). It tells how
                  similar two texts are; i.e. how many simple edit commands are needed to
                  make the second text from the first. For example distance between
                  Chicken and Chikin is one.
                  Sounds good.
                  About Soundex: I have not studied it much, but I have a feeling Knuth
                  built it for english words. How well does it apply to other languages
                  or names?
                  I've heard people say that Soundex is an ancient crutch and shouldn't be
                  relied upon too much.


                  Comment

                  • =?UTF-8?B?IkZlcm5hbmRvIEEuIEfDs21leiBGLiI=?=

                    #10
                    Re: Search a String

                    Peter Duniho wrote:
                    On Thu, 02 Oct 2008 11:26:38 -0700, S <scottremiger@y ahoo.comwrote:
                    >
                    >Any idea on how I would be able to do a search within C# that does
                    >ranges or words
                    >
                    There are a variety of techniques for dealing with partial matching as
                    you describe.
                    >
                    I've actually implemented a very simple approach myself in the past, in
                    which I do a sort of character-by-character merge between the target
                    text and the search word, finding the spot in the target text where the
                    search word fits the best with the least number of excluded characters.
                    This would work perfectly for the examples you gave, but whether that
                    would address your more general goals I can't say.
                    >
                    [SNIP]
                    Pete
                    That sounds interesting. Is there any chance of seeing it as a
                    CodeProject article? :D

                    Comment

                    • Peter Duniho

                      #11
                      Re: Search a String

                      On Thu, 02 Oct 2008 22:21:06 -0700, Arto Viitanen
                      <arto.viitanen@ microteam.fiwro te:
                      [...]
                      Spelling checker does not help. For example the text contains "Cut" and
                      the user searches for "Cat". Both are spelled right, so the search gives
                      no result.
                      A spell checker itself wouldn't help. But if you apply the same algorithm
                      that a spell checker would, but with a dictionary that only has the word
                      you're specifically looking for at the moment, presumably those algorithms
                      would spit out some reasonable assessment of "closeness" for the word as
                      compared to the text being searched.

                      After all, for a spell checker to evaluate what the "best match" is for
                      misspelled words, it has to be able to compute some metric so that
                      possible matches can be ordered according to "closeness" .

                      For all I know, the mentioned "Levenshtei n distance" computation is in
                      fact something that a spell checker might use. But regardless, it seems
                      both would be good starting points for the question.

                      Pete

                      Comment

                      • Peter Duniho

                        #12
                        Re: Search a String

                        On Fri, 03 Oct 2008 12:05:12 -0700, Fernando A. Gómez F.
                        <fernando.a.gom ez.f@gmail.comw rote:
                        [...]
                        > I've actually implemented a very simple approach myself in the past,
                        >in which I do a sort of character-by-character merge between the target
                        >text and the search word, finding the spot in the target text where the
                        >search word fits the best with the least number of excluded
                        >characters. This would work perfectly for the examples you gave, but
                        >whether that would address your more general goals I can't say.
                        >>
                        [SNIP]
                        >Pete
                        >
                        That sounds interesting. Is there any chance of seeing it as a
                        CodeProject article? :D
                        Nope. But I'll look for the code and try to remember to post it here.

                        After that, someone else can post it to CodeProject if they like.

                        Note that I wasn't kidding when I said it's not efficient code. The
                        application I needed it for involved very short strings and brute force
                        worked great. I haven't looked at the Levenshtein article yet, but I
                        suspect that the algorithm described there is similar to what I did, but
                        with some actual thought and cleverness applied to it so that it has a
                        reasonable computational cost. :)

                        Pete

                        Comment

                        • =?UTF-8?B?IkZlcm5hbmRvIEEuIEfDs21leiBGLiI=?=

                          #13
                          Re: Search a String

                          Peter Duniho wrote:
                          On Fri, 03 Oct 2008 12:05:12 -0700, Fernando A. Gómez F.
                          <fernando.a.gom ez.f@gmail.comw rote:
                          >
                          >[...]
                          >> I've actually implemented a very simple approach myself in the past,
                          >>in which I do a sort of character-by-character merge between the
                          >>target text and the search word, finding the spot in the target text
                          >>where the search word fits the best with the least number of excluded
                          >>characters. This would work perfectly for the examples you gave, but
                          >>whether that would address your more general goals I can't say.
                          >>>
                          >[SNIP]
                          >>Pete
                          >>
                          >That sounds interesting. Is there any chance of seeing it as a
                          >CodeProject article? :D
                          >
                          Nope. But I'll look for the code and try to remember to post it here.
                          >
                          After that, someone else can post it to CodeProject if they like.
                          >
                          That would be nice, thank you.
                          Note that I wasn't kidding when I said it's not efficient code. The
                          application I needed it for involved very short strings and brute force
                          worked great. I haven't looked at the Levenshtein article yet, but I
                          suspect that the algorithm described there is similar to what I did, but
                          with some actual thought and cleverness applied to it so that it has a
                          reasonable computational cost. :)
                          >
                          I see. However perhaps it could be improved. It would be interesting, at
                          least, to see the approach you used, so if you have the time and mood,
                          I'd be most grateful if you could post such code.

                          Thanks. Best Regards.

                          Comment

                          • Peter Duniho

                            #14
                            Re: Search a String

                            On Fri, 03 Oct 2008 16:48:27 -0700, Fernando A. Gómez F.
                            <fernando.a.gom ez.f@gmail.comw rote:
                            [...]
                            >Note that I wasn't kidding when I said it's not efficient code. The
                            >application I needed it for involved very short strings and brute force
                            >worked great. I haven't looked at the Levenshtein article yet, but I
                            >suspect that the algorithm described there is similar to what I did,
                            >but with some actual thought and cleverness applied to it so that it
                            >has a reasonable computational cost. :)
                            >
                            I see. However perhaps it could be improved. It would be interesting, at
                            least, to see the approach you used, so if you have the time and mood,
                            I'd be most grateful if you could post such code.
                            Here you go (see below). Note that in my original problem, I was
                            comparing two complete strings and producing a metric that indicated how
                            similar the strings were, rather than searching for one string in
                            another. But you can easily modify this approach by using the same basic
                            technique, but only looking at substrings of the string being searched.

                            Though, actually...now that I look at this implementation again, I think
                            you could probably feed it a string to be searched and a string to look
                            for, and it would work okay. It could even be modified to return the
                            character index of the starting point of the best match for the string
                            being looked for. You'd want to limit how far it looks ahead, maybe
                            simply by delimiting attempted matches, or otherwise take into account
                            gaps within the matched characters, because otherwise it will consider
                            itself to have successfully matched the string even if each character it
                            found was actually just located in a different word. But other than that,
                            it's probably pretty close to what could work.

                            The basic idea below is to start out doing a basic string compare, but
                            when reaching a character that doesn't match, scan forward in each string
                            to try to match the current character in the other string (for two scans
                            each mis-match). If a matching character can be found, continue trying to
                            match characters again, looking for the point at which a comparison of the
                            remaining characters returns the highest "score". In this context, the
                            score is the number of characters that the algorithm could match before
                            running out of characters in one of the strings.

                            You'll note that the algorithm basically enumerates all of the possible
                            ways that the strings could match. There's essentially a binary search
                            tree that bifurcates each time there's a character mis-match, which means
                            that the cost of the algorithm has a worst-case cost that is exponential
                            by the length of the longest search. In practice, it will hit the
                            worst-case scenario only rarely, but it's at least worth understanding,
                            especially since you don't have to get very far into the algorithm for
                            even an average-case scenario to be relatively expensive as compared to
                            some other algorithm that might be designed more optimally.

                            Like I said, I was dealing with such short strings, I made no attempt at
                            all to be clever. It worked very well for my needs at the time, but I
                            make no promises about anyone else's. :)

                            Pete

                            static int ScoreStrings(st ring str1, string str2)
                            {
                            int ich = 0, cchMax = Math.Min(str1.L ength, str2.Length);
                            int scoreMax;

                            // Scan the strings, stopping at the first difference
                            while (ich < cchMax && str1[ich] == str2[ich])
                            {
                            ich++;
                            }

                            // If we've exhausted one of the strings, that's the best we're
                            // going to do
                            if (ich == cchMax)
                            {
                            return ich;
                            }

                            // For each string, scan forward looking for the current
                            character
                            // in the other. Check each possible match, and use the best
                            score
                            // we find.

                            scoreMax = Math.Max(ich,
                            Math.Max(ScoreS tringsScan(str1 .Substring(ich) , str2, ich),
                            ScoreStringsSca n(str2.Substrin g(ich), str1, ich)));

                            return scoreMax;
                            }

                            static int ScoreStringsSca n(string str1, string str2, int ichStart)
                            {
                            int scoreMax = 0;

                            for (int ichScan = ichStart; ichScan < str2.Length; ichScan++)
                            {
                            if (str1[0] == str2[ichScan])
                            {
                            int score = ScoreStrings(st r1,
                            str2.Substring( ichScan));

                            if (score scoreMax)
                            {
                            scoreMax = score;
                            }
                            }
                            }

                            return scoreMax;
                            }

                            Comment

                            • =?UTF-8?B?IkZlcm5hbmRvIEEuIEfDs21leiBGLiI=?=

                              #15
                              Re: Search a String

                              Peter Duniho wrote:
                              On Fri, 03 Oct 2008 16:48:27 -0700, Fernando A. Gómez F.
                              <fernando.a.gom ez.f@gmail.comw rote:
                              >
                              >[...]
                              >>Note that I wasn't kidding when I said it's not efficient code. The
                              >>application I needed it for involved very short strings and brute
                              >>force worked great. I haven't looked at the Levenshtein article yet,
                              >>but I suspect that the algorithm described there is similar to what I
                              >>did, but with some actual thought and cleverness applied to it so
                              >>that it has a reasonable computational cost. :)
                              >>
                              >I see. However perhaps it could be improved. It would be interesting,
                              >at least, to see the approach you used, so if you have the time and
                              >mood, I'd be most grateful if you could post such code.
                              >
                              Here you go (see below). Note that in my original problem, I was
                              comparing two complete strings and producing a metric that indicated how
                              similar the strings were, rather than searching for one string in
                              another. But you can easily modify this approach by using the same
                              basic technique, but only looking at substrings of the string being
                              searched.
                              >
                              Though, actually...now that I look at this implementation again, I think
                              you could probably feed it a string to be searched and a string to look
                              for, and it would work okay. It could even be modified to return the
                              character index of the starting point of the best match for the string
                              being looked for. You'd want to limit how far it looks ahead, maybe
                              simply by delimiting attempted matches, or otherwise take into account
                              gaps within the matched characters, because otherwise it will consider
                              itself to have successfully matched the string even if each character it
                              found was actually just located in a different word. But other than
                              that, it's probably pretty close to what could work.
                              >
                              The basic idea below is to start out doing a basic string compare, but
                              when reaching a character that doesn't match, scan forward in each
                              string to try to match the current character in the other string (for
                              two scans each mis-match). If a matching character can be found,
                              continue trying to match characters again, looking for the point at
                              which a comparison of the remaining characters returns the highest
                              "score". In this context, the score is the number of characters that
                              the algorithm could match before running out of characters in one of the
                              strings.
                              >
                              You'll note that the algorithm basically enumerates all of the possible
                              ways that the strings could match. There's essentially a binary search
                              tree that bifurcates each time there's a character mis-match, which
                              means that the cost of the algorithm has a worst-case cost that is
                              exponential by the length of the longest search. In practice, it will
                              hit the worst-case scenario only rarely, but it's at least worth
                              understanding, especially since you don't have to get very far into the
                              algorithm for even an average-case scenario to be relatively expensive
                              as compared to some other algorithm that might be designed more optimally.
                              >
                              Like I said, I was dealing with such short strings, I made no attempt at
                              all to be clever. It worked very well for my needs at the time, but I
                              make no promises about anyone else's. :)
                              >
                              Pete
                              Thanks a lot Pete, it looks interesting.

                              Regards,
                              Fernando.

                              Comment

                              Working...