Fastest way to search text file for string

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Julie

    Fastest way to search text file for string

    What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
    for a given string.

    The files are unindexed and unsorted, and for the purposes of my immediate
    requirements, can't be indexed/sorted.

    I don't want to load the entire file into physical memory, memory-mapped files
    are ok (and preferred). Speed/performance is a requirement -- the target is to
    locate the string in 10 seconds or less for a 100 MB file. The search string
    is typically 10 characters or less. Finally, I don't want to spawn out to an
    external executable (e.g. grep), but include the algorithm/method directly in
    the .NET code base. For the first rev, wildcard support is not a requirement.

    Thanks for any pointers!
  • Hermit Dave

    #2
    Re: Fastest way to search text file for string

    i would suggest that you have a look at Regex implmentation. I think regex
    is the fastest when it comes to scanning.
    You might need to use filestream to load the file so i dont think its the
    most appropriate answer.
    anyways make a local copy of one of those files and give Regex a try. see if
    it comes anywhere near the 10 sec mark.

    --

    Regards,

    Hermit Dave
    (http://hdave.blogspot.com)
    "Julie" <julie@nospam.c om> wrote in message
    news:414B76DD.2 0637E2A@nospam. com...[color=blue]
    > What is the *fastest* way in .NET to search large on-disk text files (100+[/color]
    MB)[color=blue]
    > for a given string.
    >
    > The files are unindexed and unsorted, and for the purposes of my immediate
    > requirements, can't be indexed/sorted.
    >
    > I don't want to load the entire file into physical memory, memory-mapped[/color]
    files[color=blue]
    > are ok (and preferred). Speed/performance is a requirement -- the target[/color]
    is to[color=blue]
    > locate the string in 10 seconds or less for a 100 MB file. The search[/color]
    string[color=blue]
    > is typically 10 characters or less. Finally, I don't want to spawn out to[/color]
    an[color=blue]
    > external executable (e.g. grep), but include the algorithm/method directly[/color]
    in[color=blue]
    > the .NET code base. For the first rev, wildcard support is not a[/color]
    requirement.[color=blue]
    >
    > Thanks for any pointers![/color]


    Comment

    • Drebin

      #3
      Re: Fastest way to search text file for string

      I wouldn't spend anymore time on see *if* you can do this, until you find
      out *why* you have to do this!

      100mb flat file?? This is exactly the reason why relational databases were
      made and are still used for just about everything. Without knowing more
      about your app, I'd rather take the 2 minutes to load this into a SQL table,
      build an index - and then what you want to do, suddenly becomes quick
      (sub-second), simple and will support wildcards later. Maybe bulk-load your
      file at night - and have your front-end hit the database during the day?

      I don't think you will be happy with just about any solution. Every response
      you will get to this is either going to be way to slow -or- way too
      complicated. You're re-inventing the wheel!!

      My $ .02



      "Julie" <julie@nospam.c om> wrote in message
      news:414B76DD.2 0637E2A@nospam. com...[color=blue]
      > What is the *fastest* way in .NET to search large on-disk text files (100+
      > MB)
      > for a given string.
      >
      > The files are unindexed and unsorted, and for the purposes of my immediate
      > requirements, can't be indexed/sorted.
      >
      > I don't want to load the entire file into physical memory, memory-mapped
      > files
      > are ok (and preferred). Speed/performance is a requirement -- the target
      > is to
      > locate the string in 10 seconds or less for a 100 MB file. The search
      > string
      > is typically 10 characters or less. Finally, I don't want to spawn out to
      > an
      > external executable (e.g. grep), but include the algorithm/method directly
      > in
      > the .NET code base. For the first rev, wildcard support is not a
      > requirement.
      >
      > Thanks for any pointers![/color]


      Comment

      • cody

        #4
        Re: Fastest way to search text file for string

        Just load the Text file into a large string and use string.IndexOf( ) this
        should be even faster than RegEx.

        --
        cody

        [Freeware, Games and Humor]
        www.deutronium.de.vu || www.deutronium.tk
        "Julie" <julie@nospam.c om> schrieb im Newsbeitrag
        news:414B76DD.2 0637E2A@nospam. com...[color=blue]
        > What is the *fastest* way in .NET to search large on-disk text files (100+[/color]
        MB)[color=blue]
        > for a given string.
        >
        > The files are unindexed and unsorted, and for the purposes of my immediate
        > requirements, can't be indexed/sorted.
        >
        > I don't want to load the entire file into physical memory, memory-mapped[/color]
        files[color=blue]
        > are ok (and preferred). Speed/performance is a requirement -- the target[/color]
        is to[color=blue]
        > locate the string in 10 seconds or less for a 100 MB file. The search[/color]
        string[color=blue]
        > is typically 10 characters or less. Finally, I don't want to spawn out to[/color]
        an[color=blue]
        > external executable (e.g. grep), but include the algorithm/method directly[/color]
        in[color=blue]
        > the .NET code base. For the first rev, wildcard support is not a[/color]
        requirement.[color=blue]
        >
        > Thanks for any pointers![/color]


        Comment

        • John Timney \(Microsoft MVP\)

          #5
          Re: Fastest way to search text file for string

          Given the size of the file, probably the only way would be to use a
          filebytearray, load the file in as bytes 1 at a time and convert them to
          chars by creating an indexer. You will need to work out a way of checking
          if a series of chars make up the string you are looking for.

          I would start here.


          vcwlkindexerstu torial.asp

          However, I very much doubt you will manage to scan 100 meg of data in 10
          seconds.

          --
          Regards

          John Timney
          Microsoft Regional Director
          Microsoft MVP


          "Julie" <julie@nospam.c om> wrote in message
          news:414B76DD.2 0637E2A@nospam. com...[color=blue]
          > What is the *fastest* way in .NET to search large on-disk text files (100+[/color]
          MB)[color=blue]
          > for a given string.
          >
          > The files are unindexed and unsorted, and for the purposes of my immediate
          > requirements, can't be indexed/sorted.
          >
          > I don't want to load the entire file into physical memory, memory-mapped[/color]
          files[color=blue]
          > are ok (and preferred). Speed/performance is a requirement -- the target[/color]
          is to[color=blue]
          > locate the string in 10 seconds or less for a 100 MB file. The search[/color]
          string[color=blue]
          > is typically 10 characters or less. Finally, I don't want to spawn out to[/color]
          an[color=blue]
          > external executable (e.g. grep), but include the algorithm/method directly[/color]
          in[color=blue]
          > the .NET code base. For the first rev, wildcard support is not a[/color]
          requirement.[color=blue]
          >
          > Thanks for any pointers![/color]


          Comment

          • Julie

            #6
            Re: Fastest way to search text file for string

            Drebin wrote:[color=blue]
            >
            > I wouldn't spend anymore time on see *if* you can do this, until you find
            > out *why* you have to do this![/color]

            All requirements have been defined at this point by project management. This
            isn't just a blind decision, but the result of the examination of the domain
            and expected results.
            [color=blue]
            > 100mb flat file?? This is exactly the reason why relational databases were
            > made and are still used for just about everything. Without knowing more
            > about your app, I'd rather take the 2 minutes to load this into a SQL table,
            > build an index - and then what you want to do, suddenly becomes quick
            > (sub-second), simple and will support wildcards later. Maybe bulk-load your
            > file at night - and have your front-end hit the database during the day?[/color]

            Yes, 100+ MB flat files. These are loosely formatted datafiles from external
            laboratory instruments.

            Remember, proper implementation dictates that you implement what is required,
            nothing more. The current requirements are simple access & management of these
            text files that allows immediate searching that averages 10 seconds or less.
            WinGrep accomplishes this in 6 seconds on our target system (1.3 GHz, 500 MB
            RAM).

            Future requirements *may* dictate that additional time constraints are imposed,
            which would then lead to db or other external indexing of the files. But, that
            will be implemented when and if necessary. If you have questions about this
            approach, you may want to look into the (industrial) extreme programming
            paradigm, which is what our shop successfully uses.
            [color=blue]
            > I don't think you will be happy with just about any solution. Every response
            > you will get to this is either going to be way to slow -or- way too
            > complicated. You're re-inventing the wheel!![/color]

            Nay, my good friend, not re-inventing the wheel, but asking where the wheel
            is. Text-searching of large files isn't uncommon or inappropriate. I'm just
            looking into comments on such searches in .Net; this stuff is fairly trivial in
            C++/Win32 (I'd prefer *not* to drop down to managed/unmanaged C++ for this
            project).
            [color=blue]
            >
            > "Julie" <julie@nospam.c om> wrote in message
            > news:414B76DD.2 0637E2A@nospam. com...[color=green]
            > > What is the *fastest* way in .NET to search large on-disk text files (100+
            > > MB)
            > > for a given string.
            > >
            > > The files are unindexed and unsorted, and for the purposes of my immediate
            > > requirements, can't be indexed/sorted.
            > >
            > > I don't want to load the entire file into physical memory, memory-mapped
            > > files
            > > are ok (and preferred). Speed/performance is a requirement -- the target
            > > is to
            > > locate the string in 10 seconds or less for a 100 MB file. The search
            > > string
            > > is typically 10 characters or less. Finally, I don't want to spawn out to
            > > an
            > > external executable (e.g. grep), but include the algorithm/method directly
            > > in
            > > the .NET code base. For the first rev, wildcard support is not a
            > > requirement.
            > >
            > > Thanks for any pointers![/color][/color]

            Comment

            • Julie

              #7
              Re: Fastest way to search text file for string

              "John Timney (Microsoft MVP)" wrote:[color=blue]
              >
              > Given the size of the file, probably the only way would be to use a
              > filebytearray, load the file in as bytes 1 at a time and convert them to
              > chars by creating an indexer. You will need to work out a way of checking
              > if a series of chars make up the string you are looking for.
              >
              > I would start here.
              >
              > http://msdn.microsoft.com/library/de...us/csref/html/
              > vcwlkindexerstu torial.asp
              >
              > However, I very much doubt you will manage to scan 100 meg of data in 10
              > seconds.[/color]

              Thanks, I'll look into that.

              WinGrep performs the search in 6 seconds on our target system (1.3 GHz, 500 MB
              RAM). (WinGrep is not open source, and C++/Win32.)

              [color=blue]
              >
              > --
              > Regards
              >
              > John Timney
              > Microsoft Regional Director
              > Microsoft MVP
              >
              > "Julie" <julie@nospam.c om> wrote in message
              > news:414B76DD.2 0637E2A@nospam. com...[color=green]
              > > What is the *fastest* way in .NET to search large on-disk text files (100+[/color]
              > MB)[color=green]
              > > for a given string.
              > >
              > > The files are unindexed and unsorted, and for the purposes of my immediate
              > > requirements, can't be indexed/sorted.
              > >
              > > I don't want to load the entire file into physical memory, memory-mapped[/color]
              > files[color=green]
              > > are ok (and preferred). Speed/performance is a requirement -- the target[/color]
              > is to[color=green]
              > > locate the string in 10 seconds or less for a 100 MB file. The search[/color]
              > string[color=green]
              > > is typically 10 characters or less. Finally, I don't want to spawn out to[/color]
              > an[color=green]
              > > external executable (e.g. grep), but include the algorithm/method directly[/color]
              > in[color=green]
              > > the .NET code base. For the first rev, wildcard support is not a[/color]
              > requirement.[color=green]
              > >
              > > Thanks for any pointers![/color][/color]

              Comment

              • Frans Bouma [C# MVP]

                #8
                Re: Fastest way to search text file for string

                Julie wrote:[color=blue]
                > Nay, my good friend, not re-inventing the wheel, but asking where the wheel
                > is. Text-searching of large files isn't uncommon or inappropriate. I'm
                > just looking into comments on such searches in .Net; this stuff is fairly
                > trivial in C++/Win32 (I'd prefer not to drop down to managed/unmanaged C++
                > for this project).[/color]

                Searching text in textblocks should use the textsearch algorithm by
                Knuth-Morris-More or the Boyer-Moore variant. These algorithms are much
                faster than the brute force algorithms implemented in the string class.

                Algorithms in C by Sedgewick contains a description of these algorithms, and
                I'm sure you'll find some descriptions on the internet.

                Basicly they come down to this:
                string: ababababcababab abacababababacb abababacbababab c

                if you now try to find the string abc, do not start at teh first character,
                but at the last. So in the string, the 3rd character is an 'a'. So abc will
                never start at the first character of the string, so we can skip the first 3.
                It works with skip arrays and is quite clever, it will tremendously speed up
                string search, especially with large texts.

                Frans.


                --
                Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
                My .NET Blog: http://weblogs.asp.net/fbouma
                Microsoft C# MVP

                Comment

                • James Curran

                  #9
                  Re: Fastest way to search text file for string

                  Julie wrote:[color=blue]
                  > What is the *fastest* way in .NET to search large on-disk text files
                  > (100+ MB) for a given string.
                  >
                  > I don't want to load the entire file into physical memory,
                  > memory-mapped files are ok (and preferred).[/color]

                  The problem is that fast access to a large file requires direct access
                  to memory, which is antithetical to managed code. Your best choice would be
                  to isolate the search in a unmanaged C++ function, which is coded by the
                  managed C# app. Access the file as a memory mapped file is fairly easy in
                  unmanaged code, but I don't believe it's possible at all in a managed app.

                  The best algorithm for searching your file is probably the Boyer-Moore
                  method. Moore himself has a cool webpage graphically demostrating it:
                  http://www.cs.utexas.edu/users/moore...ing/index.html.
                  Googling "Boyer-Moore" will provide any number of implementations , such as
                  this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html

                  --
                  Truth,
                  James Curran [MVP]
                  www.NJTheater.com (Professional)
                  www.NovelTheory.com (Personal)




                  Comment

                  • William Stacey [MVP]

                    #10
                    Re: Fastest way to search text file for string

                    If you want to jam em out in c#, I for one would really appreaciate it and
                    keep for future use. Or post at CodeProject. TIA.

                    --
                    William Stacey, MVP

                    "Frans Bouma [C# MVP]" <perseus.usenet NOSPAM@xs4all.n l> wrote in message
                    news:xn0dnf2oxg rnt1000@msnews. microsoft.com.. .[color=blue]
                    > Julie wrote:[color=green]
                    > > Nay, my good friend, not re-inventing the wheel, but asking where the[/color][/color]
                    wheel[color=blue][color=green]
                    > > is. Text-searching of large files isn't uncommon or inappropriate. I'm[/color][/color]
                    ....

                    Comment

                    • Frans Bouma [C# MVP]

                      #11
                      Re: Fastest way to search text file for string

                      William Stacey [MVP] wrote:
                      [color=blue]
                      > If you want to jam em out in c#, I for one would really appreaciate it and
                      > keep for future use. Or post at CodeProject. TIA.[/color]

                      I made a typo, It's Knutt-Morris-Pratt, not Knutt-Morris-Moore.

                      Here is a link to a lot of string search algorithms. Both I mentioned are
                      there with C-code and explanation. It's pretty straight forward :)]
                      EXACT STRING MATCHING ALGORITHMS Animation in Java


                      Frans.

                      --
                      Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
                      My .NET Blog: http://weblogs.asp.net/fbouma
                      Microsoft C# MVP

                      Comment

                      • Julie

                        #12
                        Re: Fastest way to search text file for string

                        Drebin wrote:[color=blue]
                        >
                        > "Julie" <julie@nospam.c om> wrote in message
                        > news:414C3ACD.5 7F7A919@nospam. com...[color=green]
                        > > Remember, proper implementation dictates that you implement what is
                        > > required,
                        > > nothing more.[/color]
                        >
                        > Wow. Most every other developer would argue that "proper" implementation
                        > dictates: functionality that is required and then any other structural or
                        > design implementations that will likely be used in the future.[/color]

                        I was in that group of developers 6 months ago. I was raised old school where
                        requirements, specs, and definitions are done before coding starts.

                        Believe me, the paradigm switch *wasn't* easy nor intuitive. However, I have
                        to say, that there is a lot to IXP that really impacts the production quality
                        *and* performance of the entire project. I'm no expert, but I'm making
                        progress, and probably won't switch back to the old way.

                        And I guarantee that I wasn't excited about the switch. However, I committed
                        to trying it for 6 months, and then reevaluate. So far, so good.
                        [color=blue]
                        > That's like saying "my customers told me to build a house and they want 3
                        > bedrooms".[/color]

                        Right -- and that is what you build. The foresight is to build it in such a
                        fashion that that future changes can be easily accomplished. One of the
                        primary keys is constant refactoring of code and TDD. That leaves the code in
                        a much more stable and flexible state. It doesn't sound efficient, but when
                        you are proficient at it, it really is more effective.
                        [color=blue]
                        > You need to have the foresight and wisdom to anticipate that
                        > later they are going to say "we want a kitchen and bathrooms too!".[/color]

                        In your hypothetical example, what if they never want a kitchen/bathroom? Then
                        you have wasted work.
                        [color=blue]
                        > If you
                        > don't, you will *always* be doing way more work than you need to! Customers
                        > often don't know what they want - it's up to your expertise to help them
                        > define that.
                        >
                        > For your app, common sense tells me: if they want to search this data, they
                        > will want to sort it, and they will want to browse and filter results. Going
                        > your route, EACH step will likely take just as long. If you did it the right
                        > way once, in the beginning - then each additional thing the users wanted
                        > would be practically free.[/color]

                        See, your common sense is already wrong, and if followed, leads down the wrong
                        path and direction for this application.

                        Anyway, the thread isn't to discuss the merits of IXP, merely looking for a
                        quick text search implementation in C# -- nothing more, nothing less.[color=blue]
                        >[color=green]
                        > > If you have questions about this
                        > > approach, you may want to look into the (industrial) extreme programming
                        > > paradigm, which is what our shop successfully uses.[/color]
                        >
                        > You can call it what you want and justify it however, but I'm really glad I
                        > don't work where you work. :-)[/color]

                        Comment

                        • Julie

                          #13
                          Re: Fastest way to search text file for string

                          Michael C wrote:[color=blue]
                          >[color=green]
                          > > That's like saying "my customers told me to build a house and they want 3
                          > > bedrooms". You need to have the foresight and wisdom to anticipate that
                          > > later they are going to say "we want a kitchen and bathrooms too!". If you
                          > > don't, you will *always* be doing way more work than you need to![/color]
                          > Customers[color=green]
                          > > often don't know what they want - it's up to your expertise to help them
                          > > define that.[/color]
                          >
                          > If you're storing 100MB of data in a flat file, you're probably not running
                          > very efficiently to begin with. Like Drebin said, you need to look at your
                          > ultimate goals and you'll probably find much better solutions out there
                          > (i.e., SQL, etc.)[/color]

                          Like I indicated in another follow-up: "These are loosely formatted datafiles
                          from external laboratory instruments." The requirement is to work w/ these
                          files, not change the file format.
                          [color=blue]
                          > That being said, you need to have the foresight, wisdom and *discipline* to
                          > define all requirements *up-front*. Adding extra kitchens and bathrooms to
                          > the floorplan halfway through the house building process jacks the cost - in
                          > time and $$$'s - *way* up. It's true that a lot of users don't understand
                          > the requirements definition process, and many aren't aware of the features
                          > they'll want down the road; that's where you step in and help guide them
                          > during the planning process. If you're constantly re-writing your
                          > applications because the user requirements keep changing during the
                          > implementation phase, you might want to look at improving your own planning
                          > process.[/color]

                          Thank you for the compliment! Yes, we definitely do have foresight, wisdom and
                          discipline. The conclusion of our investigation was the following
                          requirements:

                          Search an approx. 100 MB flat text file for a given string in an average of 10
                          seconds or less, no wildcards.

                          Now, if I can just have the *discipline* to implement something that meets
                          _those_ requirements, I get the job done, am appreciated by the team, we
                          release the product, I get paid, and everyone is happy.

                          Comment

                          • Julie

                            #14
                            Re: Fastest way to search text file for string

                            James Curran wrote:[color=blue]
                            >
                            > Julie wrote:[color=green]
                            > > What is the *fastest* way in .NET to search large on-disk text files
                            > > (100+ MB) for a given string.
                            > >
                            > > I don't want to load the entire file into physical memory,
                            > > memory-mapped files are ok (and preferred).[/color]
                            >
                            > The problem is that fast access to a large file requires direct access
                            > to memory, which is antithetical to managed code. Your best choice would be
                            > to isolate the search in a unmanaged C++ function, which is coded by the
                            > managed C# app. Access the file as a memory mapped file is fairly easy in
                            > unmanaged code, but I don't believe it's possible at all in a managed app.
                            >
                            > The best algorithm for searching your file is probably the Boyer-Moore
                            > method. Moore himself has a cool webpage graphically demostrating it:
                            > http://www.cs.utexas.edu/users/moore...ing/index.html.
                            > Googling "Boyer-Moore" will provide any number of implementations , such as
                            > this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html[/color]

                            Thanks for the tips. I wasn't aware that mm file support wasn't available in
                            ..NET, seems short-sighted to me.

                            Managed/unmanaged code really isn't a possibility, part of the requirements is
                            that it is implemented in C#. I may be able to get away w/ using interop
                            though for the mm file support.

                            Comment

                            • Julie

                              #15
                              Re: Fastest way to search text file for string

                              Julie wrote:[color=blue]
                              >
                              > James Curran wrote:[color=green]
                              > >
                              > > Julie wrote:[color=darkred]
                              > > > What is the *fastest* way in .NET to search large on-disk text files
                              > > > (100+ MB) for a given string.
                              > > >
                              > > > I don't want to load the entire file into physical memory,
                              > > > memory-mapped files are ok (and preferred).[/color]
                              > >
                              > > The problem is that fast access to a large file requires direct access
                              > > to memory, which is antithetical to managed code. Your best choice would be
                              > > to isolate the search in a unmanaged C++ function, which is coded by the
                              > > managed C# app. Access the file as a memory mapped file is fairly easy in
                              > > unmanaged code, but I don't believe it's possible at all in a managed app.
                              > >
                              > > The best algorithm for searching your file is probably the Boyer-Moore
                              > > method. Moore himself has a cool webpage graphically demostrating it:
                              > > http://www.cs.utexas.edu/users/moore...ing/index.html.
                              > > Googling "Boyer-Moore" will provide any number of implementations , such as
                              > > this one: http://www.dcc.uchile.cl/~rbaeza/han...3b.srch.c.html[/color]
                              >
                              > Thanks for the tips. I wasn't aware that mm file support wasn't available in
                              > .NET, seems short-sighted to me.
                              >
                              > Managed/unmanaged code really isn't a possibility, part of the requirements is
                              > that it is implemented in C#. I may be able to get away w/ using interop
                              > though for the mm file support.[/color]

                              Not interop, but P/Invoke...

                              Comment

                              Working...