Reading & Searching a Huge file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ahmad Jalil Qarshi

    Reading & Searching a Huge file

    Hi,

    I have a text file having size about 2 GB. The text file format is like:

    Numeric value[space]AlphaNumeric values
    Numeric value[space]AlphaNumeric values
    Numeric value[space]AlphaNumeric values

    For example consider following chunk of actual data:

    52 SANZP3118.00Q31 26T070533N16CEP XT
    96 SANZP3118.00Q72 94T070534N17CEP
    100 SBHPP3200.00Q16 000T070534N18CE PXT
    150 SBHPP3200.00Q10 00T070534N19CEP
    153 SBHPP3200.00Q50 00070534N20CEP
    159 SBHPP3200.00Q30 00070534N21CEP
    168 SBHPP3200.00Q11 1000T070534N22C EP
    168 SLHGP375.00Q400 0T070534N26CEP
    210 SBHPP3200.0Q000 T070534N23CEP
    569 SLHGP375.00Q230 00T070534N24CEP
    1000 SLHGP375.00Q162 000T070534N25CE P
    1010 SLHGP375.00Q400 0T070534N26CEP

    Now what I want to is to find a specific numeric value at the beginning of
    the line. For example in the above few lines the numeric values are
    52,96,100,150,1 53,159,168,210, 569,1000,1010 (these values are always
    sorted). Suppose I have to search 168. Since the values are sorted I can use
    the binary search technique to search a specific numeric value. But for that
    I will have to read all the numeric values sequentially into some array in
    memory. The other way is sequential search i.e to read line by line and
    fetch the numeric value before [space] and compare it with required one.

    What is the best possible way to perform searching in the above mentioned
    file format.

    Thanks in anticipation,

    Regards,

    Ahmad Jalil Qarshi


  • Peter Duniho

    #2
    Re: Reading & Searching a Huge file

    On Fri, 14 Mar 2008 10:35:49 -0700, Ahmad Jalil Qarshi
    <ahmaddearNO@SP AMhotmail.comwr ote:
    [...]
    Now what I want to is to find a specific numeric value at the beginning
    of
    the line. For example in the above few lines the numeric values are
    52,96,100,150,1 53,159,168,210, 569,1000,1010 (these values are always
    sorted). Suppose I have to search 168. Since the values are sorted I can
    use
    the binary search technique to search a specific numeric value. But for
    that
    I will have to read all the numeric values sequentially into some array
    in
    memory. The other way is sequential search i.e to read line by line and
    fetch the numeric value before [space] and compare it with required one.
    >
    What is the best possible way to perform searching in the above mentioned
    file format.
    How many times do you need to perform this search?

    If it's just once, then I think you could take advantage of the fact that
    the lines are all _almost_ the same length and do a binary search on the
    file itself. You'll have to make a guess, then search for the
    previous/next end-of-line, then check the numeric value at the beginning
    of that line. That's not as efficient as it could be if you had identical
    line lengths, but compared to scanning through a 2GB file it should still
    be plenty fast.

    If you're going to perform the search on a regular basis, I'd recommend
    creating an index for the file. Scan it once, storing just the initial
    numeric value and a byte offset within the file representing where that
    record's line starts. Then later you can do your binary search on the
    index instead of the file. This is, I think, what you're suggesting when
    you wrote "read all the numeric values sequentially into some array"; you
    didn't mention also storing the file offsets for each record, but
    hopefully you realize that's implied since otherwise keeping the values in
    an array wouldn't be useful.

    Of course, keep in mind that this is the sort of thing that databases are
    really good at. If you have a way to redirect the data into a database
    rather than keeping it in this huge text file, I think that would be the
    best way to handle it.

    Pete

    Comment

    • Ahmad Jalil Qarshi

      #3
      Re: Reading &amp; Searching a Huge file

      Thanks Peter & Jon,

      I have to read the file on user request. Now the problem is that I can't
      create index as the format of file is fixed. So I think the only option left
      is binary search.

      Let me explain one more thing which I missed in my previous post. The
      numeric value before the [space] can be repeated thousands of time
      consequtively. Like.
      ........ ............... ...............
      2073 SANZP311800Q729 T070534N17CEP
      2345 SBHPP3200.00Q16 000T070534N18CE PXT
      2345 SBHPP3200.00Q10 00T070534N19CEP
      2345 SBHPP3200.00Q50 00070534N20CEPT P
      ........ ............... ...............
      ........ ............... ...............
      2345 SLHGP375.00Q400 0T070534N26CEP
      2567 SBHPP3200.0Q000 T070534N23CEP
      ......... ............... ...............

      Now if I have to find the first occurance of a numeric value 2345. In that
      case binary search will help or not?

      Regards,

      "Peter Duniho" <NpOeStPeAdM@nn owslpianmk.comw rote in message
      news:op.t70rb3i r8jd0ej@petes-computer.local. ..
      On Fri, 14 Mar 2008 10:35:49 -0700, Ahmad Jalil Qarshi
      <ahmaddearNO@SP AMhotmail.comwr ote:
      >
      >[...]
      >Now what I want to is to find a specific numeric value at the beginning
      >of
      >the line. For example in the above few lines the numeric values are
      >52,96,100,150, 153,159,168,210 ,569,1000,1010 (these values are always
      >sorted). Suppose I have to search 168. Since the values are sorted I can
      >use
      >the binary search technique to search a specific numeric value. But for
      >that
      >I will have to read all the numeric values sequentially into some array
      >in
      >memory. The other way is sequential search i.e to read line by line and
      >fetch the numeric value before [space] and compare it with required one.
      >>
      >What is the best possible way to perform searching in the above mentioned
      >file format.
      >
      How many times do you need to perform this search?
      >
      If it's just once, then I think you could take advantage of the fact that
      the lines are all _almost_ the same length and do a binary search on the
      file itself. You'll have to make a guess, then search for the
      previous/next end-of-line, then check the numeric value at the beginning
      of that line. That's not as efficient as it could be if you had identical
      line lengths, but compared to scanning through a 2GB file it should still
      be plenty fast.
      >
      If you're going to perform the search on a regular basis, I'd recommend
      creating an index for the file. Scan it once, storing just the initial
      numeric value and a byte offset within the file representing where that
      record's line starts. Then later you can do your binary search on the
      index instead of the file. This is, I think, what you're suggesting when
      you wrote "read all the numeric values sequentially into some array"; you
      didn't mention also storing the file offsets for each record, but
      hopefully you realize that's implied since otherwise keeping the values in
      an array wouldn't be useful.
      >
      Of course, keep in mind that this is the sort of thing that databases are
      really good at. If you have a way to redirect the data into a database
      rather than keeping it in this huge text file, I think that would be the
      best way to handle it.
      >
      Pete

      Comment

      • Peter Duniho

        #4
        Re: Reading &amp; Searching a Huge file

        On Sat, 15 Mar 2008 02:36:29 -0700, Ahmad Jalil Qarshi
        <ahmaddearNO@SP AMhotmail.comwr ote:
        Thanks Peter & Jon,
        >
        I have to read the file on user request. Now the problem is that I can't
        create index as the format of file is fixed. So I think the only option
        left
        is binary search.
        Well, a linear scan is always an option. Whether that performs well
        enough to justify avoiding the work of implementing a binary search,
        that's up to you or whoever's making the decision about how your time is
        best used.

        If the text file uses an encoding that has fixed-sized characters, I think
        that a binary search isn't actually going to be very hard to implement.
        But as Jon points out, if you have to support multiple encodings and/or
        encodings with variable-length characters (e.g. UTF-8), then it gets more
        complicated (a little...I'm no character-encoding expert, but I'm under
        the impression that in any of the encodings, finding an end-of-line marker
        while scanning backwards would not be difficult; a LF or CR/LF pair isn't
        a valid sequence in any other character's data, is it?).

        In either case, it's certainly possible. The question is what's the best
        use of your time.

        I still wonder at the requirement that this has to be done "on demand",
        without an opportunity to index the file. That's an awful lot of data to
        be search just once. But if "the powers that be" deem it more sensible to
        make you implement a binary search on a variable-line-length text file
        than to fix the problem space to better incorporate an indexing- or
        database-based solution, I guess that's what you're stuck with.

        I have to admit: if these happen infrequently enough that indexing a file
        doesn't make sense, I wonder if they don't also happen infrequently enough
        that just scanning linearly through the file wouldn't be okay. Why does
        this need to be fast?
        Let me explain one more thing which I missed in my previous post. The
        numeric value before the [space] can be repeated thousands of time
        consequtively. [...]
        >
        Now if I have to find the first occurance of a numeric value 2345. In
        that
        case binary search will help or not?
        It will help just as much as it would in any other situation where a
        binary search can be used. Binary search doesn't imply unique elements,
        but if you have duplicated elements that does mean that you have to do
        some extra work to scan and find the first matching element, since the
        first element the search finds may not be the actual first element in the
        sequence.

        Pete

        Comment

        Working...