File Processing

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jeff

    File Processing

    Hello

    I want to read and process and rewrite a very large disk based file
    (>3Gbytes) as quickly as possible.
    The processing effectively involves finding certain strings and replacing
    them with other strings of equal length such that the file size is unaltered
    (the file is uncompressed btw). I wondered if anyone could advise me of the
    best way to do this and also of things to avoid. More specifically I was
    wondering :-

    -Is it best to open a single file for read-write access and overwrite the
    changed bytes or would it be better to create a new file?
    -Is there any point in buffering bytes in rather than reading one byte at a
    time or does this just defeat the buffering that's done by the OS anyway?
    -Would this benefit from multi-threading - read, process, write?

    And finally could anyone point me to any sample code which already does this
    sort of thing in the fastest possible way?

    Many Thanks
    Jeff


  • Victor Bazarov

    #2
    Re: File Processing

    Jeff wrote:
    I want to read and process and rewrite a very large disk based file
    (>3Gbytes) as quickly as possible.
    The processing effectively involves finding certain strings and replacing
    them with other strings of equal length such that the file size is unaltered
    (the file is uncompressed btw). I wondered if anyone could advise me of the
    best way to do this and also of things to avoid. More specifically I was
    wondering :-
    >
    -Is it best to open a single file for read-write access and overwrite the
    changed bytes or would it be better to create a new file?
    It is always a good idea to leave the old file intact, unless you
    somehow can ensure that a single write operation will never fail and
    that an incomplete set of find/replace operations is still OK. Ask in
    any database development newsgroup.
    -Is there any point in buffering bytes in rather than reading one byte at a
    time or does this just defeat the buffering that's done by the OS anyway?
    You'd have to experiment. C++ language does not define any buffering
    AFA OS is concerned.
    -Would this benefit from multi-threading - read, process, write?
    Unlikely. Processing will take so little time compared to the I/O, and
    I/O is going to be the bottleneck anyway, so...
    [..]
    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask

    Comment

    • James Kanze

      #3
      Re: File Processing

      On Sep 30, 9:35 pm, Victor Bazarov <v.Abaza...@com Acast.netwrote:
      Jeff wrote:
      I want to read and process and rewrite a very large disk based file
      (>3Gbytes) as quickly as possible.
      The processing effectively involves finding certain strings and replacing
      them with other strings of equal length such that the file size is unaltered
      (the file is uncompressed btw). I wondered if anyone could advise me of the
      best way to do this and also of things to avoid. More specifically I was
      wondering :-
      -Is it best to open a single file for read-write access and overwrite the
      changed bytes or would it be better to create a new file?
      It is always a good idea to leave the old file intact, unless you
      somehow can ensure that a single write operation will never fail and
      that an incomplete set of find/replace operations is still OK. Ask in
      any database development newsgroup.
      This is generally true, but he said a "very large" file. I'd
      have some hesitations about making a copy if the file size were,
      say, 100 Gigabytes.

      As always, you have to weigh the trade offs. Making a copy is
      certainly a safer solution, if you can afford it.
      -Is there any point in buffering bytes in rather than
      reading one byte at a time or does this just defeat the
      buffering that's done by the OS anyway?
      You'd have to experiment. C++ language does not define any
      buffering AFA OS is concerned.
      C++ does define buffering in iostreams. But the fastest
      solution will almost certainly involve platform specific
      requests. I'd probably start by using mmap on a Unix system, or
      CreateFileMappi ng/MapViewOfFile under Windows. If performance
      is really an issue, he'll probably have to experiment with
      different solutions, but I'd be surprised if anything was
      significantly faster than using a memory mapped file, modified
      in place.

      But of course, as you pointed out above, this solution doesn't
      provide transactional integrity. And it only works if the
      process has enough available address space to map the file.
      (Probably no problem on a 64 bit processor, but likely not the
      case on 32 bit one.)
      -Would this benefit from multi-threading - read, process, write?
      Unlikely. Processing will take so little time compared to the
      I/O, and I/O is going to be the bottleneck anyway, so...
      If he uses memory mapping, the system will take care of all of
      the IO behind his back anyway. Otherwise, some sort of
      asynchronous I/O can sometimes improve performance.

      --
      James Kanze (GABI Software) email:james.kan ze@gmail.com
      Conseils en informatique orientée objet/
      Beratung in objektorientier ter Datenverarbeitu ng
      9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

      Comment

      • jacek.dziedzic@gmail.com

        #4
        Re: File Processing

        On Sep 30, 8:44 pm, "Jeff" <some...@somewh ere.comwrote:
        Hello
        >
        I want to read and process and rewrite a very large disk based file
        (>3Gbytes) as quickly as possible.
        The processing effectively involves finding certain strings and replacing
        them with other strings of equal length such that the file size is unaltered
        (the file is uncompressed btw).  I wondered if anyone could advise me of the
        best way to do this and also of things to avoid. More specifically I was
        wondering :-
        >
        -Is it best to open a single file for read-write access and overwrite the
        changed bytes or would it be better to create a new file?
        You are asking about performance or safety? As Victor pointed out
        already,
        it's always safer to work on a copy. Performance-wise overwriting the
        bytes
        in the one file you have will be way faster then copying the file.
        -Is there any point in buffering bytes in rather than reading one byte ata
        time or does this just defeat the buffering that's done by the OS anyway?
        There is. If you intend to issue 3000000000 read() calls to read a
        3GB file,
        one byte a time, you're wasting quite a lot of time doing the calls.
        Reading
        in, say, 1MB chunks would make it faster, although it complicates
        looking
        for the strings (chunk boundaries).
        -Would this benefit from multi-threading - read, process, write?
        Not to any significant degree, unless you're doing a *lot* of
        processing
        to find the strings you need (like complex regexen or such). Very
        likely
        you're way I/O-bound here.
        And finally could anyone point me to any sample code which already does this
        sort of thing in the fastest possible way?
        No, but I would strongly advise you to look into memory-mapped I/O,
        if
        your system supports it. This is not portable in C++ sense, and hence
        OT for this newsgroup, but it is most likely the fastest you can get,
        and -- as a bonus -- you avoid all read() and write() calls, and need
        no
        buffering. Google for the mmap() call.

        HTH,
        - J.

        Comment

        • James Kanze

          #5
          Re: File Processing

          On Oct 1, 2:24 pm, jacek.dzied...@ gmail.com wrote:
          On Sep 30, 8:44 pm, "Jeff" <some...@somewh ere.comwrote:
          No, but I would strongly advise you to look into memory-mapped
          I/O, if your system supports it. This is not portable in C++
          sense, and hence OT for this newsgroup, but it is most likely
          the fastest you can get, and -- as a bonus -- you avoid all
          read() and write() calls, and need no buffering. Google for
          the mmap() call.
          While it's true that mmap is usually faster than naïve file
          handling, the buffering, reading and writing are still there.
          The only difference is that its the OS which takes care of them
          (with a bit of help from the hardware), and not you. Typically,
          *IF* you're a real expert, and you're willing to invest a lot of
          time and effort, you can do better for any specific use.
          Typically, not much better, however, and typically, you're not a
          real expert (the real experts are busy implementing the code in
          the OS), and the slight gains you get aren't worth the cost.

          --
          James Kanze (GABI Software) email:james.kan ze@gmail.com
          Conseils en informatique orientée objet/
          Beratung in objektorientier ter Datenverarbeitu ng
          9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

          Comment

          • AnonMail2005@gmail.com

            #6
            Re: File Processing

            On Sep 30, 2:44 pm, "Jeff" <some...@somewh ere.comwrote:
            Hello
            >
            I want to read and process and rewrite a very large disk based file
            (>3Gbytes) as quickly as possible.
            The processing effectively involves finding certain strings and replacing
            them with other strings of equal length such that the file size is unaltered
            (the file is uncompressed btw).  I wondered if anyone could advise me of the
            best way to do this and also of things to avoid. More specifically I was
            wondering :-
            >
            -Is it best to open a single file for read-write access and overwrite the
            changed bytes or would it be better to create a new file?
            -Is there any point in buffering bytes in rather than reading one byte ata
            time or does this just defeat the buffering that's done by the OS anyway?
            -Would this benefit from multi-threading - read, process, write?
            >
            And finally could anyone point me to any sample code which already does this
            sort of thing in the fastest possible way?
            >
            Many Thanks
            Jeff
            First cut, I would look into unix text processing tools like grep and
            sed.
            Why reinvent the wheel? Also, these tools are available for use in
            non
            unix environmetns like the PC.

            HTH

            Comment

            • Jeff

              #7
              Re: File Processing

              Thanks a million for the very helpful replies.

              I'm still experimenting, but I already found that I can make significant
              (>10) improvements in speed by buffering in the file rather than reading it
              byte by byte.

              Jeff


              Comment

              Working...