Decoding strategy

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • marcin.rzeznicki@gmail.com

    Decoding strategy

    Hello everyone
    I've got a little problem with choosing the best decoding strategy for
    some nasty problem. I have to deal with very large files wich contain
    text encoded with various encodings. Their length makes loading
    contents of file into memory in single run inappropriate. I solved this
    problem by implementing memory mapping using P/Invoke and I load
    contents of file in chunks. Since files' contents are in different
    encodings what I really do is mapping portion of file into memory and
    then decoding that part using System.Text.Enc oding. So far, so good,
    but. It's not difficult to imagine serious problem with this approach.
    Since file processing is not, and also cannot be, sequential and
    furthermore, memory mapping limits offsets at which mapping can take
    place, then some mapping can "tear" a character apart. How to deal with
    this? I thought of implementing decoder fallback which would check few
    bytes behind current mapping and would try to substitute unrecognized
    chars, but I don't know whether it is feasible. I do not know if
    decoder will not accidently mistake broken char for some valid, but
    different from expected, character. I guess it depends on encoding
    used. What do You think?

  • Kevin Spencer

    #2
    Re: Decoding strategy

    I would use a FileStream instance to read the file. The FileStream class
    supports random access to files, allowing you to jump around in the file.
    You can read as little or as much as you want into memory when you need to.

    --
    HTH,

    Kevin Spencer
    Microsoft MVP
    Chicken Salad Shooter
    Thoughts and Ideas about programming, philosophy, science, arts, life, God, and related subjects.


    A man, a plan, a canal, a palindrome that has.. oh, never mind.

    <marcin.rzeznic ki@gmail.comwro te in message
    news:1160431060 .920574.5670@h4 8g2000cwc.googl egroups.com...
    Hello everyone
    I've got a little problem with choosing the best decoding strategy for
    some nasty problem. I have to deal with very large files wich contain
    text encoded with various encodings. Their length makes loading
    contents of file into memory in single run inappropriate. I solved this
    problem by implementing memory mapping using P/Invoke and I load
    contents of file in chunks. Since files' contents are in different
    encodings what I really do is mapping portion of file into memory and
    then decoding that part using System.Text.Enc oding. So far, so good,
    but. It's not difficult to imagine serious problem with this approach.
    Since file processing is not, and also cannot be, sequential and
    furthermore, memory mapping limits offsets at which mapping can take
    place, then some mapping can "tear" a character apart. How to deal with
    this? I thought of implementing decoder fallback which would check few
    bytes behind current mapping and would try to substitute unrecognized
    chars, but I don't know whether it is feasible. I do not know if
    decoder will not accidently mistake broken char for some valid, but
    different from expected, character. I guess it depends on encoding
    used. What do You think?
    >

    Comment

    • marcin.rzeznicki@gmail.com

      #3
      Re: Decoding strategy


      Kevin Spencer napisal(a):
      I would use a FileStream instance to read the file. The FileStream class
      supports random access to files, allowing you to jump around in the file.
      You can read as little or as much as you want into memory when you need to.
      >
      Hello Kevin
      Thanks for reply.
      I didn't test performance with FileStream, but maybe you can confirm -
      Does File Stream caches contents of file in memory? I think there is
      slight speedup when using memory mapping in that I do not have to hit
      the disk all the time. In my solution I simply open mapping over whole
      file and create views as needed. Anyway, let's say that I did it using
      FileStream, I can read some bytes from it, but I still face the same
      problem - how to interpret first bytes I have read, whether they are
      beginning of character, or maybe end of "previous" character?

      Comment

      • Kim Greenlee

        #4
        RE: Decoding strategy

        Hi Marcin,

        I need a little clarification. You have multiple files where each file
        could use a different encoding OR you have multiple files where WITHIN each
        file multiple encodings are used?

        I'm also confused by your reference to a character "tear". And if you could
        explain that reference, I would find it helpful.

        Thanks,

        Kim Greenlee
        --
        digipede - Many legs make light work.
        Grid computing for the real world.
        Empower your business with the help of Digipede Technologies. We are a leading provider of distributed computing solutions on the Microsoft .NET platform.



        Comment

        • Kevin Spencer

          #5
          Re: Decoding strategy

          No, the FileStream is the .Net equivalent of a FILE pointer (in a sense). It
          is positioned and reads from the file according to your code. You must
          create a buffer for it to read into. That buffer can be used to read
          portions of the file, and used repeatedly. See
          http://msdn2.microsoft.com/en-us/library/ms256203.aspx for more detailed
          information.

          --
          HTH,

          Kevin Spencer
          Microsoft MVP
          Chicken Salad Shooter
          Thoughts and Ideas about programming, philosophy, science, arts, life, God, and related subjects.


          A man, a plan, a canal, a palindrome that has.. oh, never mind.

          <marcin.rzeznic ki@gmail.comwro te in message
          news:1160502641 .574989.204890@ i42g2000cwa.goo glegroups.com.. .
          >
          Kevin Spencer napisal(a):
          >I would use a FileStream instance to read the file. The FileStream class
          >supports random access to files, allowing you to jump around in the file.
          >You can read as little or as much as you want into memory when you need
          >to.
          >>
          >
          Hello Kevin
          Thanks for reply.
          I didn't test performance with FileStream, but maybe you can confirm -
          Does File Stream caches contents of file in memory? I think there is
          slight speedup when using memory mapping in that I do not have to hit
          the disk all the time. In my solution I simply open mapping over whole
          file and create views as needed. Anyway, let's say that I did it using
          FileStream, I can read some bytes from it, but I still face the same
          problem - how to interpret first bytes I have read, whether they are
          beginning of character, or maybe end of "previous" character?
          >

          Comment

          • Peter Duniho

            #6
            Re: Decoding strategy

            <marcin.rzeznic ki@gmail.comwro te in message
            news:1160502641 .574989.204890@ i42g2000cwa.goo glegroups.com.. .
            I didn't test performance with FileStream, but maybe you can confirm -
            Does File Stream caches contents of file in memory?
            FileStream does buffer, which is in a sense a kind of caching. You can
            specify the buffer size when you create the FileStream.
            I think there is
            slight speedup when using memory mapping in that I do not have to hit
            the disk all the time.
            IMHO, the two major benefits to memory mapping are 1) convenience (as long
            as your file access fits within the addressable space available to you), and
            2) minimal and efficient virtual memory usage (the physical memory storage
            of the data can be backed by the file itself, rather than using up swap file
            space).

            Any i/o speed advantage you can get with memory mapping, you can get with
            normal file i/o using appropriate techniques.
            In my solution I simply open mapping over whole
            file and create views as needed. Anyway, let's say that I did it using
            FileStream, I can read some bytes from it, but I still face the same
            problem - how to interpret first bytes I have read, whether they are
            beginning of character, or maybe end of "previous" character?
            I'm not entirely sure I understand the question. Even using a memory mapped
            file, if you jump into a random location in the middle, you can't tell
            whether you're at the beginning of a new character or in the middle of one.
            You need some point of reference to tell the difference.

            If the file is entirely made up of contiguous Unicode characters, and thus
            each character always starts on an even offset from the start of the file,
            then that's one easy way to tell when you are at the beginning or middle of
            a character. If that's the case though, then you could easily preserve that
            characteristic even reading the file using FileStream.

            On the other hand, if you are dealing with some other multibyte character
            set, or it's all Unicode but there's other data that can cause the Unicode
            characters to get shifted to odd offsets, then even using memory mapped
            files you need to find a good point of reference before you decide whether
            you're dealing with the start of a Unicode character.

            Basically, I don't see how using the FileStream class versus using memory
            mapping alters the underlying issue of determining what the character
            boundaries are. You can read sections of the file using FileStream, and as
            long as you keep track of what absolute file position those sections come
            from, you can always translate the address of a byte from a partial section
            back to an absolute file position, giving you the exact same position
            information you'd have when using memory mapping.

            It *is* true that reading the file into buffers by sections using the
            FileStream class, you could wind up with partial data at the beginning of
            end of one of these sections. The question there though is not knowing what
            you've got (since as I point out above, you can just as easily determine
            that whether using FileStream or memory mapping), but rather how to get back
            the other part. To deal with that, you'd need additional layer of
            processing that can piece together these data that straddle read boundaries.

            I agree that this is an area in which memory mapped files are more
            convenient, but it shouldn't be that hard for you to maintain a small
            "workspace" buffer in which this sort of reconstruction can take place. In
            the simplest case, it need only be a single "char" in which you pull out one
            byte at a time from the buffer read by FileStream and combine them as pairs
            into the "char" buffer (that may or may not be efficient, depending on what
            level at which you're processing the data...if you have to look at each and
            every character anyway, it may not be all that bad).

            Pete


            Comment

            • marcin.rzeznicki@gmail.com

              #7
              Re: Decoding strategy


              Kim Greenlee napisal(a):
              Hi Marcin,
              Hi Kim
              Thanks for reply
              >
              I need a little clarification. You have multiple files where each file
              could use a different encoding OR you have multiple files where WITHIN each
              file multiple encodings are used?
              The former. In other words, I do not know in advance what the encoding
              of a file is. It can be some encoding which properties might lead to
              "tearing" :-) (see below)
              >
              I'm also confused by your reference to a character "tear". And if you could
              explain that reference, I would find it helpful.
              Yes, sure. Let's say file is encoded using UTF8. Then single character
              can occupy 1, 2, 3 or 4 bytes. Let's say at offset n-1 lies the
              beginning of some 3-byte character. Its byte pattern, according to UTF8
              specs, are then as follows:
              1110xxxx 10yyyyyy 10zzzzzz
              I read contents, this way or that, starting at offset n. I cannot, due
              to memory mapping constraints, choose that offset freely, they have to
              be aligned to some boundaries.
              1110xxxx | 10yyyyyy 10zzzzzz (| indicates start of mapping)
              That's what I call torn character.
              In this particular case, due to UTF8 properties, it is easy to fix
              this, no sane decoder can assume that 10yyyyyy is beginning of UTF8
              character, so it suffices to read bytes behind offset thus providing
              fallback to decoder.
              So, having said that, my questions are: whether all encodings
              (multibyte, of course) have that nice property that one can determine
              that given byte is part of "torn" character, rather than treating it
              wrongly as beginning of other character? and - what is the bast way to
              solve the problem, in your opinion, I think that implementing decoder
              fallback is quite sane, but I want to know your opinion.

              Comment

              • marcin.rzeznicki@gmail.com

                #8
                Re: Decoding strategy

                Peter Duniho napisal(a):
                <marcin.rzeznic ki@gmail.comwro te in message
                news:1160502641 .574989.204890@ i42g2000cwa.goo glegroups.com.. .
                I didn't test performance with FileStream, but maybe you can confirm -
                Does File Stream caches contents of file in memory?
                >
                FileStream does buffer, which is in a sense a kind of caching. You can
                specify the buffer size when you create the FileStream.
                >
                I think there is
                slight speedup when using memory mapping in that I do not have to hit
                the disk all the time.
                >
                IMHO, the two major benefits to memory mapping are 1) convenience (as long
                as your file access fits within the addressable space available to you), and
                2) minimal and efficient virtual memory usage (the physical memory storage
                of the data can be backed by the file itself, rather than using up swap file
                space).
                I agree with you. Especially second point is what I struggle to
                achieve. I think that there is also other advantage, which lies in
                explicit access of "memory buffer". Since I get pointer (it is unsafe I
                know :-) ) to contiguous memory I save one copy operation each time I
                need to map portion of file into memory. Reason being, FileStream, even
                though using buffering, does not give me access to it. Then to perform
                subsequent decoding I have to copy data from FileStream into byte array
                and pass it into decoder, on the other hand, I pass pointer to memory
                view of file directly into decoder.
                >
                Any i/o speed advantage you can get with memory mapping, you can get with
                normal file i/o using appropriate techniques.
                Not with FileStream I fear.
                >
                In my solution I simply open mapping over whole
                file and create views as needed. Anyway, let's say that I did it using
                FileStream, I can read some bytes from it, but I still face the same
                problem - how to interpret first bytes I have read, whether they are
                beginning of character, or maybe end of "previous" character?
                >
                I'm not entirely sure I understand the question. Even using a memory mapped
                file, if you jump into a random location in the middle, you can't tell
                whether you're at the beginning of a new character or in the middle of one.
                You need some point of reference to tell the difference.
                Obviously true. I build for myself character index, which tells me
                approximately where to seek given character. When opening file I decode
                each block of file and ask decoder to tell me how many chars are found
                in each and every block of file. Then I buld data structure like this
                (100, 200, ..., 5000) which means: chars 0-99 are in the first block
                100-199 in the second and so on. Then, when I have to read string
                starting at, let's say, 250th character, simple index lookup tells me
                that I should start mapping at 2nd block. After mapping I decode
                contents and calculate needed offset
                >
                If the file is entirely made up of contiguous Unicode characters, and thus
                each character always starts on an even offset from the start of the file,
                then that's one easy way to tell when you are at the beginning or middle of
                a character. If that's the case though, then you could easily preserve that
                characteristic even reading the file using FileStream.
                Yes, but it's not the case
                >
                On the other hand, if you are dealing with some other multibyte character
                set, or it's all Unicode but there's other data that can cause the Unicode
                characters to get shifted to odd offsets, then even using memory mapped
                files you need to find a good point of reference before you decide whether
                you're dealing with the start of a Unicode character.
                I am using index whic I described above for that "point of reference"
                >
                Basically, I don't see how using the FileStream class versus using memory
                mapping alters the underlying issue of determining what the character
                boundaries are. You can read sections of the file using FileStream, and as
                long as you keep track of what absolute file position those sections come
                from, you can always translate the address of a byte from a partial section
                back to an absolute file position, giving you the exact same position
                information you'd have when using memory mapping.
                >
                It *is* true that reading the file into buffers by sections using the
                FileStream class, you could wind up with partial data at the beginning of
                end of one of these sections. The question there though is not knowing what
                you've got (since as I point out above, you can just as easily determine
                that whether using FileStream or memory mapping), but rather how to get back
                the other part. To deal with that, you'd need additional layer of
                processing that can piece together these data that straddle read boundaries.
                Yes, I agree. That;s why I asked Kevin whather he sees some magical way
                by which FileStream will get things right. So, I do not think that
                using FileStream, or any othr i/o strategy for that matter, will help
                me in my problem
                >
                I agree that this is an area in which memory mapped files are more
                convenient, but it shouldn't be that hard for you to maintain a small
                "workspace" buffer in which this sort of reconstruction can take place. In
                the simplest case, it need only be a single "char" in which you pull out one
                byte at a time from the buffer read by FileStream and combine them as pairs
                into the "char" buffer (that may or may not be efficient, depending on what
                level at which you're processing the data...if you have to look at each and
                every character anyway, it may not be all that bad).
                Right, so here you come to the point where my doubts are born :-)
                First of all - what's the best way to create small buffer - whether
                decoder fallback, or maybe some other strategy will do better. Or maybe
                I screwed up everything and there is better solution.
                And - is it always possible (keep in my mind that some encodings migh
                not be so nica as Unicode encodings) to reconstruct character? I do not
                know much about encodings in general, but while pondering on this idea
                I decided to check a few encodings and see whether I am right. I came
                across Shit JIS encoding, which, I fear, can mistake "torn" character
                for a different one.
                >
                Pete
                Thanks for helpful reply.

                Comment

                • marcin.rzeznicki@gmail.com

                  #9
                  Re: Decoding strategy


                  Kevin Spencer napisal(a):
                  No, the FileStream is the .Net equivalent of a FILE pointer (in a sense). It
                  is positioned and reads from the file according to your code. You must
                  create a buffer for it to read into. That buffer can be used to read
                  portions of the file, and used repeatedly. See
                  http://msdn2.microsoft.com/en-us/library/ms256203.aspx for more detailed
                  information.
                  Hi Kevin
                  The link you gave me leads to XSLT reference section.

                  Comment

                  • marcin.rzeznicki@gmail.com

                    #10
                    Re: Decoding strategy


                    marcin.rzeznick i@gmail.com napisal(a):
                    Reason being, FileStream, even
                    though using buffering, does not give me access to it.
                    Should be '(...) its buffer'

                    Comment

                    • Kevin Spencer

                      #11
                      Re: Decoding strategy

                      Sorry! Wrong browser instance. Here you go:

                      Provides a Stream for a file, supporting both synchronous and asynchronous read and write operations.


                      --
                      HTH,

                      Kevin Spencer
                      Microsoft MVP
                      Chicken Salad Shooter
                      Thoughts and Ideas about programming, philosophy, science, arts, life, God, and related subjects.


                      A man, a plan, a canal, a palindrome that has.. oh, never mind.

                      <marcin.rzeznic ki@gmail.comwro te in message
                      news:1160510370 .340222.221090@ b28g2000cwb.goo glegroups.com.. .
                      >
                      Kevin Spencer napisal(a):
                      >No, the FileStream is the .Net equivalent of a FILE pointer (in a sense).
                      >It
                      >is positioned and reads from the file according to your code. You must
                      >create a buffer for it to read into. That buffer can be used to read
                      >portions of the file, and used repeatedly. See
                      >http://msdn2.microsoft.com/en-us/library/ms256203.aspx for more detailed
                      >information.
                      >
                      Hi Kevin
                      The link you gave me leads to XSLT reference section.
                      >

                      Comment

                      • Peter Duniho

                        #12
                        Re: Decoding strategy

                        <marcin.rzeznic ki@gmail.comwro te in message
                        news:1160509674 .674707.253020@ e3g2000cwe.goog legroups.com...
                        I agree with you. Especially second point is what I struggle to
                        achieve. I think that there is also other advantage, which lies in
                        explicit access of "memory buffer". Since I get pointer (it is unsafe I
                        know :-) ) to contiguous memory I save one copy operation each time I
                        need to map portion of file into memory.
                        That's true. But since your use of the file is non-trivial, it is likely
                        that the copying of data from one memory location to another will not
                        dominate the performance of your program.

                        In other words, worry about that bridge when you come to it. First step is
                        to get something that works. :)
                        Reason being, FileStream, even
                        though using buffering, does not give me access to it.
                        It doesn't give you direct access, you're right. But merely by reading from
                        the file in large chunks at a time, even if it does so in a way opaque to
                        your own code, performance may well be acceptable.

                        Keep in mind that if you are not reading from the file in a purely
                        sequential way, even memory mapping the file may or may not buffer in a way
                        that optimizes your access to the file.
                        [...]
                        >Any i/o speed advantage you can get with memory mapping, you can get with
                        >normal file i/o using appropriate techniques.
                        >
                        Not with FileStream I fear.
                        But your fears might be unfounded. I can't really say for sure one way or
                        the other without having a full-blown implementation in my hands to look at.
                        But getting data from the hard disk is going to be a major bottleneck, as
                        will sifting through it after it's been safely stored in memory. As long as
                        that data has been buffered somewhere, it may not really matter that it gets
                        copied one or two extra times once in memory.
                        [...]
                        >I'm not entirely sure I understand the question. Even using a memory
                        >mapped
                        >file, if you jump into a random location in the middle, you can't tell
                        >whether you're at the beginning of a new character or in the middle of
                        >one.
                        >You need some point of reference to tell the difference.
                        >
                        Obviously true. I build for myself character index, which tells me
                        approximately where to seek given character.
                        How large are these indexes? You might keep in mind that consuming RAM in
                        the form of an index is likely to interfere with the memory mapped file in
                        at least a couple of ways: one, by fragmenting your virtual memory space
                        (thereby limiting the size of the file you can deal with) and two, by
                        consuming physical RAM to deal with the indexes you may wind up flushing
                        file data out of physical RAM sooner than you'd like.

                        The latter issue is a problem whether you're using memory mapping or not, so
                        I'm not trying to say this is a significant factor in deciding between the
                        two. My main point is that the indexes are one thing that may cause more
                        disk i/o to occur, and thus further reducing the significance of any
                        additional memory-to-memory data copies.
                        [...]
                        Yes, I agree. That;s why I asked Kevin whather he sees some magical way
                        by which FileStream will get things right. So, I do not think that
                        using FileStream, or any othr i/o strategy for that matter, will help
                        me in my problem
                        Well, one advantage of using the FileStream class is that since you need to
                        do more explicit handling of the file i/o, it gives you an opportunity to
                        address the issue you're asking about.

                        That said, it seems to me that in terms of the specific question you're
                        asking, memory mapped file i/o is the best solution. It has its
                        limitations, as you've already pointed out, but if you can live with those
                        limitations then it's a good solution.

                        However, that's not how I interpreted the question you asked. My apologies
                        if I misunderstood, but the way I read it is that you've stated the
                        limitations of the memory mapped file i/o and are looking for a means around
                        it. The only way around it is to use more conventional file i/o, in the
                        form of the FileStream class or something similar.
                        [...]
                        Right, so here you come to the point where my doubts are born :-)
                        First of all - what's the best way to create small buffer - whether
                        decoder fallback, or maybe some other strategy will do better.
                        IMHO, the first thing you should do is try just using a FileStream directly.
                        Give it some reasonably large buffer size to use (at least a handful of file
                        blocks, which are usually 4K each), and read data from the file as you need
                        it. Even if this means reading just a small number of bytes at a time,
                        between one and four depending on where your encoder is and what data is
                        being processed.

                        For example, if your decoder would get "cbDecode" bytes from offset
                        "ibDecode" (I have no idea how you do this in your code...maybe if you could
                        post line or two that demonstrates how you actually access the data, that
                        would be useful), you could do this instead with a FileStream (let's call it
                        "fsDecode") :

                        byte[] rgbDecode = new byte[cbDecode];

                        fsDecode.Seek(i bDecode, SeekOrigin.Begi n);
                        fsDecode.Read(r gbDecode, 0, cbDecode);

                        Then you've got your bytes in the byte array ready for processing. There's
                        no tearing issue, and most of the time the read will come from memory,
                        buffered by the FileStream object. The biggest problem here would be the
                        high overhead from calling Seek and Read over and over. But it's a nice
                        simple approach. :)

                        (A side note: you may actually find the BinaryReader class more suitable, as
                        the FileStream.Read method can in theory actually return fewer bytes than
                        you ask for, even if you don't reach the end of the file...I left out the
                        return value checking for simplicity, but you might need to include that if
                        you don't use BinaryReader. BinaryReader.Re adBytes will always return as
                        many bytes as you ask for, unless it reaches the end of the file and can't).

                        Once you've done that, then you've got your worst-case scenario. That's
                        likely to be the poorest-performing way to read the file, and if it turns
                        out to be fast enough, you can just stop right there. :)

                        If you find that's too slow, then you can accomplish pretty much the same
                        performance gain you might get from a memory-mapped file (or possibly even
                        better, depending on what sort of buffering Windows was capable of doing
                        with your memory-mapped file) by reading the file directly in larger chunks.
                        If you do that, then yes...you need to worry about the data you're
                        processing straddling whatever artificial boundary you wind up imposing by
                        adding the extra layer of buffering in your own code. But that is a
                        solvable problem (and in fact will be solved in a very similar way to what
                        the memory-mapped solution has to do behind the scenes for you anyway). If
                        that last sentence causes you some questions, let me know and I can
                        elaborate.
                        Or maybe
                        I screwed up everything and there is better solution.
                        And - is it always possible (keep in my mind that some encodings migh
                        not be so nica as Unicode encodings) to reconstruct character?
                        I don't know. That's a somewhat different question and doesn't have much to
                        do with the file i/o method you use. I don't have a lot of experience with
                        multibyte character encodings, but as far as I recall from my limited use of
                        them, an initial byte always looks different from a subsequent byte within a
                        given character. So you can always work your way backwards to find an
                        initial byte and start decoding from there.
                        I do not
                        know much about encodings in general, but while pondering on this idea
                        I decided to check a few encodings and see whether I am right. I came
                        across Shit JIS encoding, which, I fear, can mistake "torn" character
                        for a different one.
                        I hope that's not the case, but if it is you have that issue whether you use
                        memory-mapped file i/o or not. Or alternatively, if you think that
                        memory-mapped file i/o solves that issue, maybe if you explain why it is you
                        think that, it would help us understand your question better. :)

                        Pete


                        Comment

                        • marcin.rzeznicki@gmail.com

                          #13
                          Re: Decoding strategy


                          Peter Duniho wrote:
                          <marcin.rzeznic ki@gmail.comwro te in message
                          news:1160509674 .674707.253020@ e3g2000cwe.goog legroups.com...
                          I agree with you. Especially second point is what I struggle to
                          achieve. I think that there is also other advantage, which lies in
                          explicit access of "memory buffer". Since I get pointer (it is unsafe I
                          know :-) ) to contiguous memory I save one copy operation each time I
                          need to map portion of file into memory.
                          >
                          That's true. But since your use of the file is non-trivial, it is likely
                          that the copying of data from one memory location to another will not
                          dominate the performance of your program.
                          >
                          In other words, worry about that bridge when you come to it. First step is
                          to get something that works. :)
                          Well, it works, I mean - memory mapping solution has been implemented
                          and works perfectly, at least with one-byte encodings :-)
                          >
                          Reason being, FileStream, even
                          though using buffering, does not give me access to it.
                          >
                          It doesn't give you direct access, you're right. But merely by reading from
                          the file in large chunks at a time, even if it does so in a way opaque to
                          your own code, performance may well be acceptable.
                          That's my mistake, I assumed that memory mapping would be the best way
                          to do this and I implemented it right away. Fortunately implementation
                          didn't take much time, 'cause I'd done that stuff before, though not in
                          managed code. I didn't investigate managed alternatives nor measured
                          their performance. I also haven't abstracted i/o out very well, so it
                          may be awkward to replace memory mapping with FileStream and measure
                          how it performs. I think I suffer from some kind of "premature
                          optimization" syndrome :-)
                          >
                          Keep in mind that if you are not reading from the file in a purely
                          sequential way, even memory mapping the file may or may not buffer in a way
                          that optimizes your access to the file.
                          >
                          Well, you can pass hints about your usage of file to memory mapping
                          function, so I think that OS caches it appropriately.
                          [...]
                          Any i/o speed advantage you can get with memory mapping, you can get with
                          normal file i/o using appropriate techniques.
                          Not with FileStream I fear.
                          >
                          But your fears might be unfounded. I can't really say for sure one way or
                          the other without having a full-blown implementation in my hands to look at.
                          But getting data from the hard disk is going to be a major bottleneck, as
                          will sifting through it after it's been safely stored in memory. As long as
                          that data has been buffered somewhere, it may not really matter that it gets
                          copied one or two extra times once in memory.
                          >
                          Perfect solution would be to solve potential decoding problems with
                          memory mapping. Hope it is possible.

                          Obviously true. I build for myself character index, which tells me
                          approximately where to seek given character.
                          >
                          How large are these indexes? You might keep in mind that consuming RAM in
                          the form of an index is likely to interfere with the memory mapped file in
                          at least a couple of ways: one, by fragmenting your virtual memory space
                          (thereby limiting the size of the file you can deal with) and two, by
                          consuming physical RAM to deal with the indexes you may wind up flushing
                          file data out of physical RAM sooner than you'd like.
                          >
                          Index is an integer array with one entry per allocation block. Size of
                          block depends on machine, on my machine it is 64k. So, assuming average
                          file length is about 500 MB then index contains more or less eight
                          thousand entries, each entry being integer, gives us 32k memory
                          occupied by index. So I do not think it can noticeably degrade
                          performance.
                          [...]
                          Yes, I agree. That;s why I asked Kevin whather he sees some magical way
                          by which FileStream will get things right. So, I do not think that
                          using FileStream, or any othr i/o strategy for that matter, will help
                          me in my problem
                          >
                          Well, one advantage of using the FileStream class is that since you need to
                          do more explicit handling of the file i/o, it gives you an opportunity to
                          address the issue you're asking about.
                          >
                          That said, it seems to me that in terms of the specific question you're
                          asking, memory mapped file i/o is the best solution. It has its
                          limitations, as you've already pointed out, but if you can live with those
                          limitations then it's a good solution.
                          >
                          Yes, pure advanatage of FileStream I see so far, is that it enables
                          file access at any offset, so tearing problem can be prevented. Tearing
                          problem is born because you have to map file at offsets aligned to
                          allocation block boundary. But that would not be really much if I knew
                          that I could solve decoding problems reliably.
                          However, that's not how I interpreted the question you asked. My apologies
                          if I misunderstood, but the way I read it is that you've stated the
                          limitations of the memory mapped file i/o and are looking for a means around
                          it. The only way around it is to use more conventional file i/o, in the
                          form of the FileStream class or something similar.
                          >
                          That's what I wrote, except for the part "looking for a means around".
                          Well, depends on what you mean by this, but I'd not rather disband
                          memory mapping. So I am not looking for "means around memory mapping"
                          but: living within memory mapping walls, how can I solve the "tearing"
                          problem? As I pointed out, one way is to implement your own decoder
                          fallback, but then another question arises, which is: is it really
                          reliable? If it is proven that there is no good solution, then I will
                          drop out memory mapping.

                          [cut]
                          >
                          For example, if your decoder would get "cbDecode" bytes from offset
                          "ibDecode" (I have no idea how you do this in your code...maybe if you could
                          post line or two that demonstrates how you actually access the data, that
                          would be useful), you could do this instead with a FileStream (let's call it
                          "fsDecode") :
                          >
                          byte[] rgbDecode = new byte[cbDecode];
                          >
                          fsDecode.Seek(i bDecode, SeekOrigin.Begi n);
                          fsDecode.Read(r gbDecode, 0, cbDecode);
                          >
                          Then you've got your bytes in the byte array ready for processing. There's
                          no tearing issue, and most of the time the read will come from memory,
                          buffered by the FileStream object. The biggest problem here would be the
                          high overhead from calling Seek and Read over and over. But it's a nice
                          simple approach. :)
                          >
                          Certainly it is :-)
                          That's how I wanted to implement fallback buffer. Each time I detect
                          "torn" char I reposition file pointer, probe bytes backward till I find
                          valid char and provide replacement.
                          (A side note: you may actually find the BinaryReader class more suitable, as
                          the FileStream.Read method can in theory actually return fewer bytes than
                          you ask for, even if you don't reach the end of the file...I left out the
                          return value checking for simplicity, but you might need to include that if
                          you don't use BinaryReader. BinaryReader.Re adBytes will always return as
                          many bytes as you ask for, unless it reaches the end of the file and can't).
                          >
                          Once you've done that, then you've got your worst-case scenario. That's
                          likely to be the poorest-performing way to read the file, and if it turns
                          out to be fast enough, you can just stop right there. :)
                          >
                          Yeah, nice solution. Even though performance hit may be noticeable, if
                          I restrict these operations to fallback times only and extend my index
                          structure to cache "torn" characters I should not need to execute that
                          code very often. Seems good to me. Yet ... :-( Can I be sure whether
                          decoder cannot mistake characters?
                          If you find that's too slow, then you can accomplish pretty much the same
                          performance gain you might get from a memory-mapped file (or possibly even
                          better, depending on what sort of buffering Windows was capable of doing
                          with your memory-mapped file) by reading the file directly in larger chunks.
                          If you do that, then yes...you need to worry about the data you're
                          processing straddling whatever artificial boundary you wind up imposing by
                          adding the extra layer of buffering in your own code. But that is a
                          solvable problem (and in fact will be solved in a very similar way to what
                          the memory-mapped solution has to do behind the scenes for you anyway). If
                          that last sentence causes you some questions, let me know and I can
                          elaborate.
                          Well, yes, please. If you are able to show me how to solve that, then I
                          can mix memory mapping with direct file access at fallback times and be
                          perfectly happy.
                          >
                          Or maybe
                          I screwed up everything and there is better solution.
                          And - is it always possible (keep in my mind that some encodings migh
                          not be so nica as Unicode encodings) to reconstruct character?
                          >
                          I don't know. That's a somewhat different question and doesn't have much to
                          do with the file i/o method you use. I don't have a lot of experience with
                          multibyte character encodings, but as far as I recall from my limited use of
                          them, an initial byte always looks different from a subsequent byte within a
                          given character. So you can always work your way backwards to find an
                          initial byte and start decoding from there.
                          >
                          After little afterthought I've found that it is the most significant
                          question. But let me rephrase what you wrote: it is no problem to find
                          characters when reading byte sequence forward and every sane encoding
                          must adhere to this in order to be usable. But is it the same case when
                          looking backward?
                          I do not
                          know much about encodings in general, but while pondering on this idea
                          I decided to check a few encodings and see whether I am right. I came
                          across Shit JIS encoding, which, I fear, can mistake "torn" character
                          for a different one.
                          >
                          I hope that's not the case, but if it is you have that issue whether you use
                          memory-mapped file i/o or not. Or alternatively, if you think that
                          memory-mapped file i/o solves that issue, maybe if you explain why it is you
                          think that, it would help us understand your question better. :)
                          >
                          It is exactly opposite :-) I believe that memory mapping causes this
                          problem because of mapping offsets limitations. If you used FileStream
                          then, after some initial playing with decoder getcharcount, you could
                          find exact character boundaries. That's its big advantage over memory
                          mapping, because memory mapping imposes restrictions on mapping
                          offsets.
                          So, summing up. I think that question reduces to the one about encoding
                          characteristics . You showed us very good solution using FileStream. It
                          can be extended to mix these two approches which may be faster but I
                          still do not know whether it is realiable.
                          Pete

                          Comment

                          • marcin.rzeznicki@gmail.com

                            #14
                            Re: Decoding strategy

                            Just one more thing :-)
                            For example, if your decoder would get "cbDecode" bytes from offset
                            "ibDecode" (I have no idea how you do this in your code...maybe if you could
                            post line or two that demonstrates how you actually access the data, that
                            would be useful)
                            ..Yes, of course, I can post. Unfortunately, I do not have access to
                            code in question right now, I will post it in few hours.

                            Comment

                            • Peter Duniho

                              #15
                              Re: Decoding strategy

                              <marcin.rzeznic ki@gmail.comwro te in message
                              news:1160566847 .098932.180410@ e3g2000cwe.goog legroups.com...
                              Just one more thing :-)
                              >
                              >For example, if your decoder would get "cbDecode" bytes from offset
                              >"ibDecode" (I have no idea how you do this in your code...maybe if you
                              >could
                              >post line or two that demonstrates how you actually access the data, that
                              >would be useful)
                              >
                              .Yes, of course, I can post. Unfortunately, I do not have access to
                              code in question right now, I will post it in few hours.
                              It seems to me that if you need access to the code in order to post the
                              general idea I'm asking about, you may be answering the question in far more
                              detail than I was looking for. :)

                              If .NET supported memory-mapped file i/o, I probably wouldn't even ask the
                              question. But since it doesn't, and since you're obviously using some kind
                              of workaround to incorporate memory-mapped file i/o into your program, some
                              of the specifics are unknowable to us unless you post them. They may not
                              even be relevant, but it wouldn't hurt to try to clarify that.

                              Still, I'm not asking for the whole decoder here. Just some general idea of
                              how you've merged the non-.NET concept of memory-mapped file i/o into a .NET
                              context.

                              Pete


                              Comment

                              Working...