Decoding strategy

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Peter Duniho

    #16
    Re: Decoding strategy

    <marcin.rzeznic ki@gmail.comwro te in message
    news:1160566533 .533123.293730@ c28g2000cwb.goo glegroups.com.. .
    [...]
    I didn't investigate managed alternatives nor measured
    their performance. I also haven't abstracted i/o out very well, so it
    may be awkward to replace memory mapping with FileStream and measure
    how it performs. I think I suffer from some kind of "premature
    optimization" syndrome :-)
    That could be. It's a common enough problem. I find myself occasionally
    *paralyzed* by it, when I get stuck trying to decide the most optimal
    solution and fail to get any progress toward ANY solution.
    >Keep in mind that if you are not reading from the file in a purely
    >sequential way, even memory mapping the file may or may not buffer in a
    >way
    >that optimizes your access to the file.
    >
    Well, you can pass hints about your usage of file to memory mapping
    function, so I think that OS caches it appropriately.
    It will cache as best as it can. But if you are jumping around the file,
    the OS simply cannot correct predict what to buffer for you. This is
    especially bad when going backwards in the file.

    Note that with respect to the hints you can provide, the docs say that best
    performance is when you are accessing the file sequentially, and sparsely,
    and provide the sequential access hint. That doesn't mean that if you
    provide some other hint and are accessing the file differently, you will get
    similar performance. :)

    The OS doesn't know what you're doing with the file, and it has no way to
    predict when you might go backwards in the file. If you are accessing the
    file in a sparse manner (as it appears you may be), then you may find that
    often when you go backwards, that data hasn't been read yet. Going back
    just one byte might incur another disk read.

    It's hard to say for sure without all the details...I'm just pointing out
    that these caching issues exist whether you're using a memory-mapped file or
    just reading normally.
    [...]
    Yes, pure advanatage of FileStream I see so far, is that it enables
    file access at any offset, so tearing problem can be prevented. Tearing
    problem is born because you have to map file at offsets aligned to
    allocation block boundary. But that would not be really much if I knew
    that I could solve decoding problems reliably.
    From this, I think that I may still not fully understand the question.

    It is true that a file must be mapped to an aligned memory address. But
    this should only affect the virtual address used to locate the file in the
    virtual address space. That is, the first byte of the file will be on an
    aligned address, but the rest of the file is contiguous from there.

    Likewise, even if you are mapping sections of the file into different
    virtual address locations (why? is this to allow more of the file to be
    mapped in spite of virtual address space fragmentation?) , resulting in those
    sections of the file having to each be aligned, you can still access the
    virtual address for the data in a byte-wise fashion.

    All that the alignment requirement affects is where the data winds up in
    virtual memory. I don't see how it affects your access of the data.

    Now, that said, there do seem to be one or two different issues related to
    this. I say "one or two" because they are either the exact same problem or
    not, depending on how you look at it. :) That is, the inability to map the
    entire file to a single contiguous section of your virtual address space at
    once. This causes the secondary problem that you may have to jump from one
    spot in virtual memory to another as you traverse (forward or backward) the
    data in the file. It also may limit how much data you can have mapped at
    once.

    Judging from this:
    That's what I wrote, except for the part "looking for a means around".
    Well, depends on what you mean by this, but I'd not rather disband
    memory mapping. So I am not looking for "means around memory mapping"
    but: living within memory mapping walls, how can I solve the "tearing"
    problem?
    I'm guessing that both of those issues are really just the same problem for
    you. That is, that you have to address the file using non-contiguous
    pointers.
    Certainly it is :-)
    That's how I wanted to implement fallback buffer. Each time I detect
    "torn" char I reposition file pointer, probe bytes backward till I find
    valid char and provide replacement.
    Perhaps you could clarify under what situation you "detect a 'torn' char".
    That is, it's unclear to me whether you are referring to simply jumping into
    an offset that's not the start of a character, or if this somehow
    specifically relates to the sectioning of the file caused by your memory
    mapped i/o.

    The former would be an issue even if you could map the entire file to a
    single contiguous virtual address range. The latter is obviously only an
    issue because of the sectioning of the file. I'm confused as to which it
    is.
    [...]
    Yeah, nice solution. Even though performance hit may be noticeable, if
    I restrict these operations to fallback times only and extend my index
    structure to cache "torn" characters I should not need to execute that
    code very often. Seems good to me. Yet ... :-( Can I be sure whether
    decoder cannot mistake characters?
    Well, as I mentioned...I can't help you with that question. :) That
    depends on the nature of the data you're decoding, and I don't know enough
    to be able to answer that.
    >If you find that's too slow, then you can accomplish pretty much the same
    >performance gain you might get from a memory-mapped file (or possibly
    >even
    >better, depending on what sort of buffering Windows was capable of doing
    >with your memory-mapped file) by reading the file directly in larger
    >chunks.
    >If you do that, then yes...you need to worry about the data you're
    >processing straddling whatever artificial boundary you wind up imposing
    >by
    >adding the extra layer of buffering in your own code. But that is a
    >solvable problem (and in fact will be solved in a very similar way to
    >what
    >the memory-mapped solution has to do behind the scenes for you anyway).
    >If
    >that last sentence causes you some questions, let me know and I can
    >elaborate.
    >
    Well, yes, please. If you are able to show me how to solve that, then I
    can mix memory mapping with direct file access at fallback times and be
    perfectly happy.
    Okay, let's see if this makes sense. First, keep in mind that my comment
    was assuming a general solution to the file i/o problem. I think you should
    be able to apply it as a "fallback" solution, but it may or may not be
    better than just falling back to reading a few bytes at a time if that's
    your approach.

    Also, keep in mind that this is just a simple example of what I mean. I
    don't mean to imply that this would be the best implementation. ..just that
    it's a sample of the general idea.

    Finally, keep in mind that this doesn't remove the issue of sectioning the
    file. It just abstracts it out a bit. Since I didn't realize before that
    you may be trying to get rid of the whole issue of having to jump your data
    reads from one block of memory to another, I proposed the idea not realizing
    it may be exactly the opposite of what you're looking for. :(

    That said:

    What I meant was that you can read from the file a few blocks at a time,
    keeping that buffer centered on where you are currently accessing. You'll
    need to keep track of:

    -- current file offset
    -- an array of blocks read from the file
    -- the file offsets those blocks came from
    -- the current block

    The general idea is to maintain the array of blocks such that there is an
    odd number of blocks, at least three, and they are centered on the current
    offset within the file you're reading. Normally, you'll be reading from the
    middle block. If you skip over to another block, you drop one block from
    the far end of the array, and read another adding it to the near end of the
    array.

    Basically, you're windowing the file in a fixed set of buffers. If you read
    new data asynchronously to your use of the data in the buffers you currently
    have, then when you drop a block at one end and fill it for use at the other
    end, the file i/o can happen while you're still processing the data that you
    do have.

    Obviously if you jump to a completely different point in the file, you'll
    have to wait for the surrounding data to be read, but that's an issue even
    if memory mapped files or just reading directly with a FileStream.
    [...]
    After little afterthought I've found that it is the most significant
    question. But let me rephrase what you wrote: it is no problem to find
    characters when reading byte sequence forward and every sane encoding
    must adhere to this in order to be usable. But is it the same case when
    looking backward?
    I still don't know. :) I suspect that it is, because it was true with the
    basic MBCS I've seen. But I also realize that there are a LOT of different
    ways to encode text, and some may be context-sensitive.

    Some of this really depends on what you mean by "encoding" and "decoding".
    The word "encoding" is applied in a variety of ways. Two that could apply
    here are the basic idea of text encoding, which mostly just has to do with
    the character set, or some actual conversion of data, which has to do with
    compressing the data, or translating it into a more portable format (MIME,
    for example). I don't even know which of these meanings you're addressing,
    making it even harder for me to know the answer. :)
    [...]
    So, summing up. I think that question reduces to the one about encoding
    characteristics . You showed us very good solution using FileStream. It
    can be extended to mix these two approches which may be faster but I
    still do not know whether it is realiable.
    Indeed, that is a question you should probably figure out. Earlier rather
    than later. :) Sorry I can't be of more help on that front.

    Pete


    Comment

    • marcin.rzeznicki@gmail.com

      #17
      Re: Decoding strategy


      Peter Duniho napisal(a):
      <marcin.rzeznic ki@gmail.comwro te in message
      news:1160566847 .098932.180410@ e3g2000cwe.goog legroups.com...
      Just one more thing :-)
      For example, if your decoder would get "cbDecode" bytes from offset
      "ibDecode" (I have no idea how you do this in your code...maybe if you
      could
      post line or two that demonstrates how you actually access the data, that
      would be useful)
      .Yes, of course, I can post. Unfortunately, I do not have access to
      code in question right now, I will post it in few hours.
      >
      It seems to me that if you need access to the code in order to post the
      general idea I'm asking about, you may be answering the question in far more
      detail than I was looking for. :)
      >
      If .NET supported memory-mapped file i/o, I probably wouldn't even ask the
      question. But since it doesn't, and since you're obviously using some kind
      of workaround to incorporate memory-mapped file i/o into your program, some
      of the specifics are unknowable to us unless you post them. They may not
      even be relevant, but it wouldn't hurt to try to clarify that.
      >
      Still, I'm not asking for the whole decoder here. Just some general idea of
      how you've merged the non-.NET concept of memory-mapped file i/o into a .NET
      context.
      Hi,
      Here it goes. It is no rocket science, I just used P/Invoke in order to
      access WinAPI functions and that's all. I'm pasting just relevant
      methods, if you feel you need sth more then let me know:

      //this method builds index of characters
      private void BuildCharIndex( )
      {
      if ( encoding.IsSing leByte ) //if encoding is single byte I assume one
      byte - one char correspondence
      {
      //nothing interesting here
      }
      else
      {
      //....
      while ( fileOffset < fileLength )
      {
      ReadPage( blockIndex );
      unsafe
      {
      for ( int i = 0; i < mappedBlocksCou nt ; ++i )
      {
      charCount += decoder.GetChar Count( (byte*) page.ToPointer( ) + i *
      blockSize, blockSize, false );
      index.Add( charCount );
      }
      }
      //...
      }
      }

      page variable references my own SafeHandle descendant for managing
      mapping handles. I wrote simple toPointer method for convenience
      ReadPage is responsible for establishing file mapping. Page represents
      range of blocks, each block is 64k

      private void ReadPage(int startBlock)
      {
      //...
      int fileOffset = startBlock * blockSize;
      pageLength = Math.Min( fileLength - fileOffset, PAGE_SIZE * blockSize
      );
      //I wrote PInvoke signature for this and additional enums for
      convenience
      page = NativeMethods.M apViewOfFile( fileMapping, FileViewAccess. Read,
      0, fileOffset, pageLength );
      if ( page.IsInvalid )
      Marshal.ThrowEx ceptionForHR( Marshal.GetLast Win32Error() );
      //...
      }

      If you care for signature of MapViewOfFile:

      [DllImport( "kernel32.d ll", SetLastError = true )]
      internal static extern SafeFileMapView Handle
      MapViewOfFile(S afeGenericHandl e hFileMappingObj ect, FileViewAccess
      dwDesiredAccess , int dwFileOffsetHig h, int dwFileOffsetLow , int
      dwNumberOfBytes ToMap);

      So, that's how I read contents. Code is much simplified than original
      but I hope it carries the idea

      if ( !IsBlockInMemor y( firstBlockIndex ) )
      {
      ReadPage( firstBlockIndex );
      CopyCurrentPage ToBuffer();
      }
      //...
      int bufferOffset = //here I calculate needed offset using index and
      additional calculations
      Marshal.PtrToSt ringUni( new IntPtr( memoryBuffer.To Int64() +
      bufferOffset * CHAR_SIZE ), length );

      And the method which causes my problems is CopyCurrentPage ToBuffer. It
      reads mapped portion and decodes. What is relevant is the line:

      unsafe
      {
      encoding.GetDec oder().GetChars ( (byte*) page.ToPointer( ), pageLength
      //... );
      }
      >
      Pete

      Comment

      • marcin.rzeznicki@gmail.com

        #18
        Re: Decoding strategy

        Hi
        [...]
        It's hard to say for sure without all the details...I'm just pointing out
        that these caching issues exist whether you're using a memory-mapped file or
        just reading normally.
        In general, form what I observed the most frequent access is random yet
        with locality pattern. So I start with random part of file, mess around
        not too far away from beginning of mapping, and then jump somewhere
        else. So, memory mapped view is sure too be usable for a while, so I
        think it pays off too keep that in memory. That characteristic also
        ensures me that OS cache can be helpful and performance will not suffer
        from misses/disk reads very often.
        >
        [...]
        Yes, pure advanatage of FileStream I see so far, is that it enables
        file access at any offset, so tearing problem can be prevented. Tearing
        problem is born because you have to map file at offsets aligned to
        allocation block boundary. But that would not be really much if I knew
        that I could solve decoding problems reliably.
        >
        From this, I think that I may still not fully understand the question.
        >
        It is true that a file must be mapped to an aligned memory address. But
        this should only affect the virtual address used to locate the file in the
        virtual address space. That is, the first byte of the file will be on an
        aligned address, but the rest of the file is contiguous from there.
        >
        Yes, I know. I was referring to something else. Sorry for being
        unclear. Docs say:

        "(..)must specify an offset within the file that matches the memory
        allocation granularity of the system, or the function fails. That is,
        the offset must be a multiple of the allocation granularity".

        I know that this is going to be aligned to sth in VM, but I do not
        care, this is transparent unless you write kernel-mode stuff or,
        generally, very low level stuff. What I do care is that I cannot choose
        FILE offset at wich mapping starts. And that leads to "tearing"

        [...]
        >
        Judging from this:
        >
        That's what I wrote, except for the part "looking for a means around".
        Well, depends on what you mean by this, but I'd not rather disband
        memory mapping. So I am not looking for "means around memory mapping"
        but: living within memory mapping walls, how can I solve the "tearing"
        problem?
        >
        I'm guessing that both of those issues are really just the same problem for
        you. That is, that you have to address the file using non-contiguous
        pointers.
        >
        Mm, now I do not understand :-) Memory I map is guaranteed to be
        contiguous, it does not span the whole file, but contents mapped
        (current "page" in my code, if you will) have to start at specific
        offset in file - here I seek the rot of all evil :-)
        Certainly it is :-)
        That's how I wanted to implement fallback buffer. Each time I detect
        "torn" char I reposition file pointer, probe bytes backward till I find
        valid char and provide replacement.
        >
        Perhaps you could clarify under what situation you "detect a 'torn' char".
        That is, it's unclear to me whether you are referring to simply jumping into
        an offset that's not the start of a character, or if this somehow
        specifically relates to the sectioning of the file caused by your memory
        mapped i/o.
        >
        Well, yes, it can be thought of as jumping into character, which, in
        turn, is related to sectioning :-) How I detect? Hmm, that's
        interesting question. I am not sure, if I were to used DecoderFallback
        then that detection would happend by means of decoder itself.
        The former would be an issue even if you could map the entire file to a
        single contiguous virtual address range. The latter is obviously only an
        issue because of the sectioning of the file. I'm confused as to which it
        is.
        >
        Hope I clarified :-)
        [...]
        Yeah, nice solution. Even though performance hit may be noticeable, if
        I restrict these operations to fallback times only and extend my index
        structure to cache "torn" characters I should not need to execute that
        code very often. Seems good to me. Yet ... :-( Can I be sure whether
        decoder cannot mistake characters?
        >
        Well, as I mentioned...I can't help you with that question. :) That
        depends on the nature of the data you're decoding, and I don't know enough
        to be able to answer that.
        >
        Data is simply plain text file with some human readable text.
        If you find that's too slow, then you can accomplish pretty much the same
        performance gain you might get from a memory-mapped file (or possibly
        even
        better, depending on what sort of buffering Windows was capable of doing
        with your memory-mapped file) by reading the file directly in larger
        chunks.
        If you do that, then yes...you need to worry about the data you're
        processing straddling whatever artificial boundary you wind up imposing
        by
        adding the extra layer of buffering in your own code. But that is a
        solvable problem (and in fact will be solved in a very similar way to
        what
        the memory-mapped solution has to do behind the scenes for you anyway).
        If
        that last sentence causes you some questions, let me know and I can
        elaborate.
        Well, yes, please. If you are able to show me how to solve that, then I
        can mix memory mapping with direct file access at fallback times and be
        perfectly happy.
        >
        Okay, let's see if this makes sense. First, keep in mind that my comment
        was assuming a general solution to the file i/o problem. I think you should
        be able to apply it as a "fallback" solution, but it may or may not be
        better than just falling back to reading a few bytes at a time if that's
        your approach.
        >
        That's what I planned to do. Apply it as "fallback", but I think we
        somehow disagree on meaning of "fallback". I used that word in terms of
        "decoder fallback", speaking C# - it is an instance of DecoderFallback
        class, speaking more generally, something which provides replacement
        chars to decoder when it cannot, for some reason, decode a sequence. I
        somehow suspect that you used that as "another plan". So, I planned to
        use your solution as part of DecoderFallback implementation, which will
        read few bytes back and try to concatenate these with bytes from
        beginning of mapping.

        [...]
        That said:
        >
        What I meant was that you can read from the file a few blocks at a time,
        keeping that buffer centered on where you are currently accessing. You'll
        need to keep track of:
        >
        -- current file offset
        -- an array of blocks read from the file
        -- the file offsets those blocks came from
        -- the current block
        >
        The general idea is to maintain the array of blocks such that there is an
        odd number of blocks, at least three, and they are centered on the current
        offset within the file you're reading. Normally, you'll be reading from the
        middle block. If you skip over to another block, you drop one block from
        the far end of the array, and read another adding it to the near end of the
        array.
        >
        Basically, you're windowing the file in a fixed set of buffers. If you read
        new data asynchronously to your use of the data in the buffers you currently
        have, then when you drop a block at one end and fill it for use at the other
        end, the file i/o can happen while you're still processing the data that you
        do have.
        >
        Obviously if you jump to a completely different point in the file, you'll
        have to wait for the surrounding data to be read, but that's an issue even
        if memory mapped files or just reading directly with a FileStream.
        >
        Well, that's very close to what I have now. Let me specify the details.
        I read few "blocks" a time, namely 4, which is 256kb of data (block for
        me is memory allocation granularity, as that it is the smallest
        addressable part of file when it comes to memory mapping). I try to
        adjust offset a little, so that: I always read the whole data I am
        requested, and, immediate reads in the neighbourhood will not cause
        remapping, whch is close to your idea. But then, how do you know
        whether the very first byte of current "window" is the first block of
        character?
        [...]
        After little afterthought I've found that it is the most significant
        question. But let me rephrase what you wrote: it is no problem to find
        characters when reading byte sequence forward and every sane encoding
        must adhere to this in order to be usable. But is it the same case when
        looking backward?
        >
        I still don't know. :) I suspect that it is, because it was true with the
        basic MBCS I've seen. But I also realize that there are a LOT of different
        ways to encode text, and some may be context-sensitive.
        >
        :-( That's pain in the ass for me. If I knew that I could always look
        back for missing parts of single character, then mix of your solution
        with memory mapping would be the best scheme
        Some of this really depends on what you mean by "encoding" and "decoding".
        The word "encoding" is applied in a variety of ways. Two that could apply
        here are the basic idea of text encoding, which mostly just has to do with
        the character set, or some actual conversion of data, which has to do with
        compressing the data, or translating it into a more portable format (MIME,
        for example). I don't even know which of these meanings you're addressing,
        making it even harder for me to know the answer. :)
        >
        I meant "basic idea of text encoding" :-)
        [...]
        So, summing up. I think that question reduces to the one about encoding
        characteristics . You showed us very good solution using FileStream. It
        can be extended to mix these two approches which may be faster but I
        still do not know whether it is realiable.
        >
        Indeed, that is a question you should probably figure out. Earlier rather
        than later. :) Sorry I can't be of more help on that front.
        >
        Pete, first of all thank you for wonderful discussion, it was really
        helpful. And I hope you'll add something more after reading the code
        :-)
        Pete

        Comment

        • Peter Duniho

          #19
          Re: Decoding strategy

          <marcin.rzeznic ki@gmail.comwro te in message
          news:1160597995 .002328.207620@ b28g2000cwb.goo glegroups.com.. .
          [...]
          So, memory mapped view is sure too be usable for a while, so I
          think it pays off too keep that in memory. That characteristic also
          ensures me that OS cache can be helpful and performance will not suffer
          from misses/disk reads very often.
          That's well and good. However, those characteristics assist in ensuring
          that the file data is cached when using other forms of i/o as well,
          including using a FileStream. The benefit is not unique to memory mapped
          file i/o.
          Yes, I know. I was referring to something else. Sorry for being
          unclear. Docs say:
          >
          "(..)must specify an offset within the file that matches the memory
          allocation granularity of the system, or the function fails. That is,
          the offset must be a multiple of the allocation granularity".
          >
          I know that this is going to be aligned to sth in VM, but I do not
          care, this is transparent unless you write kernel-mode stuff or,
          generally, very low level stuff. What I do care is that I cannot choose
          FILE offset at wich mapping starts. And that leads to "tearing"
          Okay, I think I understand better what you meant. I'm going to snip a bunch
          of stuff here, and hopefully jump to the core of the issue...
          [...]
          Well, that's very close to what I have now. Let me specify the details.
          I read few "blocks" a time, namely 4, which is 256kb of data (block for
          me is memory allocation granularity, as that it is the smallest
          addressable part of file when it comes to memory mapping). I try to
          adjust offset a little, so that: I always read the whole data I am
          requested, and, immediate reads in the neighbourhood will not cause
          remapping, whch is close to your idea. But then, how do you know
          whether the very first byte of current "window" is the first block of
          character?
          Thanks. The code you posted helps me understand better what's going on.

          In fact, as near as I can tell, you are using memory mapping in practically
          the same way as my proposed multiple-buffer solution deals with things.
          That is, you're windowing the file with memory mapping the same way I'm
          doing it with the buffers.

          Here's a dumb question: is there any particular reason you're NOT mapping
          the entire file at once? I've mentioned the possibility in previous
          messages, making assumptions that you have your reasons for not doing so.
          But if you could, all of these issues just go away. Are you genuinely
          concerned that you won't have enough contiguous virtual address space to map
          the whole file?

          Anyway, for the moment let's assume that you can only map a portion of the
          file at a time...

          Depending on what the actual performance is, it seems to me that either
          method would be the correct solution. I suspect that the answer is to
          simply do the memory mapping a little differently, but I don't have enough
          experience with memory mapped files to know for sure.

          Specifically: what if you modified your code that maps the file, so that it
          maps a range *around* the starting point, the way I suggested with the
          buffers? At certain points (perhaps only when you got right to the very
          edge and attempted to read a byte outside your mapped range), you would
          remap the file, shifting the window so that the bytes you want to deal with
          are within the mapped range.

          When you index the data, I would recommend the high-level code using an
          index relative to the file beginning. That is, your index is just the file
          offset for the data (note that I'm not using the word "index" to relate to
          the broader index you calculate for the file data...I just mean a way to
          identify which byte you're working on at the moment). Then you translate
          that to the actual offset within the mapped range as necessary. That way,
          you can be changing the mapped range on the fly without affecting how the
          higher-level code that actually processes the data works.

          I believe that performance should be fine doing this. When you remap the
          file, most of the file should still be in physical RAM and I suspect the OS
          will correctly reattach the newly mapped range to the portions of the range
          that are already resident in RAM. Only the newly mapped portions of the
          file should need to be read.
          >[...]
          :-( That's pain in the ass for me. If I knew that I could always look
          back for missing parts of single character, then mix of your solution
          with memory mapping would be the best scheme
          Well, at some point you need to come up with some mechanism for finding the
          beginning of a valid character. :) How you access the file might make this
          easier or harder, but the problem exists even if you can map the entire file
          at once. Sorry I can't be more helpful on that front. I agree, that part
          actually seems to be the "hard part" of this problem, in spite of all the
          space we've consumed discussing the file i/o part. :)

          Pete


          Comment

          • marcin.rzeznicki@gmail.com

            #20
            Re: Decoding strategy


            Peter Duniho napisal(a):
            <marcin.rzeznic ki@gmail.comwro te in message
            news:1160597995 .002328.207620@ b28g2000cwb.goo glegroups.com.. .
            [...]
            So, memory mapped view is sure too be usable for a while, so I
            think it pays off too keep that in memory. That characteristic also
            ensures me that OS cache can be helpful and performance will not suffer
            from misses/disk reads very often.
            >
            That's well and good. However, those characteristics assist in ensuring
            that the file data is cached when using other forms of i/o as well,
            including using a FileStream. The benefit is not unique to memory mapped
            file i/o.
            >
            I see. I am on the winning side though, because I eliminate unnecessary
            in-memory copying. But, agreed, that may not be much overall.
            Yes, I know. I was referring to something else. Sorry for being
            unclear. Docs say:

            "(..)must specify an offset within the file that matches the memory
            allocation granularity of the system, or the function fails. That is,
            the offset must be a multiple of the allocation granularity".

            I know that this is going to be aligned to sth in VM, but I do not
            care, this is transparent unless you write kernel-mode stuff or,
            generally, very low level stuff. What I do care is that I cannot choose
            FILE offset at wich mapping starts. And that leads to "tearing"
            >
            Okay, I think I understand better what you meant. I'm going to snip a bunch
            of stuff here, and hopefully jump to the core of the issue...
            >
            [...]
            Well, that's very close to what I have now. Let me specify the details.
            I read few "blocks" a time, namely 4, which is 256kb of data (block for
            me is memory allocation granularity, as that it is the smallest
            addressable part of file when it comes to memory mapping). I try to
            adjust offset a little, so that: I always read the whole data I am
            requested, and, immediate reads in the neighbourhood will not cause
            remapping, whch is close to your idea. But then, how do you know
            whether the very first byte of current "window" is the first block of
            character?
            >
            Thanks. The code you posted helps me understand better what's going on.
            >
            In fact, as near as I can tell, you are using memory mapping in practically
            the same way as my proposed multiple-buffer solution deals with things.
            That is, you're windowing the file with memory mapping the same way I'm
            doing it with the buffers.
            >
            Here's a dumb question: is there any particular reason you're NOT mapping
            the entire file at once? I've mentioned the possibility in previous
            messages, making assumptions that you have your reasons for not doing so.
            But if you could, all of these issues just go away. Are you genuinely
            concerned that you won't have enough contiguous virtual address space to map
            the whole file?
            Well, there are two issues involved, and I do not know which one are
            you reffering to. Let me explain. Mapping is actually two-step process,
            first of all you reserve VM for mapping and then you commit, which
            result in bringing contents of file to memory. So, when it comes to
            reservation step, I map entire file at once, code I pasted does not
            show this step. What is shown is the commitment step, and I commit only
            small portion of reserved memory at once. This app is not going to be
            server app, running on high end machines with many gigs of ram. It is
            rather intended to be desktop app. So, I do not want to reserve like
            500 MB of memory for just one file because it could easily cause
            constant swapping and overall performance degradation on user machine.
            >
            Anyway, for the moment let's assume that you can only map a portion of the
            file at a time...
            >
            Depending on what the actual performance is, it seems to me that either
            method would be the correct solution. I suspect that the answer is to
            simply do the memory mapping a little differently, but I don't have enough
            experience with memory mapped files to know for sure.
            >
            Specifically: what if you modified your code that maps the file, so that it
            maps a range *around* the starting point, the way I suggested with the
            buffers? At certain points (perhaps only when you got right to the very
            edge and attempted to read a byte outside your mapped range), you would
            remap the file, shifting the window so that the bytes you want to deal with
            are within the mapped range.
            >
            It does. Well, I am sorry, because I stripped this code of mapping
            logic, but when you see sth like firstBufferInde x it is, almost in all
            cases, carefully computed index of a portion which contains requested
            data but also its neighbourhood, so that near "jumps" should not cause
            remapping. Actually user may as well use enumerated acces, in that case
            I know in advance tha data is going to be read forward, then I can, if
            I must, map from where previous mapping ends.
            When you index the data, I would recommend the high-level code using an
            index relative to the file beginning. That is, your index is just the file
            offset for the data (note that I'm not using the word "index" to relate to
            the broader index you calculate for the file data...I just mean a way to
            identify which byte you're working on at the moment). Then you translate
            that to the actual offset within the mapped range as necessary. That way,
            you can be changing the mapped range on the fly without affecting how the
            higher-level code that actually processes the data works.
            >
            Well, that is not going to work for me unfortunately. Interfaces I have
            to implement imply that data access uses "string coordinates" - so
            client code specifies - I want 5th char, not 5th byte, and reckoning
            that encoding-hell I would not be able to compute that easily, so I
            decided to use only "string coordinates".
            I believe that performance should be fine doing this. When you remap the
            file, most of the file should still be in physical RAM and I suspect the OS
            will correctly reattach the newly mapped range to the portions of the range
            that are already resident in RAM. Only the newly mapped portions of the
            file should need to be read.
            >
            Yeah, I think so too
            [...]
            :-( That's pain in the ass for me. If I knew that I could always look
            back for missing parts of single character, then mix of your solution
            with memory mapping would be the best scheme
            >
            Well, at some point you need to come up with some mechanism for finding the
            beginning of a valid character. :) How you access the file might make this
            easier or harder, but the problem exists even if you can map the entire file
            at once. Sorry I can't be more helpful on that front. I agree, that part
            actually seems to be the "hard part" of this problem, in spite of all the
            space we've consumed discussing the file i/o part. :)
            >
            Yes, this problem vanishes when you are able to map AND decode entire
            file at once. But that's overkill I suppose.
            So, I'll try to implement DecoderFallback , with nothing more than HOPE
            that it will always be anle to do its job :-)
            Thank you.
            Pete

            Comment

            • Peter Duniho

              #21
              Re: Decoding strategy

              <marcin.rzeznic ki@gmail.comwro te in message
              news:1160608339 .224268.43620@k 70g2000cwa.goog legroups.com...
              [...]
              >Here's a dumb question: is there any particular reason you're NOT mapping
              >the entire file at once? I've mentioned the possibility in previous
              >messages, making assumptions that you have your reasons for not doing so.
              >But if you could, all of these issues just go away. Are you genuinely
              >concerned that you won't have enough contiguous virtual address space to
              >map
              >the whole file?
              >
              Well, there are two issues involved, and I do not know which one are
              you reffering to. Let me explain. Mapping is actually two-step process,
              first of all you reserve VM for mapping and then you commit, which
              result in bringing contents of file to memory.
              That's not the process of memory-mapped file i/o I'm familiar with. That
              is, while I know you can use MapViewOfFileEx () to provide a specific virtual
              address at which to map the file, this isn't necessary, nor does it to my
              knowledge require an explicit commit of the entire file.

              The usual method of memory-mapping that I use is this:

              * open the file (CreateFile)
              * create the file mapping (CreateFileMapp ing)
              * assign virtual address space to file mapping (MapViewOfFile)

              When MapViewOfFile returns, the code now has a virtual address that
              represents the beginning of the data of the file. Physical RAM is committed
              only as the data is actually accessed, and can be reclaimed through the
              usual page aging process (older pages get tossed as needed if something else
              needs physical RAM that's not available).
              So, when it comes to
              reservation step, I map entire file at once, code I pasted does not
              show this step. What is shown is the commitment step, and I commit only
              small portion of reserved memory at once.
              The code you posted calls only MapViewOfFile. This doesn't reserve any
              physical RAM for the data. It just reserves room in the virtual address
              space for it.
              This app is not going to be
              server app, running on high end machines with many gigs of ram. It is
              rather intended to be desktop app. So, I do not want to reserve like
              500 MB of memory for just one file because it could easily cause
              constant swapping and overall performance degradation on user machine.
              Negative. That's one of the nice benefits of memory mapping: you can map an
              entire file, even a large one, and use only the physical RAM required to
              process the parts you're looking at. In addition, because the physical RAM
              being used is backed by the mapped file, it doesn't get swapped out to the
              swap file...the file itself can be used for the backing store (this doesn't
              necessarily help the physical RAM side of things, but it does ease the
              pressure on the swap file itself).

              There is no reason that I can think of that would cause mapping a large file
              into virtual address space to cause any more swapping than processing that
              file would cause in any case. The OS certainly does not read all 500MB of a
              mapped 500MB file into physical RAM just because you've mapped the file.
              [...]
              >Specifically : what if you modified your code that maps the file, so that
              >it
              >maps a range *around* the starting point, the way I suggested with the
              >buffers? At certain points (perhaps only when you got right to the very
              >edge and attempted to read a byte outside your mapped range), you would
              >remap the file, shifting the window so that the bytes you want to deal
              >with
              >are within the mapped range.
              >
              It does. Well, I am sorry, because I stripped this code of mapping
              logic, but when you see sth like firstBufferInde x it is, almost in all
              cases, carefully computed index of a portion which contains requested
              data but also its neighbourhood, so that near "jumps" should not cause
              remapping. Actually user may as well use enumerated acces, in that case
              I know in advance tha data is going to be read forward, then I can, if
              I must, map from where previous mapping ends.
              That's not what I mean. If you were doing what I was suggesting already,
              then the only issue remaining for you would be figuring out when you need to
              back up in the data. The actual backing up would be trivial...you'd just
              decrement your pointer and read the byte you want to read. You would have
              moments when the mapped section of the file would have to change, but that
              would be a momentary diversion and you'd get right back to just reading the
              bytes from the mapped address space.
              >[...] Then you translate
              >that to the actual offset within the mapped range as necessary. That
              >way,
              >you can be changing the mapped range on the fly without affecting how the
              >higher-level code that actually processes the data works.
              >
              Well, that is not going to work for me unfortunately. Interfaces I have
              to implement imply that data access uses "string coordinates" - so
              client code specifies - I want 5th char, not 5th byte, and reckoning
              that encoding-hell I would not be able to compute that easily, so I
              decided to use only "string coordinates".
              I don't think you got my meaning. I don't mean that the highest level of
              your code has to use a byte offset within the file. Just that the decoder
              part need not concern itself with anything other than the byte offset. As
              it read bytes, it would ask the file mapping layer of your code for a byte
              offset within the file, and the file mapping layer would then translate that
              into an offset within the mapped view you're using.

              That said, so far I haven't seen an indication that you actually need to be
              mapping sections of the file. You seem to be concerned about committing too
              much physical RAM at once to the mapping, but unless you're doing something
              really odd that you haven't posted in code, your concern is unfounded.

              There are reasons that you might not be able to map an entire file into your
              virtual address space, but 500MB ought to be within the usual limitations.
              It seems to me that you should look at just mapping the entire file all at
              once, and if you run into problems with that, then start worrying about
              windowing the file.

              The reason you might not be able to map the whole file at once is that you
              don't have a contiguous range of virtual address space large enough for the
              file. That can happen for two reasons: insufficient virtual address space
              left or fragmented virtual address space. How much virtual address space
              you might have will vary, but even the theoretical 2GB maximum (and of
              course, this never comes close to being available) is smaller than some
              files. Fragmentation is harder to predict, and could limit your available
              virtual address space to something significantly smaller than the actual
              virtual address space left. But IMHO, if 500MB is a typical file size for
              you, you ought to be able to map that without problems.

              Pete


              Comment

              • marcin.rzeznicki@gmail.com

                #22
                Re: Decoding strategy


                Peter Duniho wrote:
                <marcin.rzeznic ki@gmail.comwro te in message
                news:1160608339 .224268.43620@k 70g2000cwa.goog legroups.com...
                [...]
                Here's a dumb question: is there any particular reason you're NOT mapping
                the entire file at once? I've mentioned the possibility in previous
                messages, making assumptions that you have your reasons for not doing so.
                But if you could, all of these issues just go away. Are you genuinely
                concerned that you won't have enough contiguous virtual address space to
                map
                the whole file?
                Well, there are two issues involved, and I do not know which one are
                you reffering to. Let me explain. Mapping is actually two-step process,
                first of all you reserve VM for mapping and then you commit, which
                result in bringing contents of file to memory.
                >
                That's not the process of memory-mapped file i/o I'm familiar with. That
                is, while I know you can use MapViewOfFileEx () to provide a specific virtual
                address at which to map the file, this isn't necessary, nor does it to my
                knowledge require an explicit commit of the entire file.
                >
                The usual method of memory-mapping that I use is this:
                >
                * open the file (CreateFile)
                * create the file mapping (CreateFileMapp ing)
                * assign virtual address space to file mapping (MapViewOfFile)
                >
                That's the same, but under different names. CreateFileMappi ng reserves
                VM range. It is not yet committed, and you pay almost no
                resources-usage/performance price. MapViewOfFile commits some part of
                previously reserved VM and brings contents of file (maybe lazily, I
                don't know for sure)
                When MapViewOfFile returns, the code now has a virtual address that
                represents the beginning of the data of the file. Physical RAM is committed
                only as the data is actually accessed, and can be reclaimed through the
                usual page aging process (older pages get tossed as needed if something else
                needs physical RAM that's not available).
                >
                So, when it comes to
                reservation step, I map entire file at once, code I pasted does not
                show this step. What is shown is the commitment step, and I commit only
                small portion of reserved memory at once.
                >
                The code you posted calls only MapViewOfFile. This doesn't reserve any
                physical RAM for the data. It just reserves room in the virtual address
                space for it.
                >
                Well, actually, if I understand docs correctly, CreateFileMap reserves
                virtual memory address range and establishes associacion between VM
                addresses and file. MapViewOfFile brings contents of file to RAM
                This app is not going to be
                server app, running on high end machines with many gigs of ram. It is
                rather intended to be desktop app. So, I do not want to reserve like
                500 MB of memory for just one file because it could easily cause
                constant swapping and overall performance degradation on user machine.
                >
                Negative. That's one of the nice benefits of memory mapping: you can map an
                entire file, even a large one, and use only the physical RAM required to
                process the parts you're looking at. In addition, because the physical RAM
                being used is backed by the mapped file, it doesn't get swapped out to the
                swap file...the file itself can be used for the backing store (this doesn't
                necessarily help the physical RAM side of things, but it does ease the
                pressure on the swap file itself).
                >
                Positivie with respect to "swapping" definition :-) It does not get
                swapped to swap file, true, but still it may be swapped to the mapped
                file. So, though you are right that memory pressure is removed from
                page file, you still pay the price of swapping if lot of RAM is
                occupied by file view
                There is no reason that I can think of that would cause mapping a large file
                into virtual address space to cause any more swapping than processing that
                file would cause in any case. The OS certainly does not read all 500MB of a
                mapped 500MB file into physical RAM just because you've mapped the file.
                >
                I think that when I've established a view then RAM gets occupied. So,
                as I said, I map whole file at once as docs assure me that there is
                nothing wrong with that, but I restrict myself to moderately sized
                views.
                [...]
                Specifically: what if you modified your code that maps the file, so that
                it
                maps a range *around* the starting point, the way I suggested with the
                buffers? At certain points (perhaps only when you got right to the very
                edge and attempted to read a byte outside your mapped range), you would
                remap the file, shifting the window so that the bytes you want to deal
                with
                are within the mapped range.
                It does. Well, I am sorry, because I stripped this code of mapping
                logic, but when you see sth like firstBufferInde x it is, almost in all
                cases, carefully computed index of a portion which contains requested
                data but also its neighbourhood, so that near "jumps" should not cause
                remapping. Actually user may as well use enumerated acces, in that case
                I know in advance tha data is going to be read forward, then I can, if
                I must, map from where previous mapping ends.
                >
                That's not what I mean. If you were doing what I was suggesting already,
                then the only issue remaining for you would be figuring out when you need to
                back up in the data. The actual backing up would be trivial...you'd just
                decrement your pointer and read the byte you want to read. You would have
                moments when the mapped section of the file would have to change, but that
                would be a momentary diversion and you'd get right back to just reading the
                bytes from the mapped address space.
                >
                Sorry Peter, I don't get it then. Could you explain it to me, it seems
                to be interesting idea, but now I feel that I 've got lost.
                [...] Then you translate
                that to the actual offset within the mapped range as necessary. That
                way,
                you can be changing the mapped range on the fly without affecting how the
                higher-level code that actually processes the data works.
                Well, that is not going to work for me unfortunately. Interfaces I have
                to implement imply that data access uses "string coordinates" - so
                client code specifies - I want 5th char, not 5th byte, and reckoning
                that encoding-hell I would not be able to compute that easily, so I
                decided to use only "string coordinates".
                >
                I don't think you got my meaning. I don't mean that the highest level of
                your code has to use a byte offset within the file. Just that the decoder
                part need not concern itself with anything other than the byte offset. As
                it read bytes, it would ask the file mapping layer of your code for a byte
                offset within the file, and the file mapping layer would then translate that
                into an offset within the mapped view you're using.
                >
                Isn't that what ReadPage in my code does? It is asked to bring contents
                indexed by block offset, it computes "real" offset and establishes a
                view. Decoder part does not even have to think of byte offsets because
                it operates on current page only, and pointer to it is constant in time
                when decoder operates.
                That said, so far I haven't seen an indication that you actually need to be
                mapping sections of the file. You seem to be concerned about committing too
                much physical RAM at once to the mapping, but unless you're doing something
                really odd that you haven't posted in code, your concern is unfounded.
                >
                There are reasons that you might not be able to map an entire file into your
                virtual address space, but 500MB ought to be within the usual limitations.
                It seems to me that you should look at just mapping the entire file all at
                once, and if you run into problems with that, then start worrying about
                windowing the file.
                >
                The reason you might not be able to map the whole file at once is that you
                don't have a contiguous range of virtual address space large enough for the
                file. That can happen for two reasons: insufficient virtual address space
                left or fragmented virtual address space. How much virtual address space
                you might have will vary, but even the theoretical 2GB maximum (and of
                course, this never comes close to being available) is smaller than some
                files. Fragmentation is harder to predict, and could limit your available
                virtual address space to something significantly smaller than the actual
                virtual address space left. But IMHO, if 500MB is a typical file size for
                you, you ought to be able to map that without problems.
                >
                So, if I understand correctly what you wrote, I am not concerned with
                mapping file at once, I reserve all VM I will need for one file
                (CreateFileMapp ing). But I am concerned when it comes to commit
                (MapViewOfFile) because that's where memory resources are really
                consumed. Am I missing something?
                Pete

                Comment

                • Peter Duniho

                  #23
                  Re: Decoding strategy

                  <marcin.rzeznic ki@gmail.comwro te in message
                  news:1160663537 .474472.154060@ b28g2000cwb.goo glegroups.com.. .
                  That's the same, but under different names. CreateFileMappi ng reserves
                  VM range.
                  That is incorrect. The virtual memory range is not reserved until you call
                  MapViewOfFile.
                  [...] It is not yet committed, and you pay almost no
                  resources-usage/performance price. MapViewOfFile commits some part of
                  previously reserved VM and brings contents of file (maybe lazily, I
                  don't know for sure)
                  That is also incorrect. MapViewOfFile reserves the virtual address space.
                  There may be some caching, but otherwise committing the file data to
                  physical RAM does not occur until a specific portion of the reserved virtual
                  address space is referenced.

                  I'm offline right now, otherwise I'd provide a link to the MSDN web site.
                  However, you can easily look those functions up yourself, and the
                  documentation explicitly describes the behavior as I do above.

                  From the documentation for CreateFileMappi ng:

                  Creating a file mapping object creates the potential for
                  mapping a view of the file, but does not map the view. The
                  MapViewOfFile and MapViewOfFileEx functions map a view of
                  a file into a process address space

                  If CreateFileMappi ng was what allocated virtual address space, it would not
                  make sense for MapViewOfFileEx to even exist, since the main reason for that
                  function is to allow the program to provide a specific virtual memory
                  address at which to map the file.
                  [...]
                  Well, actually, if I understand docs correctly, CreateFileMap reserves
                  virtual memory address range and establishes associacion between VM
                  addresses and file. MapViewOfFile brings contents of file to RAM
                  What can I say? You don't understand the docs correctly.
                  [...]
                  Positivie with respect to "swapping" definition :-) It does not get
                  swapped to swap file, true, but still it may be swapped to the mapped
                  file. So, though you are right that memory pressure is removed from
                  page file, you still pay the price of swapping if lot of RAM is
                  occupied by file view
                  My point is that the amount of data in physical RAM will be related to your
                  use of that data. The OS will keep the data in physical RAM based on your
                  access of that data, not based on how much of it there is. This is true
                  whether you use memory mapping or not.

                  With either technique, you can limit the *maximum* amount of physical RAM
                  potentially consumed. Using memory mapping, you do this by mapping only a
                  small range of the file at a time. Using conventional file i/o, you do this
                  by limiting your own buffers that are used to store data you've read from
                  the file.

                  In either case, the OS has the final say on how much physical RAM is
                  actually used. Using memory mapping, if there are other depends on physical
                  RAM, then only a portion of mapped virtual address space will actually be
                  resident at any given time. Likewise, using conventional file i/o, only a
                  portion of your own program buffers will be resident in physical RAM at any
                  given time.

                  But memory mapped file i/o will not in and of itself increase memory
                  swapping. The only way it could do that is if you not only map the entirety
                  of a very large file in RAM, but you wind up *accessing* the totality of
                  that file more frequently than you access anything else. In that case, the
                  OS would be chasing you trying to keep all of the file data you're
                  referencing resident, at the same time that other stuff needs to be swapped
                  in and back out.

                  This is not a typical case, and doesn't seem relevant to your own situation.
                  In any case, the OS is pretty smart. If your use of a memory mapped file
                  starts pressuring other users of physical RAM, the OS is not going to bother
                  trying to keep all of the memory mapped file in RAM. Even better, as long
                  as you open the file as read-only, you're assured to never have to have the
                  cost of writing any data back to the disk if a physical page of RAM used by
                  the file mapping has to get discarded and used for something else.

                  Your worries about memory mapping the entire file causing some serious
                  problem with disk swapping are unfounded.
                  >There is no reason that I can think of that would cause mapping a large
                  >file
                  >into virtual address space to cause any more swapping than processing
                  >that
                  >file would cause in any case. The OS certainly does not read all 500MB
                  >of a
                  >mapped 500MB file into physical RAM just because you've mapped the file.
                  >
                  I think that when I've established a view then RAM gets occupied. So,
                  as I said, I map whole file at once as docs assure me that there is
                  nothing wrong with that, but I restrict myself to moderately sized
                  views.
                  But it's not true that when you establish a view then RAM gets occupied.
                  The "view" is an allocation of virtual address space, not physical RAM.
                  >That's not what I mean. If you were doing what I was suggesting already,
                  >then the only issue remaining for you would be figuring out when you need
                  >to
                  >back up in the data. The actual backing up would be trivial...you'd just
                  >decrement your pointer and read the byte you want to read. You would
                  >have
                  >moments when the mapped section of the file would have to change, but
                  >that
                  >would be a momentary diversion and you'd get right back to just reading
                  >the
                  >bytes from the mapped address space.
                  >>
                  >
                  Sorry Peter, I don't get it then. Could you explain it to me, it seems
                  to be interesting idea, but now I feel that I 've got lost.
                  Assume you have some code that attempts to retrieve a byte from a specific
                  file offset. Assume also that you have some code that translates this into
                  access from your mapped view of the file. Finally, assume that the
                  higher-level code is trying to access a byte that is just before the lowest
                  file offset currently being mapped.

                  In pseudocode then:

                  // The desired byte offset from the file
                  long ibFileOffset;
                  // This is the mapped range, "Min" inclusive, "Mac" exclusive
                  long ibMappedMin, ibMappedMac;
                  // The resulting offset within the mapped range
                  long ibMappedOffset;

                  if (ibFileOffset < ibMappedMin || ibFileOffset >= ibMappedMac)
                  {
                  // remap file so that ibMappedMin < ibFileOffset and
                  // ibFileOffset < ibMappedMac. Don't forget to make sure
                  // that ibMappedMin and ibMappedMac remain between 0 and
                  // the total file length.
                  }

                  ibMappedOffset = ibFileOffset - ibMappedMin;
                  return *(pbMappedData + ibMappedOffset) ;

                  Basically, in the normal case, all that the code is doing is translating the
                  file offset to the mapping offset and returning the data at that offset.
                  When the requested data falls outside the range, you just shift the offset
                  enough to accomodate the new request for data.

                  Most likely, you'd try to center the newly-mapped range on the request file
                  offset. When you get near the beginning or end of the file, you'll
                  necessarily wind up at least trimming the mapped range as appropriate
                  (making it smaller than normal), if not just pinning the range to the
                  relevant boundary (preserving the total size of the mapping).
                  Isn't that what ReadPage in my code does? It is asked to bring contents
                  indexed by block offset, it computes "real" offset and establishes a
                  view. Decoder part does not even have to think of byte offsets because
                  it operates on current page only, and pointer to it is constant in time
                  when decoder operates.
                  IMHO, there's no reason for the decoder to have to think of pages within the
                  file. As near as I can tell, that's an arbitrary choice affected by the
                  implementation of your file i/o. In particular, if I understand correctly
                  (and maybe I don't), part of the issue of "tearing" that you're worried
                  about comes about because of the potential for data being read to cross one
                  of these page boundaries.

                  The decoder should be concerning itself only with the entire file. That's
                  why you have the tearing issue. If you allowed the decoder to simply use an
                  offset relative to the beginning of the file, then the decoder would never
                  have to worry about whether the data falls outside the currently mapped
                  range. The i/o code would take care of that instead, and always return
                  whatever byte it is the decoder wants to handle.

                  Of course, if you simply map the entire file all at once, the issue becomes
                  trivial. So this may or may not be a moot point. You don't seem to be
                  basing your architectural decisions on correct information about how file
                  mapping works, so maybe understanding correctly how file mapping works you
                  will find all of this "map a subset of the file" stuff becomes irrelevant.
                  [...]
                  So, if I understand correctly what you wrote, I am not concerned with
                  mapping file at once, I reserve all VM I will need for one file
                  (CreateFileMapp ing). But I am concerned when it comes to commit
                  (MapViewOfFile) because that's where memory resources are really
                  consumed. Am I missing something?
                  Yes, I think so. See above. :)

                  Pete


                  Comment

                  • marcin.rzeznicki@gmail.com

                    #24
                    Re: Decoding strategy


                    Peter Duniho napisal(a):
                    <marcin.rzeznic ki@gmail.comwro te in message
                    news:1160663537 .474472.154060@ b28g2000cwb.goo glegroups.com.. .
                    That's the same, but under different names. CreateFileMappi ng reserves
                    VM range.
                    >
                    That is incorrect. The virtual memory range is not reserved until you call
                    MapViewOfFile.
                    >
                    [...] It is not yet committed, and you pay almost no
                    resources-usage/performance price. MapViewOfFile commits some part of
                    previously reserved VM and brings contents of file (maybe lazily, I
                    don't know for sure)
                    >
                    That is also incorrect. MapViewOfFile reserves the virtual address space.
                    There may be some caching, but otherwise committing the file data to
                    physical RAM does not occur until a specific portion of the reserved virtual
                    address space is referenced.
                    >
                    Methinks that we are giving the same subject different names. Here is
                    quotation from MSDN:

                    the address range is reserved with the function CreateFileMappi ng until
                    portions are requested via a call to function MapViewOfFile. This
                    permits applications to map a large file (it is possible to load a file
                    1 GB in size in Windows NT) to a specific range of addresses without
                    having to load the entire file into memory. Instead, portions (views)
                    of the file can be loaded on demand directly to the reserved address
                    space.


                    [...]
                    >
                    Your worries about memory mapping the entire file causing some serious
                    problem with disk swapping are unfounded.
                    >
                    Yes, it seems that I was wrong. I'll have to rethink design once again.

                    [...]
                    Assume you have some code that attempts to retrieve a byte from a specific
                    file offset. Assume also that you have some code that translates this into
                    access from your mapped view of the file. Finally, assume that the
                    higher-level code is trying to access a byte that is just before the lowest
                    file offset currently being mapped.
                    >
                    In pseudocode then:
                    >
                    // The desired byte offset from the file
                    long ibFileOffset;
                    // This is the mapped range, "Min" inclusive, "Mac" exclusive
                    long ibMappedMin, ibMappedMac;
                    // The resulting offset within the mapped range
                    long ibMappedOffset;
                    >
                    if (ibFileOffset < ibMappedMin || ibFileOffset >= ibMappedMac)
                    {
                    // remap file so that ibMappedMin < ibFileOffset and
                    // ibFileOffset < ibMappedMac. Don't forget to make sure
                    // that ibMappedMin and ibMappedMac remain between 0 and
                    // the total file length.
                    }
                    >
                    ibMappedOffset = ibFileOffset - ibMappedMin;
                    return *(pbMappedData + ibMappedOffset) ;
                    >
                    >
                    But, is there any difference if it seems that mapping whole file at
                    once will do?

                    [...]
                    Of course, if you simply map the entire file all at once, the issue becomes
                    trivial. So this may or may not be a moot point. You don't seem to be
                    basing your architectural decisions on correct information about how file
                    mapping works, so maybe understanding correctly how file mapping works you
                    will find all of this "map a subset of the file" stuff becomes irrelevant.
                    >
                    Yes, and that's the whole point. You are perfectly right about MMF, I
                    shouldn't have worried about tearing, because I am able to map file at
                    once and rely on OS when it comes to swapping.
                    [...]
                    So, if I understand correctly what you wrote, I am not concerned with
                    mapping file at once, I reserve all VM I will need for one file
                    (CreateFileMapp ing). But I am concerned when it comes to commit
                    (MapViewOfFile) because that's where memory resources are really
                    consumed. Am I missing something?
                    >
                    Yes, I think so. See above. :)
                    >
                    Thank you very much. You clarified me this whole mapping issue :-)
                    Thanks once again

                    Comment

                    • Peter Duniho

                      #25
                      Re: Decoding strategy

                      <marcin.rzeznic ki@gmail.comwro te in message
                      news:1161037786 .793439.255940@ m73g2000cwd.goo glegroups.com.. .
                      Methinks that we are giving the same subject different names. Here is
                      quotation from MSDN: [...]
                      I'm not sure of that. You still seem to believe that CreateFileMappi ng
                      affects the use of the virtual address space, and you still seem to believe
                      that calling MapViewOfFile affects the use of physical memory.

                      As far as this specific misunderstandin g goes, IMHO you should be very
                      careful about believing a statement found in a general article, rather than
                      specific comments found in the documentation for the functions you're trying
                      to understand. In particular, the comments found in the documentation for
                      CreateFileMappi ng, MapViewOfFile, and MapViewOfFileEx trump any other
                      documentation you might find, unless you have independent confirmation that
                      suggests otherwise.

                      In this case, I am aware of no other independent confirmation. It seems
                      most likely to me that the article is simply mentioning in passing some
                      behavior of the functions that may or may not be relevant to your use.

                      If you look at the documentation for CreateFileMappi ng, you'll note that
                      there is a way of calling it for a mapping not backed by a specific disk
                      file. In this use, it may be true that CreateFileMappi ng reserves a virtual
                      address range. However, that doesn't mean that that's what the function
                      does in all cases.

                      As I've pointed out, the behavior of CreateFileMappi ng and MapViewOfFile(E x)
                      are specifically documented contrary to your understanding and contrary to
                      the article you've referenced. In particular, it would make no sense:

                      1) That CreateFileMappi ng could reserve any virtual address space, when
                      the whole point of the MapViewOfFileEx function is to specify a specific
                      virtual address at which to map the file. If CreateFileMappi ng had already
                      reserved virtual address space, then there would be no way to ask for a
                      specific virtual address later, as CreateFileMappi ng would have already
                      determined the mapped virtual address (you can't reserve a range of virtual
                      address space without knowing where in the virtual address space it is), or

                      2) That CreateFileMappi ng can reserve virtual address space before
                      knowing how much address space to reserve. When you call CreateFileMappi ng,
                      you tell it the full extent of the file you wish to map. It is perfectly
                      legal for this extent to be larger than 2GB. How would CreateFileMappi ng
                      reserve virtual adddress space in this case? Does it pick an arbitrary
                      length for the range? What happens when you ask to map more than the
                      arbitrary length it chose? No, I think it much more likely that the
                      documentation is correct and that virtual address space is not reserved
                      until you call MapViewOfFile(E x).

                      By the way, you should be able to use the VirtualXXX functions or possibly
                      performance counters to confirm the behavior. I haven't looked closely at
                      what's available, but I'm sure there's some mechanism for querying the state
                      of the process's virtual memory. In particular, if you call
                      CreateFileMappi ng and the available virtual memory before the call and after
                      the call is reduced by the size of the mapping you've requested, then that
                      would support your interpretation that CreateFileMappi ng is reserving
                      virtual address space.

                      I suspect you'll find that a large change in the virtual memory available
                      happens only after MapViewOfFile. :)
                      [...]
                      But, is there any difference if it seems that mapping whole file at
                      once will do?
                      No, I don't think so. If it is suitable for your needs to map the entire
                      file at once, then any issues related to windowing the file simply go away.
                      Yes, and that's the whole point. You are perfectly right about MMF, I
                      shouldn't have worried about tearing, because I am able to map file at
                      once and rely on OS when it comes to swapping.
                      Indeed. :)
                      Thank you very much. You clarified me this whole mapping issue :-)
                      Thanks once again
                      You're very welcome. I only regret that it seems as though none of this
                      thread has anything to do with C#. :)

                      Pete


                      Comment

                      • marcin.rzeznicki@gmail.com

                        #26
                        Re: Decoding strategy


                        Peter Duniho napisal(a):
                        <marcin.rzeznic ki@gmail.comwro te in message
                        news:1161037786 .793439.255940@ m73g2000cwd.goo glegroups.com.. .
                        You're very welcome. I only regret that it seems as though none of this
                        thread has anything to do with C#. :)
                        It's had, did you forget that I implemented this in C# ?:-)))
                        >
                        Pete

                        Comment

                        Working...