parsing an ifstream to get some specific text

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • toton

    parsing an ifstream to get some specific text

    Hi,
    I have some ascii files, which are having some formatted text. I want
    to read some section only from the total file.
    For that what I am doing is indexing the sections (denoted by .START
    in the file) with the location.
    And for a particular section I parse only that section.

    The file is something like,

    .... DATAS
    .....
    ..START
    ....
    .....
    ..START
    ....
    ......
    etc.
    I need to parse datas between two .START when only that section is
    needed. I don't load all of the data's in the memory at a time, as the
    file is big, 4MB~20MB in size.
    To mark all of the .START I parse it once, just to check .START and
    mark that position, and when actually the detailed data is needed seek
    to that marked position and do parsing.

    For quick parsing, I do
    while(_stream) {
    std::string currentLine;
    getline(_stream , currentLine);
    currentLine = utils::trim(cur rentLine);///this removes whitespace
    from front & back.
    if (currentLine == ".START"){
    _pos.push_back( _stream.tellg() );
    }
    }
    But this code runs slower than I expect. Anything better can be done
    here ? like some buffering in the stream etc? .

    abir

  • Ondra Holub

    #2
    Re: parsing an ifstream to get some specific text


    toton napsal:
    Hi,
    I have some ascii files, which are having some formatted text. I want
    to read some section only from the total file.
    For that what I am doing is indexing the sections (denoted by .START
    in the file) with the location.
    And for a particular section I parse only that section.
    >
    The file is something like,
    >
    ... DATAS
    ....
    .START
    ...
    ....
    .START
    ...
    .....
    etc.
    I need to parse datas between two .START when only that section is
    needed. I don't load all of the data's in the memory at a time, as the
    file is big, 4MB~20MB in size.
    To mark all of the .START I parse it once, just to check .START and
    mark that position, and when actually the detailed data is needed seek
    to that marked position and do parsing.
    >
    For quick parsing, I do
    while(_stream) {
    std::string currentLine;
    getline(_stream , currentLine);
    currentLine = utils::trim(cur rentLine);///this removes whitespace
    from front & back.
    if (currentLine == ".START"){
    _pos.push_back( _stream.tellg() );
    }
    }
    But this code runs slower than I expect. Anything better can be done
    here ? like some buffering in the stream etc? .
    >
    abir
    Buffering is made already in input stream. Also your operating system
    probably buffers files, so it should not be problem.

    I have some ideas which could help:
    - You should parse the file in 1 pass. It is faster than 2 pass parsing
    and you can get data also from standard input or pipes.
    - Where do you store positions (what is the type of _pos)? It should be
    list, queue or stack, not vector
    - You can treat input as binary file (no difference from text file on
    many systems, but for example on Windows it is different), use method
    read for reading to some buffer and search ".START" on your own. [ In
    fact I do not believe it will make big difference.]

    - Although any assumption like "something will probably not exceed xyz
    MB of memory" is wrong, you can place data in memory and process it
    there (20MB is not so big amount if you are not working on embedded
    system)
    - You can use system dependent solution - memory mapped file

    Comment

    • toton

      #3
      Re: parsing an ifstream to get some specific text


      Ondra Holub wrote:
      toton napsal:
      Hi,
      I have some ascii files, which are having some formatted text. I want
      to read some section only from the total file.
      For that what I am doing is indexing the sections (denoted by .START
      in the file) with the location.
      And for a particular section I parse only that section.

      The file is something like,

      ... DATAS
      ....
      .START
      ...
      ....
      .START
      ...
      .....
      etc.
      I need to parse datas between two .START when only that section is
      needed. I don't load all of the data's in the memory at a time, as the
      file is big, 4MB~20MB in size.
      To mark all of the .START I parse it once, just to check .START and
      mark that position, and when actually the detailed data is needed seek
      to that marked position and do parsing.

      For quick parsing, I do
      while(_stream) {
      std::string currentLine;
      getline(_stream , currentLine);
      currentLine = utils::trim(cur rentLine);///this removes whitespace
      from front & back.
      if (currentLine == ".START"){
      _pos.push_back( _stream.tellg() );
      }
      }
      But this code runs slower than I expect. Anything better can be done
      here ? like some buffering in the stream etc? .

      abir
      >
      Buffering is made already in input stream. Also your operating system
      probably buffers files, so it should not be problem.
      >
      I have some ideas which could help:
      - You should parse the file in 1 pass. It is faster than 2 pass parsing
      and you can get data also from standard input or pipes.
      - Where do you store positions (what is the type of _pos)? It should be
      list, queue or stack, not vector
      _pos is std::vector<pos _type again, pos_type is usually int. so _pos
      can also be treated as std::vector<int >.
      I am using a pseudo 2 pass parsing. The first pass I only marking the
      location (in bytes as returned by tellg() ) for .START . The second
      pass is only needed when someone want's to parse data between two
      ..START. so I can quickly go to the marked location using seekg() .
      Usually with xml type of file I can quickly jump to a particular
      element without going to the detail of other elements. Here the format
      is somewhat different, so I am making a positional reference (in bytes
      ) for those sections marked by .START, and storing them for later
      parsing.
      Here IO operations are done 2 times, but loading a 20 MB file is even
      slower. And the second IO operation may not be done for whole file, say
      for eg I may parse only one such section out of 20 sections marked by
      ..START
      - You can treat input as binary file (no difference from text file on
      many systems, but for example on Windows it is different), use method
      read for reading to some buffer and search ".START" on your own. [ In
      fact I do not believe it will make big difference.]
      My primary system is Windows :(
      I have some estimate how much buffer I may need to get a next .START in
      terms of bytes. Can it be set anyway for the stream, or is it totally
      implementation dependent/ OS dependent ?
      - Although any assumption like "something will probably not exceed xyz
      MB of memory" is wrong, you can place data in memory and process it
      there (20MB is not so big amount if you are not working on embedded
      system)
      This is what I want in automated way. ie instead of loading a fixed no
      of bytes in the buffer, let the stream load the bytes under the hood.
      as you mentioned , it may be doing that already. Only I want to control
      the size.
      - You can use system dependent solution - memory mapped file
      Don't know any C++ library for it. Boost is also not providing any mmap
      file .

      Comment

      • Ondra Holub

        #4
        Re: parsing an ifstream to get some specific text

        toton napsal:
        Ondra Holub wrote:
        toton napsal:
        Hi,
        I have some ascii files, which are having some formatted text. I want
        to read some section only from the total file.
        For that what I am doing is indexing the sections (denoted by .START
        in the file) with the location.
        And for a particular section I parse only that section.
        >
        The file is something like,
        >
        ... DATAS
        ....
        .START
        ...
        ....
        .START
        ...
        .....
        etc.
        I need to parse datas between two .START when only that section is
        needed. I don't load all of the data's in the memory at a time, as the
        file is big, 4MB~20MB in size.
        To mark all of the .START I parse it once, just to check .START and
        mark that position, and when actually the detailed data is needed seek
        to that marked position and do parsing.
        >
        For quick parsing, I do
        while(_stream) {
        std::string currentLine;
        getline(_stream , currentLine);
        currentLine = utils::trim(cur rentLine);///this removes whitespace
        from front & back.
        if (currentLine == ".START"){
        _pos.push_back( _stream.tellg() );
        }
        }
        But this code runs slower than I expect. Anything better can be done
        here ? like some buffering in the stream etc? .
        >
        abir
        Buffering is made already in input stream. Also your operating system
        probably buffers files, so it should not be problem.

        I have some ideas which could help:
        - You should parse the file in 1 pass. It is faster than 2 pass parsing
        and you can get data also from standard input or pipes.
        - Where do you store positions (what is the type of _pos)? It should be
        list, queue or stack, not vector
        _pos is std::vector<pos _type again, pos_type is usually int. so _pos
        can also be treated as std::vector<int >.
        Yes, vector can be used from the functional point of view, but it may
        be less effective for this kind of use, because vector has some
        preallocated amount of memory and when it is exceeded, it must
        reallocate it and it may lead to copying of items from old area to new
        one. List does not need it. That's why I suggested not to use vector.
        I am using a pseudo 2 pass parsing. The first pass I only marking the
        location (in bytes as returned by tellg() ) for .START . The second
        pass is only needed when someone want's to parse data between two
        .START. so I can quickly go to the marked location using seekg() .
        Usually with xml type of file I can quickly jump to a particular
        element without going to the detail of other elements.
        It is simillar as the parsing of XML. XML is usualy parsed either with
        DOM like parser or with SAX parser.

        DOM (typically) loads whole document into memory and then works with
        it. Then you can simply access any element, but data are stored in
        memory. It is simpler for working with, but less effective for large
        documents.

        SAX (typically) reads document and during reading calls some methods,
        which process the currently read data. It is not as simple for use as
        DOM, but it is better and more effective for large documents.
        Here the format
        is somewhat different, so I am making a positional reference (in bytes
        ) for those sections marked by .START, and storing them for later
        parsing.
        Here IO operations are done 2 times, but loading a 20 MB file is even
        slower. And the second IO operation may not be done for whole file, say
        for eg I may parse only one such section out of 20 sections marked by
        .START
        - You can treat input as binary file (no difference from text file on
        many systems, but for example on Windows it is different), use method
        read for reading to some buffer and search ".START" on your own. [ In
        fact I do not believe it will make big difference.]
        My primary system is Windows :(
        I have some estimate how much buffer I may need to get a next .START in
        terms of bytes. Can it be set anyway for the stream, or is it totally
        implementation dependent/ OS dependent ?
        You could deal with filebuf (implement your own inherited class from
        streambuf), but I do not think it would be usefull (too much effort and
        no big effect).

        If you do not use C files (FILE* from stdio.h or cstdio), you should
        disable synchronization of C++ iostreams with FILE* with method
        sync_with_stdio of iostream. If you do it, you get the responsibility,
        that nobody uses FILE* for your files (even no library).
        - Although any assumption like "something will probably not exceed xyz
        MB of memory" is wrong, you can place data in memory and process it
        there (20MB is not so big amount if you are not working on embedded
        system)
        This is what I want in automated way. ie instead of loading a fixed no
        of bytes in the buffer, let the stream load the bytes under the hood.
        as you mentioned , it may be doing that already. Only I want to control
        the size.
        - You can use system dependent solution - memory mapped file
        Don't know any C++ library for it. Boost is also not providing any mmap
        file .
        There is no such standard C++ library, you have to use API of your OS
        or some library, which supports many platforms and wraps platform
        dependent code in it's functions (for example ACE).

        Comment

        Working...