parsing a file..

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • broli

    parsing a file..

    I need to parse a file which has about 2000 lines and I'm getting
    told that reading the file in ascii would be a slower way to do it and
    so i need to resort to binary by reading it in large chunks. Can any
    one please explain what is all
    this about ?
  • Richard Heathfield

    #2
    Re: parsing a file..

    broli said:
    I need to parse a file which has about 2000 lines and I'm getting
    told that reading the file in ascii would be a slower way to do it and
    so i need to resort to binary by reading it in large chunks. Can any
    one please explain what is all this about ?
    Someone's pulling your leg. 2000 lines of text is nothing. Just write the
    program so that it's clear, correct, and easy to understand. Then, if and
    only if it's too slow (and you should define the "fast enough"/"too slow"
    boundary before you start writing the program), it's time to think about
    how it might be made faster.

    --
    Richard Heathfield <http://www.cpax.org.uk >
    Email: -http://www. +rjh@
    Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
    "Usenet is a strange place" - dmr 29 July 1999

    Comment

    • Richard Heathfield

      #3
      Re: parsing a file..

      broli said:

      <snip>
      But then I
      was told that " normally we don't read scientific data in ascii for
      accuracy and speed concerns" which made me wonder what was so wrong ?
      The statement!
      I could parse 2000 lines in hardly any time and there was no problem
      with ascii either.
      Right. Someone's pulling your leg, or is overly concerned with efficiency
      at the expense of development time and clarity. That isn't to say that
      efficiency isn't important. But let's just pretend, for the sake of
      argument, that you write it /both/ ways, and then you measure. You
      discover that the "binary" technique takes 0.025 seconds to process the
      2000 data groups, whereas the "text" version takes 0.075 seconds - three
      times slower! Surely this is a triumph for binary!

      Yeah, right, but who cares? You press ENTER, and then it takes you 0.1
      seconds to look up at the screen, and everything's finished, no matter
      which one you ran.

      Write it clear, simple, and correct. Then worry about speed if and only if
      you have to.

      --
      Richard Heathfield <http://www.cpax.org.uk >
      Email: -http://www. +rjh@
      Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
      "Usenet is a strange place" - dmr 29 July 1999

      Comment

      • Richard Tobin

        #4
        Re: parsing a file..

        In article <4e0df786-3196-4efc-a4f7-3b86d07e75b5@s1 9g2000prg.googl egroups.com>,
        broli <Broli00@gmail. comwrote:
        >I need to parse a file which has about 2000 lines and I'm getting
        >told that reading the file in ascii would be a slower way to do it and
        >so i need to resort to binary by reading it in large chunks. Can any
        >one please explain what is all this about ?
        Reading in large chunks is unrelated to whether it's binary or
        ascii. Perhaps they meant that character-at-a-time reading with
        getchar() is slow, which it is on some systems. You can perfectly
        well use fread() on text files.

        -- Richard



        --
        :wq

        Comment

        • Richard Heathfield

          #5
          Re: parsing a file..

          Chris Dollin said:
          Richard Heathfield wrote:
          >
          <snip>
          >>
          >Someone's pulling your leg. 2000 lines of text is nothing. Just write
          >the program so that it's clear, correct, and easy to understand. Then,
          >if and only if it's too slow (and you should define the "fast
          >enough"/"too slow" boundary before you start writing the program), it's
          >time to think about how it might be made faster.
          >
          I agree that speed is unlikely to be a factor -- but accuracy may be.
          Possibly, but that comes under correctness, not performance.

          <snip>
          After all, if they want to read those 2000 lines 1000 times per second
          ...
          ....and that is covered by "fast enough/too slow". Again, I would emphasise
          that the first priority is to make the program *clear* (because it's
          easier to make a clear program correct than to make a correct program
          clear). The second priority (and a sine qua non, obviously) is to make the
          program *correct*. When and only when it works, it's time to worry about
          speed. (This obviously does *not* mean that one should intentionally adopt
          gross algorithmic inefficiencies. )

          --
          Richard Heathfield <http://www.cpax.org.uk >
          Email: -http://www. +rjh@
          Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
          "Usenet is a strange place" - dmr 29 July 1999

          Comment

          • broli

            #6
            Re: parsing a file..

            Richard HeathField,

            There are many modules involved in my software package and this is
            just one of them. My software would also involve huge number of
            calculations, searching, memory allocation etc etc but the thing is
            that I have to parallelize the software code to run on different
            machines anyway. Even if speed is an issue, I doubt that reading a
            file in ascii or "binary" would make a huge impact overall.

            Comment

            • Richard Heathfield

              #7
              Re: parsing a file..

              broli said:

              <snip>
              But when I use fgets() then wouldn't I get a string
              of characters (also many tabs, null character etc) ?
              Yes.
              Wouldn't it be a
              difficult task to convert an array of characters into double type
              floating numbers again ?
              I don't see that you have any choice. If what you've described is correct,
              the numbers are already in text form. Converting is easy enough, though,
              using strtod.
              I think using fread will make it very fast
              (considering that it allows you to read as many bytes of data at a
              time as you want) but once again I'm not very adept at file handling
              just at the begginign stages.
              It's very likely that the input stream is buffered, so it won't actually
              make much, if any, difference.

              --
              Richard Heathfield <http://www.cpax.org.uk >
              Email: -http://www. +rjh@
              Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
              "Usenet is a strange place" - dmr 29 July 1999

              Comment

              • Richard

                #8
                Re: parsing a file..

                richard@cogsci. ed.ac.uk (Richard Tobin) writes:
                In article <4e0df786-3196-4efc-a4f7-3b86d07e75b5@s1 9g2000prg.googl egroups.com>,
                broli <Broli00@gmail. comwrote:
                >
                >>I need to parse a file which has about 2000 lines and I'm getting
                >>told that reading the file in ascii would be a slower way to do it and
                >>so i need to resort to binary by reading it in large chunks. Can any
                >>one please explain what is all this about ?
                >
                Reading in large chunks is unrelated to whether it's binary or
                ascii.
                I would question that statement. Reading in binary will be a LOT faster
                ,if its the same platform. for reading in the same NUMBER of
                readings.
                Perhaps they meant that character-at-a-time reading with
                getchar() is slow, which it is on some systems. You can perfectly
                well use fread() on text files.
                The text file will be larger. There is a need to parse the ascii text
                into the destination formats.

                It will be slower in the great majority of cases.
                >
                -- Richard

                Comment

                • Chris Dollin

                  #9
                  Re: parsing a file..

                  Richard wrote:
                  richard@cogsci. ed.ac.uk (Richard Tobin) writes:
                  >
                  >In article <4e0df786-3196-4efc-a4f7-3b86d07e75b5@s1 9g2000prg.googl egroups.com>,
                  >broli <Broli00@gmail. comwrote:
                  >>
                  >>>I need to parse a file which has about 2000 lines and I'm getting
                  >>>told that reading the file in ascii would be a slower way to do it and
                  >>>so i need to resort to binary by reading it in large chunks. Can any
                  >>>one please explain what is all this about ?
                  >>
                  >Reading in large chunks is unrelated to whether it's binary or
                  >ascii.
                  >
                  I would question that statement. Reading in binary will be a LOT faster
                  ,if its the same platform. for reading in the same NUMBER of
                  readings.
                  >
                  > Perhaps they meant that character-at-a-time reading with
                  >getchar() is slow, which it is on some systems. You can perfectly
                  >well use fread() on text files.
                  >
                  The text file will be larger. There is a need to parse the ascii text
                  into the destination formats.
                  >
                  It will be slower in the great majority of cases.
                  Quick test, one file, 2000 lines, each line with two floats (1.12345
                  and 7.890), about 28Kb total.

                  One single big-enough fread:

                  real 0m0.002s
                  user 0m0.000s
                  sys 0m0.001s

                  Repeat fscanf( ... "%lf %lf" ... ) until EOF:

                  real 0m0.004s
                  user 0m0.002s
                  sys 0m0.002s

                  Yes, in this test it's twice as slow. The data file is probably
                  cached (it's been read several other times already as I /cough/
                  debugged my code). It includes program start-up time (I just did
                  `time ./a.out` to get the numbers) so the actual reading time will
                  be less.

                  Myself I wouldn't count that as "LOTS faster" for binary data,
                  but doubtless there are applications where it is so counted;
                  I don't think the OPs case is one of them, and it does look as
                  though he's reading a text file anyway.

                  --
                  "Creation began." - James Blish, /A Clash of Cymbals/

                  Hewlett-Packard Limited registered office: Cain Road, Bracknell,
                  registered no: 690597 England Berks RG12 1HN

                  Comment

                  • Richard Tobin

                    #10
                    Re: parsing a file..

                    In article <frdskf$hi9$1@r egistered.motza rella.org>,
                    Richard <devr_@gmail.co mwrote:
                    >Reading in large chunks is unrelated to whether it's binary or
                    >ascii.
                    >I would question that statement. Reading in binary will be a LOT faster
                    >,if its the same platform. for reading in the same NUMBER of
                    >readings.
                    I didn't say whether it's in binary is unrelated to *speed*.

                    I meant: there are two separate issues; whether you read it in large
                    chunks, and whether it's binary. You can read each of text or binary
                    in small or large chunks. Each of these choices will separately affect
                    the speed.

                    -- Richard
                    --
                    :wq

                    Comment

                    • Richard Bos

                      #11
                      Re: parsing a file..

                      richard@cogsci. ed.ac.uk (Richard Tobin) wrote:
                      In article <frdskf$hi9$1@r egistered.motza rella.org>,
                      Richard <devr_@gmail.co mwrote:
                      >
                      Reading in large chunks is unrelated to whether it's binary or
                      ascii.
                      >
                      I would question that statement. Reading in binary will be a LOT faster
                      ,if its the same platform. for reading in the same NUMBER of
                      readings.
                      >
                      I didn't say whether it's in binary is unrelated to *speed*.
                      >
                      I meant: there are two separate issues; whether you read it in large
                      chunks, and whether it's binary. You can read each of text or binary
                      in small or large chunks. Each of these choices will separately affect
                      the speed.
                      Besides, he _has_ a text file. Yes, it's a lot larger than a binary file
                      would be, and therefore slower to read. But the fact that the _file_ is
                      text is not the OP's doing. Reading this file as text or as binary won't
                      make a large difference. _Writing_ it as a binary file would have; but
                      that's not something the OP can do.

                      Richard

                      Comment

                      • Bartc

                        #12
                        Re: parsing a file..


                        "Chris Dollin" <chris.dollin@h p.comwrote in message
                        news:frdvhf$cl4 $1@news-pa1.hpl.hp.com. ..
                        Richard wrote:
                        >
                        >richard@cogsci. ed.ac.uk (Richard Tobin) writes:
                        >>
                        >>In article
                        >><4e0df786-3196-4efc-a4f7-3b86d07e75b5@s1 9g2000prg.googl egroups.com>,
                        >>broli <Broli00@gmail. comwrote:
                        >>>
                        >>>>I need to parse a file which has about 2000 lines and I'm getting
                        >>>>told that reading the file in ascii would be a slower way to do it and
                        >>>>so i need to resort to binary by reading it in large chunks. Can any
                        >>>>one please explain what is all this about ?
                        >>>
                        >>Reading in large chunks is unrelated to whether it's binary or
                        >>ascii.
                        >>
                        >I would question that statement. Reading in binary will be a LOT faster
                        >,if its the same platform. for reading in the same NUMBER of
                        >readings.
                        >>
                        Quick test, one file, 2000 lines, each line with two floats (1.12345
                        and 7.890), about 28Kb total.
                        >
                        One single big-enough fread:
                        >
                        real 0m0.002s
                        user 0m0.000s
                        sys 0m0.001s
                        >
                        Repeat fscanf( ... "%lf %lf" ... ) until EOF:
                        >
                        real 0m0.004s
                        user 0m0.002s
                        sys 0m0.002s
                        >
                        Yes, in this test it's twice as slow. The data file is probably
                        cached (it's been read several other times already as I /cough/
                        My own tests:

                        (A) 100,000 lines of text, each with 3 doubles (2900000 bytes):

                        2.1 seconds to read a number at a time, using sscanf() (but I use a wrapper
                        or two with some extra overhead)

                        (B) The same data as 300,000 doubles written as binary (2400000 bytes):

                        0.8 seconds to read a number at a time, using fread() 8 bytes at a time

                        (C) Same binary data as (B)

                        0.004 seconds to read as a single block into memory (possibly straight into
                        the array or whatever datastructure is used). Using fread() on 2400000
                        bytes.

                        So about 200-500 times faster in binary mode, when done properly.

                        --
                        Bart



                        Comment

                        • Richard

                          #13
                          Re: parsing a file..

                          richard@cogsci. ed.ac.uk (Richard Tobin) writes:
                          In article <frdskf$hi9$1@r egistered.motza rella.org>,
                          Richard <devr_@gmail.co mwrote:
                          >
                          >>Reading in large chunks is unrelated to whether it's binary or
                          >>ascii.
                          >
                          >>I would question that statement. Reading in binary will be a LOT faster
                          >>,if its the same platform. for reading in the same NUMBER of
                          >>readings.
                          >
                          I didn't say whether it's in binary is unrelated to *speed*.
                          I'm not sure that parses :-;
                          >
                          I meant: there are two separate issues; whether you read it in large
                          chunks, and whether it's binary. You can read each of text or binary
                          in small or large chunks. Each of these choices will separately affect
                          the speed.
                          Yes, I agree.
                          >
                          -- Richard

                          Comment

                          • Willem

                            #14
                            Re: parsing a file..

                            Bartc wrote:
                            ) My own tests:
                            )
                            ) (A) 100,000 lines of text, each with 3 doubles (2900000 bytes):
                            )
                            ) 2.1 seconds to read a number at a time, using sscanf() (but I use a wrapper
                            ) or two with some extra overhead)
                            )
                            ) (B) The same data as 300,000 doubles written as binary (2400000 bytes):
                            )
                            ) 0.8 seconds to read a number at a time, using fread() 8 bytes at a time
                            )
                            ) (C) Same binary data as (B)
                            )
                            ) 0.004 seconds to read as a single block into memory (possibly straight into
                            ) the array or whatever datastructure is used). Using fread() on 2400000
                            ) bytes.
                            )
                            ) So about 200-500 times faster in binary mode, when done properly.

                            Have you tried reading the text file into memory as a single block
                            and then using sscanf() to parse it ?


                            SaSW, Willem
                            --
                            Disclaimer: I am in no way responsible for any of the statements
                            made in the above text. For all I know I might be
                            drugged or something..
                            No I'm not paranoid. You all think I'm paranoid, don't you !
                            #EOT

                            Comment

                            • Richard

                              #15
                              Re: parsing a file..

                              Chris Dollin <chris.dollin@h p.comwrites:
                              Richard wrote:
                              >
                              >richard@cogsci. ed.ac.uk (Richard Tobin) writes:
                              >>
                              >>In article <4e0df786-3196-4efc-a4f7-3b86d07e75b5@s1 9g2000prg.googl egroups.com>,
                              >>broli <Broli00@gmail. comwrote:
                              >>>
                              >>>>I need to parse a file which has about 2000 lines and I'm getting
                              >>>>told that reading the file in ascii would be a slower way to do it and
                              >>>>so i need to resort to binary by reading it in large chunks. Can any
                              >>>>one please explain what is all this about ?
                              >>>
                              >>Reading in large chunks is unrelated to whether it's binary or
                              >>ascii.
                              >>
                              >I would question that statement. Reading in binary will be a LOT faster
                              >,if its the same platform. for reading in the same NUMBER of
                              >readings.
                              >>
                              >> Perhaps they meant that character-at-a-time reading with
                              >>getchar() is slow, which it is on some systems. You can perfectly
                              >>well use fread() on text files.
                              >>
                              >The text file will be larger. There is a need to parse the ascii text
                              >into the destination formats.
                              >>
                              >It will be slower in the great majority of cases.
                              >
                              Quick test, one file, 2000 lines, each line with two floats (1.12345
                              and 7.890), about 28Kb total.
                              >
                              One single big-enough fread:
                              >
                              real 0m0.002s
                              user 0m0.000s
                              sys 0m0.001s
                              >
                              Repeat fscanf( ... "%lf %lf" ... ) until EOF:
                              >
                              real 0m0.004s
                              user 0m0.002s
                              sys 0m0.002s
                              >
                              Yes, in this test it's twice as slow. The data file is probably
                              cached (it's been read several other times already as I /cough/
                              debugged my code). It includes program start-up time (I just did
                              `time ./a.out` to get the numbers) so the actual reading time will
                              be less.
                              >
                              Myself I wouldn't count that as "LOTS faster" for binary data,
                              but doubtless there are applications where it is so counted;
                              I don't think the OPs case is one of them, and it does look as
                              though he's reading a text file anyway.
                              Then why not take the static noise out? Make the file a lot bigger and
                              report back.

                              But even these results do indicate quite a large % difference .....

                              And we do not know how often this data sample is written or read. I
                              could be thousands of times an hour leading to considerable unnecessary
                              overhead if using ascii over binary.

                              Comment

                              Working...