Fast File Input

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Scott Brady Drummonds

    Fast File Input

    Hi, everyone,

    I'm a relative novice to Python and am trying to reduce the processing time
    for a very large text file that I am reading into my Python script. I'm
    currently reading each line one at a time (readline()), stripping the
    leading and trailing whitespace (strip()) and splitting it's delimited data
    (split()). For my large input files, this text processing is taking many
    hours.

    If I were working in C, I'd consider using a lower level I/O library,
    minimizing text processing, and reducing memory redundancy. However, I have
    no idea at all what to do to optimize this process in Python.

    Can anyone offer some suggestions?

    Thanks,
    Scott

    --
    Remove ".nospam" from the user ID in my e-mail to reply via e-mail.


  • P@draigBrady.com

    #2
    Re: Fast File Input

    Scott Brady Drummonds wrote:[color=blue]
    > Hi, everyone,
    >
    > I'm a relative novice to Python and am trying to reduce the processing time
    > for a very large text file that I am reading into my Python script. I'm
    > currently reading each line one at a time (readline()), stripping the
    > leading and trailing whitespace (strip()) and splitting it's delimited data
    > (split()). For my large input files, this text processing is taking many
    > hours.
    >
    > If I were working in C, I'd consider using a lower level I/O library,
    > minimizing text processing, and reducing memory redundancy. However, I have
    > no idea at all what to do to optimize this process in Python.
    >
    > Can anyone offer some suggestions?[/color]

    This actually improved a lot with python version 2
    but is still quite slow as you can see here:
    Comparing the performance of tools processing lines of text

    There are a few notes within the python script there.

    Pádraig.

    Comment

    • Terry Reedy

      #3
      Re: Fast File Input


      "Scott Brady Drummonds" <scott.b.drummo nds.nospam@inte l.com> wrote in
      message news:c1iit0$2jj $1@news01.intel .com...[color=blue]
      > Hi, everyone,
      >
      > I'm a relative novice to Python and am trying to reduce the processing[/color]
      time[color=blue]
      > for a very large text file that I am reading into my Python script. I'm
      > currently reading each line one at a time (readline()), stripping the
      > leading and trailing whitespace (strip()) and splitting it's delimited[/color]
      data[color=blue]
      > (split()). For my large input files, this text processing is taking many
      > hours.[/color]

      for line in file('somefile. txt'): ...
      will be faster because the file iterator reads a much larger block with
      each disk access.

      Do you really need strip()? Clipping \n off the last item after split()
      *might* be faster.

      Terry J. Reedy




      Comment

      • Andrei

        #4
        Re: Fast File Input

        Scott Brady Drummonds wrote on Wed, 25 Feb 2004 08:35:43 -0800:
        [color=blue]
        > Hi, everyone,
        >
        > I'm a relative novice to Python and am trying to reduce the processing time
        > for a very large text file that I am reading into my Python script. I'm
        > currently reading each line one at a time (readline()), stripping the
        > leading and trailing whitespace (strip()) and splitting it's delimited data
        > (split()). For my large input files, this text processing is taking many
        > hours.[/color]

        An easy improvement is using "for line in sometextfile:" instead of
        repetitive readline(). Not sure how much time this will save you (depends
        on what you're doing after reading), but it can make a difference at
        virtually no cost. You might also want to try rstrip() instead of strip()
        (not sure if it's faster, but perhaps it is).

        --
        Yours,

        Andrei

        =====
        Real contact info (decode with rot13):
        cebwrpg5@jnanqb b.ay. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V ernq
        gur yvfg, fb gurer'f ab arrq gb PP.

        Comment

        • Eddie Corns

          #5
          Re: Fast File Input

          >"Scott Brady Drummonds" <scott.b.drummo nds.nospam@inte l.com> wrote in[color=blue]
          >message news:c1iit0$2jj $1@news01.intel .com...[color=green]
          >> Hi, everyone,
          >>
          >> I'm a relative novice to Python and am trying to reduce the processing[/color]
          >time[color=green]
          >> for a very large text file that I am reading into my Python script. I'm
          >> currently reading each line one at a time (readline()), stripping the
          >> leading and trailing whitespace (strip()) and splitting it's delimited[/color]
          >data[color=green]
          >> (split()). For my large input files, this text processing is taking many
          >> hours.[/color][/color]

          If you mean delimited in the CSV sense then I believe that the CSV modules are
          optimised for this. Included in 2.3 IIRC.

          Eddie

          Comment

          • Skip Montanaro

            #6
            Re: Fast File Input


            Pádraig> This actually improved a lot with python version 2
            Pádraig> but is still quite slow as you can see here:
            Pádraig> http://www.pixelbeat.org/readline/
            Pádraig> There are a few notes within the python script there.

            Your page doesn't mention precisely which version of Python 2 you used.I
            suspect a rather old one (2.0? 2.1?) because of the style of loop you used
            to read from sys.stdin. Eliminating comments, your python2 script was:

            import sys

            while 1:
            line = sys.stdin.readl ine()
            if line == '':
            break
            try:
            print line,
            except:
            pass

            Running that using the CVS version of Python feeding it my machine's
            dictionary as input I got this time(1) output (fastest real time of four runs):

            % time python readltst.py < /usr/share/dict/words > /dev/null

            real 0m1.384s
            user 0m1.290s
            sys 0m0.060s

            Rewriting it to eliminate the try/except statement (why did you have that
            there?) got it to:

            % time python readltst.py < /usr/share/dict/words > /dev/null

            real 0m1.373s
            user 0m1.270s
            sys 0m0.040s

            Further rewriting it as the more modern:

            import sys

            for line in sys.stdin:
            print line,

            yielded:

            % time python readltst2.py < /usr/share/dict/words > /dev/null

            real 0m0.660s
            user 0m0.600s
            sys 0m0.060s

            My guess is that your python2 times are probably at least a factor of 2too
            large if you accept that people will use a recent version of Python in which
            file objects are iterators.

            Skip


            Comment

            Working...