python vs. grep

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Anton Slesarev

    python vs. grep

    I've read great paper about generators:


    Author say that it's easy to write analog of common linux tools such
    as awk,grep etc. He say that performance could be even better.

    But I have some problem with writing performance grep analog.


    It's my script:

    import re
    pat = re.compile("som etext")

    f = open("bigfile", 'r')

    flines = (line for line in f if pat.search(line ))
    c=0
    for x in flines:
    c+=1
    print c

    and bash:
    grep "sometext" bigfile | wc -l

    Python code 3-4 times slower on windows. And as I remember on linux
    the same situation...

    Buffering in open even increase time.

    Is it possible to increase file reading performance?
  • Ian Kelly

    #2
    Re: python vs. grep

    On Tue, May 6, 2008 at 1:42 PM, Anton Slesarev <slesarev.anton @gmail.comwrote :
    Is it possible to increase file reading performance?
    Dunno about that, but this part:
    flines = (line for line in f if pat.search(line ))
    c=0
    for x in flines:
    c+=1
    print c
    could be rewritten as just:

    print sum(1 for line in f if pat.search(line ))

    Comment

    • Arnaud Delobelle

      #3
      Re: python vs. grep

      Anton Slesarev <slesarev.anton @gmail.comwrite s:
      f = open("bigfile", 'r')
      >
      flines = (line for line in f if pat.search(line ))
      c=0
      for x in flines:
      c+=1
      print c
      It would be simpler (and probably faster) not to use a generator expression:

      search = re.compile('som etext').search

      c = 0
      for line in open('bigfile') :
      if search(line):
      c += 1

      Perhaps faster (because the number of name lookups is reduced), using
      itertools.ifilt er:

      from itertools import ifilter

      c = 0
      for line in ifilter(search, 'bigfile'):
      c += 1


      If 'sometext' is just text (no regexp wildcards) then even simpler:

      ....
      for line in ...:
      if 'sometext' in line:
      c += 1

      I don't believe you'll easily beat grep + wc using Python though.

      Perhaps faster?

      sum(bool(search (line)) for line in open('bigfile') )
      sum(1 for line in ifilter(search, open('bigfile') ))

      ....etc...

      All this is untested!
      --
      Arnaud

      Comment

      • Wojciech Walczak

        #4
        Re: python vs. grep

        2008/5/6, Anton Slesarev <slesarev.anton @gmail.com>:
        But I have some problem with writing performance grep analog.
        [...]
        Python code 3-4 times slower on windows. And as I remember on linux
        the same situation...
        >
        Buffering in open even increase time.
        >
        Is it possible to increase file reading performance?
        The best advice would be not to try to beat grep, but if you really
        want to, this is the right place ;)

        Here is my code:
        $ cat grep.py
        import sys

        if len(sys.argv) != 3:
        print 'grep.py <pattern<file >'
        sys.exit(1)

        f = open(sys.argv[2],'r')

        print ''.join((line for line in f if sys.argv[1] in line)),

        $ ls -lh debug.0
        -rw-r----- 1 gminick root 4,1M 2008-05-07 00:49 debug.0

        ---
        $ time grep nusia debug.0 |wc -l
        26009

        real 0m0.042s
        user 0m0.020s
        sys 0m0.004s
        ---

        ---
        $ time python grep.py nusia debug.0 |wc -l
        26009

        real 0m0.077s
        user 0m0.044s
        sys 0m0.016s
        ---

        ---
        $ time grep nusia debug.0

        real 0m3.163s
        user 0m0.016s
        sys 0m0.064s
        ---

        ---
        $ time python grep.py nusia debug.0
        [26009 lines here...]
        real 0m2.628s
        user 0m0.032s
        sys 0m0.064s
        ---

        So, printing the results take 2.6 secs for python and 3.1s for original grep.
        Suprised? The only reason for this is that we have reduced the number
        of write calls in the python example:

        $ strace -ooriggrep.log grep nusia debug.0
        $ grep write origgrep.log |wc -l
        26009


        $ strace -opygrep.log python grep.py nusia debug.0
        $ grep write pygrep.log |wc -l
        12


        Wish you luck saving your CPU cycles :)

        --
        Regards,
        Wojtek Walczak

        Comment

        • Ville Vainio

          #5
          Re: python vs. grep

          On May 6, 10:42 pm, Anton Slesarev <slesarev.an... @gmail.comwrote :
          flines = (line for line in f if pat.search(line ))
          What about re.findall() / re.finditer() for the whole file contents?

          Comment

          • Pop User

            #6
            Re: python vs. grep

            Anton Slesarev wrote:
            >
            But I have some problem with writing performance grep analog.
            >
            I don't think you can ever catch grep. Searching is its only purpose in
            life and its very good at it. You may be able to come closer, this
            thread relates.



            This relates to the speed of re. If you don't need regex don't use re.
            If you do need re an alternate re library might be useful but you
            aren't going to catch grep.


            Comment

            • Anton Slesarev

              #7
              Re: python vs. grep

              On May 7, 7:22 pm, Pop User <popu...@christ est2.dc.k12us.c omwrote:
              Anton Slesarev wrote:
              >
              But I have some problem with writing performance grep analog.
              >
              I don't think you can ever catch grep. Searching is its only purpose in
              life and its very good at it. You may be able to come closer, this
              thread relates.
              >
              http://groups.google.com/group/comp....thread/thread/...
              >
              This relates to the speed of re. If you don't need regex don't use re.
              If you do need re an alternate re library might be useful but you
              aren't going to catch grep.
              In my last test I dont use re. As I understand the main problem in
              reading file.

              Comment

              • Alan Isaac

                #8
                Re: python vs. grep

                Anton Slesarev wrote:
                I've read great paper about generators:

                Author say that it's easy to write analog of common linux tools such
                as awk,grep etc. He say that performance could be even better.
                But I have some problem with writing performance grep analog.



                hth,
                Alan Isaac

                Comment

                • Robert Kern

                  #9
                  Re: python vs. grep

                  Alan Isaac wrote:
                  Anton Slesarev wrote:
                  >I've read great paper about generators:
                  >http://www.dabeaz.com/generators/index.html Author say that it's easy
                  >to write analog of common linux tools such as awk,grep etc. He say
                  >that performance could be even better. But I have some problem with
                  >writing performance grep analog.
                  >
                  https://svn.enthought.com/svn/sandbox/grin/trunk/
                  As the author of grin I can definitively state that it is not at all competitive
                  with grep in terms of speed. grep reads files really fast. awk is probably
                  beatable, though.

                  --
                  Robert Kern

                  "I have come to believe that the whole world is an enigma, a harmless enigma
                  that is made terrible by our own mad attempt to interpret it as though it had
                  an underlying truth."
                  -- Umberto Eco

                  Comment

                  • Ville Vainio

                    #10
                    Re: python vs. grep

                    On May 8, 8:11 pm, Ricardo Aráoz <ricar...@gmail .comwrote:
                    All these examples assume your regular expression will not span multiple
                    lines, but this can easily be the case. How would you process the file
                    with regular expressions that span multiple lines?
                    re.findall/ finditer, as I said earlier.


                    Comment

                    • =?ISO-8859-1?Q?Ricardo_Ar=E1oz?=

                      #11
                      Re: python vs. grep

                      Ville Vainio wrote:
                      On May 8, 8:11 pm, Ricardo Aráoz <ricar...@gmail .comwrote:
                      >
                      >All these examples assume your regular expression will not span multiple
                      >lines, but this can easily be the case. How would you process the file
                      >with regular expressions that span multiple lines?
                      >
                      re.findall/ finditer, as I said earlier.
                      >
                      Hi, sorry took so long to answer. Too much work.

                      findall/finditer do not address the issue, they merely find ALL the
                      matches in a STRING. But if you keep reading the files a line at a time
                      (as most examples given in this thread do) then you are STILL in trouble
                      when a regular expression spans multiple lines.
                      The easy/simple (too easy/simple?) way I see out of it is to read THE
                      WHOLE file into memory and don't worry. But what if the file is too
                      heavy? So I was wondering if there is any other way out of it. Does grep
                      read the whole file into memory? Does it ONLY process a line at a time?

                      Comment

                      • Kam-Hung Soh

                        #12
                        Re: python vs. grep

                        On Tue, 13 May 2008 00:03:08 +1000, Ricardo Aráoz <ricaraoz@gmail .com>
                        wrote:
                        Ville Vainio wrote:
                        >On May 8, 8:11 pm, Ricardo Aráoz <ricar...@gmail .comwrote:
                        >>
                        >>All these examples assume your regular expression will not span
                        >>multiple
                        >>lines, but this can easily be the case. How would you process the file
                        >>with regular expressions that span multiple lines?
                        > re.findall/ finditer, as I said earlier.
                        >>
                        >
                        Hi, sorry took so long to answer. Too much work.
                        >
                        findall/finditer do not address the issue, they merely find ALL the
                        matches in a STRING. But if you keep reading the files a line at a time
                        (as most examples given in this thread do) then you are STILL in trouble
                        when a regular expression spans multiple lines.
                        The easy/simple (too easy/simple?) way I see out of it is to read THE
                        WHOLE file into memory and don't worry. But what if the file is too
                        heavy? So I was wondering if there is any other way out of it. Does grep
                        read the whole file into memory? Does it ONLY process a line at a time?
                        >
                        --

                        >
                        Standard grep can only match a line at a time. Are you thinking about
                        "sed", which has a sliding window?

                        See http://www.gnu.org/software/sed/manual/sed.html, Section 4.13

                        --
                        Kam-Hung Soh <a href="http://kamhungsoh.com/blog">Software Salariman</a>

                        Comment

                        • Ville M. Vainio

                          #13
                          Re: python vs. grep

                          Ricardo Aráoz <ricaraoz@gmail .comwrites:
                          The easy/simple (too easy/simple?) way I see out of it is to read THE
                          WHOLE file into memory and don't worry. But what if the file is too
                          The easiest and simplest approach is often the best with
                          Python. Reading in the whole file is rarely too heavy, and you omit
                          the python "object overhead" entirely - all the code executes in the
                          fast C extensions.

                          If the file is too big, you might want to look up mmap:


                          Comment

                          • =?ISO-8859-1?Q?Ricardo_Ar=E1oz?=

                            #14
                            Re: python vs. grep

                            Ville M. Vainio wrote:
                            Ricardo Aráoz <ricaraoz@gmail .comwrites:
                            >
                            >The easy/simple (too easy/simple?) way I see out of it is to read THE
                            >WHOLE file into memory and don't worry. But what if the file is too
                            >
                            The easiest and simplest approach is often the best with
                            Python.
                            Keep forgetting that!
                            >
                            If the file is too big, you might want to look up mmap:
                            >
                            http://effbot.org/librarybook/mmap.htm
                            Thanks!

                            Comment

                            Working...