str.count is slow

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • chrisperkins99@gmail.com

    str.count is slow

    It seems to me that str.count is awfully slow. Is there some reason
    for this?
    Evidence:

    ######## str.count time test ########
    import string
    import time
    import array

    s = string.printabl e * int(1e5) # 10**7 character string
    a = array.array('c' , s)
    u = unicode(s)
    RIGHT_ANSWER = s.count('a')

    def main():
    print 'str: ', time_call(s.cou nt, 'a')
    print 'array: ', time_call(a.cou nt, 'a')
    print 'unicode:', time_call(u.cou nt, 'a')

    def time_call(f, *a):
    start = time.clock()
    assert RIGHT_ANSWER == f(*a)
    return time.clock()-start

    if __name__ == '__main__':
    main()

    ###### end ########

    On my machine, the output is:

    str: 0.29365715475
    array: 0.448095498171
    unicode: 0.0243757237303

    If a unicode object can count characters so fast, why should an str
    object be ten times slower? Just curious, really - it's still fast
    enough for me (so far).

    This is with Python 2.4.1 on WinXP.


    Chris Perkins

  • Ben Cartwright

    #2
    Re: str.count is slow

    chrisperkins99@ gmail.com wrote:[color=blue]
    > It seems to me that str.count is awfully slow. Is there some reason
    > for this?
    > Evidence:
    >
    > ######## str.count time test ########
    > import string
    > import time
    > import array
    >
    > s = string.printabl e * int(1e5) # 10**7 character string
    > a = array.array('c' , s)
    > u = unicode(s)
    > RIGHT_ANSWER = s.count('a')
    >
    > def main():
    > print 'str: ', time_call(s.cou nt, 'a')
    > print 'array: ', time_call(a.cou nt, 'a')
    > print 'unicode:', time_call(u.cou nt, 'a')
    >
    > def time_call(f, *a):
    > start = time.clock()
    > assert RIGHT_ANSWER == f(*a)
    > return time.clock()-start
    >
    > if __name__ == '__main__':
    > main()
    >
    > ###### end ########
    >
    > On my machine, the output is:
    >
    > str: 0.29365715475
    > array: 0.448095498171
    > unicode: 0.0243757237303
    >
    > If a unicode object can count characters so fast, why should an str
    > object be ten times slower? Just curious, really - it's still fast
    > enough for me (so far).
    >
    > This is with Python 2.4.1 on WinXP.
    >
    >
    > Chris Perkins[/color]


    Your evidence points to some unoptimized code in the underlying C
    implementation of Python. As such, this should probably go to the
    python-dev list (http://mail.python.org/mailman/listinfo/python-dev).

    The problem is that the C library function memcmp is slow, and
    str.count calls it frequently. See lines 2165+ in stringobject.c
    (inside function string_count):

    r = 0;
    while (i < m) {
    if (!memcmp(s+i, sub, n)) {
    r++;
    i += n;
    } else {
    i++;
    }
    }

    This could be optimized as:

    r = 0;
    while (i < m) {
    if (s[i] == *sub && !memcmp(s+i, sub, n)) {
    r++;
    i += n;
    } else {
    i++;
    }
    }

    This tactic typically avoids most (sometimes all) of the calls to
    memcmp. Other string search functions, including unicode.count,
    unicode.index, and str.index, use this tactic, which is why you see
    unicode.count performing better than str.count.

    The above might be optimized further for cases such as yours, where a
    single character appears many times in the string:

    r = 0;
    if (n == 1) {
    /* optimize for a single character */
    while (i < m) {
    if (s[i] == *sub)
    r++;
    i++;
    }
    } else {
    while (i < m) {
    if (s[i] == *sub && !memcmp(s+i, sub, n)) {
    r++;
    i += n;
    } else {
    i++;
    }
    }
    }

    Note that there might be some subtle reason why neither of these
    optimizations are done that I'm unaware of... in which case a comment
    in the C source would help. :-)

    --Ben

    Comment

    • Fredrik Lundh

      #3
      Re: str.count is slow

      Ben Cartwright wrote:
      [color=blue][color=green]
      > > On my machine, the output is:
      > >
      > > str: 0.29365715475
      > > array: 0.448095498171
      > > unicode: 0.0243757237303[/color][/color]
      [color=blue]
      > This tactic typically avoids most (sometimes all) of the calls to
      > memcmp. Other string search functions, including unicode.count,
      > unicode.index, and str.index, use this tactic, which is why you see
      > unicode.count performing better than str.count.[/color]

      it's about time that someone sat down and merged the string and unicode
      implementations into a single "stringlib" code base (see the SRE sources for
      an efficient way to do this in plain C).

      moving to (basic) C++ might also be a good idea (in 3.0, perhaps). is any-
      one still stuck with pure C89 these days ?

      </F>



      Comment

      • Terry Reedy

        #4
        Re: str.count is slow


        "Ben Cartwright" <bencvt@gmail.c om> wrote in message
        news:1141083127 .970403.147100@ v46g2000cwv.goo glegroups.com.. .[color=blue]
        > Your evidence points to some unoptimized code in the underlying C
        > implementation of Python. As such, this should probably go to the
        > python-dev list (http://mail.python.org/mailman/listinfo/python-dev).
        >
        > The problem is that the C library function memcmp is slow, and
        > str.count calls it frequently. See lines 2165+ in stringobject.c
        > (inside function string_count):
        >
        > r = 0;
        > while (i < m) {
        > if (!memcmp(s+i, sub, n)) {
        > r++;
        > i += n;
        > } else {
        > i++;
        > }
        > }
        >
        > This could be optimized as:
        >
        > r = 0;
        > while (i < m) {
        > if (s[i] == *sub && !memcmp(s+i, sub, n)) {
        > r++;
        > i += n;
        > } else {
        > i++;
        > }
        > }
        >
        > This tactic typically avoids most (sometimes all) of the calls to
        > memcmp. Other string search functions, including unicode.count,
        > unicode.index, and str.index, use this tactic, which is why you see
        > unicode.count performing better than str.count.[/color]

        If not doing the same in str.count is indeed an oversight. a patch should
        be welcome (on the SF tracker).



        Comment

        Working...