Trouble sorting lists (unicode/locale related?)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Shu-Hsien Sheu

    #16
    Slicing vs .startswith

    Hi,

    I have a question about the comparison of efficiency of string slicing
    and using string.startswi th.
    For example, which one of the following would be more efficient, or ,
    moreover, more pythonic?

    if aa[:3] == 'abc':

    vs

    if aa.startswith(' abc'):


    Thanks!

    -shuhsien


    Comment

    • Bob Gailer

      #17
      Re: Slicing vs .startswith

      At 09:17 AM 9/22/2003, Shu-Hsien Sheu wrote:
      [color=blue]
      >Hi,
      >
      >I have a question about the comparison of efficiency of string slicing and
      >using string.startswi th.
      >For example, which one of the following would be more efficient, or ,
      >moreover, more pythonic?
      >
      >if aa[:3] == 'abc':
      >
      >vs
      >
      >if aa.startswith(' abc'):[/color]

      Here's one way to address this kind of question:
      [color=blue][color=green][color=darkred]
      >>> import time
      >>> def f():[/color][/color][/color]
      .... st = time.time()
      .... for i in range(500000):
      .... if aa.startswith(' abc'):pass
      .... print time.time() - st
      ....[color=blue][color=green][color=darkred]
      >>> f()[/color][/color][/color]
      1.01200008392[color=blue][color=green][color=darkred]
      >>> def f():[/color][/color][/color]
      .... st = time.time()
      .... for i in range(500000):
      .... if aa[:3] == 'abc':pass
      .... print time.time() - st
      ....[color=blue][color=green][color=darkred]
      >>> f()[/color][/color][/color]
      1.01100003719

      Bob Gailer
      bgailer@alum.rp i.edu
      303 442 2625


      ---
      Outgoing mail is certified Virus Free.
      Checked by AVG anti-virus system (http://www.grisoft.com).
      Version: 6.0.506 / Virus Database: 303 - Release Date: 8/1/2003

      Comment

      • Duncan Booth

        #18
        Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

        Alex Martelli <aleax@aleax.it > wrote in
        news:VOEbb.1016 41$hE5.3573494@ news1.tin.it:
        [color=blue]
        > Duncan Booth wrote:
        > ...[color=green]
        >> FWIW, I think something like this belongs in the standard library rather
        >> than as a method on lists or a new builtin.[/color]
        >
        > Definitely not a method on lists, we agree on that. Problem is, there
        > is no module in the standard library to collect "functions that are often
        > useful but not QUITE often enough to warrant being built-ins" (and if
        > there was, there are quite a few current built-ins I'd like to relegate
        > into such an auxiliary/utilities module), so that finding a good place,
        > other than built-ins, for such utility functions, is often a bother.
        >[/color]
        So someone needs to write a PEP proposing such a module.

        --
        Duncan Booth duncan@rcp.co.u k
        int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
        "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

        Comment

        • Peter Hansen

          #19
          Re: Slicing vs .startswith

          Shu-Hsien Sheu wrote:[color=blue]
          >
          > I have a question about the comparison of efficiency of string slicing
          > and using string.startswi th.
          > For example, which one of the following would be more efficient, or ,
          > moreover, more pythonic?
          >
          > if aa[:3] == 'abc':
          >
          > vs
          >
          > if aa.startswith(' abc'):[/color]

          The latter is clearly more readable.

          Comment

          • David Eppstein

            #20
            Re: Slicing vs .startswith

            In article <3F6F3E38.5221D 7E7@engcorp.com >,
            Peter Hansen <peter@engcorp. com> wrote:
            [color=blue][color=green]
            > > For example, which one of the following would be more efficient, or ,
            > > moreover, more pythonic?
            > >
            > > if aa[:3] == 'abc':
            > >
            > > vs
            > >
            > > if aa.startswith(' abc'):[/color]
            >
            > The latter is clearly more readable.[/color]

            More Pythonic, too, I think. "Readabilit y counts," and "There should be
            one-- and preferably only one --obvious way to do it." In this case,
            startswith must be the one obvious way, or else why would it exist in
            the standard library at all?

            --
            David Eppstein http://www.ics.uci.edu/~eppstein/
            Univ. of California, Irvine, School of Information & Computer Science

            Comment

            • xtian

              #21
              Re: Slicing vs .startswith

              David Eppstein <eppstein@ics.u ci.edu> wrote in message news:<eppstein-89A196.11515122 092003@news.ser vice.uci.edu>.. .[color=blue]
              > In article <3F6F3E38.5221D 7E7@engcorp.com >,
              > Peter Hansen <peter@engcorp. com> wrote:
              >[color=green][color=darkred]
              > > > For example, which one of the following would be more efficient, or ,
              > > > moreover, more pythonic?
              > > >
              > > > if aa[:3] == 'abc':
              > > >
              > > > vs
              > > >
              > > > if aa.startswith(' abc'):[/color]
              > >
              > > The latter is clearly more readable.[/color]
              >
              > More Pythonic, too, I think. "Readabilit y counts," and "There should be
              > one-- and preferably only one --obvious way to do it." In this case,
              > startswith must be the one obvious way, or else why would it exist in
              > the standard library at all?[/color]

              It's also much more maintainable - if in the future that 'abc' needs
              to change to 'abcdef', the slicing version requires changes in two
              places (just asking for bugs), while startswith requires only one. Of
              course, all these things (readability, maintainability and
              pythonicism) are fairly closely interrelated.

              xtian

              Comment

              • Jeremy Fincher

                #22
                Re: Slicing vs .startswith

                Shu-Hsien Sheu <sheu@bu.edu> wrote in message news:<mailman.1 064245938.12315 .python-list@python.org >...[color=blue]
                > For example, which one of the following would be more efficient, or ,
                > moreover, more pythonic?
                >
                > if aa[:3] == 'abc':
                >
                > vs
                >
                > if aa.startswith(' abc'):[/color]

                Python is about maintainability , and the latter is significantly more
                maintainable than the former; if the string you're checking against
                changes in size, using .startswith doesn't require you to change your
                sourcecode anywhere else.

                Jeremy

                Comment

                • Paul

                  #23
                  Re: Slicing vs .startswith

                  However, what if you don't want case sensitivity? For example, to
                  check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
                  work with both foo.jpg and foo.JPG.

                  Is this slower than name.lower().en dswith('jpg')? Is there a better
                  solution altogether?

                  Paul


                  Bob Gailer <bgailer@alum.r pi.edu> wrote in message news:<mailman.1 064246454.19668 .python-list@python.org >...[color=blue]
                  > At 09:17 AM 9/22/2003, Shu-Hsien Sheu wrote:
                  >[color=green]
                  > >Hi,
                  > >
                  > >I have a question about the comparison of efficiency of string slicing and
                  > >using string.startswi th.
                  > >For example, which one of the following would be more efficient, or ,
                  > >moreover, more pythonic?
                  > >
                  > >if aa[:3] == 'abc':
                  > >
                  > >vs
                  > >
                  > >if aa.startswith(' abc'):[/color]
                  >
                  > Here's one way to address this kind of question:
                  >[color=green][color=darkred]
                  > >>> import time
                  > >>> def f():[/color][/color]
                  > ... st = time.time()
                  > ... for i in range(500000):
                  > ... if aa.startswith(' abc'):pass
                  > ... print time.time() - st
                  > ...[color=green][color=darkred]
                  > >>> f()[/color][/color]
                  > 1.01200008392[color=green][color=darkred]
                  > >>> def f():[/color][/color]
                  > ... st = time.time()
                  > ... for i in range(500000):
                  > ... if aa[:3] == 'abc':pass
                  > ... print time.time() - st
                  > ...[color=green][color=darkred]
                  > >>> f()[/color][/color]
                  > 1.01100003719
                  >
                  > Bob Gailer
                  > bgailer@alum.rp i.edu
                  > 303 442 2625
                  >
                  > --[/color]

                  Comment

                  • Alex Martelli

                    #24
                    Re: Slicing vs .startswith

                    Paul wrote:
                    [color=blue]
                    > However, what if you don't want case sensitivity? For example, to
                    > check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
                    > work with both foo.jpg and foo.JPG.
                    >
                    > Is this slower than name.lower().en dswith('jpg')? Is there a better
                    > solution altogether?[/color]

                    timeit.py gives me 0.9 microseconds for the less maintainable
                    name[:-3].lower()=='jpg' vs 1.7 for name.lower().en dswith('jpg')
                    [and over 3 for re's]. Point is, _do I *care*_? How many millions
                    of filenames will I check, when the extra overhead of checking
                    a million filenames in the more maintainable way is less than a
                    second? How long will it have taken me to get those millions
                    of filenames into memory in the first place?

                    If this operation IS on the bottleneck, and a 0.8 microseconds
                    difference matters, I'll do the slicing -- but 99 times out of 100
                    I'll do the .endswith instead (I'll _start_ that way 100 times out
                    of 100, and optimize it iff profiling tells me that matters...).


                    Alex

                    Comment

                    • Peter Hansen

                      #25
                      Re: Slicing vs .startswith

                      Paul wrote:[color=blue]
                      >
                      > However, what if you don't want case sensitivity? For example, to
                      > check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
                      > work with both foo.jpg and foo.JPG.
                      >
                      > Is this slower than name.lower().en dswith('jpg')? Is there a better
                      > solution altogether?[/color]

                      Yes, of course. :-)

                      import os
                      if os.path.splitex t(name)[1].lower() == 'jpg':
                      pass

                      That also handles the problem with files named "ThisFileIs.Not Ajpg"
                      being mistreated, as the other solutions do. ;-)

                      -Peter

                      Comment

                      • David Eppstein

                        #26
                        Re: Slicing vs .startswith

                        In article <3F6F8306.1ABCC 6D@engcorp.com> ,
                        Peter Hansen <peter@engcorp. com> wrote:
                        [color=blue]
                        > Paul wrote:[color=green]
                        > >
                        > > However, what if you don't want case sensitivity? For example, to
                        > > check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
                        > > work with both foo.jpg and foo.JPG.
                        > >
                        > > Is this slower than name.lower().en dswith('jpg')? Is there a better
                        > > solution altogether?[/color]
                        >
                        > Yes, of course. :-)
                        >
                        > import os
                        > if os.path.splitex t(name)[1].lower() == 'jpg':
                        > pass
                        >
                        > That also handles the problem with files named "ThisFileIs.Not Ajpg"
                        > being mistreated, as the other solutions do. ;-)[/color]

                        I was about to post the same answer. One minor nit, though: it should be

                        os.path.splitex t(name)[1].lower() == '.jpg'

                        It may be a little longer than the other solutions, but it expresses the
                        meaning more clearly.

                        --
                        David Eppstein http://www.ics.uci.edu/~eppstein/
                        Univ. of California, Irvine, School of Information & Computer Science

                        Comment

                        • Terry Reedy

                          #27
                          Re: Slicing vs .startswith


                          "Shu-Hsien Sheu" <sheu@bu.edu> wrote in message
                          news:mailman.10 64245938.12315. python-list@python.org ...[color=blue]
                          > Hi,
                          >
                          > For example, which one of the following would be more efficient, or[/color]
                          ,[color=blue]
                          > moreover, more pythonic?
                          >
                          > if aa[:3] == 'abc':[/color]

                          This creates and deletes a temporary object.
                          [color=blue]
                          > if aa.startswith(' abc'):[/color]

                          This makes a function call. As Bob showed for his system, the two
                          overheads are about the same for a three char prefix. If one were
                          checking a 30000 byte prefix, the call might win.

                          Terry J. Reedy


                          Comment

                          • Peter Hansen

                            #28
                            Re: Slicing vs .startswith

                            David Eppstein wrote:[color=blue]
                            >
                            > In article <3F6F8306.1ABCC 6D@engcorp.com> ,
                            > Peter Hansen <peter@engcorp. com> wrote:
                            >[color=green]
                            > > Paul wrote:[color=darkred]
                            > > >
                            > > > However, what if you don't want case sensitivity? For example, to
                            > > > check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
                            > > > work with both foo.jpg and foo.JPG.
                            > > >
                            > > > Is this slower than name.lower().en dswith('jpg')? Is there a better
                            > > > solution altogether?[/color]
                            > >
                            > > Yes, of course. :-)
                            > >
                            > > import os
                            > > if os.path.splitex t(name)[1].lower() == 'jpg':
                            > > pass
                            > >
                            > > That also handles the problem with files named "ThisFileIs.Not Ajpg"
                            > > being mistreated, as the other solutions do. ;-)[/color]
                            >
                            > I was about to post the same answer. One minor nit, though: it should be
                            >
                            > os.path.splitex t(name)[1].lower() == '.jpg'[/color]

                            Oops, thanks! I _always_ make that mistake. It just seems that if
                            os.path.split() does not return any path separators in the components,
                            then os.path.splitex t() shouldn't return the extension separator...

                            Luckily we have tests around here to catch that kind of thing. :-)

                            -Peter

                            Comment

                            • Peter Otten

                              #29
                              Re: DSU pattern (was Re: Trouble sorting lists (unicode/locale related?))

                              Duncan Booth wrote:
                              [color=blue]
                              > Note that if anyone proposes this seriously, it should generate a 3-tuple
                              > (mapping(item), index, item) rather than the 2-tuple you suggest.
                              >
                              > This is because the mapping function could reasonably be used to impose an
                              > ordering on objects that have no natural order, so you need to be sure
                              > that the comparison never falls through the the original object even where
                              > the mapping compares equal. It also has the side effect of making the sort
                              > stable (although if stability is a goal you might want another option to
                              > reverse the sort which would use '-index' as the second element and call
                              > .reverse() on the result).[/color]

                              So my demo implementation was faulty :-(
                              Let me try again:

                              def sort(self, cmpfunc=None, mapfunc=None):
                              if mapfunc:
                              list.sort(self, lambda x, y: cmp(mapfunc(x), mapfunc(y)))
                              else:
                              list.sort(self, cmpfunc)

                              Seriously, if sort() were to be thus extended, the dsu pattern would rather
                              act as a performance enhancement under the hood. I prefer plain C

                              struct {
                              PyObject *decorator;
                              int index; /* iff stability is not guaranteed by the sort algorithm */
                              PyObject *data;
                              }

                              over n-tuples and would hope to reuse the sorting infrastructure already
                              there in listobject.c to compare only the decorators. I am not familiar
                              with the C source, so be gentle if that is not feasible.
                              [color=blue]
                              > FWIW, I think something like this belongs in the standard library rather
                              > than as a method on lists or a new builtin.[/color]

                              If you think of the comparison as a two-step process,

                              (1) extract or calculate an attribute
                              (2) wrap it into a comparison function

                              consider the following use cases:

                              (a) lambda x, y: cmp(y, x)
                              (b) lambda x, y: cmp(x.attr.lowe r(), y.attr.lower())

                              All code following the latter pattern could be rewritten (more clearly, I
                              think) as

                              alist.sort(mapf unc=lambda x: attr.lower())

                              and would automatically benefit from dsu (which could even be turned off
                              transparently for the client code for very large lists, where the memory
                              footprint becomes a problem).

                              The litmus test whether to put it into a utility function or into an already
                              existing method that needs only a minor and completely backwards-compatible
                              interface change would then be:

                              Are (b)-style comparison functions nearly as or more common than (a)-style
                              functions.

                              Peter

                              PS: You and Alex Martelli dismissed my somewhat unfocused idea faster than I
                              could sort it out. I hope you will find the above useful, though.

                              Comment

                              • Shu-Hsien Sheu

                                #30
                                Re: Slicing vs .startswith

                                Great thanks for all the answers!
                                I really enjoy Python and learned a lot here.

                                -shuhsien


                                Comment

                                Working...