split() and string.whitespace

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Chaim Krause

    split() and string.whitespace

    I am unable to figure out why the first two statements work as I
    expect them to and the next two do not. Namely, the first two spit the
    sentence into its component words, while the latter two return the
    whole sentence entact.

    import string
    from string import whitespace
    mytext = "The quick brown fox jumped over the lazy dog.\n"

    print mytext.split()
    print mytext.split(' ')
    print mytext.split(wh itespace)
    print string.split(my text, sep=whitespace)
  • Marc 'BlackJack' Rintsch

    #2
    Re: split() and string.whitespa ce

    On Fri, 31 Oct 2008 11:53:30 -0700, Chaim Krause wrote:
    I am unable to figure out why the first two statements work as I expect
    them to and the next two do not. Namely, the first two spit the sentence
    into its component words, while the latter two return the whole sentence
    entact.
    >
    import string
    from string import whitespace
    mytext = "The quick brown fox jumped over the lazy dog.\n"
    >
    print mytext.split()
    print mytext.split(' ')
    This splits at the string ' '.
    print mytext.split(wh itespace)
    This splits at the string '\t\n\x0b\x0c\r ' which doesn't occur in
    `mytext`. The argument is a string not a set of characters.
    print string.split(my text, sep=whitespace)
    Same here.

    Ciao,
    Marc 'BlackJack' Rintsch

    Comment

    • Tim Chase

      #3
      Re: split() and string.whitespa ce

      I am unable to figure out why the first two statements work as I
      expect them to and the next two do not. Namely, the first two spit the
      sentence into its component words, while the latter two return the
      whole sentence entact.
      >
      import string
      from string import whitespace
      mytext = "The quick brown fox jumped over the lazy dog.\n"
      >
      print mytext.split()
      print mytext.split(' ')
      print mytext.split(wh itespace)
      print string.split(my text, sep=whitespace)

      Split does its work on literal strings, or if a separator is not
      specified, on a set of data, splits on arbitrary whitespace.

      For an example, try

      s = "abcdefgbcdefgh "
      s.split("c") # ['ab', 'defgb', 'defgh']
      s.split("fgb") # ['abcde', 'cdefgh']


      string.whitespa ce is a string, so split() tries to use split on
      the literal whitespace, not a set of whitespace.

      -tkc





      Comment

      • Chris Rebert

        #4
        Re: split() and string.whitespa ce

        On Fri, Oct 31, 2008 at 11:53 AM, Chaim Krause <chaim@chaim.co mwrote:
        I am unable to figure out why the first two statements work as I
        expect them to and the next two do not. Namely, the first two spit the
        sentence into its component words, while the latter two return the
        whole sentence entact.
        >
        import string
        from string import whitespace
        mytext = "The quick brown fox jumped over the lazy dog.\n"
        >
        print mytext.split()
        print mytext.split(' ')
        print mytext.split(wh itespace)
        print string.split(my text, sep=whitespace)
        Also note that a plain 'mytext.split() ' with no arguments will split
        on any whitespace character like you're trying to do here.

        Cheers,
        Chris
        --
        Follow the path of the Iguana...

        Comment

        • Chaim Krause

          #5
          Re: split() and string.whitespa ce

          The documentation I am referencing states...

          The sep argument may consist of multiple characters (for example, "'1,
          2, 3'.split(', ')" returns "['1', '2', '3']").

          So why doesn't the latter two split on *any* whitespace character, and
          is instead looking for the sep string as a whole?

          Comment

          • Chaim Krause

            #6
            Re: split() and string.whitespa ce

            I have arrived here while attempting to break down a larger problem. I
            got to this question when attempting to split a line on any whitespace
            character so that I could then add several other characters like ';'
            and ':'. Ultimately splitting a line on any char in a union of
            string.whitespa ce and some pre-designated chars.

            I am now beginning to think that I have outgrown split() and must move
            up to regular expressions. If that is the case, I will go off and RTFM
            on RegEx.

            Comment

            • Chaim Krause

              #7
              Re: split() and string.whitespa ce

              On Oct 31, 2:12 pm, Chaim Krause <ch...@chaim.co mwrote:
              The documentation I am referencing states...
              >
              The sep argument may consist of multiple characters (for example, "'1,
              2, 3'.split(', ')" returns "['1', '2', '3']").
              >
              So why doesn't the latter two split on *any* whitespace character, and
              is instead looking for the sep string as a whole?
              Now, rereading the documentation in light of the replies to my
              origional posting, I see that I misinterpreted the example as using
              "comma OR space" when it was actually "commaspace ". I am now properly
              enlightened.

              Thank you all for your help.

              Comment

              • MRAB

                #8
                Re: split() and string.whitespa ce

                On Oct 31, 6:57 pm, Marc 'BlackJack' Rintsch <bj_...@gmx.net wrote:
                On Fri, 31 Oct 2008 11:53:30 -0700, Chaim Krause wrote:
                I am unable to figure out why the first two statements work as I expect
                them to and the next two do not. Namely, the first two spit the sentence
                into its component words, while the latter two return the whole sentence
                entact.
                >
                import string
                from string import whitespace
                mytext = "The quick brown fox jumped over the lazy dog.\n"
                >
                print mytext.split()
                print mytext.split(' ')
                >
                This splits at the string ' '.
                >
                print mytext.split(wh itespace)
                >
                This splits at the string '\t\n\x0b\x0c\r ' which doesn't occur in
                `mytext`.  The argument is a string not a set of characters.
                >
                print string.split(my text, sep=whitespace)
                >
                Same here.
                >
                <muse>
                It's interesting, if you think about it, that here we have someone who
                wants to split on a set of characters but 'split' splits on a string,
                and others sometimes want to strip off a string but 'strip' strips on
                a set of characters (passed as a string). You could imagine that if
                Python had had (character) sets from the start then 'split' and
                'strip' could have accepted a string or a set depending on whether you
                wanted to split on or stripping off a string or a set.
                </muse>

                Comment

                • Steven D'Aprano

                  #9
                  Re: split() and string.whitespa ce

                  On Fri, 31 Oct 2008 12:18:32 -0700, Chaim Krause wrote:
                  I have arrived here while attempting to break down a larger problem. I
                  got to this question when attempting to split a line on any whitespace
                  character so that I could then add several other characters like ';' and
                  ':'. Ultimately splitting a line on any char in a union of
                  string.whitespa ce and some pre-designated chars.
                  >
                  I am now beginning to think that I have outgrown split() and must move
                  up to regular expressions. If that is the case, I will go off and RTFM
                  on RegEx.
                  Or just do this:

                  s = "the quick brown\tdog\njum ps over\r\n\t the lazy dog"
                  s = s.replace('\t', ' ').replace('\n' , ' ').replace('\r' , ' ')
                  s.split(' ')


                  or even simpler:

                  s.split()


                  --
                  Steven

                  Comment

                  • Scott David Daniels

                    #10
                    Re: split() and string.whitespa ce

                    Steven D'Aprano wrote:
                    On Fri, 31 Oct 2008 12:18:32 -0700, Chaim Krause wrote:
                    >I have arrived here while attempting to break down a larger problem. I
                    >got to this question when attempting to split a line on any whitespace
                    >character so that I could then add several other characters like ';' and
                    >':'. Ultimately splitting a line on any char in a union of
                    >string.whitesp ace and some pre-designated chars.
                    >>
                    >I am now beginning to think that I have outgrown split() and must move
                    >up to regular expressions. If that is the case, I will go off and RTFM
                    >on RegEx.
                    >
                    Or just do this:
                    s = "the quick brown\tdog\njum ps over\r\n\t the lazy dog"
                    s = s.replace('\t', ' ').replace('\n' , ' ').replace('\r' , ' ')
                    s.split(' ')
                    or even simpler:
                    s.split()
                    Or, for faster per-repetition (blending in to your use-case):

                    import string
                    SEP = string.maketran s('abc \t', ' ')
                    ...
                    parts = 'whatever, abalone dudes'.translat e(SEP).split()
                    print parts

                    ['wh', 'tever,', 'lone', 'dudes']


                    --Scott David Daniels
                    Scott.Daniels@A cm.Org

                    Comment

                    • Chaim Krause

                      #11
                      Re: split() and string.whitespa ce

                      That is a very creative solution! Thank you Scott.
                      Or, for faster per-repetition (blending in to your use-case):
                      >
                           import string
                           SEP = string.maketran s('abc \t', '     ')
                           ...
                           parts = 'whatever, abalone dudes'.translat e(SEP).split()
                           print parts
                      >
                      ['wh', 'tever,', 'lone', 'dudes']

                      Comment

                      • bearophileHUGS@lycos.com

                        #12
                        Re: split() and string.whitespa ce

                        MRAB:
                        It's interesting, if you think about it, that here we have someone who
                        wants to split on a set of characters but 'split' splits on a string,
                        and others sometimes want to strip off a string but 'strip' strips on
                        a set of characters (passed as a string).
                        That can be seen as a little inconsistency in the language. But with
                        some practice you learn it.

                        You could imagine that if
                        Python had had (character) sets from the start then 'split' and
                        'strip' could have accepted a string or a set depending on whether you
                        wanted to split on or stripping off a string or a set.
                        Too bad you haven't suggested this when they were designing Python
                        3 :-)
                        This may be suggested for Python 3.1.

                        Bye,
                        bearophile

                        Comment

                        • MRAB

                          #13
                          Re: split() and string.whitespa ce

                          On Nov 4, 8:00 pm, bearophileH...@ lycos.com wrote:
                          MRAB:
                          >
                          It's interesting, if you think about it, that here we have someone who
                          wants to split on a set of characters but 'split' splits on a string,
                          and others sometimes want to strip off a string but 'strip' strips on
                          a set of characters (passed as a string).
                          >
                          That can be seen as a little inconsistency in the language. But with
                          some practice you learn it.
                          >
                          You could imagine that if
                          Python had had (character) sets from the start then 'split' and
                          'strip' could have accepted a string or a set depending on whether you
                          wanted to split on or stripping off a string or a set.
                          >
                          Too bad you haven't suggested this when they were designing Python
                          3 :-)
                          This may be suggested for Python 3.1.
                          >
                          I might also add that str.startswith can accept a tuple of strings;
                          shouldn't that have been a set? :-)

                          I also had the thought that the backtick (`), which is not used in
                          Python 3, could be used to form character set literals (`aeiou` =>
                          set("aeiou")), although that might only be worth while if character
                          sets were introduced as an specialised form of set.

                          Comment

                          • bearophileHUGS@lycos.com

                            #14
                            Re: split() and string.whitespa ce

                            MRAB:
                            I also had the thought that the backtick (`), which is not used in
                            Python 3, could be used to form character set literals (`aeiou` =>
                            set("aeiou")), although that might only be worth while if character
                            sets were introduced as an specialised form of set.
                            Python developers have removed it from the syntax mostly because lot
                            of keyboards (probably most in the world) don't have "`" on them.

                            Bye,
                            bearophile

                            Comment

                            Working...