splitting a string into 2 new strings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Mark Light

    splitting a string into 2 new strings

    Hi,
    I have a string e.g. 'C6 H12 O6' that I wish to split up to give 2
    strings
    'C H O' and '6 12 6'. I have played with string.split() and the re module -
    but can't quite get there.

    Any help would be greatly appreciated.

    Thanks,

    Mark.




  • trp

    #2
    Re: splitting a string into 2 new strings

    Mark Light wrote:
    [color=blue]
    > Hi,
    > I have a string e.g. 'C6 H12 O6' that I wish to split up to give 2
    > strings
    > 'C H O' and '6 12 6'. I have played with string.split() and the re module
    > - but can't quite get there.
    >
    > Any help would be greatly appreciated.
    >
    > Thanks,
    >
    > Mark.[/color]

    I'm, assuming that these are chemical compounds, so you're not limited to
    one-character symbols.

    Here's how I'd do it

    import re

    re_pat = re.compile('([A-Z]+)(\d+)')
    text = 'C6 H12 O6'

    # find each component, returns list of tuples (e.g. [('C', '6'), ...]
    component = re_pat.findall( text)

    #split into separate lists
    symbols, counts = zip(*component)

    # create the strings
    symbols = ' '.join(symbols)
    counts = ' '.join(counts)

    --Andy





    Comment

    • Mark Light

      #3
      Re: splitting a string into 2 new strings

      that works great - many thanks.

      "trp" <trp@smyrncable .net> wrote in message
      news:vg5jjqcc8d 2r1f@corp.super news.com...[color=blue]
      > Mark Light wrote:
      >[color=green]
      > > Hi,
      > > I have a string e.g. 'C6 H12 O6' that I wish to split up to give 2
      > > strings
      > > 'C H O' and '6 12 6'. I have played with string.split() and the re[/color][/color]
      module[color=blue][color=green]
      > > - but can't quite get there.
      > >
      > > Any help would be greatly appreciated.
      > >
      > > Thanks,
      > >
      > > Mark.[/color]
      >
      > I'm, assuming that these are chemical compounds, so you're not limited to
      > one-character symbols.
      >
      > Here's how I'd do it
      >
      > import re
      >
      > re_pat = re.compile('([A-Z]+)(\d+)')
      > text = 'C6 H12 O6'
      >
      > # find each component, returns list of tuples (e.g. [('C', '6'), ...]
      > component = re_pat.findall( text)
      >
      > #split into separate lists
      > symbols, counts = zip(*component)
      >
      > # create the strings
      > symbols = ' '.join(symbols)
      > counts = ' '.join(counts)
      >
      > --Andy
      >
      >
      >
      >
      >[/color]


      Comment

      • P@draigBrady.com

        #4
        Re: splitting a string into 2 new strings

        Mark Light wrote:[color=blue]
        > Hi,
        > I have a string e.g. 'C6 H12 O6' that I wish to split up to give 2
        > strings
        > 'C H O' and '6 12 6'. I have played with string.split() and the re module -
        > but can't quite get there.
        >
        > Any help would be greatly appreciated.[/color]

        import re

        molecule_re = re.compile("(.+ ?)([0-9]+)")
        def processMolecule (molecule):
        elements=[]
        numbers=[]

        for item in molecule.split( ):
        element, number = molecule_re.fin dall(item)[0]
        elements.append (element)
        numbers.append( number)

        elements = ' '.join(elements )
        numbers = ' '.join(numbers)

        return (elements, numbers)

        print processMolecule ('C6 H12 O6')

        Comment

        • Andrew Dalke

          #5
          Re: splitting a string into 2 new strings

          trp:[color=blue]
          > I'm, assuming that these are chemical compounds, so you're not limited to
          > one-character symbols.[/color]

          The problem is underspecified. Usually 2-character (or 3-character for some
          elements with high atomic number, and not assuming the newer IUPAC names
          like "Dubnium", which was also called Unnilpentium (Unp) or, depending on
          your political persuasion, Joliotium (Jl) or Hahnium (Ha)) have the first
          letter
          capitalized and the rest in lower case.
          [color=blue]
          > re_pat = re.compile('([A-Z]+)(\d+)')[/color]

          So this should be written ([A-Z][A-Za-z]*)(\d+), where I explicitly allow
          both lower and upper case trailing letters to be more accepting. (In some
          systems, "CU" is "1 carbon + 1 uranium" and in others it's an alternate way
          to
          write "1 copper". Though I suspect it's not allowed in the OP's problem.)

          Andrew
          dalke@dalkescie ntific.com


          Comment

          • Andrew Dalke

            #6
            Re: splitting a string into 2 new strings

            Anton Vredegoor:[color=blue]
            > The issue seems to be resolved already, but I haven't seen the split
            > and strip combination:
            >
            > from string import letters,digits[/color]

            Use "ascii_lett ers" instead of "letters". The latter is based on the locale
            so
            might not work on some machines where "C" (or rather, byte 67) isn't
            a letter in the local alphabet.

            Andrew
            dalke@dalkescie ntific.com


            Comment

            • Andrew Dalke

              #7
              Re: splitting a string into 2 new strings

              trp:[color=blue]
              > I'm, assuming that these are chemical compounds, so you're not limited to
              > one-character symbols.[/color]

              The problem is underspecified. Usually 2-character (or 3-character for some
              elements with high atomic number, and not assuming the newer IUPAC names
              like "Dubnium", which was also called Unnilpentium (Unp) or, depending on
              your political persuasion, Joliotium (Jl) or Hahnium (Ha)) have the first
              letter
              capitalized and the rest in lower case.
              [color=blue]
              > re_pat = re.compile('([A-Z]+)(\d+)')[/color]

              So this should be written ([A-Z][A-Za-z]*)(\d+), where I explicitly allow
              both lower and upper case trailing letters to be more accepting. (In some
              systems, "CU" is "1 carbon + 1 uranium" and in others it's an alternate way
              to
              write "1 copper". Though I suspect it's not allowed in the OP's problem.)

              Andrew
              dalke@dalkescie ntific.com


              Comment

              • Andrew Dalke

                #8
                Re: splitting a string into 2 new strings

                Anton Vredegoor:[color=blue]
                > The issue seems to be resolved already, but I haven't seen the split
                > and strip combination:
                >
                > from string import letters,digits[/color]

                Use "ascii_lett ers" instead of "letters". The latter is based on the locale
                so
                might not work on some machines where "C" (or rather, byte 67) isn't
                a letter in the local alphabet.

                Andrew
                dalke@dalkescie ntific.com


                Comment

                Working...