problem with regex, how to conclude more than one character

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • tecspring@gmail.com

    problem with regex, how to conclude more than one character

    I always have no idea about how to express "conclude the entire word"
    with regexp, while using python, I encountered this problem again...

    for example, if I want to match the "string" in "test a string",
    re.findall(r"[^a]* (\w+)","test a string") will work, but what if
    there is not "a" but "an"(test a string)? the [^an] will failed
    because it will stop at the first character "a".

    I guess people not always use this kind of way to filter words?
    Here comes the real problem I encountered:
    I want to filter the text both in "<td>" block and the "<span>"'s
    title attribute
    ############### ####### code ############### ##############
    import re
    content='''<tr align="center" valign="middle" class="CellCss" ><td
    valign="middle" >LA</td><td valign="middle" >11/10/2008</td><td
    valign="middle" >1340/1430</td><td valign="middle" >PF1/5</td><td
    valign="middle" ><span title="Understa nding the stock market"
    class="MouseCur sor">Understand ....</span></td><td title="Charisma "
    valign="middle" >Charisma</td><td valign="middle" >Booked</td><td
    valign="middle" >'''

    re.findall(r''' <td valign="middle" >([^<]+)</td><td
    valign="middle" >([^<]+)</td><td valign="middle" >([^<]+)</td><td
    valign="middle" >([^<]+)</td><td valign="middle" ><span
    title="([^"]*)"''',conten t)

    ############### ##### code end ############### #############
    As you saw above,
    I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
    the stock market"
    there are two "<span>" block but I can just get the "title" attribute
    of the first "<span>" using regexp.
    for the second, which should be "Charisma" I need to use some kind of
    [^</td>]* to match "class="MouseCu rsor">Understan d....</span></td>",
    then I can continue match the second "<span>" block.

    Maybe I didn't describe this clearly, then feel free to tell me:)
    thanks for any further reply!
  • tecspring@gmail.com

    #2
    Re: problem with regex, how to conclude more than one character

    On Nov 7, 3:06 pm, tecspr...@gmail .com wrote:
    I always have no idea about how to express "conclude the entire word"
    with regexp,  while using python, I encountered this problem again...
    >
    for example, if I want to match the "string" in "test a string",
    re.findall(r"[^a]* (\w+)","test a string") will work, but what if
    there is not "a" but "an"(test a string)? the [^an] will failed
    because it will stop at the first character "a".
    >
    I guess people not always use this kind of way to filter words?
    Here comes the real problem I encountered:
    I want to filter the text both in "<td>" block and the "<span>"'s
    title attribute
    ############### ####### code ############### ##############
    import re
    content='''<tr align="center" valign="middle" class="CellCss" ><td
    valign="middle" >LA</td><td valign="middle" >11/10/2008</td><td
    valign="middle" >1340/1430</td><td valign="middle" >PF1/5</td><td
    valign="middle" ><span title="Understa nding the stock market"
    class="MouseCur sor">Understand ....</span></td><td title="Charisma "
    valign="middle" >Charisma</td><td valign="middle" >Booked</td><td
    valign="middle" >'''
    >
    re.findall(r''' <td valign="middle" >([^<]+)</td><td
    valign="middle" >([^<]+)</td><td valign="middle" >([^<]+)</td><td
    valign="middle" >([^<]+)</td><td valign="middle" ><span
    title="([^"]*)"''',conten t)
    >
    ############### ##### code end ############### #############
    As you saw above,
    I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
    the stock market"
    there are two "<span>" block but I can just get the "title" attribute
    of the first "<span>" using regexp.
    for the second, which should be "Charisma" I need to use some kind of
    [^</td>]* to match "class="MouseCu rsor">Understan d....</span></td>",
    then I can continue match the second "<span>" block.
    >
    Maybe I didn't describe this clearly, then feel free to tell me:)
    thanks for any further reply!
    And by the way, I've tried both (!</td>) and (?:!</td>), many ways
    doesn't work.... so sad...

    Comment

    • Chris Rebert

      #3
      Re: problem with regex, how to conclude more than one character

      On Thu, Nov 6, 2008 at 11:06 PM, <tecspring@gmai l.comwrote:
      I always have no idea about how to express "conclude the entire word"
      with regexp, while using python, I encountered this problem again...
      >
      for example, if I want to match the "string" in "test a string",
      re.findall(r"[^a]* (\w+)","test a string") will work, but what if
      there is not "a" but "an"(test a string)? the [^an] will failed
      because it will stop at the first character "a".
      >
      I guess people not always use this kind of way to filter words?
      Here comes the real problem I encountered:
      I want to filter the text both in "<td>" block and the "<span>"'s
      title attribute
      Is there any particularly good reason why you're using regexps for
      this rather than, say, an actual (X)HTML parser?

      Cheers,
      Chris
      --
      Follow the path of the Iguana...

      ############### ####### code ############### ##############
      import re
      content='''<tr align="center" valign="middle" class="CellCss" ><td
      valign="middle" >LA</td><td valign="middle" >11/10/2008</td><td
      valign="middle" >1340/1430</td><td valign="middle" >PF1/5</td><td
      valign="middle" ><span title="Understa nding the stock market"
      class="MouseCur sor">Understand ....</span></td><td title="Charisma "
      valign="middle" >Charisma</td><td valign="middle" >Booked</td><td
      valign="middle" >'''
      >
      re.findall(r''' <td valign="middle" >([^<]+)</td><td
      valign="middle" >([^<]+)</td><td valign="middle" >([^<]+)</td><td
      valign="middle" >([^<]+)</td><td valign="middle" ><span
      title="([^"]*)"''',conten t)
      >
      ############### ##### code end ############### #############
      As you saw above,
      I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
      the stock market"
      there are two "<span>" block but I can just get the "title" attribute
      of the first "<span>" using regexp.
      for the second, which should be "Charisma" I need to use some kind of
      [^</td>]* to match "class="MouseCu rsor">Understan d....</span></td>",
      then I can continue match the second "<span>" block.
      >
      Maybe I didn't describe this clearly, then feel free to tell me:)
      thanks for any further reply!
      --

      >

      Comment

      • tecspring@gmail.com

        #4
        Re: problem with regex, how to conclude more than one character

        On Nov 7, 3:13 pm, "Chris Rebert" <c...@rebertia. comwrote:
        On Thu, Nov 6, 2008 at 11:06 PM,  <tecspr...@gmai l.comwrote:
        I always have no idea about how to express "conclude the entire word"
        with regexp,  while using python, I encountered this problem again...
        >
        for example, if I want to match the "string" in "test a string",
        re.findall(r"[^a]* (\w+)","test a string") will work, but what if
        there is not "a" but "an"(test a string)? the [^an] will failed
        because it will stop at the first character "a".
        >
        I guess people not always use this kind of way to filter words?
        Here comes the real problem I encountered:
        I want to filter the text both in "<td>" block and the "<span>"'s
        title attribute
        >
        Is there any particularly good reason why you're using regexps for
        this rather than, say, an actual (X)HTML parser?
        >
        Cheers,
        Chris
        --
        Follow the path of the Iguana...http://rebertia.com
        >
        >
        >
        ############### ####### code ############### ##############
        import re
        content='''<tr align="center" valign="middle" class="CellCss" ><td
        valign="middle" >LA</td><td valign="middle" >11/10/2008</td><td
        valign="middle" >1340/1430</td><td valign="middle" >PF1/5</td><td
        valign="middle" ><span title="Understa nding the stock market"
        class="MouseCur sor">Understand ....</span></td><td title="Charisma "
        valign="middle" >Charisma</td><td valign="middle" >Booked</td><td
        valign="middle" >'''
        >
        re.findall(r''' <td valign="middle" >([^<]+)</td><td
        valign="middle" >([^<]+)</td><td valign="middle" >([^<]+)</td><td
        valign="middle" >([^<]+)</td><td valign="middle" ><span
        title="([^"]*)"''',conten t)
        >
        ############### ##### code end ############### #############
        As you saw above,
        I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding
        the stock market"
        there are two "<span>" block but I can just get the "title" attribute
        of the first "<span>" using regexp.
        for the second, which should be "Charisma" I need to use some kind of
        [^</td>]* to match "class="MouseCu rsor">Understan d....</span></td>",
        then I can continue match the second "<span>" block.
        >
        Maybe I didn't describe this clearly, then feel free to tell me:)
        thanks for any further reply!
        --
        http://mail.python.org/mailman/listinfo/python-list- Hide quoted text -
        >
        - Show quoted text -
        Really thanks for quickly reply Chris!
        Actually I tried BeautifulSoup and it's great.
        But I'm not very familiar with it and it need more codes to parse the
        html and get the right text.
        I think regexp is more convenient if there is a way to filter out the
        list just in one line:)
        I did this all the way but stopped here...

        Comment

        Working...