RegEx conditional search and replace

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Martin Evans

    RegEx conditional search and replace

    Sorry, yet another REGEX question. I've been struggling with trying to get
    a regular expression to do the following example in Python:

    Search and replace all instances of "sleeping" with "dead".

    This parrot is sleeping. Really, it is sleeping.
    to
    This parrot is dead. Really, it is dead.


    But not if part of a link or inside a link:

    This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
    is sleeping.
    to
    This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
    is dead.


    This is the full extent of the "html" that would be seen in the text, the
    rest of the page has already been processed. Luckily I can rely on the
    formating always being consistent with the above example (the url will
    normally by much longer in reality though). There may though be more than
    one link present.

    I'm hoping to use this to implement the automatic addition of links to other
    areas of a website based on keywords found in the text.

    I'm guessing this is a bit too much to ask for regex. If this is the case,
    I'll add some more manual Python parsing to the string, but was hoping to
    use it to learn more about regex.

    Any pointers would be appreciated.

    Martin



  • Juho Schultz

    #2
    Re: RegEx conditional search and replace

    Martin Evans wrote:
    Sorry, yet another REGEX question. I've been struggling with trying to get
    a regular expression to do the following example in Python:
    >
    Search and replace all instances of "sleeping" with "dead".
    >
    This parrot is sleeping. Really, it is sleeping.
    to
    This parrot is dead. Really, it is dead.
    >
    >
    But not if part of a link or inside a link:
    >
    This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
    is sleeping.
    to
    This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
    is dead.
    >
    >
    This is the full extent of the "html" that would be seen in the text, the
    rest of the page has already been processed. Luckily I can rely on the
    formating always being consistent with the above example (the url will
    normally by much longer in reality though). There may though be more than
    one link present.
    >
    I'm hoping to use this to implement the automatic addition of links to other
    areas of a website based on keywords found in the text.
    >
    I'm guessing this is a bit too much to ask for regex. If this is the case,
    I'll add some more manual Python parsing to the string, but was hoping to
    use it to learn more about regex.
    >
    Any pointers would be appreciated.
    >
    Martin
    What you want is:

    re.sub(regex, replacement, instring)
    The replacement can be a function. So use a function.

    def sleeping_to_dea d(inmatch):
    instr = inmatch.group(0 )
    if needsfixing(ins tr):
    return instr.replace(' sleeping','dead ')
    else:
    return instr

    as for the regex, something like
    (<a)?[^<>]*(</a>)?
    could be a start. It is probaly better to use the regex to recognize
    the links as you might have something like
    sleeping.sleepi ng/sleeping/sleeping.html in your urls. Also you
    probably want to do many fixes, so you can put them all within the same
    replacer function.

    Comment

    • Martin Evans

      #3
      Re: RegEx conditional search and replace

      "Juho Schultz" <juho.schultz@p p.inet.fiwrote in message
      news:1152125582 .577222.95010@l 70g2000cwa.goog legroups.com...
      Martin Evans wrote:
      >Sorry, yet another REGEX question. I've been struggling with trying to
      >get
      >a regular expression to do the following example in Python:
      >>
      >Search and replace all instances of "sleeping" with "dead".
      >>
      >This parrot is sleeping. Really, it is sleeping.
      >to
      >This parrot is dead. Really, it is dead.
      >>
      >>
      >But not if part of a link or inside a link:
      >>
      >This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
      >it
      >is sleeping.
      >to
      >This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
      >it
      >is dead.
      >>
      >>
      >This is the full extent of the "html" that would be seen in the text, the
      >rest of the page has already been processed. Luckily I can rely on the
      >formating always being consistent with the above example (the url will
      >normally by much longer in reality though). There may though be more than
      >one link present.
      >>
      >I'm hoping to use this to implement the automatic addition of links to
      >other
      >areas of a website based on keywords found in the text.
      >>
      >I'm guessing this is a bit too much to ask for regex. If this is the
      >case,
      >I'll add some more manual Python parsing to the string, but was hoping to
      >use it to learn more about regex.
      >>
      >Any pointers would be appreciated.
      >>
      >Martin
      >
      What you want is:
      >
      re.sub(regex, replacement, instring)
      The replacement can be a function. So use a function.
      >
      def sleeping_to_dea d(inmatch):
      instr = inmatch.group(0 )
      if needsfixing(ins tr):
      return instr.replace(' sleeping','dead ')
      else:
      return instr
      >
      as for the regex, something like
      (<a)?[^<>]*(</a>)?
      could be a start. It is probaly better to use the regex to recognize
      the links as you might have something like
      sleeping.sleepi ng/sleeping/sleeping.html in your urls. Also you
      probably want to do many fixes, so you can put them all within the same
      replacer function.
      Many thanks for that, the function method looks very useful. My first
      working attempt had been to use the regex to locate all links. I then looped
      through replacing each with a numbered dummy entry. Then safely do the
      find/replaces and then replace the dummy entries with the original links. It
      just seems overly inefficient.



      Comment

      • Anthra Norell

        #4
        Re: RegEx conditional search and replace

        >>import SE
        >>Editor = SE.SE ('sleeping=dead sleeping.htm== sleeping<==')
        >>Editor ('This parrot <a href="sleeping. htm" target="new">is
        sleeping</a>. Really, it is sleeping.'
        'This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really, it
        is dead.'
        Or:
        >>Editor ( (name of htm file), (name of output file) )
        Usage: You make an explicit list of what you want and don't want after
        identifying the distinctions.

        I am currently trying to upload SE to the Cheese Shop which seems to be
        quite a procedure. So far I have only been successful uploading the
        description, but not the program. Gudidance welcome. In the interim I can
        send SE out individually by request.

        Regards

        Frederic

        ----- Original Message -----
        From: "Martin Evans" <martin@brown s-nospam.co.uk>
        Newsgroups: comp.lang.pytho n
        To: <python-list@python.org >
        Sent: Wednesday, July 05, 2006 1:34 PM
        Subject: RegEx conditional search and replace

        Sorry, yet another REGEX question. I've been struggling with trying to
        get
        a regular expression to do the following example in Python:
        >
        Search and replace all instances of "sleeping" with "dead".
        >
        This parrot is sleeping. Really, it is sleeping.
        to
        This parrot is dead. Really, it is dead.
        >
        >
        But not if part of a link or inside a link:
        >
        This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
        it
        is sleeping.
        to
        This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
        it
        is dead.
        >
        >
        This is the full extent of the "html" that would be seen in the text, the
        rest of the page has already been processed. Luckily I can rely on the
        formating always being consistent with the above example (the url will
        normally by much longer in reality though). There may though be more than
        one link present.
        >
        I'm hoping to use this to implement the automatic addition of links to
        other
        areas of a website based on keywords found in the text.
        >
        I'm guessing this is a bit too much to ask for regex. If this is the case,
        I'll add some more manual Python parsing to the string, but was hoping to
        use it to learn more about regex.
        >
        Any pointers would be appreciated.
        >
        Martin
        >
        >
        >
        --
        http://mail.python.org/mailman/listinfo/python-list

        Comment

        • mbstevens

          #5
          Re: RegEx conditional search and replace

          On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:
          "Juho Schultz" <juho.schultz@p p.inet.fiwrote in message
          news:1152125582 .577222.95010@l 70g2000cwa.goog legroups.com...
          >Martin Evans wrote:
          >>Sorry, yet another REGEX question. I've been struggling with trying to
          >>get
          >>a regular expression to do the following example in Python:
          >>>
          >>Search and replace all instances of "sleeping" with "dead".
          >>>
          >>This parrot is sleeping. Really, it is sleeping.
          >>to
          >>This parrot is dead. Really, it is dead.
          >>>
          >>>
          >>But not if part of a link or inside a link:
          >>>
          >>This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
          >>it
          >>is sleeping.
          >>to
          >>This parrot <a href="sleeping. htm" target="new">is sleeping</a>. Really,
          >>it
          >>is dead.
          >>>
          >>>
          >>This is the full extent of the "html" that would be seen in the text, the
          >>rest of the page has already been processed. Luckily I can rely on the
          >>formating always being consistent with the above example (the url will
          >>normally by much longer in reality though). There may though be more than
          >>one link present.
          >>>
          >>I'm hoping to use this to implement the automatic addition of links to
          >>other
          >>areas of a website based on keywords found in the text.
          >>>
          >>I'm guessing this is a bit too much to ask for regex. If this is the
          >>case,
          >>I'll add some more manual Python parsing to the string, but was hoping to
          >>use it to learn more about regex.
          >>>
          >>Any pointers would be appreciated.
          >>>
          >>Martin
          >>
          >What you want is:
          >>
          >re.sub(regex , replacement, instring)
          >The replacement can be a function. So use a function.
          >>
          >def sleeping_to_dea d(inmatch):
          > instr = inmatch.group(0 )
          > if needsfixing(ins tr):
          > return instr.replace(' sleeping','dead ')
          > else:
          > return instr
          >>
          >as for the regex, something like
          >(<a)?[^<>]*(</a>)?
          >could be a start. It is probaly better to use the regex to recognize
          >the links as you might have something like
          >sleeping.sleep ing/sleeping/sleeping.html in your urls. Also you
          >probably want to do many fixes, so you can put them all within the same
          >replacer function.
          >
          ... My first
          working attempt had been to use the regex to locate all links. I then looped
          through replacing each with a numbered dummy entry. Then safely do the
          find/replaces and then replace the dummy entries with the original links. It
          just seems overly inefficient.
          Someone may have made use of
          multiline links:

          _______________ __________
          This parrot
          <a
          href="sleeping. htm"
          target="new" >
          is sleeping
          </a>.
          Really, it is sleeping.
          _______________ __________


          In such a case you may need to make the page
          into one string to search if you don't want to use some complex
          method of tracking state with variables as you move from
          string to string. You'll also have to make it possible
          for non-printing characters to have been inserted in all sorts
          of ways around the '>' and '<' and 'a' or 'A'
          characters using ' *' here and there in he regex.



          Comment

          • Martin Evans

            #6
            Re: RegEx conditional search and replace

            "mbstevens" <NOXwebmasterX@ XmbstevensX.com wrote in message
            news:pan.2006.0 7.06.12.19.42.9 43014@Xmbsteven sX.com...
            On Thu, 06 Jul 2006 08:32:46 +0100, Martin Evans wrote:
            >
            >"Juho Schultz" <juho.schultz@p p.inet.fiwrote in message
            >news:115212558 2.577222.95010@ l70g2000cwa.goo glegroups.com.. .
            >>Martin Evans wrote:
            >>>Sorry, yet another REGEX question. I've been struggling with trying to
            >>>get
            >>>a regular expression to do the following example in Python:
            >>>>
            >>>Search and replace all instances of "sleeping" with "dead".
            >>>>
            >>>This parrot is sleeping. Really, it is sleeping.
            >>>to
            >>>This parrot is dead. Really, it is dead.
            >>>>
            >>>>
            >>>But not if part of a link or inside a link:
            >>>>
            >>>This parrot <a href="sleeping. htm" target="new">is sleeping</a>.
            >>>Really,
            >>>it
            >>>is sleeping.
            >>>to
            >>>This parrot <a href="sleeping. htm" target="new">is sleeping</a>.
            >>>Really,
            >>>it
            >>>is dead.
            >>>>
            >>>>
            >>>This is the full extent of the "html" that would be seen in the text,
            >>>the
            >>>rest of the page has already been processed. Luckily I can rely on the
            >>>formating always being consistent with the above example (the url will
            >>>normally by much longer in reality though). There may though be more
            >>>than
            >>>one link present.
            >>>>
            >>>I'm hoping to use this to implement the automatic addition of links to
            >>>other
            >>>areas of a website based on keywords found in the text.
            >>>>
            >>>I'm guessing this is a bit too much to ask for regex. If this is the
            >>>case,
            >>>I'll add some more manual Python parsing to the string, but was hoping
            >>>to
            >>>use it to learn more about regex.
            >>>>
            >>>Any pointers would be appreciated.
            >>>>
            >>>Martin
            >>>
            >>What you want is:
            >>>
            >>re.sub(rege x, replacement, instring)
            >>The replacement can be a function. So use a function.
            >>>
            >>def sleeping_to_dea d(inmatch):
            >> instr = inmatch.group(0 )
            >> if needsfixing(ins tr):
            >> return instr.replace(' sleeping','dead ')
            >> else:
            >> return instr
            >>>
            >>as for the regex, something like
            >>(<a)?[^<>]*(</a>)?
            >>could be a start. It is probaly better to use the regex to recognize
            >>the links as you might have something like
            >>sleeping.slee ping/sleeping/sleeping.html in your urls. Also you
            >>probably want to do many fixes, so you can put them all within the same
            >>replacer function.
            >>
            >... My first
            >working attempt had been to use the regex to locate all links. I then
            >looped
            >through replacing each with a numbered dummy entry. Then safely do the
            >find/replaces and then replace the dummy entries with the original links.
            >It
            >just seems overly inefficient.
            >
            Someone may have made use of
            multiline links:
            >
            _______________ __________
            This parrot
            <a
            href="sleeping. htm"
            target="new" >
            is sleeping
            </a>.
            Really, it is sleeping.
            _______________ __________
            >
            >
            In such a case you may need to make the page
            into one string to search if you don't want to use some complex
            method of tracking state with variables as you move from
            string to string. You'll also have to make it possible
            for non-printing characters to have been inserted in all sorts
            of ways around the '>' and '<' and 'a' or 'A'
            characters using ' *' here and there in he regex.
            I agree, but luckily in this case the HREF will always be formated the same
            as it happens to be inserted by another automated system, not a user. Ok it
            be changed by the server but AFAIK it hasn't for the last 6 years. The text
            in question apart from the links should be plain (not even <bis allowed I
            think).

            I'd read about back and forward regex matching and thought it might somehow
            be of use here ie back search for the "<A" and forward search for the "</A>"
            but of course this would easily match in between two links.



            Comment

            • Blair P. Houghton

              #7
              Re: RegEx conditional search and replace


              mbstevens wrote:
              In such a case you may need to make the page
              into one string to search if you don't want to use some complex
              method of tracking state with variables as you move from
              string to string.
              In general it's a very hard problem to do stateful regexes.

              I recall something from last year about the new Perl implementation
              that tried to address this sort of problem. But I may have been
              reading old docs and it could have been done years ago.

              Parsing the HTML would be the only sure way to accomplish
              it. Let something that already knows the hierarchy tell you
              that you're entering a URL and you can skip past all of its
              recursive inclusions of strings with URLs with strings that
              have URLs and so on...

              Of course, that means reconstructing the HTML from the
              parse tree afterward...

              --Blair

              Comment

              Working...