Regex, HTML string modification

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • anglaissam
    New Member
    • Dec 2009
    • 1

    Regex, HTML string modification

    I have a regex that is designed to help improve readability for a html document.
    Code:
    "(?=((?!<\/?em).)*<\/em>)
    The purpose of this regex is to escape " marks from within <EM> affected sentences. Example:

    Before: <P>This "is" <EM>a <STRONG>"Test "</STRONG></EM></P>
    After: <P>This "is" <EM>a <STRONG></EM>"<EM>Test</EM>"<EM></STRONG></EM></P>

    Note the regex only affects " inside of <EM> elements. My problem is that i need to modify the regex to account for " inside of tags. <EM CLASS="a1"> or <STRONG CLASS="a1"> etc.

    With the current regex those " marks will be modified. Any help in stopping that from happening would be appreciated.
  • chaarmann
    Recognized Expert Contributor
    • Nov 2007
    • 785

    #2
    The regular expression you gave checks if there is an "<em" or "</em" between the double-quotation mark and the "</em>". (negative lookahead). if yes, it won't match, else it matches. This algorithm has many errors:
    - it doesn't account for "em"-tags inside other "em"-tags, for example '<em> "hello" <em> you </em> </em>' would not be matched.
    - it doesn't account for "<em" which are not a tag, for example '<em>"hello"<ar ea>let x<emap</area></em>' would not be matched.
    - the error you figured out: it matches inside tags.
    - a match inside a subtag destroys the HTML-structure: it is not allowed to have nested tags as result. Look at your line with "After:..." : the "strong" and "em" tags are illegally nested now!
    - it destroys embedded javascript: '<em><script>x= "Hello"</script></em> would be matched!
    - some more errors which I have no time to describe now.

    To fiddle with the errorneous expression and transform it into something new and error-free is very difficult, very lengthy and even much harder to understand for other programmers.. One idea to fix your "no replacement inside tags"-problem is by counting the number of ">" between the double-quotation mark and the "</em>". If it is odd, then you know you are still inside a tag; if it is even, you are outside and allowed to match.

    So i would recomment to go a different and secure way for a solution:
    Read the whole HTML-page with a DOM-Parser. Then walk through the DOM-object and search for "<em>" nodes. Look for text (not arguments or other nodes) inside them and search for double-quotation-marks inside this text. Then replace as you wish.

    Comment

    Working...