Tough Regular Expression problem

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Bryan

    Tough Regular Expression problem

    Hi All:

    I'm trying to find the right Regexp string to remove empty SPAN tags
    from an HTML string.

    Say I have a string like so, and I want to remove the empty span tags:

    <span>This is my text</span>

    A simple expression like this /<SPAN>(.*)?<\/SPAN>/gi will give me the
    text between the two span tags, which I can then use in a replace
    statement.

    This gets much more complicated when we have nested tags, however.
    For example:

    <span style="font-weight: bold>one <span>two <span style="color:
    red">three</span> four</span> five</span>

    What I really want after the replace statement is this:

    <span style="font-weight: bold>one two <span style="color:
    red">three</span> four five</span>

    I'm having trouble crafting the perfect expression for this. I can't
    seem to get my head around the right solution to handle the greedy vs
    non-greedy thing, and not eliminate the wrong closing tag.

    Is this even possible with straight expressions?

    Thanks in advance for any help you can provide!

    Bryan
  • J. J. Cale

    #2
    Re: Tough Regular Expression problem


    "Bryan" <bryan@chameleo n-systems.com> wrote in message
    news:b1d5c20c.0 411080900.34179 99b@posting.goo gle.com...[color=blue]
    > Hi All:
    >
    > I'm trying to find the right Regexp string to remove empty SPAN tags
    > from an HTML string.[/color]

    if you need to remove the element try the DOM
    and specifially the childNodes collection

    <snip>
    [color=blue]
    > This gets much more complicated when we have nested tags, however.
    > For example:
    > <span style="font-weight: bold>one <span>two <span style="color:
    > red">three</span> four</span> five</span>[/color]

    <span style="font-weight: bold> is the containing element
    a node of nodeType element. (obj.nodeType = 1)
    First you need a reference to the containing span. Either find it via the
    DOM tree or give it a specific id <span id="anId" and use
    var oRef = document.getEle mentById('anId' );
    or whatever you wish to support.
    one is a text node type 3 oRef.childNodes[0] or oRef.firstChild
    oRef.childNodes[0].nodeValue is 'one'
    oRef.childNodes[1] is the next span element (type 1) containing
    oRef.childNodes[1].firstChild the textNode containing 'two'
    From here there are a number of ways to deal with this.
    [color=blue]
    > What I really want after the replace statement is this:
    > <span style="font-weight: bold>one two <span style="color:
    > red">three</span> four five</span>[/color]

    Create a new text node, insert it before the span
    you want to delete and delete the span.
    Or clone the spanToDelete.fi rstChild node, insert it.
    before the span to delete and delete the span.
    Or, copy the span.firstChild .nodeValue, delete the span
    and append the copied text to the firstSpan.first Child.nodeValue
    and other possibilities
    Google for DOM Level 2 to see how to do these things correctly.
    Hope this helps
    Jimbo


    Comment

    • Bryan

      #3
      Re: Tough Regular Expression problem

      J. J. Cale wrote...[color=blue]
      > if you need to remove the element try the DOM
      > and specifially the childNodes collection[/color]

      Huh. That's an interesting idea. A little more complicated than a
      regexp replace, but it should work. If I can come up with something
      that's cross-browser, I might be able to use that approach.

      Thanks for the idea.

      Comment

      • Thomas 'PointedEars' Lahn

        #4
        Re: Tough Regular Expression problem

        Bryan wrote:
        [color=blue]
        > [...]
        > A simple expression like this /<SPAN>(.*)?<\/SPAN>/gi will give me the
        > text between the two span tags, which I can then use in a replace
        > statement.
        >
        > This gets much more complicated when we have nested tags, however.
        > For example:
        >
        > <span style="font-weight: bold>one <span>two <span style="color:
        > red">three</span> four</span> five</span>
        >
        > What I really want after the replace statement is this:
        >
        > <span style="font-weight: bold>one two <span style="color:
        > red">three</span> four five</span>
        >
        > I'm having trouble crafting the perfect expression for this. I can't
        > seem to get my head around the right solution to handle the greedy vs
        > non-greedy thing, and not eliminate the wrong closing tag.
        >
        > Is this even possible with straight expressions?[/color]

        No, it is not, by design; or let us say it is not generally possible --
        enough constraints provided (such as that `span' elements may not nest,
        in opposition to the HTML specifications) , it may be possible (which
        is why removeTags() exists in my JSX:string.js, BTW).

        AIUI, Regular Expressions require either a DFA or a NFA or both of them
        to be matched against a text (that said, know that because ECMAScript
        implementations like JavaScript and JScript support PCRE alternation,
        they must be using either a NFA or a combination of DFA and NFA to
        match RegExps). However, to parse arbitrary occurrences of open and
        matching close tags, i.e. to recognize a program in a (deterministic)
        context-free language, you require a (N)PDA (which could be implemented
        as a markup parser to build a parse tree which indeed is done in common
        HTML UAs) [1].

        See Jeffrey E. F. Friedl, Mastering Regular Expressions, chapter 4,
        section 'Multi-Character "Quotes"' pp., available online at
        <http://www.oreilly.com/catalog/regex/chapter/ch04.html> for
        further information and possible solutions.


        PointedEars
        ___________
        [1] It has been a while since my lectures in automata theory, please CMIIW.
        --
        "Nothing makes you appreciate the weekend like idiots."
        -- Jen

        Comment

        Working...