regex failing

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • noon

    regex failing

    I'm runing an xmlHttpRequest to get the site's source code and then
    applying the regex

    xhr.responseTex t.split(/<body[^>]*>((?:.|\n)*)< \/body>/i)[1]

    Works for google.com. Fails on yahoo.com and imdb.com pages (ex:
    http://imdb.com/title/tt0482606/ )

    Can someone help me tweak this, or give insight as to why its
    failing? I can't spot it
  • Erwin Moller

    #2
    Re: regex failing

    noon schreef:
    I'm runing an xmlHttpRequest to get the site's source code and then
    applying the regex
    >
    xhr.responseTex t.split(/<body[^>]*>((?:.|\n)*)< \/body>/i)[1]
    >
    Works for google.com. Fails on yahoo.com and imdb.com pages (ex:
    http://imdb.com/title/tt0482606/ )
    >
    Can someone help me tweak this, or give insight as to why its
    failing? I can't spot it
    Maybe...
    You didn't mention what it is you WANT your regex to do.
    And you didn't say what 'failing' is. An error? An unexpected result?

    Regards,
    Erwin Moller

    Comment

    • noon

      #3
      Re: regex failing

      That information might help huh. I want it to strip everything
      inbetween body tags. The error was that I was either receiving nothing
      or receiving the entire html including the head tags etc. I have since
      seem to have got it working with this code:

      xhr.responseTex t.split(/<body[^>]*>((.|\n|\r|\u2 028|\u2029)*)<\/body>/
      gi)[1];

      Though improvement suggestions are welcome

      Comment

      • Thomas 'PointedEars' Lahn

        #4
        Re: regex failing

        noon wrote:
        That information might help huh. I want it to strip everything
        inbetween body tags. The error was that I was either receiving nothing
        or receiving the entire html including the head tags etc. I have since
        seem to have got it working with this code:
        >
        xhr.responseTex t.split(/<body[^>]*>((.|\n|\r|\u2 028|\u2029)*)<\/body>/
        gi)[1];
        With

        foo<body>...</body>bar

        this would give you

        ...

        But you wanted to *strip* everything *in between*, _not_ split.
        Though improvement suggestions are welcome
        ... = xhr.responseTex t.match(/<body(|\s+[^>]*)>((.|\s)*)<\/body>/i)[1];

        is largely equivalent to your code in this case and more efficient.
        However, IMHO that is still _not_ stripping everything in between but
        *matching* everything in between, which is probably what you meant to say.

        Note that (X)HTML is a context-sensitive language which cannot be parsed
        with one regular expression (defining a regular language) alone. In your
        case it should work because a Valid (X)HTML document MUST NOT have more
        than one `body' element.


        PointedEars
        --
        var bugRiddenCrashP ronePieceOfJunk = (
        navigator.userA gent.indexOf('M SIE 5') != -1
        && navigator.userA gent.indexOf('M ac') != -1
        ) // Plone, register_functi on.js:16

        Comment

        Working...