Another RegEx Problem

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Andrew Dixon

    Another RegEx Problem

    Hi Everyone.

    Ok I have a problem getting the following regex to work in Java.

    <script[^>]*>(.|\r|\n)+?</script>

    It works fine in EditPad Pro but in Java it causes the following error
    message when ran:-

    'Exception in thread "main" java.lang.Stack OverflowError'

    Any ideas why?

    Best Regards
    [color=blue][color=green][color=darkred]
    >>> Andrew Dixon[/color][/color][/color]


  • hiwa

    #2
    Re: Another RegEx Problem

    "Andrew Dixon" <da644_98@NOREP LY.yahoo.co.uk> wrote in message news:<zSTIb.361 7$D83.31729941@ news-text.cableinet. net>...[color=blue]
    > Hi Everyone.
    >
    > Ok I have a problem getting the following regex to work in Java.
    >
    > <script[^>]*>(.|\r|\n)+?</script>
    >
    > It works fine in EditPad Pro but in Java it causes the following error
    > message when ran:-
    >
    > 'Exception in thread "main" java.lang.Stack OverflowError'
    >
    > Any ideas why?
    >
    > Best Regards
    >[color=green][color=darkred]
    > >>> Andrew Dixon[/color][/color][/color]

    (1)Use MULTILINE flag instead of \r|\n etc.
    (2)Use non-capturing-group parenthesizatio n.
    (3)Guard against </script> and quotes in the above parens.

    Comment

    • Andrew Dixon

      #3
      Re: Another RegEx Problem

      Hi.

      Sorry, I'm not really understanding what you mean, could you show me an
      example or re-write my expression.

      Thanks.

      --

      Best Regards
      [color=blue][color=green][color=darkred]
      >>> Andrew Dixon[/color][/color][/color]
      "hiwa" <HGA03630@nifty .ne.jp> wrote in message
      news:6869384d.0 401011625.75288 915@posting.goo gle.com...[color=blue]
      > "Andrew Dixon" <da644_98@NOREP LY.yahoo.co.uk> wrote in message[/color]
      news:<zSTIb.361 7$D83.31729941@ news-text.cableinet. net>...[color=blue][color=green]
      > > Hi Everyone.
      > >
      > > Ok I have a problem getting the following regex to work in Java.
      > >
      > > <script[^>]*>(.|\r|\n)+?</script>
      > >
      > > It works fine in EditPad Pro but in Java it causes the following error
      > > message when ran:-
      > >
      > > 'Exception in thread "main" java.lang.Stack OverflowError'
      > >
      > > Any ideas why?
      > >
      > > Best Regards
      > >[color=darkred]
      > > >>> Andrew Dixon[/color][/color]
      >
      > (1)Use MULTILINE flag instead of \r|\n etc.
      > (2)Use non-capturing-group parenthesizatio n.
      > (3)Guard against </script> and quotes in the above parens.[/color]


      Comment

      • hiwa

        #4
        Re: Another RegEx Problem

        "Andrew Dixon" <da644_98@NOREP LY.yahoo.co.uk> wrote in message news:<mcaJb.752 7$M42.20@news-binary.blueyond er.co.uk>...[color=blue]
        > Hi.
        >
        > Sorry, I'm not really understanding what you mean, could you show me an
        > example or re-write my expression.[/color]
        Here is a simple example. Hope this helps.
        <code>
        import java.nio.*;
        import java.nio.channe ls.*;
        import java.io.*;
        import java.util.regex .*;

        public class TagBodyExtracto r{

        public static void main(String[] args){
        String tagId, closingTag, inFileName;
        boolean bodyOnly;

        if (args.length < 1){
        System.err.prin tln("USAGE:");
        System.err.prin tln("java TagBodyExtracto r filename");
        System.err.prin tln("or,");
        System.err.prin tln("java TagBodyExtracto r tagtext filename");
        System.exit(1);
        }

        if (args.length == 2){
        tagId = args[0];
        inFileName = args[1];
        }
        else{
        tagId = "script"; // do to-lower on tags before using this prog
        inFileName = args[0];
        }
        closingTag = "</" + tagId + ">";

        bodyOnly = false; //output both tags and their bodies

        try{
        FileInputStream fis = new FileInputStream (inFileName);
        FileChannel fc = fis.getChannel( );
        MappedByteBuffe r mbf
        = fc.map(FileChan nel.MapMode.REA D_ONLY, 0, fc.size());
        byte[] barray = new byte[(int)(fc.size() )];
        mbf.get(barray) ;
        String str = new String(barray, "US-ASCII");
        //or //String str = new String(barray") ; //use default

        String match1, match2, match3;
        //here we assume syntax-error-free html file!
        String regex = "(<" + tagId + "[^>]*>)" //1st capturing group
        + "((?:\"[^\"]*\"|\'[^\']*\'|[^\"\'])*?(?="
        + closingTag + "))" //2nd capturing group
        + "(" + closingTag + ")"; //3rd capturing group
        Pattern pat = Pattern.compile (regex, Pattern.DOTALL | Pattern.MULTILI NE);
        boolean hasMore = false;
        Matcher mat = pat.matcher(str );
        while (hasMore = mat.find()){
        match1 = mat.group(1);
        match2 = mat.group(2);
        match3 = mat.group(3);
        if (bodyOnly){
        System.out.prin tln(match2);
        }
        else{
        System.out.prin tln(match1 + match2 + match3);
        }
        }
        fc.close();
        fis.close();
        }
        catch(Exception e){
        e.printStackTra ce();
        }
        }
        }
        </code>

        Comment

        • hiwa

          #5
          Re: Another RegEx Problem

          HGA03630@nifty. ne.jp (hiwa) wrote in message news:<6869384d. 0401021925.4dba 3353@posting.go ogle.com>...

          Note: This particular regex string just happens to have no dot '.',
          line-head '^' and line-tail '$' regexp operators. So, DOTALL and
          MULTILINE flags aren't necessary for this particular case. But
          specifying these flags is a good practice when you handle a multi-line
          document in a single matcher loop.

          Comment

          • Andrew Dixon

            #6
            Re: Another RegEx Problem

            Hi.

            Thanks, I have it working now.

            --

            Best Regards
            [color=blue][color=green][color=darkred]
            >>> Andrew Dixon[/color][/color][/color]
            "hiwa" <HGA03630@nifty .ne.jp> wrote in message
            news:6869384d.0 401031645.3ad1e b08@posting.goo gle.com...[color=blue]
            > HGA03630@nifty. ne.jp (hiwa) wrote in message[/color]
            news:<6869384d. 0401021925.4dba 3353@posting.go ogle.com>...[color=blue]
            >
            > Note: This particular regex string just happens to have no dot '.',
            > line-head '^' and line-tail '$' regexp operators. So, DOTALL and
            > MULTILINE flags aren't necessary for this particular case. But
            > specifying these flags is a good practice when you handle a multi-line
            > document in a single matcher loop.[/color]


            Comment

            Working...