Why are the loops taking so much time to execute in following program

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • rspvsanjay
    New Member
    • Sep 2016
    • 21

    Why are the loops taking so much time to execute in following program

    in this program, i have given wikipeadia URL for text extraction logic but after extraction of text for loops are taking to much time to execute.
    the same logic too fast in python program.

    how to reduces execution time ?
    Code:
    import java.io.IOException;
    import java.net.URL;
    import java.util.Scanner;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class TextExtraction1 
    {
    	static TextExtraction1 fj;
    	public String toHtmlString(String url) throws IOException 
    	{
    		StringBuilder sb = new StringBuilder();
    		   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
    		      sb.append(sc.nextLine()).append('\n');
    		   return sb.toString();
    	}
    	
    	static int search(String key,String target)
    	{
    		int count=0;
    		Pattern p=Pattern.compile(key);
    		Matcher m=p.matcher(target);
    		while(m.find()){count++;}
    		return count;
    	} 
    
    	String extractText(String s) throws IOException
    	{
    				 
    		String h1 = fj.toHtmlString(s); 
            System.out.println("extracted \n\n");
            int i2=0;
            String h2[] = h1.split("\n");
            String html="";
            long start = System.currentTimeMillis();
            
            for(String h3:h2)
            {	//bw.write(h3);bw.newLine();
            		html += h3;
                    html += ""; //iu=iu+1;               	
            }
            long end = System.currentTimeMillis();
            System.out.println(++i2+" th loop end in "+(end-start)/1000+" seconds");
            boolean capture = true;
            String filtered_text = "";
            
            String html_text[] = html.split("<");
            String h_text[];//System.out.println("kyhe1");
            
            
            start = System.currentTimeMillis();
            for(String h:html_text)
            {
            	h = "<" + h;
            	h_text = h.split(">");
            	for(String w :h_text)
            	{
            		if(w.length()>0)	{	if(w.substring(0, 1).equals("<")){w +=">";}	}
            		if(search("</script>",w)>0){capture=true;}
            		else if(search("<script",w)>0){capture=false;}
            		else if(capture){filtered_text += w;     filtered_text += "\n";}
            	}
            }
           // System.out.println("kyhe1");
            end = System.currentTimeMillis();
            html_text = filtered_text.split("\n");
            
            System.out.println(++i2+" th loop end in "+(end-start)/1000+" seconds");
            return html_text[0];
    	}
    	
    		
    	public static void main(String []args)throws IOException 
    	{
    		fj = new TextExtraction1();
    		System.out.println(fj.extractText("https://en.wikipedia.org/wiki/Varanasi"));
    	}
    }
    Last edited by Frinavale; Jan 27 '17, 04:36 PM. Reason: Added code tags.
  • chaarmann
    Recognized Expert Contributor
    • Nov 2007
    • 785

    #2
    You have written a method search() that
    searches through the whole string and does not stop when finding the first.
    Quotation:
    Code:
    while(m.find()){count++;}
    But you use it in a way that it would be sufficient to find the first one:
    Quotation:
    Code:
    if(search("</script>",w)>0) ...
    So why do you not make a method searchFirst() that doesn't have a while-loop but just returns after finding first occurrence?
    No need to use regular expressions then, just use searchString.in dexOf(searchKey ).

    Second, you split the string into parts:
    Code:
    h_text = h.split(">");
    then assemble it into a new string:
    Code:
    filtered_text += w; filtered_text += "\n"
    and split it again:
    Code:
    html_text = filtered_text.split("\n");
    So why do you not just put your splitted parts in variable w directly into html_text array?
    for example html_text[i] = w ?

    This is also a performance-no-go:
    Code:
    h = "<" + h;
    this will copy the whole string again in memory. It cannot just append "<" in front of the existing string without shifting all characters in memory.

    That's the reason why it is so slow. If you want to have high performance, do it this way:
    Use a regular expression on your string "html" that deletes all script tags and the stuff inside.
    Then split it into single lines.
    Like so:
    Code:
    html=html.replaceAll("\\<script>.*?\\</script>", "");
    html_text = html.split("\n");
    Now measure the performance of these two lines that replaces your whole code logic. It should be much faster!

    Comment

    • rspvsanjay
      New Member
      • Sep 2016
      • 21

      #3
      ok, thank you

      how to write this code by replaceAll method:

      String extractText(Str ing s) throws IOException
      {
      String html = fj.toHtmlString (s); //extracted html source code from wikipedia
      String filtered_text=" ";
      System.out.prin tln("extracted \n\n");
      String []html_text = html.split("\n" );
      long start = System.currentT imeMillis();

      for(String h:html_text)
      { //System.out.prin tln("ky4"+h);
      if(Pattern.comp ile("</strong>", Pattern.CASE_IN SENSITIVE + Pattern.LITERAL ).matcher(h).fi nd())
      {

      }
      else if(Pattern.comp ile("<strong", Pattern.CASE_IN SENSITIVE + Pattern.LITERAL ).matcher(h).fi nd())
      {

      }
      else
      {
      filtered_text += h;
      filtered_text += "\n";
      }
      }
      long end = System.currentT imeMillis();
      System.out.prin tln("loop end in "+(end-start)/1000+" seconds"+" or "+(end-start)+" miliseconds");//System.out.prin tln(++i2+" th loop end in "+(end-start)/1000+" seconds");
      return filtered_text;
      }
      Last edited by rspvsanjay; Feb 14 '17, 12:30 PM. Reason: that was too much code for same pattern

      Comment

      • chaarmann
        Recognized Expert Contributor
        • Nov 2007
        • 785

        #4
        1.)To enhance performance, you should compile a pattern outside the for-loop, that means only once! Then apply it many times (matcher) inside the for-loop.
        2.) If you have a pattern A and a pattern B, then do not write two if-statements searching the whole string for it in each.
        Just seach the string once with the combined pattern "A|B". This will go through the string only once.
        3.) Do not split the text first at newlines and then put the filtered pieces together. Just apply the pattern once on the whole original string. This way it could be up to 10 times faster.
        4.) Using StringBuilder intead of "+" to concatenate many strings is much faster.

        Comment

        • rspvsanjay
          New Member
          • Sep 2016
          • 21

          #5
          my problem is resolved thank you

          Comment

          Working...