Extract all Img Src tags using Java Regular Expression

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dkasyap
    New Member
    • Jan 2008
    • 1

    Extract all Img Src tags using Java Regular Expression

    Hi,

    I have a huge string containing html tags, some of these tags being <img src="URL"> ones. I need to extract the urls from all the occurences of these tags in the input string. This is what I am doing:

    Code:
    Pattern p=null;
    Matcher m= null;
    String word0= null;
    String word1= null;
    
    p= Pattern.compile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE);
    m= p.matcher(txt);
    while (m.find())
         {
    	word0=m.group(1);
    	System.out.println(word0.toString());
         }

    The problem with this code is that this prints only the last URL. For example if there are 5 <img src="URL"> tags, this code prints only the URL contained withn the 5th< img src> tag. Please tell me how to solve this.

    Thanking you in advance
  • BigDaddyLH
    Recognized Expert Top Contributor
    • Dec 2007
    • 1216

    #2
    Usually when someone wants to extract tags from XML or HTML, it makes sense to parse the input using a proper XML/HTML parser. Have you considered that? For example, what about HTML comments -- they may contain what looks like an image tag...

    Comment

    • adeel809
      New Member
      • Jan 2013
      • 1

      #3
      while (m.find()) change to if (m.find()) you got first img tag scr
      and change while to for you get any you want.

      Comment

      • Anas Mosaad
        New Member
        • Jan 2013
        • 185

        #4
        Because RegEx matches the biggest match. How that is related to your case? It's the .* at the beginning of your expression. It gets the largest match that is all the document until the start tag of your last img tag. If you moved that to the end of your expression, it will match only the first one. If you want to get all images, just drop it to have something like this:
        Code:
        p = Pattern.compile("<img[^>]*src=[\"']([^\"^']*)",
        				Pattern.CASE_INSENSITIVE);
        P.S: I added support to ' as well as " as valid container of the src URL.

        @adeel809, if will match only once. He will never be able to get all images.

        @BigDaddyLH, this is a very simple case that doesn't require all these sophistications .

        Comment

        • ibilal
          New Member
          • May 2017
          • 1

          #5
          Just change
          word0=m.group(1 );
          to
          word0=m.group() ;

          Comment

          Working...