Extract URL from String

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • markmcgookin
    Recognized Expert Contributor
    • Dec 2006
    • 648

    Extract URL from String

    Hi Folks,

    I am writing a program to analyse an html page in java, I am connecting to a website, then going to extract ALL the links from it. I think the best way to do this is using the <a href... /a> tags as a guideline.

    I have the code....

    Code:
    String	data1;
    DataInputStream  webadd = null;
    
    webadd = new DataInputStream( 
            (new URL("http://www.anyrandomurl.com/")).openStream() );
    			
    data1 = webadd.readLine();
    
    while ( data != null ) 
      {
         data = webadd.readLine();
         *** HELP NEEDED HERE ***
      }
    This obviously reads through every line of code in an html doc at the URL and puts it into data. I was thinking of storing all the URLs from the site in an array later on, but it is the way of extracting the links I was unsure of... possibly somekind of sting tokenizer? I really need something that will scroll through a string, char by char until it hits <a href=" and will then record the data until it hits /a> giving me the URL.

    Which I can then just add to something like URLs [] and loop through that later.

    Think it will only be one line of code or so, any ideas?

    Cheers!
  • Ganon11
    Recognized Expert Specialist
    • Oct 2006
    • 3651

    #2
    The String class has a .charAt() function that will return the character at the position specified. You can use this to search through the String char by char.

    Alternatively, there's a .find() function in the String class that you can use to search for "<a href=", and extract the substring (URL) by using the .substr() function in String. Check out the official documentation here.

    Comment

    • markmcgookin
      Recognized Expert Contributor
      • Dec 2006
      • 648

      #3
      Originally posted by Ganon11
      The String class has a .charAt() function that will return the character at the position specified. You can use this to search through the String char by char.

      Alternatively, there's a .find() function in the String class that you can use to search for "<a href=", and extract the substring (URL) by using the .substr() function in String. Check out the official documentation here.
      Ah excellent, I've used that before, and I knew something like it existed, I totally forgot the syntax! cheers!

      Just reading through the java.net 1.5.0 stuff here now too see if there is any useful methods in those classes.

      Cheers pal!

      Comment

      • markmcgookin
        Recognized Expert Contributor
        • Dec 2006
        • 648

        #4
        Would an idea be:

        Read line of html as String ( strLine )

        posStart = strLine.Find("< a href =")
        posFinish = strLine.Find("/a>")

        linkURL = strLine.subSequ ence( posStart, posFinish )

        That's obv mostly pseudo code, but would you think that would return a link (I can't test it until tomorrow) ? also, what do you think I should do to deal with lines that have more than one link? as that will obviously return only the 1st link off the line.

        Now obviously a loop if i was using the ChatAt() method for going through every char in the line would be

        i = 0
        For i = 0 ; i=strLine.Lengt h

        But not too sure how I could get that to work with me for running through.

        Maby

        IF posFinish != strLine.Length

        ... continue

        or something? lol

        Comment

        • Ganon11
          Recognized Expert Specialist
          • Oct 2006
          • 3651

          #5
          Well, I know HTML links are stored in this way:

          "<a href="http://www.mysitehere. com/thisiscool/index.html >MY TEST HERE </a >"

          without the spaces before the >'s.

          So you'd have to search for "<a href" and ">" and take that substring.

          As for finding multiple links in one line...once you find a link, you can get rid of the first x characters to the end of the "<" and begin the process again until you can't find the "<a href" in the String anymore.

          Comment

          • markmcgookin
            Recognized Expert Contributor
            • Dec 2006
            • 648

            #6
            Originally posted by Ganon11
            Well, I know HTML links are stored in this way:

            "<a href="http://www.mysitehere. com/thisiscool/index.html >MY TEST HERE </a >"

            without the spaces before the >'s.

            So you'd have to search for "<a href" and ">" and take that substring.

            As for finding multiple links in one line...once you find a link, you can get rid of the first x characters to the end of the "<" and begin the process again until you can't find the "<a href" in the String anymore.
            Cool man, cheers! I'll try something out tomorrow and maby post back here during the week! (Been VB.Net programming all day... must switch brain to java overnight!)

            Thanks very much for taking the time to reply!

            Comment

            Working...