REGEX Question: Get filenames and alt-tags from html

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Smasal
    New Member
    • May 2010
    • 1

    REGEX Question: Get filenames and alt-tags from html

    I have 2 problems with regular expressions:

    1) I want to get the image's filenames from a website:

    e.g. src="http://bytes.com/images/world.jpg", src=images/world.jpg and src='images/world.jpg' should result in "world"

    Code:
    $imgfilenames=preg_match_all('/^[0-9A-Za-z_ ](.jpg|.gif|.JPG|.GIF|.png|.PNG)$/i', $html, $matches);
    2) I want to get the alt descriptions from images in a website

    Code:
    $alttags=preg_match_all('/<img[^>]*alt="([^"]*)"/i', $html, $matches);
    But both my expressions don't work because I don't really get it ;-).
  • Atli
    Recognized Expert Expert
    • Nov 2006
    • 5062

    #2
    Your first expression should work, except that the first character class is defined as just a single character. If you want a class to cover more than one character, it needs to be trailed by one of:
    • * - Any number of characters, (including none).
    • ? - 0 or 1 characters.
    • + - One or more characters.
    • {high,low} - A specific range of characters.

    For example:
    [code=regexp][a-zA-Z0-9]+[/code]
    This matches one or more alphanumeric character. If you remove the + it only matches one (which is what your expression does).

    See regular-expressions.inf o for more details.

    P.S.
    It is usually better to parse HTML documents using a DOM parser, or one of the XML parsers. - Regular expressions are poorly suited to parse large chunks of HTML, because HTML is not a "regular" format.

    Comment

    Working...