[Answer]-Python match image tags from large content string using regular expressions

1đź‘Ť

Multipurpose solution:

image_re = re.compile(r"""
    (?P<img_tag><img)\s+    #tag starts
    [^>]*?                  #other attributes
    src=                    #start of src attribute
    (?P<quote>["''])?       #optional open quote
    (?P<image>[^"'>]+)      #image file name
    (?(quote)(?P=quote))    #close quote
    [^>]*?                  #other attributes
    >                       #end of tag
    """, re.IGNORECASE|re.VERBOSE) #re.VERBOSE allows to define regex in readable format with comments

image_tags = []
for match in image_re.finditer(content):
    image_tags.append(match.group("img_tag"))

#print found image_tags
for image_tag in image_tags:
    print image_tag

As you can see in regex definition, it contains

(?P<group_name>regex)

It allows you to access found groups by group_name, and not by number. It is for readability. So, if you want to show all src attributes of img tags, then just write:

for match in image_re.finditer(content):
    image_tags.append(match.group("image"))

After this image_tags list will contain src of image tags.

Also, if you need to parse html, then there are instruments that were designed exactly for such purposes. For example it is lxml, that use xpath expressions.

👤stalk

0đź‘Ť

I don’t know Python but assuming it uses normal Perl compatible regular expressions…

You probably want to look for “<img[^>]+>” which is: “<img”, followed by anything that is not “>”, followed by “>”. Each match should give you a complete image tag.

👤Grynn

Leave a comment