How to strip(not remove) specified tags from a html string using Python?


Beautiful soup has unwrap():

It replaces a tag with whatever’s inside that tag.

You will have to manually iterate over all tags you want to replace.


You can extend Python’s HTMLParser and create your own parser to skip specified tags.

Using the example provided in the given link, I will modify it to strip <h1></h1> tags but keep their data:

from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag not in NOT_ALLOWED_TAGS:
            print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        if tag not in NOT_ALLOWED_TAGS:
            print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
            '<body><h1>Parse me!</h1></body></html>')

That will return:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body 
# h1 start tag here
Encountered some data  : Parse me!
# h1 close tag here
Encountered an end tag : body
Encountered an end tag : html

You can now maintain a NOT_ALLOWED_TAG list to use for stripping those tags.

