[Answer]-How to parse div tag from alexa.com and show results in table in django

1πŸ‘

βœ…

The idea is to create a some kind of a state machine. In the starttag event we decide which data field of the site info should be populated later in the data event. Decision is made on the <div>/<span> class attribute and the ATTR_FIELDS map.

For example if the <div class="count"> tag is started then we will populate the rank field of the current self.site dictionary.

class MyHTMLParser(HTMLParser.HTMLParser):

    ATTR_FIELDS = {'count': 'rank',
                   'description': 'description', 'remainder': 'description'}

    def reset_site(self):
        self.site = {'rank': '', 'url': '', 'description': ''}
        self.in_site_listing = self.data_field = False

    def reset(self):
        HTMLParser.HTMLParser.reset(self)
        self.reset_site()
        self.site_list = []

    def handle_starttag(self, tag, attrs):
        class_attr = dict(attrs).get('class')
        if tag == 'li' and class_attr == 'site-listing':
            self.in_site_listing = True
        elif self.in_site_listing:
            if tag == 'a':
                if class_attr != 'moreDesc':
                    self.site['url'] = dict(attrs)['href'].replace(
                                                             '/siteinfo/', '')
            elif tag in ['div', 'span']:
                self.data_field = self.ATTR_FIELDS.get(class_attr)

    def handle_data(self, data):
        if self.data_field:
            self.site[self.data_field] += data

    def handle_endtag(self, tag):
        if tag == 'li' and self.in_site_listing:
            self.site_list.append(self.site)
            self.reset_site()
        self.data_field = None

And then change the view and template:

view.py

def top_urls(request):
    p = MyHTMLParser()
    p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())
    sites = p.site_list[:20]
    return render(request, 'top_urls.html', {'sites': sites})

top_urls.html

...
<tbody>
    {% for site in sites %}
        <tr>
            <td>{{ site.rank }}</td>
            <td>{{ site.url }}</td>
            <td>{{ site.description }}</td>
        </tr>
    {% endfor %}
</tbody>
...

EXPLANATION UPDATE:

Variables used:

  • self.site – current site info
  • self.in_site_listing' - flag is set to True if we are in the` tag
  • self.data_field – key in the site info to add the data
  • ATTR_FIELDS – a map of the <div>/<span> classes to the site info keys

The key method is the handle_starttag():

def handle_starttag(self, tag, attrs):
    # get the tag `class` attribute if any
    class_attr = dict(attrs).get('class')
    # if the tag is `<li class="site-listing">` then set the flag that we
    # should populate the site info
    if tag == 'li' and class_attr == 'site-listing':
        self.in_site_listing = True
    # we a in the site population mode
    elif self.in_site_listing:
        if tag == 'a':
            # `<li class="site-info">` contains two `<a>` tags. We should
            # use the tag withoud `class="moreDesc"` attribute to set the url
            if class_attr != 'moreDesc':
                self.site['url'] = dict(attrs)['href'].replace(
                                                         '/siteinfo/', '')
        elif tag in ['div', 'span']:
            # we are in the `<div>` or `<span>` tag. Get the `class` attribute
            # of the tag and decide which field of the site info we will
            # populate in the `handle_data()` method
            self.data_field = self.ATTR_FIELDS.get(class_attr)

So the handle_data() is pretty simple:

def handle_data(self, data):
    # if we know which field of site info should be populated
    if self.data_field:
        # append the data to this field. Site description is spread in several
        # tags this is why we append data instead of simple assigning.
        self.site[self.data_field] += data
πŸ‘€catavaran

Leave a comment