1π
β
The idea is to create a some kind of a state machine. In the starttag
event we decide which data field of the site info should be populated later in the data
event. Decision is made on the <div>
/<span>
class
attribute and the ATTR_FIELDS
map.
For example if the <div class="count">
tag is started then we will populate the rank
field of the current self.site
dictionary.
class MyHTMLParser(HTMLParser.HTMLParser):
ATTR_FIELDS = {'count': 'rank',
'description': 'description', 'remainder': 'description'}
def reset_site(self):
self.site = {'rank': '', 'url': '', 'description': ''}
self.in_site_listing = self.data_field = False
def reset(self):
HTMLParser.HTMLParser.reset(self)
self.reset_site()
self.site_list = []
def handle_starttag(self, tag, attrs):
class_attr = dict(attrs).get('class')
if tag == 'li' and class_attr == 'site-listing':
self.in_site_listing = True
elif self.in_site_listing:
if tag == 'a':
if class_attr != 'moreDesc':
self.site['url'] = dict(attrs)['href'].replace(
'/siteinfo/', '')
elif tag in ['div', 'span']:
self.data_field = self.ATTR_FIELDS.get(class_attr)
def handle_data(self, data):
if self.data_field:
self.site[self.data_field] += data
def handle_endtag(self, tag):
if tag == 'li' and self.in_site_listing:
self.site_list.append(self.site)
self.reset_site()
self.data_field = None
And then change the view and template:
view.py
def top_urls(request):
p = MyHTMLParser()
p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())
sites = p.site_list[:20]
return render(request, 'top_urls.html', {'sites': sites})
top_urls.html
...
<tbody>
{% for site in sites %}
<tr>
<td>{{ site.rank }}</td>
<td>{{ site.url }}</td>
<td>{{ site.description }}</td>
</tr>
{% endfor %}
</tbody>
...
EXPLANATION UPDATE:
Variables used:
self.site
β current site infoself.in_site_listing' - flag is set to True if we are in the
` tagself.data_field
β key in the site info to add the dataATTR_FIELDS
β a map of the<div>
/<span>
classes to the site info keys
The key method is the handle_starttag()
:
def handle_starttag(self, tag, attrs):
# get the tag `class` attribute if any
class_attr = dict(attrs).get('class')
# if the tag is `<li class="site-listing">` then set the flag that we
# should populate the site info
if tag == 'li' and class_attr == 'site-listing':
self.in_site_listing = True
# we a in the site population mode
elif self.in_site_listing:
if tag == 'a':
# `<li class="site-info">` contains two `<a>` tags. We should
# use the tag withoud `class="moreDesc"` attribute to set the url
if class_attr != 'moreDesc':
self.site['url'] = dict(attrs)['href'].replace(
'/siteinfo/', '')
elif tag in ['div', 'span']:
# we are in the `<div>` or `<span>` tag. Get the `class` attribute
# of the tag and decide which field of the site info we will
# populate in the `handle_data()` method
self.data_field = self.ATTR_FIELDS.get(class_attr)
So the handle_data()
is pretty simple:
def handle_data(self, data):
# if we know which field of site info should be populated
if self.data_field:
# append the data to this field. Site description is spread in several
# tags this is why we append data instead of simple assigning.
self.site[self.data_field] += data
π€catavaran
Source:stackexchange.com