142👍
Given the Django use case, there are two answers to this. Here is its django.utils.html.escape
function, for reference:
def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))
To reverse this, the Cheetah function described in Jake’s answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:
def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s
unescaped = html_decode(my_string)
This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape
. More generally, it is a good idea to stick with the standard library:
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)
As a suggestion: it may make more sense to store the HTML unescaped in your database. It’d be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.
With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:
{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}
170👍
With the standard library:
-
HTML Escape
try: from html import escape # python 3.x except ImportError: from cgi import escape # python 2.x print(escape("<"))
-
HTML Unescape
try: from html import unescape # python 3.4+ except ImportError: try: from html.parser import HTMLParser # python 3.x (<3.4) except ImportError: from HTMLParser import HTMLParser # python 2.x unescape = HTMLParser().unescape print(unescape(">"))
- [Django]-Multiple Database Config in Django 1.2
- [Django]-How to check if ManyToMany field is not empty?
- [Django]-How to revert the last migration?
80👍
For html encoding, there’s cgi.escape from the standard library:
>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
Replace special characters "&", "<" and ">" to HTML-safe sequences.
If the optional flag quote is true, the quotation mark character (")
is also translated.
For html decoding, I use the following:
import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39
def unescape(s):
"unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
return re.sub('&(%s);' % '|'.join(name2codepoint),
lambda m: unichr(name2codepoint[m.group(1)]), s)
For anything more complicated, I use BeautifulSoup.
- [Django]-Django: How to check if the user left all fields blank (or to initial values)?
- [Django]-Django-Forms with json fields
- [Django]-Determine variable type within django template
20👍
Use daniel’s solution if the set of encoded characters is relatively restricted.
Otherwise, use one of the numerous HTML-parsing libraries.
I like BeautifulSoup because it can handle malformed XML/HTML :
http://www.crummy.com/software/BeautifulSoup/
for your question, there’s an example in their documentation
from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacré bleu!",
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'
- [Django]-What is pip install -q -e . for in this Travis-CI build tutorial?
- [Django]-Django: Implementing a Form within a generic DetailView
- [Django]-How do I POST with jQuery/Ajax in Django?
- [Django]-Django override save for model only in some cases?
- [Django]-Create empty queryset by default in django form fields
- [Django]-How to limit the maximum value of a numeric field in a Django model?
7👍
See at the bottom of this page at Python wiki, there are at least 2 options to “unescape” html.
- [Django]-How to force application version on AWS Elastic Beanstalk
- [Django]-How to add new languages into Django? My language "Uyghur" or "Uighur" is not supported in Django
- [Django]-Execute code when Django starts ONCE only?
7👍
If anyone is looking for a simple way to do this via the django templates, you can always use filters like this:
<html>
{{ node.description|safe }}
</html>
I had some data coming from a vendor and everything I posted had html tags actually written on the rendered page as if you were looking at the source.
- [Django]-Fastest way to get the first object from a queryset in django?
- [Django]-How do I deploy Django on AWS?
- [Django]-Can't compare naive and aware datetime.now() <= challenge.datetime_end
6👍
Daniel’s comment as an answer:
“escaping only occurs in Django during template rendering. Therefore, there’s no need for an unescape – you just tell the templating engine not to escape. either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %}”
- [Django]-Django connection to postgres by docker-compose
- [Django]-How to add a cancel button to DeleteView in django
- [Django]-Django model one foreign key to many tables
5👍
I found a fine function at: http://snippets.dzone.com/posts/show/4569
def decodeHtmlentities(string):
import re
entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
def substitute_entity(match):
from htmlentitydefs import name2codepoint as n2cp
ent = match.group(2)
if match.group(1) == "#":
return unichr(int(ent))
else:
cp = n2cp.get(ent)
if cp:
return unichr(cp)
else:
return match.group()
return entity_re.subn(substitute_entity, string)[0]
- [Django]-How do I reuse HTML snippets in a django view
- [Django]-How can I access environment variables directly in a Django template?
- [Django]-How can I activate the unaccent extension on an already existing model
3👍
Even though this is a really old question, this may work.
Django 1.5.5
In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'
- [Django]-UUID as default value in Django model
- [Django]-Django: "projects" vs "apps"
- [Django]-Django south migration – Adding FULLTEXT indexes
2👍
I found this in the Cheetah source code (here)
htmlCodes = [
['&', '&'],
['<', '<'],
['>', '>'],
['"', '"'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
""" Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
for code in codes:
s = s.replace(code[1], code[0])
return s
not sure why they reverse the list,
I think it has to do with the way they encode, so with you it may not need to be reversed.
Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists…
this is going in my library though 🙂
i noticed your title asked for encode too, so here is Cheetah’s encode function.
def htmlEncode(s, codes=htmlCodes):
""" Returns the HTML encoded version of the given string. This is useful to
display a plain ASCII text string on a web page."""
for code in codes:
s = s.replace(code[0], code[1])
return s
- [Django]-IOS app with Django
- [Django]-Does SQLAlchemy have an equivalent of Django's get_or_create?
- [Django]-How do you dynamically hide form fields in Django?
1👍
You can also use django.utils.html.escape
from django.utils.html import escape
something_nice = escape(request.POST['something_naughty'])
- [Django]-Running Django with FastCGI or with mod_python
- [Django]-Django related_name for field clashes
- [Django]-Django render_to_string missing information
0👍
Below is a python function that uses module htmlentitydefs
. It is not perfect. The version of htmlentitydefs
that I have is incomplete and it assumes that all entities decode to one codepoint which is wrong for entities like ≂̸
:
http://www.w3.org/TR/html5/named-character-references.html
NotEqualTilde; U+02242 U+00338 ≂̸
With those caveats though, here’s the code.
def decodeHtmlText(html):
"""
Given a string of HTML that would parse to a single text node,
return the text value of that node.
"""
# Fast path for common case.
if html.find("&") < 0: return html
return re.sub(
'&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
_decode_html_entity,
html)
def _decode_html_entity(match):
"""
Regex replacer that expects hex digits in group 1, or
decimal digits in group 2, or a named entity in group 3.
"""
hex_digits = match.group(1) # ' ' -> unichr(10)
if hex_digits: return unichr(int(hex_digits, 16))
decimal_digits = match.group(2) # '' -> unichr(0x10)
if decimal_digits: return unichr(int(decimal_digits, 10))
name = match.group(3) # name is 'lt' when '<' was matched.
if name:
decoding = (htmlentitydefs.name2codepoint.get(name)
# Treat > like >.
# This is wrong for ≫ and ≪ which HTML5 adopted from MathML.
# If htmlentitydefs included mappings for those entities,
# then this code will magically work.
or htmlentitydefs.name2codepoint.get(name.lower()))
if decoding is not None: return unichr(decoding)
return match.group(0) # Treat "&noSuchEntity;" as "&noSuchEntity;"
- [Django]-Speeding up Django Testing
- [Django]-Django.db.migrations.exceptions.InconsistentMigrationHistory
- [Django]-How to execute a Python script from the Django shell?
0👍
This is the easiest solution for this problem –
{% autoescape on %}
{{ body }}
{% endautoescape %}
From this page.
- [Django]-How to get Django and ReactJS to work together?
- [Django]-How do I filter query objects by date range in Django?
- [Django]-Name '_' is not defined
0👍
Searching the simplest solution of this question in Django and Python I found you can use builtin theirs functions to escape/unescape html code.
Example
I saved your html code in scraped_html
and clean_html
:
scraped_html = (
'<img class="size-medium wp-image-113" '
'style="margin-left: 15px;" title="su1" '
'src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
'alt="" width="300" height="194" />'
)
clean_html = (
'<img class="size-medium wp-image-113" style="margin-left: 15px;" '
'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
'alt="" width="300" height="194" />'
)
Django
You need Django >= 1.0
unescape
To unescape your scraped html code you can use django.utils.text.unescape_entities which:
Convert all named and numeric character references to the corresponding unicode characters.
>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True
escape
To escape your clean html code you can use django.utils.html.escape which:
Returns the given text with ampersands, quotes and angle brackets encoded for use in HTML.
>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True
Python
You need Python >= 3.4
unescape
To unescape your scraped html code you can use html.unescape which:
Convert all named and numeric character references (e.g.
>
,>
,&x3e;
) in the string s to the corresponding unicode characters.
>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True
escape
To escape your clean html code you can use html.escape which:
Convert the characters
&
,<
and>
in string s to HTML-safe sequences.
>>> from html import escape
>>> scraped_html == escape(clean_html)
True
- [Django]-Django Model() vs Model.objects.create()
- [Django]-How do I run tests against a Django data migration?
- [Django]-Django custom management commands: AttributeError: 'module' object has no attribute 'Command'