3đź‘Ť
Thanks very much for your answers, John and Steven. Your answers got me thinking differently, which led me to find the source of the problem and also a working solution.
I was working with the following test code:
import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"
url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)
handle = url_handler.open(URL)
response = handle.read()
handle.close()
html_response = HtmlResponse(URL).replace(body=response) # Problematic line
hxs = HtmlXPathSelector(html_response)
desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')
Inside the Scrapy shell, when I extracted the description data, it came out fine. It gave me reason to suspect something was wrong in my code, because on the pdb
prompt, I was seeing the replacement characters in the extracted data.
I went through the Scrapy docs for the Response class and adjusted the code above to this:
import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"
url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)
handle = url_handler.open(URL)
response = handle.read()
handle.close()
#html_response = HtmlResponse(URL).replace(body=response)
html_response = HtmlResponse(URL, body=response)
hxs = HtmlXPathSelector(html_response)
desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')
The change I made was to replace the line html_response = HtmlResponse(URL).replace(body=response)
with html_response = HtmlResponse(URL, body=response)
. It is my understanding that the replace()
method was somehow mangling the special characters from an encoding point of view.
If anyone would like to chip in with any details of what exactly the replace()
method did wrong, I’d very much appreciate the effort.
Thank you once again.
3đź‘Ť
u’\ufffd’ is the “unicode replacement character”, which is usually printed as a question mark inside a black triangle. NOT a u umlaut. So the problem must be somewhere upstream. Check what encoding the web page headers say are being returned and verify that it is in fact, what it says it is.
The unicode replacement character is usually inserted as a replacement for an illegal or unrecognized character, which could be caused by several things, but the likeliest is that
the encoding is not what it claims to be.
- [Django]-How to set cookie for many views?
- [Django]-Django localization: labels don't get updated
- [Django]-Document link on front end in Wagtail
1đź‘Ť
U+FFFD is the replacement character that you get when you do some_bytes.decode('some-encoding', 'replace')
and some substring of some_bytes
can’t be decoded.
You have TWO of them: u'H\ufffd\ufffdftsitz'
… this indicates that the u-umlaut was represented as TWO bytes each of which failed to decode. Most likely, the site is encoded in UTF-8 but the software is attempting to decode it as ASCII. Attempting to decode as ASCII usually happens when there is an unexpected conversion to Unicode, and ASCII is used as the default encoding. However in that case one would not expect the 'replace'
arg to be used. More likely the code takes in an encoding and has been written by someone who thinks “doesn’t raise an exception” means the same as “works”.
Edit your question to provide the URL, and show the minimum code that produces u'H\ufffd\ufffdftsitz'
.