3👍
There is no general encoding which automatically knows how to decode an already encoded file in a specific encoding.
UTF-8 is a good option with many compatibilities with other encodings. You can e.g. simply ignore
or replace
characters which aren’t decodable like this:
from codecs import open
original = open(str(last_uploaded.document), encoding="utf-8", errors="ignore")
original_words = original.read().lower().split()
...
original.close()
Or even using a context manager (with statement) who closes the file for you:
with open(str(last_uploaded.document), encoding="utf-8", errors="ignore") as fr:
original_words = fr.read().lower().split()
...
(Note: You do not need to use the codecs
library if you’re using Python 3, but you have tagged your question with python-2.7
.)
You can see advantages and disadvantages of using different error handlers here and here. You have to know that not using an error handler will default to using errors="strict"
which you probably do not want. Other options may be nearly self-explaining, e.g.:
- using
errors="replace"
will replace an undecodable character with a suitable replacement marker - using
errors="ignore"
will simply ignore the character and continues reading the file data.
What you should use depends on your needs and usecase(s).
You’re saying that you also have encoding problems not only with plain text files, but also with proprietary doc
files:
The .doc
format is not a plain text file which you can simply read with open()
or codecs.open()
since there are many information stored in binary format, see this site for more information. So you need a special reader for .doc
files to get the text from it. Which library you are using depends on your Python version and maybe also on the operating system you are using. Maybe here is a good starting point for you.
Unfortunately, using a library does not prevent you completely from encoding errors. (Maybe yes, but I’m not sure if the encoding is saved in the file itself like in a .docx
file.) You maybe also have the chance to figure out the encoding of the file. How you can handle encoding errors likely depends on the library itself.
So I just guess that you are trying opening .doc
files as simple text files. Then you will get decoding errors, because it’s not saved as human readable text. And even if you get rid of the error, you only will see the non human readable text: (I’ve created a simple text file with LibreOffice in doc
-format (Microsoft Word 1997-2003)):
In [1]: open("./test.doc", "r").read()
UnicodeDecodeError: 'utf-8' codec can`t decode byte 0xd0 in position 0: invalid continuation byte
In [2]: open("./test.doc", "r", errors="replace").read() # or open("./test.doc", "rb").read()
'��\x11\u0871\x1a�\x00\x00\x00' ...