4๐
I too came across this issue recently. Python-magic uses the Unix command file
which uses a database file to identify documents (see man file
). By default this database does not include instructions on how to identify .docx, .pptx, and .xlsx file types.
You can give additional information to file
command to identify these types by adding instructions to /etc/magic (see https://serverfault.com/a/377792).
This should then work:
magic.from_file("path_to_the_file.docx", mime=True)
Returns 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
One thing to note from the python-magic usage instruction on GitHub โ this does not seem work for .docx, .pptx, and .xlsx file types (with the additional information in /etc/magic):
magic.from_buffer(open("testdata/test.pdf").read(1024), mime=True)
Returns 'application/zip'
It seems you need to give it more data to correctly identify these file types:
magic.from_buffer(open("testdata/test.pdf").read(2000), mime=True)
Returns 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
Iโm not sure of the exact amount needed.