5👍
First you need to read your .doc .docx .odt and .pdf.
Second, count the words (<2.7 version).
3👍
These answers miss a trick as regards MS Word & .odt.
MS Word records a .docx file’s word count whenever it is saved. A .docx file is simply a zip file. Accessing the "Words" (= word count) property therein is simple and can be done with modules from the standard library:
import zipfile
import xml.etree.ElementTree as ET
total_word_count = 0
for docx_file_path in docx_file_paths:
zin = zipfile.ZipFile(docx_file_path)
for item in zin.infolist():
if item.filename == 'docProps/app.xml':
buffer = zin.read(item.filename)
root = ET.fromstring(buffer.decode('utf-8'))
for child in root:
if child.tag.endswith('Words'):
print(f'{docx_file_path} word count {child.text}')
total_word_count += int(child.text)
print(f'total word count all files {total_word_count}')
Pros and cons: the main pro is that, for most files, this is going to be far faster than anything else.
The main con is that you’re stuck with the various idiosyncracies of MS Word’s counting methods: I am not particularly interested in the details but I know that these have changed over the versions (e.g. words in text boxes may or may not be included).
More significantly, the running wordcount maintained by Word when you have a .docx file open is notoriously different from the value saved by Word in docProps/app.xml. The real wordcount usually seems to be about 10% more than the "Words" property, and this does not appear to relate to the presence or absence of headers/footers/text boxes, etc.. So it may or may not be fit for your use case. It can still be useful typically for a lightning-quick estimation of word counts for a large number of .docx files, but ideally I’d add on an extra 10%: call it the MJ (Microsoft-Junk) adjustment.
Also bear in mind that comparable inaccuracies may also apply if you choose to pick apart and parse the entire text content of a .docx file. The various available modules, e.g. python-docx, seem to do a pretty good job, but in my experience none is perfect.
If you actually extract and parse, by yourself, the content.xml file inside a .docx file, you begin to realise that there are some daunting complexities involved.
.odt files
again, these are zip files, and again a similar property is found in meta.xml. I just created and unzipped one such file and meta.xml in it looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.3">
<office:meta>
<meta:creation-date>2023-06-11T18:25:09.898000000</meta:creation-date>
<dc:date>2023-06-11T18:25:21.656000000</dc:date>
<meta:editing-duration>PT11S</meta:editing-duration>
<meta:editing-cycles>1</meta:editing-cycles>
<meta:document-statistic meta:table-count="0" meta:image-count="0" meta:object-count="0" meta:page-count="1" meta:paragraph-count="1" meta:word-count="2" meta:character-count="12" meta:non-whitespace-character-count="11"/>
<meta:generator>LibreOffice/7.4.6.2$Windows_X86_64 LibreOffice_project/5b1f5509c2decdade7fda905e3e1429a67acd63d</meta:generator>
</office:meta>
</office:document-meta>
Thus you need to look at root['office:meta']['meta:document-statistic']
, attribute meta:word-count
.
I don’t know about PDF: they may well require brute force counting. Pypdf2 looks the way to go: the simplest way would be to convert to txt and count that way. I have no idea what might be missed out.
And a scanned PDF, for example, may be hundreds of pages long but be said to contain "0 words". Or indeed there may be scanned text interspersed with bona fide text content…
- [Django]-Django for a simple web application
- [Django]-Extending Django-admin's DateFieldListFilter for custom "Upcoming" filter
- [Django]-Django: Get previous value in clean() method
- [Django]-Django: Using 2 different AdminSite instances with different models registered
- [Django]-Using South migrations with Heroku
0👍
Given that you can do this for .txt files I’ll assume that you know how to count the words, and that you just need to know how to read the various file types. Take a look at these libraries:
PDF: pypdf
doc/docx: this question, python-docx
odt: examples here
- [Django]-How to run Daphne and Gunicorn At The Same Time?
- [Django]-How to save django drag and drop connected list sortable
- [Django]-How do I include an external CSS file as a rule for a class inside a .scss file?
-1👍
Noted by @Chad ‘s answer at extracting text from MS word files in python.
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)
word_count = len(cleaned)
- [Django]-Django.core.exceptions.ImproperlyConfigured: CreateView is missing a QuerySet
- [Django]-Docker-compose ERROR [internal] booting buildkit, http: invalid Host header while deploy Django
- [Django]-How can I use SearchRank in a Django model field?
- [Django]-Formatting Time Django Template
- [Django]-How to query for distinct groups of entities which are joined via a self-referential many-to-many table?