[Django]-How to get a word count on word document in python?

First you need to read your .doc .docx .odt and .pdf.

[Django]-Django Site encryption

These answers miss a trick as regards MS Word & .odt.

MS Word records a .docx file’s word count whenever it is saved. A .docx file is simply a zip file. Accessing the "Words" (= word count) property therein is simple and can be done with modules from the standard library:

import zipfile
import xml.etree.ElementTree as ET

total_word_count = 0
for docx_file_path in docx_file_paths:
    zin = zipfile.ZipFile(docx_file_path)
    for item in zin.infolist():
        if item.filename == 'docProps/app.xml':
            buffer = zin.read(item.filename)
            root = ET.fromstring(buffer.decode('utf-8'))
            for child in root:
                if child.tag.endswith('Words'):
                    print(f'{docx_file_path} word count {child.text}')
                    total_word_count += int(child.text)
                    
print(f'total word count all files {total_word_count}')

Pros and cons: the main pro is that, for most files, this is going to be far faster than anything else.

The main con is that you’re stuck with the various idiosyncracies of MS Word’s counting methods: I am not particularly interested in the details but I know that these have changed over the versions (e.g. words in text boxes may or may not be included).

More significantly, the running wordcount maintained by Word when you have a .docx file open is notoriously different from the value saved by Word in docProps/app.xml. The real wordcount usually seems to be about 10% more than the "Words" property, and this does not appear to relate to the presence or absence of headers/footers/text boxes, etc.. So it may or may not be fit for your use case. It can still be useful typically for a lightning-quick estimation of word counts for a large number of .docx files, but ideally I’d add on an extra 10%: call it the MJ (Microsoft-Junk) adjustment.

Also bear in mind that comparable inaccuracies may also apply if you choose to pick apart and parse the entire text content of a .docx file. The various available modules, e.g. python-docx, seem to do a pretty good job, but in my experience none is perfect.

If you actually extract and parse, by yourself, the content.xml file inside a .docx file, you begin to realise that there are some daunting complexities involved.

.odt files
again, these are zip files, and again a similar property is found in meta.xml. I just created and unzipped one such file and meta.xml in it looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.3">
    <office:meta>
        <meta:creation-date>2023-06-11T18:25:09.898000000</meta:creation-date>
        <dc:date>2023-06-11T18:25:21.656000000</dc:date>
        <meta:editing-duration>PT11S</meta:editing-duration>
        <meta:editing-cycles>1</meta:editing-cycles>
        <meta:document-statistic meta:table-count="0" meta:image-count="0" meta:object-count="0" meta:page-count="1" meta:paragraph-count="1" meta:word-count="2" meta:character-count="12" meta:non-whitespace-character-count="11"/>
        <meta:generator>LibreOffice/7.4.6.2$Windows_X86_64 LibreOffice_project/5b1f5509c2decdade7fda905e3e1429a67acd63d</meta:generator>
    </office:meta>
</office:document-meta>

Thus you need to look at root['office:meta']['meta:document-statistic'], attribute meta:word-count.

I don’t know about PDF: they may well require brute force counting. Pypdf2 looks the way to go: the simplest way would be to convert to txt and count that way. I have no idea what might be missed out.
And a scanned PDF, for example, may be hundreds of pages long but be said to contain "0 words". Or indeed there may be scanned text interspersed with bona fide text content…

mike rodent

Given that you can do this for .txt files I’ll assume that you know how to count the words, and that you just need to know how to read the various file types. Take a look at these libraries:

PDF: pypdf

doc/docx: this question, python-docx

odt: examples here

andronikus

-1

Noted by @Chad ‘s answer at extracting text from MS word files in python.

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')

content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)

word_count = len(cleaned)

Mark K

Source:stackexchange.com

Leave a comment Cancel reply