Skip to content Skip to sidebar Skip to footer

Problems Extracting The Xml From A Word Document In French With Python: Illegal Characters Generated

Over the past few days I have been attempting to create a script which would 1) extract the XML from a Word document, 2) modify that XML, and 3) use the new XML to create and save

Solution 1:

The problem is that you are accidentally changing the encoding on word/document.xml in template2.docx. word/document.xml (from template.docx) is initially encoded as UTF-8 (as is the default encoding for XML documents).

xmlString = zip.read("word/document.xml").decode("utf-8")

However, when you copy it for template2.docx you are changing the encoding to CP-1252. According to the documentation for open(file, "w"),

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

You indicated that calling locale.getpreferredencoding(False) gives you cp1252 which is the encoding word/document.xml is being written.

Since you did not explicitly add <?xml version="1.0" encoding="cp1252"?> to the beginning of word/document.xml, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.

So you want to specify the encoding as UTF-8 when writing by using the encoding argument to open():

with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f:
    f.write(xmlString)

Post a Comment for "Problems Extracting The Xml From A Word Document In French With Python: Illegal Characters Generated"