Converting a Word document to Text

The module below demonstrates how to convert a batch of Word documents to text.

If calling from a command line, you can pass the path of the files to convert as an argument. Or, call the module without a argument and it will use what you have defined as test_path.

The doc_to_text method does the actual work of converting an individual Word document to text. Using COM Interop, it opens the Word document, loops through the paragraphs and returns the paragraph text. The text is passed to clean_text to perform any text cleansing. Internally, Word documents use carriage returns (CR), which I replace as carriage returns plus line feeds (CR+LF). Page breaks are represented by the form feed (FF) character. I'm not sure what the BEL character is used for, however it was prevalent in my Word documents -- so I replaced them with an empty string.

The convert_files method gets a list of all of the Word documents in a directory, loops through that list converting each file, and saves the result as a text file.

__author__ = "Edward J. Stembler" __date__ = "2009-01-09" __module_name__ = "Converts a batch of Word documents, found in a directory, to text" __version__ = "1.0" version_info = (1,0,0)

import sys import clr import System from System.Text import StringBuilder from System.IO import DirectoryInfo, File, FileInfo, Path, StreamWriter clr.AddReference("Microsoft.Office.Interop.Word")

import Microsoft.Office.Interop.Word as Word

def convert_files(doc_path):

directory = DirectoryInfo(doc_path) files = directory.GetFiles("*.doc")

for file_info in files: text = doc_to_text(Path.Combine(doc_path, file_info.Name))

stream_writer = File.CreateText(Path.GetFileNameWithoutExtension(file_info.Name) + ".txt") stream_writer.Write(text) stream_writer.Close

return

def doc_to_text(filename):

word_application = Word.ApplicationClass word_application.visible = False

document = word_application.Documents.Open(filename)

result = StringBuilder

for p in document.Paragraphs: result.Append(clean_text(p.Range.Text))

document.Close document = None

word_application.Quit word_application = None

return result.ToString

def clean_text(text):

text = text.replace("\12", "")   # FF    text = text.replace("\07", "")    # BEL text = text.replace("\r", "\r\n") # CR -> CRLF

return text

test_path = "C:\\test\\"

if __name__ == "__main__": if len(sys.argv) == 2: convert_files(sys.argv[1]) else: convert_files(test_path)

--Ejstembler 02:49, 21 January 2009 (UTC)

Back to Contents.