How to Convert a Word Document to Markdown Format

So you need to get your nifty Word doc into a format that can be used on the web, handled by a wide variety of editors, or — if you’re like me — included in a git repository.

The Problem: You Created Your Content in Microsoft Word

Isn’t that always a problem?

OK I’m not a Microsoft fan these days—almost across the board. Haven’t been for many years.

But not long ago I created a massive proposal for a client that we’re partnering with for some projects. Our client is a Microsoft shop through and through, and I’ve been forced to install Microsoft Teams on my Linux machine to collaborate with their crew. This has actually been a surprisingly good experience—allowing me to use Microsoft Word on Ubuntu. (Yes, this could have been done in the browser, but I find the desktop client for Teams to be quite good.)

But now we need to be able to repurpose and reuse much of the content in the proposal in future proposals, which will require a fair amount of editing, version control, change tracking, etc.

Sure. This could theoretically be done in Microsoft Word, but we all know that git is a much better tool for that job, am I right?

The Goal: Edit Content from Word in a git Repository

From a high-level viewpoint, what I want to do is create a modular set of content elements that can then be loaded into the client’s proposal generator tools with nice formatting.

The Process: Converting a .DOCX File into a Markdown File Using pandoc

I engaged in some trial and error (details below if you’re interested), but for my purposes, pandoc was the tool for the job. Since it’s written in Haskell, there’s an installer for Windows, MacOS, various flavors of Linux … heck, there’s even something for ChromeOS and a Docker image, to boot!

Time needed: 5 minutes.

  1. Download and install pandoc

    Save yourself some trouble download the latest release from the pandoc GitHub repository. Ubuntu’s package manager had a very outdated version, but the release in the code repository includes a handy .deb file, which was exactly what I needed for my system.

  2. Open a command prompt and navigate to the folder where your Word doc is located

    On Ubuntu, I hit CTRL+ALT+T to open a new terminal window, and then changed directories:

    cd ~/Documents/MyFolder/

    where MyFolder is the name of the directory where your Word doc is located.

  3. Convert the file

    Running pandoc is relatively straightforward for a job like this:

    pandoc MyWordDoc.docx -f docx -t markdown -o MyWordDoc.md

    where MyWordDoc.docx is the name of the Word document you want to convert and MyWordDoc.md is the name of the output file (call yours anything you want, but it’s useful to name it with a .md file extension).

Frankly, this yielded fantastic results for me. The proposal was intentionally crafted with relatively simple formatting, so there weren’t too many bizarre elements to worry about.

That said, even a cursory glance at the pandoc documentation reveals that it has substantial capabilities. I’m filing that one away for future reference! For now, I’m not even scratching the surface of what it can do.

Huge thanks to John MacFarlane for building pandoc and making it available!

That’s it! I hope this helps! Feel free to throw a comment below one way or the other.

Also: thanks to V. David Zvenyach (@vdavez) for posting this fantastic Gist on GitHub to get me started down the right path on this!

Here’s What Didn’t Work For Me

Everything that follows is just here because it’s cathartic for me to document stuff that I’m nearly 100% certain no one else will find useful. You’re welcome to ignore this part!

Mr. Zvenyach’s approach was to convert a Word document (in .DOCX format) to Markdown using 2 tools: unoconv and then pandoc.

It wasn’t until I’d installed both tools on Ubuntu and run the Word doc through unoconv that I discovered a comment on the gist which indicated that pandoc could now handle Word docs directly.

In fact, using the version of <unoconv> from Ubuntu 18.04’s package manager, I got a nasty error message:

func=xmlSecCheckVersionExt:file=xmlsec.c:line=188:obj=unknown:subj=unknown:error=19:invalid version:mode=abi compatible;expected minor version=2;real minor version=2;expected subminor version=25;real subminor version=26

The unoconv repository’s readme file mentions python compatibility issues related to the version it’s compiled with and the version used by LibreOffice/OpenOffice (my system has LibreOffice given that’s what comes with Ubuntu).

I was going to attempt a workaround as described in the readme to see if the python version might be behind the error message I got, but then I noticed that the script had output an html file.

So I ran that file through pandoc and got a Markdown file. The resulting output wasn’t pleasant.

So I decided to upgrade pandoc and just skip unoconv altogether. Seemed like it might be worth a try.

My Ubuntu 18.04 LTS system ended up with pandoc 1.19.2.4 when I installed using apt install pandoc, but the current release shown on the pandoc website as of this writing is pandoc 2.9.2.1.

Since I got such great results, that was where I stopped. But I certainly could have tried a more recent version of unoconv to see what it might be capable of doing. And I’m sure there are other ways to accomplish this, but I’ll be sticking with pandoc for now.

Be sure to let me know what you’ve discovered or run into. I’d be very interested in hearing about it! Just drop a comment below. Thanks!

Leave a Reply