Honing the craft.

You are here: You're reading a post

Python-Markdown - Generating HTML documents from Markdown text in Python.

This post is the essence of my first talk presented at the '14 April edition of the budapest.py meetup series. The original intention was to participate in the flash talk competition, however it was extended a little later. Hence, it is more a bitesize appetizer of Markdown and its Python implementation, than a thorough introduction. I briefly explain what Markdown is and is not, what it can be used for. Then, you can read about how the Python implementation is similar and how it differs from the original conversion tool written in Perl. Finally, in a real project, extending and fine tuning how the HTML document is generated is inevitable. We will see into a simple explanation on how the conversion process can be altered to accommodate your special needs.

What is Markdown?

A plain text format for authoring prose. Markdown is a plain text format you can use to create structured documents that are:

  • Easy to read
  • Easy to write

This is the order of the priority, the most important consideration that was taken when Markdown was designed is, that the text should be easy to read in its raw, plain text format to make it publishable as is.

Markdown is also a software tool for converting the Markdown text into HTML. The original concept and the first conversion tool, that was written in Perl was authored by John Gruber, of course there were also many others, who contributed. Today, basically all the popular programming languages have a Markdown tool available, so it is very widespread. HTML and XHTML are the two most common versions of output formats for Markdown conversion tools, but a few, also support other formats, like PDF, RTF and even proprietary formats, such as Microsoft word.

Why Markdown?

So what are the main takeaways of using Markdown, why you should use it? First of all, I can't emphasize enough, how useful the good plain text readability of Markdown is, especially when documenting code. You can commit the raw Markdown format of your docs in plain text with your source, read it in your favorite text editor and easily track changes with the version control software you use. I think this was the very reason, that also convinced GitHub to bet, moreover invest in Markdown, when creating GitHub Flavored Markdown to provide it to the users as the primary documentation facility for their coding projects.

Another seemingly tiny, but in my opinion very useful feature is the sensible escaping of characters, which typically cause a headache for HTML document authors: & and < . I know that many of you now think, that there are several used and proven ways of taking care of this and you are hell right. However, in my opinion, the advantage of using Markdown is, that you just don't have to worry about that. At least for text, that is generated from this format on your page.

The Markdown syntax.

Now, let's see a few highlights of the syntax, so that you can get a taste of a small bite of Markdown. I sum up, how some of the basic document elements can be expressed in Markdown and how the resulting HTML looks like. Completeness wasn't a goal, so please refer to the syntax specification on John Gruber's site for a complete reference.

Headings

# Heading1 -> <h1>Heading1</h1>
# Heading2 -> <h2>Heading2</h2>

and so on...

Unordered list

You simply begin each line with an asterisk (*) as follows.

* List item 1
* List item 2
* List item 3

will be

<ul>
    <li>List item 1</li>
    <li>List item 2</li>
    <li>List item 3</li>
</ul>

Ordered list

Begin the line with a number and a dot.

1. List item 1
2. List item 2
3. List item 3

will be

<ol>
    <li>List item 1</li>
    <li>List item 2</li>
    <li>List item 3</li>
</ol>

Paragraph

Writing a paragraph is as easy as it can get, since any text block separated with one or more blank lines from the rest of the document is a paragraph according to Markdown.

Code

To highlight an expression in inline text, you just have to wrap it with a backtick like this:

`myFancyFunction()`

For displaying a preformatted code block, you just need to indent every line of the block with 4 or more spaces or 1 or more tabs. The code snippet will be displayed in the HTML document between <pre><code></code></pre> tags.

In general, I think a big plus of the Markdown syntax is that it doesn't try to do too much. If something couldn't be implemented in plain text properly, it wasn't. For instance, the Markdown syntax doesn't support tables, which for many, is a downside, my opinion on the other hand is, that it is an upside, because it was only possible this way to preserve the readability of the markup. Instead, Markdown allows you to fall back to HTML wherever you want to, so if there is something, that is not possible to express in Markdown, you just write it in HTML. And, in case of tables for example, this is a plus. If you have ever struggled with creating tables with wiki style markup, you know what I am talking about. So you use HTML liberally, there are a few special rules for block elements, but in general in most situations you just switch Markdown to HTML and back whenever you want. One of the special rules is, that you can't use Markdown syntax within literal HTML elements.

Python-Markdown

After this brief introduction to Markdown itself, now let's cut to the cheese and talk about the Python implementation of Markdown. The Python implementation is as close as possible to the original specification, but there are subtle differences.

Python-Markdown defaults to ignoring middle-word emphasis. This behaviour can be switched off, but this divergence from the original behaviour can be useful, especially for technical documentation. Because emphasis can be expressed in Markdown with underscores (_), having a file name with underscores your text could result in an undesired output and confusion.

The Python implementation also enforces the 4 spaces (or 1 tab) indentation rule for block level elements nested in a list. This is by the specification, however, there are some other implementations out there, that do not enforce this, so in case you expect that the content is converted by other libraries as well, it is good to keep this in mind. The tab length is adjustable via the tab_length configuration option.

The last small difference I want to mention is that Python-Markdown doesn't start a new list in case of consecutive lists, when the list marker changes.

Basic usage of Python-Markdown.

The simplest way of using the library is to just call it as a module:

import markdown
html = markdown.markdown(text_string)

where text_string is the variable holding your Markdown plain text content. Alternatively, you can also use markdown.markdownFromFile to fetch the Markdown text from a file in various ways.

There is also a chance you might want to use the library with some special configuration or customization (you will see an example shortly). In this case, you can also instantiate the Markdown object from the module and pass the configuration parameters on initialization, then you will be able to call the convert method, passing the Markdown text the same way as when simply using the module above.

Changing the conversion behavior.

What we have seen so far is all nice an good: we have a nice, easy to read lightweight markup syntax we can convert into HTML. However even in a small project, there will be at least one tiny bit of change you might want to do, to customize the output. Maybe you want to add some special class to an HTML element, or add support for HTML elements the library and the specification doesn't support out of the box.

No problem, Python-Markdown has a very simple yet powerful Extension API you can use to extend or modify the conversion behavior of the library. The two most useful extension types you can create if you want to change the way how the text is converted are:

  • Preprocessors
  • Treeprocessors (postprocessor)

A preprocessor is a function you will create following special rules, that will automatically get the plain Markdown text when the conversion happens, and you can make changes to this text before the conversion to HTML format happens. So in case you want to make changes to the raw text, this is the way to go.

A treeprocessor, is a postprocessor, meaning, that the manipulation that can be done on the content happens after it is already converted into HTML markup. This special function will get the document as an ElementTree object, which makes this processor type very powerful, as you can freely traverse the HTML tree and manipulate elements as you wish.

But instead of leaving how this works in real life to your imagination, let's see an example with a problem and a solution.

Extending Python-Markdown: an example.

The Prism JS syntax highlighter is a pretty powerful JavaScript library to optimally present and syntax highlight code on your page. It has support for a myriad of languages and it is the syntax highlighter of my choice on my blog as well as on web pages of renowned developer community people and organizations.

Using it is very simple. You just have to drop the PrismJS script file and CSS in your page and from that point, preformatted code blocks will be syntax highlighted beautifully. OK, there is one more thing. PrismJS expects the following markup:

<pre>
    <code class="language-python">Your code goes here.</code>
</pre>

And now the question. Let's say you have a blog, where you want to display code snippets, you write your posts in Markdown, hence your code blocks will look like this:

<pre>
    <code>Your code goes here.</code>
</pre>

As you can see PrismJS expects a language tag as the class of the code element, how can we do this with Markdown?

Adding that class to the <code> element.

There are two main ways you can solve this problem. First thing you can do, is to go around it by writing the code block in HTML markup explicitly. This can be a good solution if you need a quick workaround for an edge case. However in our case, it is expected that we will want to add code snippets to multiple blog posts, potentially several times in a single post. So we need a better solution.

The second thing you can do and this is what we will follow now, is to create a Treeprocessor extension, then you can build this requirement into the conversion when you need it.

The concept

The first thing we need before we start implementing our Treeprocessor is to create a concept. First of all, we will have to pass the information in the text about the language information for the syntax highlighter somehow.

My method of choice for this is to add the PrismJS supported language name in the first line of each code block we write. It will look like this:

def my_func():
    print "Hello!"

Writing the Treeprocessor.

So lets see a short briefing of what we will do now to make this work.

  • We have to extend the Treeprocessor class.
  • We will get the HTML document as an ElementTree object.
  • The setClass method will do the work.
  • We iterate through the whole tree recursively.
  • When we find a code block, we fetch the language identifier tag and add it to the element as a class.

Here is our CodeBlockTweaker tree processor at a glance, just to review the structure:

from markdown.treeprocessors import Treeprocessor

class CodeBlockTweaker(Treeprocessor):
    def __init__(self, md):
        super(CodeBlockTweaker, self).__init__(md)
    def run(self, root):
        return self.setClass(root)
    def setClass(self, element):
        # (No) magic happens here!
        return element

As mentioned earlier, you class should be a child of the Treeprocessor class. The element of particular interest here is the run method, this method, will be automatically called when the processor is invoked. The first argument (apart from self) of this method indicated as root in the code snippet will be automatically passed at runtime by the Markdown core. This is an ElementTree object representing our HTML document. We can manipulate this object and after we made the desired changes return the modified object from the run method. In the example above, the real work is delegated to our setClass method, where the magic will happen, or actually, it won't be magic, as it is very simple.

This is how the setClass method looks like in our case when we take a closer look:

def setClass(self, element):
    for child in element:
        if child.tag == "code":
            content = child.text
            langClass = re.match('^ *language-\w+', content)
            if langClass is not None:
                strippedContent = re.sub('^ *language-.*\n', 
                                         '', content)
                child.set("class", langClass.group(0))
                child.text = strippedContent
        # run recursively on children
        child = self.setClass(child)
    return element

What happens here, is that we iterate through the elements recursively, if we encounter a code tag, we look at the beginning of the inner text of the element and try to match the expected language tag, if we don't find it, we don't do anything.

If we find it, we strip the language tag from the contained text itself, as we don't want to see that displayed in the code snippet, after this, we set the same language tag as the class of the code element. When the recursion is over, we return the modified element tree.

Attaching the Treeprocessor we built to the processing chain.

We are done with the gist of the work, we just have to tell Python-Markdown to engage our processor when the conversion happens. For that to happen, we have to do a little plumbing:

from markdown.extensions import Extension

class CodeBlockExtension(Extension):
    def extendMarkdown(self, md, md_globals):
        md.treeprocessors.add('codeblocktweaker', CodeBlockTweaker(md), '_end')

This is our extension, it inherits from Extension and it has a special method called extendMarkdown that has some arguments that will be automatically passed, including a Markdown object md. In this method, we attach our CodeBlockTweaker processor to the processing chain of Markdown. If we would have multiple processors that help us achieve our task, we could attach more of them here.

Using what we built.

Now we can start to use our extension when converting Markdown text. We can engage our extension when using the library like this:

from markdown_extensions import CodeBlockExtension
from markdown import markdown

html = markdown(text=markdown_plain, extensions=[CodeBlockExtension()])

When we invoke Markdown, all we have to do is to pass our extension classes (in our case CodeBlockExtension) in the list of the extensions parameter. That's it, congratulations, you have extended Python-Markdown!

Summary

So let's sum up what was discussed. In general, Markdown is two things:

  • A plain text format for authoring prose.
  • A conversion tool, to convert your Markdown text into HTML or other document formats.

Markdown is great, because it is easy to read and write, but the emphasis is on good readability as plain text. This makes is great also for code documentation purposes.

The original implementation of the conversion tool was made in Perl, but currently, there is a Markdown tool available in pretty much any popular programming language. Markdown syntax is relatively stable, the different tool implementations handle conversion consistently, although there might be subtle differences. The same is true about Python-Markdown.

Python-Markdown is good in itself, but in a real life project, it is very likely you will want to alter or extend it in some ways. The Extension API will give you a clear workflow to achieve that. You use preprocessors or tree processors to manipulate the conversion process. In a tree processor, you can work with the HTML document as an ElementTree object, which make it easy to make changes to specific elements.

Thanks for reading this, I hope it was a useful intro into the Python-Markdown library, now go and try it yourself!

Here is a small list of related resources, I have used to put this together.

About Markdown in general:

About the Python-Markdown implementation:

The original slides from my presentation.