Extending cmark-gfm pt.1

Markdown is a lightweight markup language used to create formatted text from plain text. For example, a Markdown renderer would render **bold text** written in a plain text editor as bold text. CommonMark defines the standard set of tags that are supported by modern renderers of Markdown. Being a bit limited with it’s formatting options, some sites like reddit, have added extensions to this set, creating so-called flavors of Markdown. Let’s examine how to extend cmark-gfm, a GitHub Flavored Markdown (GFM) parser, to support Reddit Flavored Markdown (RFM).

Run. Here be spoilers

There are times we’d like to write about something that might ruin a surprise for a section of our readers. Spoiler tags are used to avoid this, and generally are rendered as hidden content that are revealed on a tap or click. These tags are marked up on Reddit by enclosing text between >! and !<.

Let’s take a step back here, and briefly talk about the kind of elements found in Markdown. While parsing markdown, the parser generally breaks it up into two kinds of nodes.

Spoilers are a type of inline element. Our parser first extracts the block elements from the string of text it is parsing. It then checks each block to see if there are any inline elements. Inline nodes are created for any that are found and attached to the parent block node. To parse out our spoilers, we need to find a way to alter the process of identifying inline nodes.

Code Customization

While we look through cmark-gfm’s source code, let’s keep its API documentation handy. Looking at our src directory (after cloning the cmark-gfm repository), we find

  1. src/inlines.c which looks like a good place to find code that does processing of inline elements
  2. An extensions folder (which seem to contain the extensions for GFM).

It looks like cmark provides an API to provide extensions that parse custom elements. Let’s kick-start our analysis of the code by looking at how elements that have been struck our are parsed by extensions/strikethrough.{h, c}. The API seems to allow us to

  1. Create and register the extension with the parser - extensions/strikethrough.c: create_strikethrough_extension. This extension needs to be registered when the cmark parser is initialized. GFM provides a convenient function, extensions/core_extensions.c: core_extensions_registration, to create and register any extension we would like enabled for the current run of the parser.
  2. Provide a custom function, extensions/strikethrough.c: match, to find matches to our struck out text. In this case, relevant text is delimited by pairs of ~~s.
  3. Indicate whether this element can contain instances of a particular child element via extensions/strikethrough.c: can_contain.
  4. Provide renderers for common targets like HTML (extensions/strikethrough.c: html_render), Latex (extensions/strikethrough.c: latex_render) etc.

This is a good start. We now know the basics of creating and registering an extension with cmark. So, let’s create a new folder reddit_extensions inside the src folder and files named spoilers.{h,c} inside it. Ideally, we would follow cmark-gfm’s example, and keep this folder outside src - thereby leaving it unpolluted to pull in upstream changes. But, while working through this, I found I had to change some upstream source, and keeping this folder inside src made life easier. Unpollution would be a project for another day I guess.

Entry point

A couple of key questions call out to us at this point. Where in the parsing process is our extension called? When does the parser call our extension? Answering these would make our task of writing our extension a bit easier.

  1. The Where?

    ripgrep makes searching for text in a bunch of files easy - you could use simple grep or any other search utility. Running rg extension src outputs a long list of text which contains the word extension, along with the file name and location that it occurs in. Going through this list, we find a target of interest - src/inlines.c: 1395 try_new_extensions. Open src/inlines.c in your favorite editor and navigate to line 1395. This function call is contained in an aptly named function parse_inline, and is part of a switch statement that seems to power the matching of default inline elements. The switch seems to be checking various special characters to detect the different markdown elements. By default, if it’s not able to detect any markdown, it tries out all the registered extension. This presents the primary issue that necessitated changing the upstream source.

    The issue we are facing is that once cmark detects a possible markdown element, it doesn’t try out any of the registered extensions. In our case, our closing tag contains a ! - which also occurs in some HTML blocks. Once it sees this character, it checks to see if it’s a part of an HTML block. If not, it just attaches a text node with ! to the node tree. Since, our extension is not called, our spoilers closing tag !< isn’t detected.

  2. The When?

    The next answer we seek is to identify the point in time at which this inline element detection is triggered. Taking a look at function parse_inlines, we see cmark processes each character in it’s input stream one at a time. It first checks to see if it matches any of the default elements it can identify. If not, it tries each registered extension for a match. So, it looks like we need to get our extension to register the following special characters - > to detect the opening of a spoiler, and ! to detect the closing of a spoiler.

We now have sufficient information to begin our coding - in the next part to this short series.