Extending cmark-gfm pt.2

In the previous installment to this series, we took a look at the high level workings of the cmark parser. In this post, we will extend the cmark-gfm source code to enable parsing of spoiler text.

Let’s start with a basic algorithmic overview of the code we are going to write. While we examined extensions/strikethrough.c to understand how extensions worked, we are now going to adapt from the parsing of a different element (brackets). The relevant pieces of code can be found at src/inlines.c: push_bracket and src/inlines.c: handle_close_bracket. The basic steps we are going to follow are

  1. Modify src/inlines.c: parse_inline to invoke our extension every time it encounters a > or !.
  2. If we see a >, it could be the beginning of a spoiler tag. If the next character is a !, push an open spoiler on to a stack of potential spoilers.
  3. A ! could be the beginning of an end spoiler tag. If we find a < as our next character, we check to see if there is an open spoiler in our stack of potential spoilers. If there is one, pop it and create a spoiler node. Else, treat the >! as plain text.
  4. Check the stack of potential spoilers at the end of our block processing. Pop any that remain, and treat them as plain text.

Now that we have the basic algorithm out of the way, let’s get started with our code. Unfortunately, this is going to involve some changes to our upstream code, making future merges with upstream possibly troublesome. Oh, well. We really do want spoilers.

First off, we need to move some code around in src/inlines.c: parse_inline. We had talked about this briefly in the first part to this series. parse_inline(...) has a big switch block that tests against all the special characters it knows of. If it finds a match, it processes the current match, and then looks for the next special character. It invokes our extensions only in the default case of the switch, i.e. if it hasn’t matched against anything earlier.

Currently, cmark-gfm treats ! as a special character in the switch. This presents us with a design choice - we can modify the code where it matches against a ! or move the extensions matching outside the switch. Here, I decided to do the latter, as it seemed cleaner considering possible future extensions. So, I moved the following bit outside the switch. Keep in mind this causes a behavior change to the code - extensions are matched against before default tags.

new_inl = try_extensions(parser, parent, c, subj);

Next, let’s create a way to track our open spoiler tags. The following struct in src/inlines.h works well.

typedef struct spoiler_tag {
    struct spoiler_tag* previous;
    cmark_node* inl_text;
    bufsize_t position;
} spoiler_tag;

This maintains a simple linked list with a reference to the previous tag. It also tracks a cmark text node. This node will hold the opening tag text >!. The parser will default to this text node in the event that it doesn’t find a matching !> to close the spoiler. Finally, we track the position of the opening tag in the text.

Next, we need a good place to track this list in the context of our parser. Brackets are tracked inside a subject struct. This struct is currently internal to src/inlines.c. The code that processes spoilers is going to be housed in reddit_extensions/spoiler.c. This is going to need to reference the subject struct as well. To ease access, let’s move the subject struct to src/inlines.h. Finally, we add the following declaration to the struct definition - struct spoiler_tag *last_spoiler;. The struct now reads as

typedef struct subject {
  cmark_mem *mem;
  cmark_chunk input;
  int line;
  bufsize_t pos;
  int block_offset;
  int column_offset;
  cmark_map *refmap;
  delimiter *last_delim;
  bracket *last_bracket;
  struct spoiler_tag *last_spoiler;
  bufsize_t backticks[MAXBACKTICKS + 1];
  bool scanned_for_backticks;
} subject;

We are pretty much at a point, where we can start the detection and parsing of the spoiler tags. The following piece of code inside reddit_extensions/spoiler.c will form the basis of our parsing.

static cmark_node *match(cmark_syntax_extension *extension, 
        cmark_parser *parser,
        cmark_node *parent,
        unsigned char input,
        cmark_inline_parser *subj) {
  cmark_node *res = NULL;

  switch (input) {
    case '>':
      if (peek_char_n(subj, 1) == '!') {
        advance(subj);
        advance(subj);
        // Create text tag with `>!`. Push open spoiler on spoilers list
        res = make_str(subj, subj->pos - 1, subj->pos, cmark_chunk_literal(">!"));
        push_spoiler(subj, res);
      } 
      break;
    case '!':
      if (peek_char_n(subj, 1) == '<') {
        res = handle_close_spoiler(extension, subj);
      }

      break;
    default:
      break;
  }

  return res;
}

The code simply does the following

  1. On encountering an open tag >!, it advances the parser by 2 positions. It creates a text node that contains >!. It then pushes a spoiler_tag on to the stack of potential spoilers on subject.
  2. When it finds a closing tag !>, it handles some checks and closes the spoiler as needed.

The two bits of missing information here are the contents of the functions push_spoiler and handle_close_spoiler. Both are relatively simple, and can be placed inside src\inlines.c. We keep them there to avoid duplicating some convenience functions found within.

void push_spoiler(subject* subj, cmark_node* inl_text) {
    spoiler_tag *st = (spoiler_tag *) subj->mem->calloc(1, sizeof(spoiler_tag));
    st->inl_text = inl_text;
    st->previous = subj->last_spoiler;
    st->position = subj->pos - 2;
    subj->last_spoiler = st;
}

cmark_node *handle_close_spoiler(cmark_syntax_extension *extension, struct subject *subj) {
  cmark_node *res = NULL;
  spoiler_tag* opener = subj->last_spoiler;
  if (opener != NULL) {
    cmark_node *tmp, *next;
    res = opener->inl_text;
    cmark_node_set_type(res, CMARK_NODE_REDDIT_SPOILER);
    cmark_node_set_syntax_extension(res, extension);

    tmp = cmark_node_next(res);

    while (tmp) {
      next = cmark_node_next(tmp);
      cmark_node_append_child(res, tmp);
      tmp = next;
    }

    advance(subj);
    advance(subj);

    subj->last_spoiler = opener->previous;
  }

  return res;
}

push_spoiler simply creates a spoiler_tag and maintains the link list of spoiler tags on subject. handle_close_spoiler is a bit more involved. It checks to see if we have a matching opening tag. If we do, it converts the text node that we have in the opening tag into a node of type CMARK_NODE_REDDIT_SPOILER. It then re-appends all the child nodes of this node to itself. Why do we need to do this? I’m really not sure. It could have something to do with the contents of a node being reset when we change its type. Anyway, this is what the parser seems to do when it closes a bracket, and it works here too.

The astute reader will notice that we have not declared CMARK_NODE_REDDIT_SPOILER as yet. The different cmark nodes are distinguished by a property of type cmark_node_type found on cmark_node. cmark_node_type is an enum declared at src/cmark-gfm.h. Ordinarily, it isn’t possible to add types to an enum, but, we are going to take a leaf out of extensions/strikethrough.{h, c} to effectively add a type at runtime. Let’s declare a new type cmark_node_type CMARK_NODE_REDDIT_SPOILER in reddit_extensions/spoiler.c. Wherever we need to reference it, we will do so by declaring extern cmark_node_type CMARK_NODE_REDDIT_SPOILER. We will assign it an appropriate int value when we create the syntax extension.

cmark_syntax_extension *create_reddit_spoiler_extension(void) {
  cmark_syntax_extension *ext = cmark_syntax_extension_new("reddit_spoiler");
  cmark_llist *special_chars = NULL;

  cmark_syntax_extension_set_get_type_string_func(ext, get_type_string);
  cmark_syntax_extension_set_can_contain_func(ext, can_contain);
  cmark_syntax_extension_set_html_render_func(ext, html_render);
  CMARK_NODE_REDDIT_SPOILER = cmark_syntax_extension_add_node(1);

  cmark_syntax_extension_set_match_inline_func(ext, match);
  cmark_inlines_add_special_character('>', false);
  cmark_inlines_add_special_character('!', false);

  cmark_mem *mem = cmark_get_default_mem_allocator();

  return ext;
}

With this method, we set some basic properties of the new spoiler extension. We initialize the value of CMARK_NODE_REDDIT_SPOILER with a helper function cmark-gfm_extension_api.h: cmark_syntax_extension_add_node. We also set function pointers to the match and html_render functions. Finally, we set the memory allocator to be used, along with the special characters used to identify the spoiler tag.

Almost there. All that remains are a few housekeeping items. We need to handle the case where we have open spoiler tags without matching close spoiler tags. We also need to distinguish between opening a quote in markdown (block starts with a >) from the start of our spoiler tag.

The former is easily handled by just popping all the open spoiler tags from the stack. Remember, while parsing, we had already attached a text node containing >! to the node tree as a default. This is similar to how brackets are currently handled by the parser. Just append the following to src/inlines.c: cmark_parse_inlines.

// free spoiler stack
while(subj.last_spoiler) {
      pop_spoiler(&subj);
  }

To handle the latter, we peek at the next char when we encounter a > at the beginning of a text block. If it’s a !, we skip starting a markdown quote node. The following condition in src/blocks.c: open_new_blocks if (!indented && peek_at(input, parser->first_nonspace) == '>') { becomes if (!indented && peek_at(input, parser->first_nonspace) == '>' && input->len > 1 && peek_at(input, parser->first_nonspace + 1) != '!') {

That pretty much sums up our code to handling the spoiler tag. We just need to initialize the extension and attach it to our parser. Again, following the handling of strikeouts, we can create a new function to handle the registration of all the new extensions we write. Place these in src/reddit_extensions/reddit_extensions.c.

static int reddit_extensions_registration(cmark_plugin* plugin) {
    cmark_plugin_register_syntax_extension(plugin, create_reddit_spoiler_extension());

    return 1;
}

CMARK_GFM_EXPORT
void cmark_gfm_reddit_extensions_ensure_registered(void) {
    static int registered = 0;

    if (!registered) {
        cmark_register_plugin(reddit_extensions_registration);
        registered = 1;
    }
}

And now, on to testing!.