on
Extending cmark-gfm pt.2
In the previous installment
to this series, we took a look at the high level workings of the cmark parser.
In this post, we will extend the cmark-gfm
source code to enable parsing of
spoiler text.
Let’s start with a basic algorithmic overview of the code we are going to write.
While we examined extensions/strikethrough.c
to understand how extensions
worked, we are now going to adapt from the parsing of a different element
(brackets). The relevant pieces of code can be found at src/inlines.c:
push_bracket
and src/inlines.c: handle_close_bracket
. The basic steps we are
going to follow are
- Modify
src/inlines.c: parse_inline
to invoke our extension every time it encounters a>
or!
. - If we see a
>
, it could be the beginning of a spoiler tag. If the next character is a!
, push an open spoiler on to a stack of potential spoilers. - A
!
could be the beginning of an end spoiler tag. If we find a<
as our next character, we check to see if there is an open spoiler in our stack of potential spoilers. If there is one, pop it and create a spoiler node. Else, treat the>!
as plain text. - Check the stack of potential spoilers at the end of our block processing. Pop any that remain, and treat them as plain text.
Now that we have the basic algorithm out of the way, let’s get started with our code. Unfortunately, this is going to involve some changes to our upstream code, making future merges with upstream possibly troublesome. Oh, well. We really do want spoilers.
First off, we need to move some code around in src/inlines.c: parse_inline
. We
had talked about this briefly in the first part to this series. parse_inline(...)
has a big switch
block that tests against all the special characters it knows
of. If it finds a match, it processes the current match, and then looks for the
next special character. It invokes our extensions only in the default
case of
the switch, i.e. if it hasn’t matched against anything earlier.
Currently, cmark-gfm
treats !
as a special character in the switch
. This
presents us with a design choice - we can modify the code where it matches
against a !
or move the extensions matching outside the switch
. Here, I
decided to do the latter, as it seemed cleaner considering possible future
extensions. So, I moved the following bit outside the switch
. Keep in mind
this causes a behavior change to the code - extensions are matched against
before default tags.
Next, let’s create a way to track our open spoiler tags. The following
struct in src/inlines.h
works well.
This maintains a simple linked list with a reference to the previous tag. It
also tracks a cmark text node. This node will hold the opening tag text >!
.
The parser will default to this text node in the event that it doesn’t find a
matching !>
to close the spoiler. Finally, we track the position of the
opening tag in the text.
Next, we need a good place to track this list in the context of our parser.
Brackets are tracked inside a subject
struct. This struct is currently
internal to src/inlines.c
. The code that processes spoilers is going to be
housed in reddit_extensions/spoiler.c
. This is going to need to reference the
subject
struct as well. To ease access, let’s move the subject
struct to
src/inlines.h
. Finally, we add the following declaration to the struct
definition - struct spoiler_tag *last_spoiler;
. The struct now reads as
We are pretty much at a point, where we can start the detection and parsing of
the spoiler tags. The following piece of code inside
reddit_extensions/spoiler.c
will form the basis of our parsing.
The code simply does the following
- On encountering an open tag
>!
, it advances the parser by 2 positions. It creates a text node that contains>!
. It then pushes aspoiler_tag
on to the stack of potential spoilers onsubject
. - When it finds a closing tag
!>
, it handles some checks and closes the spoiler as needed.
The two bits of missing information here are the contents of the functions
push_spoiler
and handle_close_spoiler
. Both are relatively simple, and can
be placed inside src\inlines.c
. We keep them there to avoid duplicating some
convenience functions found within.
push_spoiler
simply creates a spoiler_tag
and maintains the link list of
spoiler tags on subject
. handle_close_spoiler
is a bit more involved. It
checks to see if we have a matching opening tag. If we do, it converts the text
node that we have in the opening tag into a node of type
CMARK_NODE_REDDIT_SPOILER
. It then re-appends all the child nodes of this node
to itself. Why do we need to do this? I’m really not sure. It could have
something to do with the contents of a node being reset when we change its type.
Anyway, this is what the parser seems to do when it closes a bracket, and it
works here too.
The astute reader will notice that we have not declared
CMARK_NODE_REDDIT_SPOILER
as yet. The different cmark nodes are distinguished
by a property of type cmark_node_type
found on cmark_node
. cmark_node_type
is an enum declared at src/cmark-gfm.h
. Ordinarily, it isn’t possible to add
types to an enum, but, we are going to take a leaf out of
extensions/strikethrough.{h, c}
to effectively add a type at runtime. Let’s
declare a new type cmark_node_type CMARK_NODE_REDDIT_SPOILER
in
reddit_extensions/spoiler.c
. Wherever we need to reference it, we will do so
by declaring extern cmark_node_type CMARK_NODE_REDDIT_SPOILER
. We will assign
it an appropriate int
value when we create the syntax extension.
With this method, we set some basic properties of the new spoiler extension. We
initialize the value of CMARK_NODE_REDDIT_SPOILER with a helper function
cmark-gfm_extension_api.h: cmark_syntax_extension_add_node
. We also set
function pointers to the match
and html_render
functions. Finally, we set
the memory allocator to be used, along with the special characters used to
identify the spoiler tag.
Almost there. All that remains are a few housekeeping items. We need to handle
the case where we have open spoiler tags without matching close spoiler tags. We
also need to distinguish between opening a quote in markdown (block starts with
a >
) from the start of our spoiler tag.
The former is easily handled by just popping all the open spoiler tags from the
stack. Remember, while parsing, we had already attached a text node containing
>!
to the node tree as a default. This is similar to how brackets are
currently handled by the parser. Just append the following to src/inlines.c:
cmark_parse_inlines
.
To handle the latter, we peek at the next char when we encounter a >
at the
beginning of a text block. If it’s a !
, we skip starting a markdown quote
node. The following condition in src/blocks.c: open_new_blocks
if (!indented
&& peek_at(input, parser->first_nonspace) == '>') {
becomes if (!indented &&
peek_at(input, parser->first_nonspace) == '>' && input->len > 1 &&
peek_at(input, parser->first_nonspace + 1) != '!') {
That pretty much sums up our code to handling the spoiler tag. We just need to
initialize the extension and attach it to our parser. Again, following the
handling of strikeouts, we can create a new function to handle the registration
of all the new extensions we write. Place these in
src/reddit_extensions/reddit_extensions.c
.
And now, on to testing!.