Nom is a parser combinator library in Rust. We can use this to write a Rust implementation of MDX, starting with headings.
Our goal is to parse the following mdx file (which in this case has no differences from a markdown file).
# boop
In our main.rs
we'll use a couple of nom functions.
use nom::{character::*, sequence::terminated, Err::Error,IResult, *,};
and a custom error which acts pretty much like the original.
use crate::mdx_error::MDXError;
At the top of the file we'll define our data structure. This is what we're going to parse the MDX into. In this case it's an ATXHeading
struct (the name of one type of heading in commonmark). In this case we're using a reference to a [u8]
with a lifetime annotation, but that's not super important. We could have also used str
, etc.
#[derive(Debug, PartialEq, Eq)]pub struct ATXHeading<'a> {pub level: usize,pub value: &'a [u8],}
We'll start with a couple of parsers for hashes and spaces. Nom uses macros quite heavily although in 5.0 you can also write parsers with functions as we'll see in a moment. The named!
macro uses the identifier in the first argument (hashtags
or spaces
) and builds the macros in the second argument into that identifier, so we can use hashtags
or spaces
as parsers later.
named!(hashtags, is_a!("#"));named!(spaces, take_while!(is_space));
Then we write a few function-based parsers that operate on strings and return IResult
s. IResult
is a super important type to get to know because it's used everywhere and specifying the types for it is super important. While the current return for these parsers is an IResult<&str, &str>
with two type arguments (the input and return types), later we'll see that we can also use three to determine the error value in addition.
pub fn end_of_line(input: &str) -> IResult<&str, &str> {if input.is_empty() {Ok((input, input))} else {nom::character::complete::line_ending(input)}}pub fn rest_of_line(input: &str) -> IResult<&str, &str> {terminated(nom::character::complete::alphanumeric0,end_of_line,)(input)}
The meat of our setup is atx_heading
which uses the parsers we defined earlier to parse values out and return a tuple of the leftover input and the atx struct or an error. We use .map_err
to convert the return types into our custom error type so that we can return our own custom error if the hash length for the heading is greater than 6, which means it should be a paragraph. Our heading parser doesn't care about paragraphs, it only cares that it has to fail and the paragraph parser will occur somewhere else in our program.
pub fn atx_heading(input: &[u8],) -> IResult<&[u8], ATXHeading, MDXError<&[u8]>> {// TODO: up to 3 spaces can occur herelet (input, hashes) =hashtags(input).map_err(Err::convert)?;if hashes.len() > 6 {return Err(Error(MDXError::TooManyHashes));}// TODO: empty headings are a thing, so any parsing below this is optionallet (input, _) = spaces(input).map_err(Err::convert)?;// TODO: any whitespace on the end would get trimmed outlet (input, val) =rest_of_line(std::str::from_utf8(input).unwrap()).map_err(Err::convert)?;Ok((input.as_bytes(),ATXHeading {level: hashes.len(),value: val.as_bytes(),},))}
Finally, here's a test that asserts that we can parse an mdx string into the ATXHeading
AST.
#[cfg(test)]mod tests {use super::*;#[test]fn parse_atx_heading() {assert_eq!(atx_heading(b"# boop"),Ok(("".as_bytes(),ATXHeading {level: 1,value: b"boop"})));}}
Note that this is not a fully spec compliant parser (we noted TODO
s in the program comments) but it will work for specifically written headings. Can you flesh this out to parse the rest of the ATX Heading in the spec? This is part of my work on the MDX Rust implementation so by the time you read this there may be a more sophisticated parser for headings waiting for you there.