C++ Bring collectively Time Parser Generator

91
C++ Bring collectively Time Parser Generator

C++ Bring collectively Time Parser Generator

C++ single header library which takes a language description as a C++ code and turns it accurate into a LR1 desk parser with a deterministic finite automaton lexical analyzer, all in compile time.
What’s more, the generated parser is of direction itself able to parsing in compile time.
All it desires is a C++17 compiler!

Contents

Usage

Following code demonstrates a easy parser which takes a comma separated list of integer numbers as argument and prints a sum of them.

readme_example.cpp


#include

using namespace ctpg;
using namespace ctpg::buffers;

constexpr nterm list(“list”);

constexpr char number_pattern[] = “[1-9][0-9]*”;
constexpr regex_term number(“number”);

int to_int(const std::string_view& sv)
{
int i = 0;
std::from_chars(sv.data(), sv.data() + sv.size(), i);
return i;
}

constexpr parser p(
list,
terms(‘,’, number),
nterms(list),
rules(
list(number) >=
to_int,
list(list, ‘,’, quantity)
>= [](int sum, char, const auto& n){ return sum + to_int(n); }
)
);

int predominant(int argc, charargv[])
{
if (argc < 2) return -1; auto res = p.parse(string_buffer(argv[1]), std::cerr); bool success = res.has_value(); if (success) std::cout << res.value() << std::endl; return success ? 0 : -1; } ">

#consist of "ctpg.hpp"
#consist of <iostream>
#consist of <charconv>

the expend of namespace ctpg;
the expend of namespace ctpg::buffers;

constexpr nterm<int> list("list");

constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term quantity("quantity");

int to_int(const std::string_view& sv)
{
    int i = 0;
    std::from_chars(sv.files(), sv.files() + sv.dimension(), i);
    return i;
}

constexpr parser p(
    list,
    phrases(',', quantity),
    nterms(list),
    principles(
        list(quantity) >=
            to_int,
        list(list, ',', quantity) 
            >= [](int sum, char, const auto& n){ return sum + to_int(n); }
    )
);

int predominant(int argc, charargv[])
{
    if (argc < 2)
        return -1;
    auto res = p.parse(string_buffer(argv[1]), std::cerr);
    bool success = res.has_value();
    if (success)
        std::cout << res.value() << std::endl;
    return success ? 0 : -1;
}

Bring collectively and flee:

g++ readme_example.cpp -std=c++17 -o example && example "10, 20, 30"

It is top to soundless watch the output : 60. If unsuitable textual suppose supplied as an argument:

g++ readme_example.cpp -std=c++17 -o example && example "1, 2, 3x"

it’s top to soundless watch:

[1:8] PARSE: Unexpected character: x

Explanation

Header

Namespaces

Namespace ctpg is the tip namespace. There are couple of objective namespaces like buffers

the expend of namespace ctpg;
the expend of namespace ctpg::buffers;

Terminal symbols

Terminal symbols (quick: phrases) are symbols used in grammar definition that are atomic blocks.
Examples of the phrases from a C++ language are: identifier, ‘+’ operator, varied key phrases and tons others.

To account for a time duration expend the one of char_term, string_term and regex_term classes.

Here is the instance of a regex_term with a habitual integer quantity regex pattern.

number(“number”);
“>
constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term quantity("quantity");

The constructor argument ("quantity") signifies a debug name and also shall be uncared for, but it indubitably is no longer told.
Names are at hand to diagnose problems with the grammar. If uncared for, the name shall be location to the pattern string.

Point to: the pattern desires to be pleased a static linkage to be allowed as a template parameter. That is C++17 limitation, and CTPG would not motivate C++20 substances but.

Other kinds of phrases

char_term is used once we favor to match things like a + or , operator.
string_term is used once we favor to match a complete string, like a language keyword.

Nonterminal symbols

Nonterminal symbols (quick: nonterms) are of direction all non atomic symbols in the grammar.
In C++ language these are things like: expression, class definition, objective declaration and tons others.

To account for a nonterm expend the nterm class.

list");

The constructor argument (“list”) is a debug name as neatly, like in the case of regex_term.
The variation is in nterms names are neccessary, because they function habitual identifiers as neatly.
Therefore it’s a requirement that nonterm names are habitual.

Template parameter in this case is a value form. More on this belief later.

Parser definition

The parser class alongside with its template deduction guides permits to account for parsers the expend of 4 arguments:

  • Grammar root – symbol which is a top stage nonterm for a grammar.
  • List of all phrases
  • List of all nonterms
  • List of principles

The parser object desires to be declared as constexpr, which makes your total neccessary calculations of the LR(1) desk parser done in compile time.

Let’s break down the arguments:

constexpr parser p(
    list,
    phrases(',', quantity),         
    nterms(list),  

Grammar root.

When the foundation symbol will get matched (in this case list) the parse is a hit.

Term list.

List of phrases enclosed in a phrases name. In our case there are two: quantity and a ,.

Point to: the , time duration is no longer defined earlier in the code.
It is a long way an implicit char_term. The code implicitly converts the char to the char_term class.
Therefore char_terms (as neatly as string_terms) are allowed now to not be defined upfront. Their debug names are assigned to
the them by default to a char (or a string) they signify.

Nonterm list.

List of phrases enclosed in a nterms name. In our case, accurate a single list nonterm is sufficient.

Tips

List of principles enclosed in a principles name.
Every rule is in the fabricate of:
nonterm(symbols...) >= functor
The nonterm piece is what’s called a left facet of the rule. The symbols are called the appropriate facet.

The categorical facet can own any option of nterm objects as neatly as phrases (regex_terms, char_terms or string_terms).
Terms will even be of their implicit fabricate, like , in the instance. Implicit string_terms are in fabricate of “strings”.

The principle rule list(quantity) means that the list nonterm will even be parsed the expend of a single quantity regex time duration.

The 2d rule uses what’s know as a left recurrence. In varied phrases, a list will even be parsed as a list followed by a , and a quantity.

Functors

The functors are any callables that can well internet the particular option of arguments as there are symbols on the gracious facet and return a tag form of the left facet.
Every nth argument desires to honest internet a tag of a value form of the nth appropriate facet symbol.

So in the case of the first to_int functor, it’s required to honest internet a tag form of regex_term and return an int.

The 2d functor is a lambda which accepts 3 arguments: an int for the list, a char for the , and auto for with out reference to is passed as
a tag form for the regex_term.

Point to: Functors are called in a manner that allows taking gracious thing about switch semantics, so defining or no longer it’s arguments as a switch reference is encouraged.

Trace kinds for phrases

Terms not like nonterms (which be pleased their value kinds defined as a template parameter to the nterm definition),
be pleased their value kinds predefined to both a term_value for a char_term, and a term_value
for both regex_term and string_term.

The term_value class template is a easy wrapper that is implicitly convertible to or no longer it’s template parameter (both a char or std::string_view).
That is why when providing functors we are able to easily suppose arguments as both a char or a std::string_view.
In our case the to_int functor has a const std::string_view& argument, which accepts a term_value accurate exquisite.
Obviously an auto in case of lambda will continually manufacture the trick.

The gracious thing about declaring functor arguments as term_value specialization is that we are able to entry varied substances (like provide monitoring) the expend of the term_value programs.

Parse methodology name

Use parse methodology with 2 argumets:

  • a buffer
  • an error sail

Buffers

Use a string_buffer from a buffers namespace to parse a null terminated string or a std::string.

Error sail

Circulation reference like std::cerr or any varied std::ostream will even be pased as a sail argument.
That is the placement the put the parse methodology goes to spit out error messages like a syntax error.

auto res = p.parse(string_buffer(argv[1]), std::cerr);

Parse return value

The parse methodology returns an std::optional, the put T is a tag form of the foundation symbol.
Use the .has_value() and the .value() to be pleased a examine and entry the of the parse.

Point to: White spot characters are skipped by default between consequent phrases.

Bring collectively time parsing

Instance code will even be with out wretchedness changed to form an exact constexpr parser.
First, your total functors must soundless be constexpr.
To enact this swap the to_int objective to:

constexpr int to_int(const std::string_view& sv)
{
    int sum = 0;
    for (auto c : sv) { sum *= 10; sum += c - '0'; }
    return sum;
}

The objective is now constexpr. The header is now unneccessary.

Point to: To permit constexpr parsing all of the nonterm value kinds favor to be literal kinds.

Additionally swap the predominant to expend cstring_buffer and suppose a parse consequence constexpr.
The error sail argument will most definitely be unavailable in constexpr parsing.

int predominant(int argc, charargv[])
{
    if (argc < 2)
    {
        constexpr char example_text[] = "1, 20, 3";
        
        constexpr auto cres = p.parse(cstring_buffer(example_text)); // designate cstring_buffer and no std::err output
        std::cout << cres.value() << std::endl;
        return 0;
    }
        
    auto res = p.parse(string_buffer(argv[1]), std::cerr);
    bool success = res.has_value();
    if (success)
        std::cout << res.value() << std::endl;
    return success ? 0 : -1;
}

Now when no argument is passed to the program, it prints the compile time consequence of parsing “1, 20, 3”.

g++ readme_example.cpp -std=c++17 -o example && example

must soundless print the amount 24.

Invalid enter in constexpr parsing

If the example_text variable turned into once an invalid enter, the code cres.value()
would throw, since the cres is of form std::optional with out a value.

Altering the parse name to:

constexpr int cres = p.parse(cstring_buffer(example_text)).value();

would location off compilation error, because throwing std::bad_optional_access is no longer constexpr.

LR(1) parser

CTPG uses a LR(1) parser. That is quick from left-to-appropriate and 1 lookahead symbol.

Algorithm

The parser uses a parse desk which is rather akin to a stutter machine.
Here is pseudo code for the algorithm:

struct entry
   int next          // righteous if shift
   int rule_length   // righteous if reduce
   int nterm_nr      // righteous if reduce   
   enum form {success, shift, reduce, error }
   
bool parse(enter, sr_table[states_count][terms_count], goto_table[states_count][nterms_count])
   stutter = 0
   states.push(stutter)
   needs_term = appropriate;
   
   whereas (appropriate)
      if (needs_term)
         term_nr = get_next_term(enter)
      entry = sr_table[state, term_nr]
      form = entry.form
           
      if (form == success)
         return appropriate
         
      else if (form == shift)
         needs_term = appropriate;
         stutter = entry.next
         states.push(stutter)
         continue
         
      else if (form == reduce)
         states.pop_n(entry.rule_length)
         stutter = states.top()
         stutter = goto_table[state, entry.nterm_nr]
         continue
         
      else
         return fallacious

Parser incorporates a stutter stack, which grows when the algorithm encounters a shift operation and shrinks on reduce operation.

As an alternative of for a stutter stack, there will most definitely be a tag stack for dedicated for parse consequence calculation.
Every shift pushes a tag to the stack and each reduce calls an acceptable functor with values from a tag stack, getting rid of values from a stack and replacing them with a single value associated with a rule’s left facet.

Desk advent

This matter is out of scope of this handbook. There is a range of area matter online on LR parsers.
Recomended book on the matter: Compilers: Tips, Ways and Tools

Conflicts

There are instances (parser states) wherein when a particualr time duration is encountered on the enter, there might be an ambiguity concerning the operation a parser must soundless form.

In varied phrases a language grammar will most definitely be defined in such a manner, that both shift and reduce can consequence in a successfull parse consequence, nonetheless the shall be varied in both instances.

Instance 1

Steal into narrative a standard expression parser (functors uncared for for readability):

constexpr parser p(
    expr,
    phrases('+', '*', quantity),
    nterms(expr),
    principles(
        expr(quantity),
        expr(expr, '+', expr),
        expr(expr, '*', expr)
    )
);

Steal into narrative 2 + 2 2 enter being parsed and a parser in a stutter after successfully matching 2 + 2 and encountering * time duration.

Both intriguing a * time duration and cutting again by the rule expr(expr, ‘+’, expr) might well per chance be righteous, nonetheless would function varied results.
That is a standard operator priority case, and this warfare desires to be resolved in a technique. That is the put priority and associativity blueprint terminate location.

Precedence and associativity

CTPG parsers can resolve such warfare basically based mostly exclusively on priority and associativity principles defined in a grammar.

Instance above will even be mounted by explicit time duration definitions.

In most cases, char_terms will even be presented by implicit definition in the phrases name. Nonetheless when in favor to account for a priority, explicit definition is required.

Simply swap the code to:

constexpr char_term o_plus('+', 1);  // priority location to 1
constexpr char_term o_mul('*', 2);   // priority location to 2

constexpr parser p(
    expr,
    phrases(o_plus, o_mul, quantity),
    nterms(expr),
    principles(
        expr(quantity),
        expr(expr, '+', expr),      // present: no want for o_plus and o_mul in the guidelines, nonetheless imaginable
        expr(expr, '*', expr)
    )
);

The elevated the priority value location, the elevated the time duration priority. Default time duration priority is equal to 0.

This explicit priority definition permits a * operator to be pleased bigger priority over +.

Instance 2

constexpr char_term o_plus('+', 1);  // priority location to 1
constexpr char_term o_minus('-', 1);  // priority location to 1
constexpr char_term o_mul('*', 2);   // priority location to 2

constexpr parser p(
    expr,
    phrases(o_plus, o_minus, o_mul, quantity),
    nterms(expr),
    principles(
        expr(quantity),
        expr(expr, '+', expr),
        expr(expr, '-', expr),   // extra rule allowing binary -
        expr(expr, '*', expr),
        expr('-', expr)          // extra rule allowing unary -
    )
);

Binary - and + operators be pleased the equal priority in exquisite mighty all languages.
Unary - nonetheless nearly continually be pleased a bigger priority than all binary operators.
We’re going to not enact this by simply defining - priority in char_term definition.
We favor a manner to repeat that expr('-', expr) has a bigger priority then all binary principles.

To enact this override the priority in a time duration by a priority in a rule changing:

expr('-', expr)
to
expr('-', expr)[3]

The [] operator permits precisely this. It explicitly sets the rule priority so the parser would not be pleased to deduce rule priority from a time duration.

So the final code appears like this:

constexpr char_term o_plus('+', 1);  // priority location to 1
constexpr char_term o_minus('-', 1);  // priority location to 1
constexpr char_term o_mul('*', 2);   // priority location to 2

constexpr parser p(
    expr,
    phrases(o_plus, o_minus, o_mul, quantity),
    nterms(expr),
    principles(
        expr(quantity),
        expr(expr, '+', expr),
        expr(expr, '-', expr),   // extra rule allowing binary -
        expr(expr, '*', expr),
        expr('-', expr)[3]       // extra rule allowing unary -, with gracious priority
    )
);

Instance 3

Steal into narrative the final code and as an example the enter is 2 + 2 + 2, parser has be taught 2 + 2 and is able to be taught the 2d +.
On this case what is the specified behaviour? Must the first 2 + 2 be reduced or a 2d + desires to be shifted?
(This could well per chance additionally no longer matter in case of integer calculations, but might well per chance additionally be pleased a edifying inequity in instances like expression form deduction in c++ when operator overloading is fervent.)

That is the typical associativity case that will even be solved by expicitly defining the time duration associativity.

There are 3 kinds of associativity accessible: left to appropriate, appropriate to left and no longer associative because the default.

To explicitly account for a time duration associativity swap the time duration definitions to:

constexpr char_term o_plus('+', 1, associativity::ltor);
constexpr char_term o_minus('-', 1, associativity::ltor);
constexpr char_term o_mul('*', 2, associativity::ltor);

Now all of these operators are left associative, that blueprint the reduce shall be most neatly-preferred over shift.

Must the associativity be defined as associativity::rtol, shift might well per chance be most neatly-preferred.

No associativity prefers shift by default.

Precedence and associativity summary

When a shift reduce warfare is encountered these principles observe in disclose:

Let r be a rule which is a area to reduce and t be a time duration that is encountered on enter.

  1. when explicit r priority from [] operator is bigger than t priority, form a reduce
  2. when priority of final time duration in r is bigger than t priority, form a reduce
  3. when priority of final time duration in r is equal to t priority and
    final time duration in r is left associative, form a reduce
  4. otherwise, form a shift.

Prick – reduce conflicts

In some instances the language is unwell formed and the parser incorporates a stutter wherein there might be an ambiguity between various reduce actions.

Steal into narrative example:

op"); constexpr nterm<char> special_op("op"); constexpr parser p( op, phrases('!', '*', '+'), nterms(special_op, op), principles( special_op('!'), op('!'), op('*'), op('+'), op(special_op) ) );

For example we parse an enter !. The parser has no manner of telling if it must soundless reduce the expend of rule special_op('!') or op('!').

That is an example of reduce/reduce warfare and such parser behaviour desires to be belief of as undefined.

There is a diagnostic application incorporated in CTPG which detects such conflicts so they’re going to also be addressed.

Functors – stepped forward

Steal into narrative a parser matching white spot separated names (strings).

name(“name”);
using name_type = std::string_view;
using list_type = std::vector;
constexpr nterm list(“list”);

constexpr parser p(
list,
terms(name),
nterms(list),
rules(
list(),
list(list, name)
)
);
“>

constexpr char pattern[] = "[a-zA-Z0-9_]+";
constexpr regex_term name("name");
the expend of name_type = std::string_view;
the expend of list_type = std::vector;
constexpr nterm list("list");

constexpr parser p(
    list,
    phrases(name),
    nterms(list),
    principles(
        list(),
        list(list, name)
    )
);

How precisely would the functors search for for this model of parser?

The principle rule list() is an example of an empty rule. This means the list will even be reduced from no enter.

On narrative of the rule’s left facet is a list the functor desires to come its value form, which is a list_type.
The categorical facet is empty so the functor desires to fabricate no longer be pleased any arguments.

So let’s return an empty vector: [](){ return list_type{}; }

The 2d rule reduces a list from a name and a list, therefore the functor desires to honest internet:

  • list_type for the first argument: list
  • term_value for the 2d argument: name
  • return a list_type

So let’s form a functor:

[](auto&& list, auto&& name){ list.emplace_back(std::switch(name)); return list; }

The name argument will resolve to term_value&&, which is convertible to std::string_view&&.

Now the parser appears like this:

name(“name”);
using name_type = std::string_view;
using list_type = std::vector;
constexpr nterm list(“list”);

constexpr parser p(
list,
terms(name),
nterms(list),
rules(
list()
>= [](){ return list_type{}; },
list(list, name)
>= [](auto&& list, auto&& name){ list.push_back(name); return std::switch(list); }
)
);
“>

constexpr char pattern[] = "[a-zA-Z0-9_]+";
constexpr regex_term name("name");
the expend of name_type = std::string_view;
the expend of list_type = std::vector;
constexpr nterm list("list");

constexpr parser p(
    list,
    phrases(name),
    nterms(list),
    principles(
        list() 
            >= [](){ return list_type{}; },
        list(list, name)
            >= [](auto&& list, auto&& name){ list.push_back(name); return std::switch(list); }
    )
);

Point to: Here we blueprint terminate gracious thing about switch semantics that are supported in the functor calls. This fashion we’re working with the equal std::vector occasion
we created as empty the expend of the first rule.

Crucial Point to
It is a long way possible for functors to be pleased referrence (both const and no longer) argument kinds, nonetheless lifetime of the objects passed to functors ends right this moment after the functor returns.
So it’s better to retain a long way from the expend of referrence kinds as nterm value kinds.

Functor helpers

There are a pair of at hand able to expend functor templates:

val

Use when a functor desires to come a tag which would not rely on left facet:

binary"); constexpr parser p( binary, phrases('0', '1', '&', '|'), nterms(binary), principles( binary('0') >= val(fallacious), binary('1') >= val(appropriate), binary(binary, '&', binary) >= [](bool b1, auto, bool b2){ return b1 & b2; }, binary(b

Knowasiak
WRITTEN BY

Knowasiak

Hey! look, i give tutorials to all my users and i help them!