C++ Bring collectively Time Parser Generator
C++ single header library which takes a language description as a C++ code and turns it accurate into a LR1 desk parser with a deterministic finite automaton lexical analyzer, all in compile time.
What’s more, the generated parser is of direction itself able to parsing in compile time.
All it desires is a C++17 compiler!
Contents
- Usage
- Explanation
- Bring collectively Time Parsing
- LR(1) Parser
- Functors – stepped forward
- Quite a lot of substances
- Customary expressions
- Diagnostics
Usage
Following code demonstrates a easy parser which takes a comma separated list of integer numbers as argument and prints a sum of them.
Featured Content Ads
add advertising herereadme_example.cpp
#include
using namespace ctpg;
using namespace ctpg::buffers;
constexpr nterm constexpr char number_pattern[] = “[1-9][0-9]*”; int to_int(const std::string_view& sv) constexpr parser p( int predominant(int argc, charargv[])Featured Content Ads
add advertising here
constexpr regex_term
{
int i = 0;
std::from_chars(sv.data(), sv.data() + sv.size(), i);
return i;
}
list,
terms(‘,’, number),
nterms(list),
rules(
list(number) >=
to_int,
list(list, ‘,’, quantity)
>= [](int sum, char, const auto& n){ return sum + to_int(n); }
)
);
{
if (argc < 2)
return -1;
auto res = p.parse(string_buffer(argv[1]), std::cerr);
bool success = res.has_value();
if (success)
std::cout << res.value() << std::endl;
return success ? 0 : -1;
}
">Featured Content Ads
add advertising here#consist of "ctpg.hpp"
#consist of <iostream>
#consist of <charconv>
the expend of namespace ctpg;
the expend of namespace ctpg::buffers;
constexpr nterm<int> list("list");
constexpr char number_pattern[] = "[1-9][0-9]*";
constexpr regex_term
Bring collectively and flee:
g++ readme_example.cpp -std=c++17 -o example && example "10, 20, 30"
It is top to soundless watch the output : 60. If unsuitable textual suppose supplied as an argument:
g++ readme_example.cpp -std=c++17 -o example && example "1, 2, 3x"
it’s top to soundless watch:
[1:8] PARSE: Unexpected character: x
Explanation
Header
Namespaces
Namespace ctpg is the tip namespace. There are couple of objective namespaces like buffers
the expend of namespace ctpg; the expend of namespace ctpg::buffers;
Terminal symbols
Terminal symbols (quick: phrases) are symbols used in grammar definition that are atomic blocks.
Examples of the phrases from a C++ language are: identifier, ‘+’ operator, varied key phrases and tons others.
To account for a time duration expend the one of char_term
, string_term
and regex_term
classes.
Here is the instance of a regex_term with a habitual integer quantity regex pattern.
“>
constexpr char number_pattern[] = "[1-9][0-9]*"; constexpr regex_termquantity("quantity");
The constructor argument ("quantity")
signifies a debug name and also shall be uncared for, but it indubitably is no longer told.
Names are at hand to diagnose problems with the grammar. If uncared for, the name shall be location to the pattern string.
Point to: the pattern desires to be pleased a static linkage to be allowed as a template parameter. That is C++17 limitation, and CTPG would not motivate C++20 substances but.
Other kinds of phrases
char_term
is used once we favor to match things like a +
or ,
operator.
string_term
is used once we favor to match a complete string, like a language keyword.
Nonterminal symbols
Nonterminal symbols (quick: nonterms) are of direction all non atomic symbols in the grammar.
In C++ language these are things like: expression, class definition, objective declaration and tons others.
To account for a nonterm expend the nterm
class.
The constructor argument (“list”) is a debug name as neatly, like in the case of regex_term.
The variation is in nterms names are neccessary, because they function habitual identifiers as neatly.
Therefore it’s a requirement that nonterm names are habitual.
Template parameter
in this case is a value form. More on this belief later.
Parser definition
The parser
class alongside with its template deduction guides permits to account for parsers the expend of 4 arguments:
- Grammar root – symbol which is a top stage nonterm for a grammar.
- List of all phrases
- List of all nonterms
- List of principles
The parser
object desires to be declared as constexpr
, which makes your total neccessary calculations of the LR(1) desk parser done in compile time.
Let’s break down the arguments:
constexpr parser p( list, phrases(',', quantity), nterms(list),
Grammar root.
When the foundation symbol will get matched (in this case list
) the parse is a hit.
Term list.
List of phrases enclosed in a phrases
name. In our case there are two: quantity
and a ,
.
Point to: the
,
time duration is no longer defined earlier in the code.
It is a long way an implicitchar_term
. The code implicitly converts the char to thechar_term
class.
Thereforechar_terms
(as neatly asstring_terms
) are allowed now to not be defined upfront. Their debug names are assigned to
the them by default to a char (or a string) they signify.
Nonterm list.
List of phrases enclosed in a nterms
name. In our case, accurate a single list
nonterm is sufficient.
Tips
List of principles enclosed in a principles
name.
Every rule is in the fabricate of:
nonterm(symbols...) >= functor
The nonterm
piece is what’s called a left facet of the rule. The symbols are called the appropriate facet.
The categorical facet can own any option of nterm
objects as neatly as phrases (regex_terms
, char_terms
or string_terms
).
Terms will even be of their implicit fabricate, like ,
in the instance. Implicit string_terms
are in fabricate of “strings”.
list(quantity)
means that the list
nonterm will even be parsed the expend of a single quantity
regex time duration.
The 2d rule uses what’s know as a left recurrence. In varied phrases, a list
will even be parsed as a list
followed by a ,
and a quantity
.
Functors
The functors are any callables that can well internet the particular option of arguments as there are symbols on the gracious facet and return a tag form of the left facet.
Every nth argument desires to honest internet a tag of a value form of the nth appropriate facet symbol.
So in the case of the first to_int
functor, it’s required to honest internet a tag form of regex_term
and return an int
.
The 2d functor is a lambda which accepts 3 arguments: an int
for the list
, a char
for the ,
and auto for with out reference to is passed as
a tag form for the regex_term
.
Point to: Functors are called in a manner that allows taking gracious thing about switch semantics, so defining or no longer it’s arguments as a switch reference is encouraged.
Trace kinds for phrases
Terms not like nonterms (which be pleased their value kinds defined as a template parameter to the nterm definition),
be pleased their value kinds predefined to both a term_value
for a char_term
, and a term_value
for both regex_term
and string_term
.
The term_value
class template is a easy wrapper that is implicitly convertible to or no longer it’s template parameter (both a char
or std::string_view
).
That is why when providing functors we are able to easily suppose arguments as both a char
or a std::string_view
.
In our case the to_int
functor has a const std::string_view&
argument, which accepts a term_value
accurate exquisite.
Obviously an auto
in case of lambda will continually manufacture the trick.
The gracious thing about declaring functor arguments as term_value
specialization is that we are able to entry varied substances (like provide monitoring) the expend of the term_value
programs.
Parse methodology name
Use parse
methodology with 2 argumets:
- a buffer
- an error sail
Buffers
Use a string_buffer from a buffers
namespace to parse a null terminated string or a std::string
.
Error sail
Circulation reference like std::cerr
or any varied std::ostream
will even be pased as a sail argument.
That is the placement the put the parse
methodology goes to spit out error messages like a syntax error.
auto res = p.parse(string_buffer(argv[1]), std::cerr);
Parse return value
The parse
methodology returns an std::optional
, the put T
is a tag form of the foundation symbol.
Use the .has_value()
and the .value()
to be pleased a examine and entry the of the parse.
Point to: White spot characters are skipped by default between consequent phrases.
Bring collectively time parsing
Instance code will even be with out wretchedness changed to form an exact constexpr parser.
First, your total functors must soundless be constexpr.
To enact this swap the to_int
objective to:
constexpr int to_int(const std::string_view& sv) { int sum = 0; for (auto c : sv) { sum *= 10; sum += c - '0'; } return sum; }
The objective is now constexpr. The
header is now unneccessary.
Point to: To permit constexpr parsing all of the nonterm value kinds favor to be literal kinds.
Additionally swap the predominant to expend cstring_buffer
and suppose a parse consequence constexpr.
The error sail argument will most definitely be unavailable in constexpr parsing.
int predominant(int argc, charargv[]) { if (argc < 2) { constexpr char example_text[] = "1, 20, 3"; constexpr auto cres = p.parse(cstring_buffer(example_text)); // designate cstring_buffer and no std::err output std::cout << cres.value() << std::endl; return 0; } auto res = p.parse(string_buffer(argv[1]), std::cerr); bool success = res.has_value(); if (success) std::cout << res.value() << std::endl; return success ? 0 : -1; }
Now when no argument is passed to the program, it prints the compile time consequence of parsing “1, 20, 3”.
g++ readme_example.cpp -std=c++17 -o example && example
must soundless print the amount 24.
Invalid enter in constexpr parsing
If the example_text
variable turned into once an invalid enter, the code cres.value()
would throw, since the cres
is of form std::optional
with out a value.
Altering the parse
name to:
constexpr int cres = p.parse(cstring_buffer(example_text)).value();
would location off compilation error, because throwing std::bad_optional_access
is no longer constexpr.
LR(1) parser
CTPG uses a LR(1) parser. That is quick from left-to-appropriate and 1 lookahead symbol.
Algorithm
The parser uses a parse desk which is rather akin to a stutter machine.
Here is pseudo code for the algorithm:
struct entry
int next // righteous if shift
int rule_length // righteous if reduce
int nterm_nr // righteous if reduce
enum form {success, shift, reduce, error }
bool parse(enter, sr_table[states_count][terms_count], goto_table[states_count][nterms_count])
stutter = 0
states.push(stutter)
needs_term = appropriate;
whereas (appropriate)
if (needs_term)
term_nr = get_next_term(enter)
entry = sr_table[state, term_nr]
form = entry.form
if (form == success)
return appropriate
else if (form == shift)
needs_term = appropriate;
stutter = entry.next
states.push(stutter)
continue
else if (form == reduce)
states.pop_n(entry.rule_length)
stutter = states.top()
stutter = goto_table[state, entry.nterm_nr]
continue
else
return fallacious
Parser incorporates a stutter stack, which grows when the algorithm encounters a shift operation and shrinks on reduce operation.
As an alternative of for a stutter stack, there will most definitely be a tag stack for dedicated for parse consequence calculation.
Every shift pushes a tag to the stack and each reduce calls an acceptable functor with values from a tag stack, getting rid of values from a stack and replacing them with a single value associated with a rule’s left facet.
Desk advent
This matter is out of scope of this handbook. There is a range of area matter online on LR parsers.
Recomended book on the matter: Compilers: Tips, Ways and Tools
Conflicts
There are instances (parser states) wherein when a particualr time duration is encountered on the enter, there might be an ambiguity concerning the operation a parser must soundless form.
In varied phrases a language grammar will most definitely be defined in such a manner, that both shift and reduce can consequence in a successfull parse consequence, nonetheless the shall be varied in both instances.
Instance 1
Steal into narrative a standard expression parser (functors uncared for for readability):
constexpr parser p(
expr,
phrases('+', '*', quantity),
nterms(expr),
principles(
expr(quantity),
expr(expr, '+', expr),
expr(expr, '*', expr)
)
);
Steal into narrative 2 + 2 2
enter being parsed and a parser in a stutter after successfully matching 2 + 2
and encountering *
time duration.
Both intriguing a *
time duration and cutting again by the rule expr(expr, ‘+’, expr) might well per chance be righteous, nonetheless would function varied results.
That is a standard operator priority case, and this warfare desires to be resolved in a technique. That is the put priority and associativity blueprint terminate location.
Precedence and associativity
CTPG parsers can resolve such warfare basically based mostly exclusively on priority and associativity principles defined in a grammar.
Instance above will even be mounted by explicit time duration definitions.
In most cases, char_terms
will even be presented by implicit definition in the phrases
name. Nonetheless when in favor to account for a priority, explicit definition is required.
Simply swap the code to:
constexpr char_term o_plus('+', 1); // priority location to 1 constexpr char_term o_mul('*', 2); // priority location to 2 constexpr parser p( expr, phrases(o_plus, o_mul, quantity), nterms(expr), principles( expr(quantity), expr(expr, '+', expr), // present: no want for o_plus and o_mul in the guidelines, nonetheless imaginable expr(expr, '*', expr) ) );
The elevated the priority value location, the elevated the time duration priority. Default time duration priority is equal to 0.
This explicit priority definition permits a *
operator to be pleased bigger priority over +
.
Instance 2
constexpr char_term o_plus('+', 1); // priority location to 1 constexpr char_term o_minus('-', 1); // priority location to 1 constexpr char_term o_mul('*', 2); // priority location to 2 constexpr parser p( expr, phrases(o_plus, o_minus, o_mul, quantity), nterms(expr), principles( expr(quantity), expr(expr, '+', expr), expr(expr, '-', expr), // extra rule allowing binary - expr(expr, '*', expr), expr('-', expr) // extra rule allowing unary - ) );
Binary -
and +
operators be pleased the equal priority in exquisite mighty all languages.
Unary -
nonetheless nearly continually be pleased a bigger priority than all binary operators.
We’re going to not enact this by simply defining -
priority in char_term
definition.
We favor a manner to repeat that expr('-', expr)
has a bigger priority then all binary principles.
To enact this override the priority in a time duration by a priority in a rule changing:
expr('-', expr)
to
expr('-', expr)[3]
The []
operator permits precisely this. It explicitly sets the rule priority so the parser would not be pleased to deduce rule priority from a time duration.
So the final code appears like this:
constexpr char_term o_plus('+', 1); // priority location to 1 constexpr char_term o_minus('-', 1); // priority location to 1 constexpr char_term o_mul('*', 2); // priority location to 2 constexpr parser p( expr, phrases(o_plus, o_minus, o_mul, quantity), nterms(expr), principles( expr(quantity), expr(expr, '+', expr), expr(expr, '-', expr), // extra rule allowing binary - expr(expr, '*', expr), expr('-', expr)[3] // extra rule allowing unary -, with gracious priority ) );
Instance 3
Steal into narrative the final code and as an example the enter is 2 + 2 + 2
, parser has be taught 2 + 2
and is able to be taught the 2d +
.
On this case what is the specified behaviour? Must the first 2 + 2
be reduced or a 2d +
desires to be shifted?
(This could well per chance additionally no longer matter in case of integer calculations, but might well per chance additionally be pleased a edifying inequity in instances like expression form deduction in c++ when operator overloading is fervent.)
That is the typical associativity case that will even be solved by expicitly defining the time duration associativity.
There are 3 kinds of associativity accessible: left to appropriate, appropriate to left and no longer associative because the default.
To explicitly account for a time duration associativity swap the time duration definitions to:
constexpr char_term o_plus('+', 1, associativity::ltor); constexpr char_term o_minus('-', 1, associativity::ltor); constexpr char_term o_mul('*', 2, associativity::ltor);
Now all of these operators are left associative, that blueprint the reduce shall be most neatly-preferred over shift.
Must the associativity be defined as associativity::rtol
, shift might well per chance be most neatly-preferred.
No associativity prefers shift by default.
Precedence and associativity summary
When a shift reduce warfare is encountered these principles observe in disclose:
Let r be a rule which is a area to reduce and t be a time duration that is encountered on enter.
- when explicit r priority from
[]
operator is bigger than t priority, form a reduce - when priority of final time duration in r is bigger than t priority, form a reduce
- when priority of final time duration in r is equal to t priority and
final time duration in r is left associative, form a reduce - otherwise, form a shift.
Prick – reduce conflicts
In some instances the language is unwell formed and the parser incorporates a stutter wherein there might be an ambiguity between various reduce actions.
Steal into narrative example:
For example we parse an enter !
. The parser has no manner of telling if it must soundless reduce the expend of rule special_op('!')
or op('!')
.
That is an example of reduce/reduce warfare and such parser behaviour desires to be belief of as undefined.
There is a diagnostic application incorporated in CTPG which detects such conflicts so they’re going to also be addressed.
Functors – stepped forward
Steal into narrative a parser matching white spot separated names (strings).
using name_type = std::string_view;
using list_type = std::vector
constexpr nterm
constexpr parser p(
list,
terms(name),
nterms(list),
rules(
list(),
list(list, name)
)
);
“>
constexpr char pattern[] = "[a-zA-Z0-9_]+"; constexpr regex_termname("name"); the expend of name_type = std::string_view; the expend of list_type = std::vector ; constexpr nterm list("list"); constexpr parser p( list, phrases(name), nterms(list), principles( list(), list(list, name) ) );
How precisely would the functors search for for this model of parser?
The principle rule list()
is an example of an empty rule. This means the list will even be reduced from no enter.
On narrative of the rule’s left facet is a list
the functor desires to come its value form, which is a list_type
.
The categorical facet is empty so the functor desires to fabricate no longer be pleased any arguments.
So let’s return an empty vector: [](){ return list_type{}; }
The 2d rule reduces a list from a name and a list, therefore the functor desires to honest internet:
list_type
for the first argument: listterm_value
for the 2d argument: name- return a
list_type
So let’s form a functor:
[](auto&& list, auto&& name){ list.emplace_back(std::switch(name)); return list; }
The name
argument will resolve to term_value
, which is convertible to std::string_view&&
.
Now the parser appears like this:
using name_type = std::string_view;
using list_type = std::vector
constexpr nterm
constexpr parser p(
list,
terms(name),
nterms(list),
rules(
list()
>= [](){ return list_type{}; },
list(list, name)
>= [](auto&& list, auto&& name){ list.push_back(name); return std::switch(list); }
)
);
“>
constexpr char pattern[] = "[a-zA-Z0-9_]+"; constexpr regex_termname("name"); the expend of name_type = std::string_view; the expend of list_type = std::vector ; constexpr nterm list("list"); constexpr parser p( list, phrases(name), nterms(list), principles( list() >= [](){ return list_type{}; }, list(list, name) >= [](auto&& list, auto&& name){ list.push_back(name); return std::switch(list); } ) );
Point to: Here we blueprint terminate gracious thing about switch semantics that are supported in the functor calls. This fashion we’re working with the equal
std::vector
occasion
we created as empty the expend of the first rule.
Crucial Point to
It is a long way possible for functors to be pleased referrence (both const and no longer) argument kinds, nonetheless lifetime of the objects passed to functors ends right this moment after the functor returns.
So it’s better to retain a long way from the expend of referrence kinds as nterm value kinds.
Functor helpers
There are a pair of at hand able to expend functor templates:
val
Use when a functor desires to come a tag which would not rely on left facet: