| Mar | APR | May |
| 23 | ||
| 2021 | 2022 | 2023 |
COLLECTED BY
Collection: Save Page Now
●Getting Started
●Where to Get Help
●Lifecycle of a Pull Request
●Running & Writing Tests
●Increase Test Coverage
●Helping with Documentation
●Documenting Python
●Silence Warnings From the Test Suite
●Fixing “easy” Issues (and Beyond)
●Issue Tracking
Toggle child pages in navigation
●GitHub Labels
●GitHub issues for BPO users
●Triaging an Issue
●Following Python’s Development
●Porting Python to a new platform
●How to Become a Core Developer
●Developer Log
●Accepting Pull Requests
●Development Cycle
●Continuous Integration
●Adding to the Stdlib
●Changing the Python Language
●Experts Index
●gdb Support
●Exploring CPython’s Internals
●Changing CPython’s Grammar
●Guide to CPython’s Parser
●Design of CPython’s Compiler
●Design of CPython’s Garbage Collector
●Updating standard library extension modules
●Changing Python’s C API
●Coverity Scan
●Dynamic Analysis with Clang
●Running a buildbot worker
●Core Developer Motivations and Affiliations
●Git Bootcamp and Cheat Sheet
●Appendix: Topics
v: latest
Versions
latest
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds
rule: A | B | Ca context-free-grammar parser (like an LL(1) parser) will generate constructions that given an input string will deduce which alternative (
A, BorC)
must be expanded, while a PEG parser will check if the first alternative succeeds
and only if it fails, will it continue with the second or the third one in the
order in which they are written. This makes the choice operator not commutative.
Unlike LL(1) parsers, PEG-based parsers cannot be ambiguous: if a string parses,
it has exactly one valid parse tree. This means that a PEG-based parser cannot
suffer from the ambiguity problems that can arise with LL(1) parsers and with
context-free grammars in general.
PEG parsers are usually constructed as a recursive descent parser in which every
rule in the grammar corresponds to a function in the program implementing the
parser and the parsing expression (the “expansion” or “definition” of the rule)
represents the “code” in said function. Each parsing function conceptually takes
an input string as its argument, and yields one of the following results:
A “success” result. This result indicates that the expression can be parsed by
that rule and the function may optionally move forward or consume one or more
characters of the input string supplied to it.
A “failure” result, in which case no input is consumed.
Notice that “failure” results do not imply that the program is incorrect, nor do
they necessarily mean that the parsing has failed. Since the choice operator is
ordered, a failure very often merely indicates “try the following option”. A
direct implementation of a PEG parser as a recursive descent parser will present
exponential time performance in the worst case, because PEG parsers have
infinite lookahead (this means that they can consider an arbitrary number of
tokens before deciding for a rule). Usually, PEG parsers avoid this exponential
time complexity with a technique called “packrat parsing” 1which not only
loads the entire program in memory before parsing it but also allows the parser
to backtrack arbitrarily. This is made efficient by memoizing the rules already
matched for each position. The cost of the memoization cache is that the parser
will naturally use more memory than a simple LL(1) parser, which normally are
table-based.
A| Bis not the same as B| A).
If a rule returns a failure, it doesn’t mean that the parsing has failed,
it just means “try something else”.
By default PEG parsers run in exponential time, which can be optimized to linear by
using memoization.
If parsing fails completely (no rule succeeds in parsing all the input text), the
PEG parser doesn’t have a concept of “where the SyntaxError is”.
first_rule: ( 'a' | 'aa' ) 'a'
second_rule: ('aa' | 'a' ) 'a'
In a regular EBNF grammar, both rules specify the language {aa, aaa} but
in PEG, one of these two rules accepts the string aaa but not the string
aa. The other does the opposite – it accepts the string aa
but not the string aaa. The rule ('a'|'aa')'a' does
not accept aaa because 'a'|'aa' consumes the first a, letting the
final ain the rule consume the second, and leaving out the third a.
As the rule has succeeded, no attempt is ever made to go back and let
'a'|'aa' try the second alternative. The expression ('aa'|'a')'a' does
not accept aabecause 'aa'|'a' accepts all of aa, leaving nothing
for the final a. Again, the second alternative of 'aa'|'a' is not
tried.
Caution
The effects of ordered choice, such as the ones illustrated above, may be hidden by many levels of rules.
For this reason, writing rules where an alternative is contained in the next one is in almost all cases a mistake,
for example:
my_rule:
| 'if' expression 'then' block
| 'if' expression 'then' block 'else' block
In this example, the second alternative will never be tried because the first one will
succeed first (even if the input string has an 'else' block that follows). To correctly
write this rule you can simply alter the order:
my_rule:
| 'if' expression 'then' block 'else' block
| 'if' expression 'then' block
In this case, if the input string doesn’t have an 'else' block, the first alternative
will fail and the second will be attempted without said part.
rule_name: expressionOptionally, a type can be included right after the rule name, which specifies the return type of the C or Python function corresponding to the rule:
rule_name[return_type]: expressionIf the return type is omitted, then a
void * is returned in C and an
Any in Python.
# comment#
e1 e2#
e1, then match e2.
rule_name: first_rule second_rule
e1 | e2#
e1ore2.
The first alternative can also appear on the line after the rule name
for formatting purposes. In that case, a | must be used before the
first alternative, like so:
rule_name[return_type]:
| first_alt
| second_alt
( e)#
e.
rule_name: (e)
A slightly more complex and useful example includes using the grouping
operator together with the repeat operators:
rule_name: (e1 e2)*
[ e] ore?#
e.
rule_name: [e]
A more useful example includes defining that a trailing comma is
optional:
rule_name: e (',' e)* [',']
e*#
e.
rule_name: (e1 e2)*
e+#
e.
rule_name: (e1 e2)+
s.e+#
e, separated by s. The generated parse
tree does not include the separator. This is otherwise identical to
(e (s e)*).
rule_name: ','.e+
&e#
ecan be parsed, without consuming any input.
!e#
ecan be parsed, without consuming any input.
An example taken from the Python grammar specifies that a primary
consists of an atom, which is not followed by a . or a ( or a
[:
primary: atom !'.' !'(' !'['
~#
rule_name: '(' ~ some_rule ')' | some_alt
In this example, if a left parenthesis is parsed, then the other
alternative won’t be considered, even if some_rule or ) fail to be
parsed.
rule1: rule2 | 'a' rule2: rule3 | 'b' rule3: rule1 | 'c'and “hidden left-recursion” like:
rule: 'optional'? rule '@' some_other_rule
= sign. The name can then be used in the action (see below), like this:
rule_name[return_type]: '(' a=some_other_rule ')' { a }
rule_name[return_type]:
| first_alt1 first_alt2 { first_alt1 }
| second_alt1 second_alt2 { second_alt1 }
If the action is omitted, a default action is generated:
If there’s a single name in the rule, it gets returned.
If there is more than one name in the rule, a collection with all parsed
expressions gets returned (the type of the collection will be different
in C and Python).
This default behaviour is primarily made for very simple situations and for
debugging purposes.
Warning
It’s important that the actions don’t mutate any AST nodes that are passed
into them via variables referring to other rules. The reason for mutation
being not allowed is that the AST nodes are cached by memoization and could
potentially be reused in a different context, where the mutation would be
invalid. If an action needs to change an AST node, it should instead make a
new copy of the node and change that.
The full meta-grammar for the grammars supported by the PEG generator is:
start[Grammar]: grammar ENDMARKER { grammar }
grammar[Grammar]:
| metas rules { Grammar(rules, metas) }
| rules { Grammar(rules, []) }
metas[MetaList]:
| meta metas { [meta] + metas }
| meta { [meta] }
meta[MetaTuple]:
| "@" NAME NEWLINE { (name.string, None) }
| "@" a=NAME b=NAME NEWLINE { (a.string, b.string) }
| "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) }
rules[RuleList]:
| rule rules { [rule] + rules }
| rule { [rule] }
rule[Rule]:
| rulename ":" alts NEWLINE INDENT more_alts DEDENT {
Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) }
| rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) }
| rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) }
rulename[RuleName]:
| NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")}
| NAME '[' type=NAME ']' {(name.string, type.string)}
| NAME {(name.string, None)}
alts[Rhs]:
| alt "|" alts { Rhs([alt] + alts.alts)}
| alt { Rhs([alt]) }
more_alts[Rhs]:
| "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) }
| "|" alts NEWLINE { Rhs(alts.alts) }
alt[Alt]:
| items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) }
| items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) }
| items action { Alt(items, action=action) }
| items { Alt(items, action=None) }
items[NamedItemList]:
| named_item items { [named_item] + items }
| named_item { [named_item] }
named_item[NamedItem]:
| NAME '=' ~ item {NamedItem(name.string, item)}
| item {NamedItem(None, item)}
| it=lookahead {NamedItem(None, it)}
lookahead[LookaheadOrCut]:
| '&' ~ atom {PositiveLookahead(atom)}
| '!' ~ atom {NegativeLookahead(atom)}
| '~' {Cut()}
item[Item]:
| '[' ~ alts ']' {Opt(alts)}
| atom '?' {Opt(atom)}
| atom '*' {Repeat0(atom)}
| atom '+' {Repeat1(atom)}
| sep=atom '.' node=atom '+' {Gather(sep, node)}
| atom {atom}
atom[Plain]:
| '(' ~ alts ')' {Group(alts)}
| NAME {NameLeaf(name.string) }
| STRING {StringLeaf(string.string)}
# Mini-grammar for the actions
action[str]: "{" ~ target_atoms "}" { target_atoms }
target_atoms[str]:
| target_atom target_atoms { target_atom + " " + target_atoms }
| target_atom { target_atom }
target_atom[str]:
| "{" ~ target_atoms "}" { "{" + target_atoms + "}" }
| NAME { name.string }
| NUMBER { number.string }
| STRING { string.string }
| "?" { "?" }
| ":" { ":" }
As an illustrative example this simple grammar file allows directly
generating a full parser that can parse simple arithmetic expressions and that
returns a valid C-based Python AST:
start[mod_ty]: a=expr_stmt* ENDMARKER { _PyAST_Module(a, NULL, p->arena) }
expr_stmt[stmt_ty]: a=expr NEWLINE { _PyAST_Expr(a, EXTRA) }
expr[expr_ty]:
| l=expr '+' r=term { _PyAST_BinOp(l, Add, r, EXTRA) }
| l=expr '-' r=term { _PyAST_BinOp(l, Sub, r, EXTRA) }
| term
term[expr_ty]:
| l=term '*' r=factor { _PyAST_BinOp(l, Mult, r, EXTRA) }
| l=term '/' r=factor { _PyAST_BinOp(l, Div, r, EXTRA) }
| factor
factor[expr_ty]:
| '(' e=expr ')' { e }
| atom
atom[expr_ty]:
| NAME
| NUMBER
Here EXTRA is a macro that expands to start_lineno, start_col_offset,
end_lineno, end_col_offset, p->arena, those being variables automatically
injected by the parser; ppoints to an object that holds on to all state
for the parser.
A similar grammar written to target Python AST objects:
start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] }
expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) }
expr:
| l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) }
| l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) }
| term
term:
| l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) }
| l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) }
| factor
factor:
| '(' e=expr ')' { e }
| atom
atom:
| NAME
| NUMBER
Grammar/Python.gram and produce the final C
parser. It contains the following pieces:
A parser generator that can read a grammar file and produce a PEG parser written in Python or C that can parse
said grammar. The generator is located at Tools/peg_generator/pegen.
A PEG meta-grammar that automatically generates a Python parser that is used for the parser generator itself
(this means that there are no manually-written parsers). The meta-grammar is
located at Tools/peg_generator/pegen/metagrammar.gram.
A generated parser (using the parser generator) that can directly produce C and Python AST objects.
The source code for Pegen lives at Tools/peg_generator/pegen but normally all typical commands to interact
with the parser generator are executed from the main makefile.
C
parser (the one used by the interpreter) just execute:
make regen-pegenusing the
Makefile in the main directory. If you are on Windows you can
use the Visual Studio project files to regenerate the parser or to execute:
./PCbuild/build.bat --regenThe generated parser file is located at
Parser/parser.c.
Tools/peg_generator/pegen/metagrammar.gram.
Although it is very unlikely that you will ever need to modify it, if you make any modifications
to this file (in order to implement new Pegen features) you will need to regenerate
the meta-parser (the parser that parses the grammar files). To do so just execute:
make regen-pegen-metaparserIf you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute:
./PCbuild/build.bat --regen
'class') denote KEYWORDS.
Strings with double quotes (”) (e.g. "match") denote SOFT KEYWORDS.
Upper case names (e.g. NAME) denote tokens in the Grammar/Tokens file.
Rule names starting with invalid_ are used for specialized syntax errors.
These rules are NOT used in the first pass of the parser.
Only if the first pass fails to parse, a second pass including the invalid
rules will be executed.
If the parser fails in the second phase with a generic syntax error, the
location of the generic failure of the first pass will be used (this avoids
reporting incorrect locations due to the invalid rules).
The order of the alternatives involving invalid rules matter
(like any rule in PEG).
ASYNC and AWAIT
(for compatibility purposes), backtracking errors (such as unclosed parenthesis), dealing with encoding,
interactive mode and much more. Some of these reasons are also there for historical purposes, and some
others are useful even today.
The list of tokens (all uppercase names in the grammar) that you can use can be found in the Grammar/Tokens
file. If you change this file to add new tokens, make sure to regenerate the files by executing:
make regen-tokenIf you are on Windows you can use the Visual Studio project files to regenerate the tokens or to execute:
./PCbuild/build.bat --regenHow tokens are generated and the rules governing this is completely up to the tokenizer (
Parser/tokenizer.c)
and the parser just receives tokens from it.
memo after
the rule name (and type, if present):
rule_name[typr] (memo):
...
By selectively turning on memoization for a handful of rules, the parser becomes faster and uses less memory.
Note
Left-recursive rules always use memoization, since the implementation of left-recursion depends on it.
To know if a new rule needs memoization or not, benchmarking is required
(comparing execution times and memory usage of some considerably big files with
and without memoization). There is a very simple instrumentation API available
in the generated C parse code that allows to measure how much each rule uses
memoization (check the Parser/pegen.c file for more information) but it
needs to be manually activated.
p: The parser structure.
EXTRA: This is a macro that expands to (_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena),
which is normally used to create AST nodes as almost all constructors need these attributes to be provided. All of the
location variables are taken from the location information of the current token.
x= class + 1),
while soft keywords only get a special meaning in context. Trying to use a hard
keyword as a variable will always fail:
>>> class = 3
File "<stdin>", line 1
class = 3
^
SyntaxError: invalid syntax
>>> foo(class=3)
File "<stdin>", line 1
foo(class=3)
^^^^^
SyntaxError: invalid syntax
While soft keywords don’t have this limitation if used in a context other the one where they
are defined as keywords:
>>> match = 45 >>> foo(match="Yeah!")The
match and case keywords are soft keywords, so that they are recognized as
keywords at the beginning of a match statement or case block respectively, but are
allowed to be used in other places as variable or argument names.
You can get a list of all keywords defined in the grammar from Python:
>>> import keyword >>> keyword.kwlist ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']as well as soft keywords:
>>> import keyword >>> keyword.softkwlist ['_', 'case', 'match']Caution Soft keywords can be a bit challenging to manage as they can be accepted in places you don’t intend to, given how the order alternatives behave in PEG parsers (see consequences of ordered choice section for some background on this). In general, try to define them in places where there is not a lot of alternatives.
SyntaxError
exceptions and this is the main mechanism the parser uses to report custom syntax
error messages.
Note
Tokenizer errors are normally reported by raising exceptions but some special
tokenizer errors such as unclosed parenthesis will be reported only after the
parser finishes without returning anything.
NULL in C or None in Python) but no exception has
been raised.
Caution
Positive and negative lookaheads will try to match a token so they will affect
the location of generic syntax errors. Use them carefully at boundaries
between rules.
As the Python grammar was primordially written as an LL(1) grammar, this heuristic
has an extremely high success rate, but some PEG features can have small effects,
such as positive lookaheads and
negative lookaheads.
To generate more precise syntax errors, custom rules are used. This is a common practice
also in context free grammars: the parser will try to accept some construct that is known
to be incorrect just to report a specific syntax error for that construct. In pegen grammars,
these rules start with the invalid_ prefix. This is because trying to match these rules
normally has a performance impact on parsing (and can also affect the ‘correct’ grammar itself
in some tricky cases, depending on the ordering of the rules) so the generated parser acts in
two phases:
The first phase will try to parse the input stream without taking into account rules that
start with the invalid_ prefix. If the parsing succeeds it will return the generated AST
and the second phase will not be attempted.
If the first phase failed, a second parsing attempt is done including the rules that start
with an invalid_ prefix. By design this attempt cannot succeed and is only executed
to give to the invalid rules a chance to detect specific situations where custom, more precise,
syntax errors can be raised. This also allows to trade a bit of performance for precision reporting
errors: given that we know that the input text is invalid, there is no need to be fast because
the interpreter is going to stop anyway.
Important
When defining invalid rules:
Make sure all custom invalid rules raise SyntaxError exceptions (or a subclass of it).
Make sure all invalid rules start with the invalid_ prefix to not
impact performance of parsing correct Python code.
Make sure the parser doesn’t behave differently for regular rules when you introduce invalid rules
(see the how PEG parsers work section for more information).
You can find a collection of macros to raise specialized syntax errors in the
Parser/pegen.h header file. These macros allow also to report ranges for
the custom errors that will be highlighted in the tracebacks that will be
displayed when the error is reported.
Tip
A good way to test if an invalid rule will be triggered when you expect is to test if introducing
a syntax error after valid code triggers the rule or not. For example:
<valid python code> $ 42Should trigger the syntax error in the
$ character. If your rule is not correctly defined this
won’t happen. For example, if you try to define a rule to match Python 2 style print statements
to make a better error message and you define it as:
invalid_print: "print" expressionThis will seem to work because the parser will correctly parse
print(something) because it is valid
code and the second phase will never execute but if you try to parse print(something) $ 3the first pass
of the parser will fail (because of the $) and in the second phase, the rule will match the
print(something)asprint followed by the variable something between parentheses and the error
will be reported there instead of the $ character.
Grammar/Python.gram grammar file is a Python AST object (using C
structures). This means that the actions in the grammar file generate AST objects
when they succeed. Constructing these objects can be quite cumbersome (see
the AST compiler section for more information
on how these objects are constructed and how they are used by the compiler) so
special helper functions are used. These functions are declared in the
Parser/pegen.h header file and defined in the Parser/action_helpers.c
file. These functions allow you to join AST sequences, get specific elements
from them or to do extra processing on the generated tree.
Caution
Actions must never be used to accept or reject rules. It may be tempting
in some situations to write a very generic rule and then check the generated
AST to decide if is valid or not but this will render the official grammar partially incorrect
(because actions are not included) and will make it more difficult for other
Python implementations to adapt the grammar to their own needs.
As a general rule, if an action spawns multiple lines or requires something more
complicated than a single expression of C code, is normally better to create a
custom helper in Parser/action_helpers.c and expose it in the
Parser/pegen.h header file so it can be used from the grammar.
If the parsing succeeds, the parser must return a valid AST object.
Lib/test/test_peg_generator directory.
Tools/peg_generator/ directory on the CPython repository and manually call the parser generator by executing:
$ python -m pegen python <PATH TO YOUR GRAMMAR FILE>This will generate a file called
parse.py in the same directory that you can use to parse some input:
$ python parse.py file_with_source_code_to_test.pyAs the generated
parse.py file is just Python code, you can modify it and add breakpoints to debug or
better understand some complex situations.
--with-pydebug when running the configure step in Linux or by
adding -d when calling the PCbuild/python.bat script in Windows), it is possible to activate a very verbose
mode in the generated parser. This is very useful to debug the generated parser and to understand how it works, but it
can be a bit hard to understand at first.
Note
When activating verbose mode in the Python parser, it is better to not use interactive mode as it can be much harder to
understand, because interactive mode involves some special steps compared to regular parsing.
To activate verbose mode you can add the -d flag when executing Python:
$ python -d file_to_test.pyThis will print a lot of output to
stderr so is probably better to dump it to a file for further analysis. The output
consists of trace lines with the following structure:
<indentation> (‘>’|’-‘|’+’|’!’) <rule_name>[<token_location>]: <alternative> …
Every line is indented by a different amount (<indentation>) depending on how deep the call stack is. The next
character marks the type of the trace:
> indicates that a rule is going to be attempted to be parsed.
- indicates that a rule has failed to be parsed.
+ indicates that a rule has been parsed correctly.
! indicates that an exception or an error has been detected and the parser is unwinding.
The <token_location> part indicates the current index in the token array, the
<rule_name> part indicates what rule is being parsed and the <alternative> part
indicates what alternative within that rule is being attempted.
# comment
●e1 e2
●e1 | e2
●( e)
●[ e] ore?
●e*
●e+
●s.e+
●&e
●!e
●~
●Left recursion
●Variables in the Grammar
●Grammar actions
●Pegen
●How to regenerate the parser
●How to regenerate the meta-parser
●Grammatical elements and rules
●Tokenization
●Memoization
●Automatic variables
●Hard and Soft keywords
●Error handling
●How Syntax errors are reported
●Generating AST objects
●Testing
●Debugging generated parsers
●Making experiments
●Verbose mode
●References