From 868868f3dd5f958a8836cd6995184784138dac07 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?R=C3=A9mi=20Louf?= Date: Thu, 11 Apr 2024 12:13:10 +0200 Subject: [PATCH] Add a small grammar guide --- docs/reference/cfg.md | 86 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/docs/reference/cfg.md b/docs/reference/cfg.md index 25cdbcf10..4f0285c11 100644 --- a/docs/reference/cfg.md +++ b/docs/reference/cfg.md @@ -61,3 +61,89 @@ The following grammars are currently available: - JSON grammar via `outlines.grammars.json` If you would like more grammars to be added to the repository, please open an [issue](https://github.com/outlines-dev/outlines/issues) or a [pull request](https://github.com/outlines-dev/outlines/pulls). + + +## Grammar guide + +A grammar is a list of rules and terminals that define a *language*: + +- Terminals define the vocabulary of the language; they may be a string, regular expression or combination of these and other terminals. +- Rules define the structure of that language; they are a list of terminals and rules. + +Outlines uses the [Lark library](https://github.com/lark-parser/lark) to make Large Language Models generate text in a language of a grammar, it thus uses grammars defined in a format that Lark understands, based on the [EBNF syntax](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form). Read the [Lark documentation](https://lark-parser.readthedocs.io/en/stable/grammar.html) for more details on grammar, the following is a small primer that should help get your started. + +In the following we will define a [LOGO-like toy language](https://github.com/lark-parser/lark/blob/master/examples/turtle_dsl.py) for python's [turtle](https://docs.python.org/3/library/turtle.html) library. + +### Terminals + +A turtle can take 4 different `MOVEMENT` move instructions: forward (`f`), backward (`b`), turn right (`r`) and turn left (`l`). It can take `NUMBER` number of steps in each direction, and draw lines in a specified `COLOR`. These define the vocabulary of our language: + +```ebnf +MOVEMENT: "f"|"b"|"r"|"l" +COLOR: LETTER+ + +%import common.LETTER +%import common.INT -> NUMBER +%import common.WS +%ignore WS +``` + +The lines that start with `%` are called "directive". They allow to import pre-defined terminals and rules, such as `LETTER` and `NUMBER`. `LETTER+` is a regular expressions, and indicates that a `COLOR` is made of at least one `LETTER`. The last two lines specify that we will ignore white spaces (`WS`) in the grammar. + +### Rules + +We now need to define our rules, by decomposing instructions we can send to the turtle via our python program. At each line of the program, we can either choose a direction and execute a given number of steps, change the color used to draw the pattern. We can also choose to start filling, make a series of moves, and stop filling. We can also choose to repeat a series of move. + +We can easily write the first two rules: + +```ebnf +instruction: MOVEMENT NUMBER -> movement + | "c" COLOR [COLOR] -> change_color +``` + +where `movement` and `change_color` represent aliases for the rules. A whitespace implied concatenating the elements, and `|` choosing either of the elements. The `fill` and `repeat` rules are slightly more complex, since they apply to a code block, which is made of instructions. We thus define a new `code_block` rule that refers to `instruction` and finish implementing our rules: + +```ebnf +instruction: MOVEMENT NUMBER -> movement + | "c" COLOR [COLOR] -> change_color + | "fill" code_block -> fill + | "repeat" NUMBER code_block -> repeat + +code_block: "{" instruction "}" +``` + +We can now write the full grammar: + +```ebnf +start: instruction+ + +instruction: MOVEMENT NUMBER -> movement + | "c" COLOR [COLOR] -> change_color + | "fill" code_block -> fill + | "repeat" NUMBER code_block -> repeat + +code_block: "{" instruction+ "}" + +MOVEMENT: "f"|"b"|"l"|"r" +COLOR: LETTER+ + +%import common.LETTER +%import common.INT -> NUMBER +%import common.WS +%ignore WS +``` + +Notice the `start` rule, which defines the starting point of the grammar, i.e. the rule with which a program must start. This full grammars allows us to parse programs such as: + +```python +c red yellow + fill { repeat 36 { + f200 l170 + }} +``` + +The result of the parse, the parse tree, can then easily be translated into a Python program that uses the `turtle` library to draw a pattern. + +### Next steps + +This section provides a very brief overview of grammars and their possibilities. Check out the [Lark documentation](https://lark-parser.readthedocs.io/en/stable/index.html) for more thorough explanations and more examples.