Skip to content

Commit

Permalink
Improve the documentation for structured generation
Browse files Browse the repository at this point in the history
  • Loading branch information
rlouf committed Mar 25, 2024
1 parent d825d0c commit aed9d21
Show file tree
Hide file tree
Showing 13 changed files with 258 additions and 130 deletions.
1 change: 1 addition & 0 deletions docs/api/guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: outlines.fsm.guide
62 changes: 62 additions & 0 deletions docs/reference/cfg.md
Original file line number Diff line number Diff line change
@@ -1 +1,63 @@
# Grammar-structured generation

You can pass any context-free grammar in the EBNF format and Outlines will generate an output that is valid to this grammar:

```python
from outlines import models, generate

arithmetic_grammar = """
?start: expression
?expression: term (("+" | "-") term)*
?term: factor (("*" | "/") factor)*
?factor: NUMBER
| "-" factor
| "(" expression ")"
%import common.NUMBER
"""

model = models.transformers("WizardLM/WizardMath-7B-V1.1")
generator = generate.cfg(model, arithmetic_grammar)
sequence = generator(
"Alice had 4 apples and Bob ate 2. "
+ "Write an expression for Alice's apples:"
)

print(sequence)
# (8-2)
```

!!! Note "Performance"

The implementation of grammar-structured generation in Outlines is very naive. This does not reflect the performance of [.txt](https://dottxt.co)'s product, where we made grammar-structured generation as fast as regex-structured generation.


## Ready-to-use grammars

Outlines contains a (small) library of grammars that can be imported and use directly. We can rewrite the previous example as:

```python
from outlines import models, generate

arithmetic_grammar = outlines.grammars.arithmetic

model = models.transformers("WizardLM/WizardMath-7B-V1.1")
generator = generate.cfg(model, arithmetic_grammar)
sequence = generator(
"Alice had 4 apples and Bob ate 2. "
+ "Write an expression for Alice's apples:"
)

print(sequence)
# (8-2)
```

The following grammars are currently available:

- Arithmetic grammar via `outlines.grammars.arithmetic`
- JSON grammar via `outlines.grammars.json`

If you would like more grammars to be added to the repository, please open an [issue](https://github.com/outlines-dev/outlines/issues) or a [pull request](https://github.com/outlines-dev/outlines/pulls).
18 changes: 10 additions & 8 deletions docs/reference/choices.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# Multiple choices

Choice between different options
In some cases we know the output is to be chosen between different options. We can restrict the completion’s output to these choices using the is_in keyword argument:
Oultines allows you to make sure the generated text is chosen between different options:

```python
import outlines.models as models
from outlines import models, generate

model = models.transformers("mistralai/Mistral-7B-v0.1")
generator = generate.choice(model, ["skirt", "dress", "pen", "jacket"])
answer = generator("Pick the odd word out: skirt, dress, pen, jacket")

complete = models.openai("gpt-3.5-turbo")
answer = complete(
"Pick the odd word out: skirt, dress, pen, jacket",
is_in=["skirt", "dress", "pen", "jacket"]
)
```

!!! Note "Performance"

`generation.choice` computes an index that helps Outlines guide generation. This can take some time, but only needs to be done once. If you want to generate from the same list of choices several times make sure that you only call `generate.choice` once.
74 changes: 14 additions & 60 deletions docs/reference/custom_fsm_ops.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,37 @@
# Custom FSM Operations

```RegexFSM.from_interegular_fsm``` leverages the flexibility of ```interegular.FSM``` to use the available operations in ```interegular```.
Outlines is fast because it compiles regular expressions into an index ahead of inference. To do so we use the equivalence between regular expressions and Finite State Machines (FSMs), and the library [interegular](https://github.com/MegaIng/interegular) to perform the translation.

## Examples
Alternatively, one can pass a FSM built using `integular` directly to structure the generation.

### ```difference```
## Example

Returns an FSM which recognises only the strings recognised by the first FSM in the list, but none of the others.
### Using the `difference` operation

In the following example we build a fsm which recognizes only the strings valid to the first regular expression but not the second. In particular, it will prevent the words "pink" and "elephant" from being generated:

```python
import interegular
from outlines import models, generate


list_of_strings_pattern = """\["[^"\s]*"(?:,"[^"\s]*")*\]"""
pink_elephant_pattern = """.*(pink|elephant).*"""

list_of_strings_fsm = interegular.parse_pattern(list_of_strings_pattern).to_fsm()
pink_elephant_fsm = interegular.parse_pattern(pink_elephant_pattern).to_fsm()

list_of_strings_fsm.accepts('["a","pink","elephant"]')
# True

difference_fsm = list_of_strings_fsm - pink_elephant_fsm

difference_fsm_fsm.accepts('["a","pink","elephant"]')
# False
difference_fsm_fsm.accepts('["a","blue","donkey"]')
# True
```

### ```union```

Returns a finite state machine which accepts any sequence of symbols that is accepted by either self or other.

```python
list_of_strings_pattern = """\["[^"\s]*"(?:,"[^"\s]*")*\]"""
tuple_of_strings_pattern = """\("[^"\s]*"(?:,"[^"\s]*")*\)"""

list_of_strings_fsm = interegular.parse_pattern(list_of_strings_pattern).to_fsm()
tuple_of_strings_fsm = interegular.parse_pattern(tuple_of_strings_pattern).to_fsm()

list_of_strings_fsm.accepts('("a","pink","elephant")')
# False

union_fsm = list_of_strings_fsm|tuple_of_strings_fsm

union_fsm.accepts('["a","pink","elephant"]')
# True
union_fsm.accepts('("a","blue","donkey")')
# True
model = models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
generator = generate.fsm(model, difference_fsm)
response = generator("Don't talk about pink elephants")
```

### ```intersection```

Returns an FSM which accepts any sequence of symbols that is accepted by both of the original FSMs.

```python
list_of_strings_pattern = """\["[^"\s]*"(?:,"[^"\s]*")*\]"""
pink_elephant_pattern = """.*(pink|elephant).*"""

list_of_strings_fsm = interegular.parse_pattern(list_of_strings_pattern).to_fsm()
pink_elephant_fsm = interegular.parse_pattern(pink_elephant_pattern).to_fsm()

list_of_strings_fsm.accepts('["a","blue","donkey"]')
# True

intersection_fsm = list_of_strings_fsm & pink_elephant_fsm

intersection_fsm.accepts('["a","pink","elephant"]')
# True
intersection_fsm.accepts('["a","blue","donkey"]')
# False
```

_There are more operations available, we refer to https://github.com/MegaIng/interegular/blob/master/interegular/fsm.py._

# Loading Custom FSM

```python
import outlines

generator = outlines.generate.fsm(model, custom_fsm)

response = generator(prompt)
```
To see the other operations available, consult [interegular's documentation](https://github.com/MegaIng/interegular/blob/master/interegular/fsm.py).
52 changes: 45 additions & 7 deletions docs/reference/json.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Make the LLM follow a JSON Schema
# JSON structured generation

Outlines can make any open source model return a JSON object that follows a structure that is specified by the user. This is useful whenever we want the output of the model to be processed by code downstream: code does not understand natural language but rather the structured language it has been programmed to understand.

Expand All @@ -16,8 +16,7 @@ Outlines can infer the structure of the output from a Pydantic model. The result
```python
from pydantic import BaseModel

from outlines import models
from outlines import text
from outlines import models, generate


class User(BaseModel):
Expand All @@ -27,20 +26,59 @@ class User(BaseModel):


model = models.transformers("mistralai/Mistral-7B-v0.1")
generator = text.generate.json(model, User)
result = generator("Create a user profile with the fields name, last_name and id")
generator = generate.json(model, User)
result = generator(
"Create a user profile with the fields name, last_name and id"
)
print(result)
# User(name="John", last_name="Doe", id=11)
```

!!! warning "JSON and whitespaces"
!!! Note "JSON and whitespaces"

By default Outlines lets model choose the number of linebreaks and white spaces used to structure the JSON. Small models tend to struggle with this, in which case we recommend to set the value of the parameter `whitespace_pattern` to the empty string:

```python
generator = text.generate.json(model, User, whitespace_pattern="")
generator = generate.json(model, User, whitespace_pattern="")
```

!!! Note "Performance"

`generation.json` computes an index that helps Outlines guide generation. This can take some time, but only needs to be done once. If you want to generate several times with the same schema make sure that you only call `generate.json` once.


## Using a JSON Schema

Instead of a Pydantic model you can pass a string that represents a [JSON Schema](https://json-schema.org/) specification to `generate.json`:

```python
from pydantic import BaseModel

from outlines import models
from outlines import text

model = models.transformers("mistralai/Mistral-7B-v0.1")

schema = """
{
"title": "User",
"type": "object",
"properties": {
"name": {"type": "string"},
"last_name": {"type": "string"},
"id": {"type": "integer"}
}
}
"""

generator = generate.json(model, schema)
result = generator(
"Create a user profile with the fields name, last_name and id"
)
print(result)
# User(name="John", last_name="Doe", id=11)
```

## From a function's signature

Outlines can infer the structure of the output from the signature of a function. The result is a dictionary, and can be passed directly to the function using the usual dictionary expansion syntax `**`:
Expand Down
17 changes: 17 additions & 0 deletions docs/reference/json_mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# JSON mode

Outlines can guarantee that the LLM will generate valid JSON, using [Grammar-structured generation](cfg.md):

```python
from outlines import models, generate

json_grammar = outlines.grammars.json

model = models.transformers("mistralai/Mistral-7b-v0.1")
generator = generate.cfg(model, json_grammar)
sequence = generator("Generate valid JSON")
```

!!! Note "JSON that follows a schema"

If you want to guarantee that the generated JSON follows a given schema, consult [this section](json.md) instead.
26 changes: 26 additions & 0 deletions docs/reference/regex.md
Original file line number Diff line number Diff line change
@@ -1 +1,27 @@
# Regular expressions

Outlines can guarantee that the text generated by the LLM will be valid to a regular expression:

```python
from outlines import models, generate

model = models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

generator = generate.regex(
model,
r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
)

prompt = "What is the IP address of the Google DNS servers? "
answer = generator(prompt, max_tokens=30)

print(answer)
# What is the IP address of the Google DNS servers?
# 2.2.6.1
```

If you find yourself using `generate.regex` to restrict the answers' type you can take a look at [type-structured generation](types.md) instead.

!!! Note "Performance"

`generate.regex` computes an index that helps Outlines guide generation. This can take some time, but only needs to be done once. If you want to generate several times using the same regular expression make sure that you only call `generate.regex` once.
5 changes: 5 additions & 0 deletions docs/reference/samplers.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,8 @@ answer = generator("What is 2+2?")
print(answer)
# 4
```


!!! Warning "Compatibility"

Only models from the `transformers` and `exllamav2 ` libraries are compatible with Beam Search.
Loading

0 comments on commit aed9d21

Please sign in to comment.