Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate float from JSON Schema #882

Closed
wants to merge 7 commits into from

Conversation

eitanturok
Copy link
Contributor

@eitanturok eitanturok commented May 9, 2024

TLDR

Outlines currently fails to generate floats from a JSON schema. This PR fixes that.

The Problem

Here is the smallest working example of where this issue arises.

Import these packages

# !pip install outlines torch transformers datasets accelerate pyairports pycountry
import json
from outlines import models, generate
from transformers import AutoModelForCausalLM, AutoTokenizer

and consider a function that takes in floats

def add(x: float, y: float) -> float:
    """Add two floats.

    Args:
        x (float): The first float.
        y (float): The second float.
    """
    assert isinstance(x, float)
    assert isinstance(y, float)
    return x + y

and here is the same function in a json schema

schema_json = {
    'title': 'add',
    'type': 'object',
    'description': 'Add two floats.',
    'properties': {'x': {'type': 'float', 'description': 'The first float.'},
                    'y': {'type': 'float', 'description': 'The second float.'}},
    'required': ['x', 'y'],
    }

Imagine we also have a model, let's say mistral-7b-instruct-v0.2

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
device = "cuda"
llm = AutoModelForCausalLM.from_pretrained(model_id, token=<YOUR-HF-TOKEN>, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id, token=<YOUR-HF-TOKEN>, trust_remote_code=True)
model = models.Transformers(llm, tokenizer)

Now we want to generate arguments for add, and can use the json structured generator to do so

schema_str = json.dumps(schema_json)
generator = generate.json(model, schema_str)

function_args = generator("I am traveling 3 m/s and then travel 2 m/s faster. Can you call the function add to find out my new speed?", max_tokens=30)
print("Input:", function_args)

result = add(**function_args)
print("Output:", result)

This results in the error

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[36], line 1
----> 1 generator = generate.json(model, schema_str)
      3 function_args = generator(\"I am traveling 3 m/s and then travel 2 m/s faster. Can you call the function add to find out my new speed?\", max_tokens=30)
      4 print(function_args)

File /usr/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    905 if not args:
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File /mnt/workdisk/eitan/venvs/outlines-bug/lib/python3.11/site-packages/outlines/generate/json.py:58, in json(model, schema_object, sampler, whitespace_pattern)
     56 elif isinstance(schema_object, str):
     57     schema = schema_object
---> 58     regex_str = build_regex_from_schema(schema, whitespace_pattern)
     59     generator = regex(model, regex_str, sampler)
     60     generator.format_sequence = lambda x: pyjson.loads(x)

File /mnt/workdisk/eitan/venvs/outlines-bug/lib/python3.11/site-packages/outlines/fsm/json_schema.py:83, in build_regex_from_schema(schema, whitespace_pattern)
     80 resolver = registry.resolver()
     82 content = schema.contents
---> 83 return to_regex(resolver, content, whitespace_pattern)

File /mnt/workdisk/eitan/venvs/outlines-bug/lib/python3.11/site-packages/outlines/fsm/json_schema.py:142, in to_regex(resolver, instance, whitespace_pattern)
    140 for i, (name, value) in enumerate(properties.items()):
    141     subregex = f'{whitespace_pattern}\"{re.escape(name)}\"{whitespace_pattern}:{whitespace_pattern}'
--> 142     subregex += to_regex(resolver, value, whitespace_pattern)
    143     if i < last_required_pos:
    144         subregex = f\"{subregex}{whitespace_pattern},\"

File /mnt/workdisk/eitan/venvs/outlines-bug/lib/python3.11/site-packages/outlines/fsm/json_schema.py:356, in to_regex(resolver, instance, whitespace_pattern)
    349         regexes = [
    350             to_regex(resolver, {\"type\": t}, whitespace_pattern)
    351             for t in instance_type
    352             if t != \"object\"
    353         ]
    354         return rf\"({'|'.join(regexes)})\"
--> 356 raise NotImplementedError(
    357     f\"\"\"Could not translate the instance {instance} to a
    358 regular expression. Make sure it is valid to the JSON Schema specification. If
    359 it is, please open an issue on the Outlines repository\"\"\"
    360 )

NotImplementedError: Could not translate the instance {'type': 'float', 'description': 'The first float.'} to a
    regular expression. Make sure it is valid to the JSON Schema specification. If
    it is, please open an issue on the Outlines repository"
}

where the most important part is the error raised in the to_regex function

NotImplementedError: Could not translate the instance {'type': 'float', 'description': 'The first float.'} to a
    regular expression. Make sure it is valid to the JSON Schema specification. If
    it is, please open an issue on the Outlines repository

This is very strange. From some reason outlines cannot turn {'type': 'float', 'description': 'The first float.'} into regex. This seems very simple. What is going on?


Instead of passing a string into generate.json (i.e. generate.json(model, schema_str)), I tried passing the function directly into generate.json

generator = generate.json(model, add)

function_args = generator("I am traveling 3 m/s and then travel 2 m/s faster. Can you call the function add to find out my new speed?", max_tokens=30)
print("Input:", function_args)

result = add(**function_args)
print("Output:", result)

and I get

Input: {'x': 6, 'y': 0}
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[6], line 6
      3 function_args = generator(\"I am traveling 3 m/s and then travel 2 m/s faster. Can you call the function add to find out my new speed?\", max_tokens=30)
      4 print(\"Input:\", function_args)
----> 6 result = add(**function_args)
      7 print(\"Output:\", result)

Cell In[3], line 10, in add(x, y)
      3 def add(x: float, y: float) -> float:
      4     \"\"\"Add two floats.
      5 
      6     Args:
      7         x (float): The first float.
      8         y (float): The second float.
      9     \"\"\"
---> 10     assert isinstance(x, float)
     11     assert isinstance(y, float)
     12     return x + y

AssertionError: "
}

Observe that running generator = generate.json(model, add) here gives us a different error than when we ran generator = generate.json(model, schema_str).

When we ran generator = generate.json(model, schema_str) there was an issue creating the generator. With generator = generate.json(model, add) we successfully create generator and get that the function arguments generated by the model are {'x': 6, 'y': 0}. However, the model generated integers (i.e. 6, 0) and not floats (i.e. 6.0, 0.0).

What is going on here?

The solution

There are a couple of issues here:

  1. In json_schema.py there was no FLOAT type. This is why we got the error NotImplementedError: Could not translate the instance {'type': 'float', 'description': 'The first float.'} to a regular expression. I added the FLOAT type to that file. This fixed the first issue where the generator cannot convert schema_str to regex in generator = generate.json(model, schema_str).
  2. Although we don't have the FLOAT type in json_schema.py, we do have the FLOAT type in types.py:
INTEGER = r"[+-]?(0|[1-9][0-9]*)"
FLOAT = rf"{INTEGER}(\.[0-9]+)?([eE][+-][0-9]+)?"

The ? in (\.[0-9]+)? means that for a float it is optional to include a decimal with some numbers after it. This means we can have floats without a decimal afterwards. Wait... isn't that just an integer! This regex pattern is incorrect. So I changed this regex pattern to

FLOAT = rf"[+-]?[0-9]*\.[0-9]+([eE][-+]?[0-9]+)?"

which requires that we have a decimal in the float.
3. I also noticed that we have two different definitions of INTEGER. In json_schema.py we have INTEGER = r"(-)?(0|[1-9][0-9]*)" and intypes.py we have INTEGER = r"[+-]?(0|[1-9][0-9]*)". I changed the definition in json_schema.py to be the same as that in types.py because INTEGER in types.py allows both + and - in front of the number and the definition in json_schema.py does not allow this.

TO DO: More generally, I think we should store these types in one single location. I'm thinking of importing all these definitions from types.py into json_schema.py, that way we won't have diverging definitions again.

  1. But what about our second issue, that generator = generate.json(model, add) generated integers and not floats? Recall that when the schema_object is of type callable, the following lines are called:
schema = pyjson.dumps(get_schema_from_signature(schema_object))
regex_str = build_regex_from_schema(schema, whitespace_pattern)

The function get_schema_from_signature calls Pydantic's model.model_json_schema() function to convert a Pydantic model to a JSON schema. For some reason, this model_json_schema() casts objects of type floats to objects of type number. I don't know why this occurs. See this Github issue for more info.

So calling schema = pyjson.dumps(get_schema_from_signature(schema_object)) takes all floats in our function and casts them to type number. Then when we call build_regex_from_schema, we take any objects of type number and turn it to the regex pattern

NUMBER = rf"({INTEGER})(\.[0-9]+)?([eE][+-][0-9]+)?"

where NUMBER represents any integer or float. Focusing on (\.[0-9]+)?, notice that the ? means that it is optional to add the decimal with a number after it. All of our floats are turned to type number whose regex forces that number to be either an int or a float.

This is why calling generator = generate.json(model, add) with floats in add ends up with us generating ints. I don't know the best way to fix this and so I made a PR to Pydantic about this.

To Do

  1. Should we delete the types in json_schema.py and instead import the types from types.py?
  2. How do we generate turn a function into a JSON schema without casting float to number?

Any thoughts would be appreciated.

@eitanturok
Copy link
Contributor Author

I know this is a very long/complex issue but would love help fixing this as soon as possible. Thanks!

@eitanturok eitanturok marked this pull request as ready for review May 10, 2024 03:36
@rlouf
Copy link
Member

rlouf commented May 10, 2024

As in #888 please refer to the JSON definition and JSON Schema specification. As you can see, an integer is a valid number.

@eitanturok
Copy link
Contributor Author

Hi @rlouf!

I'm planning on breaking this PR up into smaller PRs as there are many different changes I made here at once. (The first of these is the PR #888.)

I will look through the JSON schema like you suggested. In the meantime my code breaks anytime I have a float in my JSON schema. I'll create a PR for just this issue now because it is the most pressing one.

@eitanturok
Copy link
Contributor Author

After reading this JSON Schema Specification, I see that JSON does not actually support float; it only supports int or number.

Man, that is frustrating.

So the JSON schema I provided in my example

schema_json = {
    'title': 'add',
    'type': 'object',
    'description': 'Add two floats.',
    'properties': {'x': {'type': 'float', 'description': 'The first float.'},
                    'y': {'type': 'float', 'description': 'The second float.'}},
    'required': ['x', 'y'],
    }

would never actually be generated in practice as there is no type float.

Additionally, the fact that schema = pyjson.dumps(get_schema_from_signature(schema_object)) takes all floats in our function and casts them to type number is actually behavior that now makes sense.

This clears a lot of things up, thank you!

@eitanturok
Copy link
Contributor Author

Is there a way to use generate.json() to get the model to output floats in a JSON schema? This float would be consistent with the Python definition of float, i.e. it must have a decimal point and one or more numbers after the decimal point, i.e. r"[+-]?[0-9]*\.[0-9]+([eE][-+]?[0-9]+)?"?

This definition of float is different from the current definition of number, which optionally has a decimal point?

Thanks!

@rlouf
Copy link
Member

rlouf commented May 10, 2024

Is there a way to use generate.json() to get the model to output floats in a JSON schema? This float would be consistent with the Python definition of float, i.e. it must have a decimal point and one or more numbers after the decimal point, i.e. r"[+-]?[0-9]*\.[0-9]+([eE][-+]?[0-9]+)?"?

This definition of float is different from the current definition of number, which optionally has a decimal point?

Thanks!

We could add a outlines.types.Float custom type. Thinking about it we may take some liberties wrt the JSON spec, but should open an issue to discuss it prior to opening a PR.

@rlouf
Copy link
Member

rlouf commented May 15, 2024

Closing as discussed offline. Generating Python-compatible strings may lead to non-parseable JSON.

@rlouf rlouf closed this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants