Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TLDR
Outlines currently fails to generate floats from a JSON schema. This PR fixes that.
The Problem
Here is the smallest working example of where this issue arises.
Import these packages
and consider a function that takes in floats
and here is the same function in a json schema
Imagine we also have a model, let's say mistral-7b-instruct-v0.2
Now we want to generate arguments for
add
, and can use the json structured generator to do soThis results in the error
where the most important part is the error raised in the
to_regex
functionThis is very strange. From some reason outlines cannot turn
{'type': 'float', 'description': 'The first float.'}
into regex. This seems very simple. What is going on?Instead of passing a string into
generate.json
(i.e.generate.json(model, schema_str)
), I tried passing the function directly intogenerate.json
and I get
Observe that running
generator = generate.json(model, add)
here gives us a different error than when we rangenerator = generate.json(model, schema_str)
.When we ran
generator = generate.json(model, schema_str)
there was an issue creating the generator. Withgenerator = generate.json(model, add)
we successfully creategenerator
and get that the function arguments generated by the model are{'x': 6, 'y': 0}
. However, the model generated integers (i.e.6
,0
) and not floats (i.e.6.0, 0.0
).What is going on here?
The solution
There are a couple of issues here:
json_schema.py
there was noFLOAT
type. This is why we got the errorNotImplementedError: Could not translate the instance {'type': 'float', 'description': 'The first float.'} to a regular expression
. I added theFLOAT
type to that file. This fixed the first issue where the generator cannot convertschema_str
to regex ingenerator = generate.json(model, schema_str)
.FLOAT
type injson_schema.py
, we do have theFLOAT
type intypes.py
:The
?
in(\.[0-9]+)?
means that for a float it is optional to include a decimal with some numbers after it. This means we can have floats without a decimal afterwards. Wait... isn't that just an integer! This regex pattern is incorrect. So I changed this regex pattern towhich requires that we have a decimal in the float.
3. I also noticed that we have two different definitions of
INTEGER
. Injson_schema.py
we haveINTEGER = r"(-)?(0|[1-9][0-9]*)"
and intypes.py
we haveINTEGER = r"[+-]?(0|[1-9][0-9]*)"
. I changed the definition injson_schema.py
to be the same as that intypes.py
becauseINTEGER
intypes.py
allows both+
and-
in front of the number and the definition injson_schema.py
does not allow this.TO DO: More generally, I think we should store these types in one single location. I'm thinking of importing all these definitions from
types.py
intojson_schema.py
, that way we won't have diverging definitions again.generator = generate.json(model, add)
generated integers and not floats? Recall that when the schema_object is of typecallable
, the following lines are called:The function get_schema_from_signature calls
Pydantic
'smodel.model_json_schema()
function to convert aPydantic
model to a JSON schema. For some reason, thismodel_json_schema()
casts objects of typefloat
s to objects of typenumber
. I don't know why this occurs. See this Github issue for more info.So calling
schema = pyjson.dumps(get_schema_from_signature(schema_object))
takes all floats in our function and casts them to typenumber
. Then when we call build_regex_from_schema, we take any objects of typenumber
and turn it to the regex patternwhere
NUMBER
represents any integer or float. Focusing on(\.[0-9]+)?
, notice that the?
means that it is optional to add the decimal with a number after it. All of ourfloat
s are turned to typenumber
whose regex forces that number to be either anint
or afloat
.This is why calling
generator = generate.json(model, add)
withfloat
s inadd
ends up with us generatingint
s. I don't know the best way to fix this and so I made a PR toPydantic
about this.To Do
json_schema.py
and instead import the types fromtypes.py
?function
into a JSON schema without castingfloat
tonumber
?Any thoughts would be appreciated.