Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State mapping cache ignores the tokenizer used to build the state machine #872

Open
br3no opened this issue May 7, 2024 · 3 comments · May be fixed by #873, #876 or #911
Open

State mapping cache ignores the tokenizer used to build the state machine #872

br3no opened this issue May 7, 2024 · 3 comments · May be fixed by #873, #876 or #911
Labels
bug structured generation Linked to structured generation

Comments

@br3no
Copy link
Contributor

br3no commented May 7, 2024

Describe the issue as clearly as possible:

def create_states_mapping(regex_string: str) -> Tuple[dict, set, set]:

The cached function actually depends on the regex and the tokenizer. The tokenizer is not a parameter of the function, though, which leads to cached state maps being shared across different tokenizers, which leads to errors.

Steps/code to reproduce the bug:

import outlines

regex = r"((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)"

model = outlines.models.transformers("stabilityai/stablelm-2-zephyr-1_6b")

prompt = "What is the IP address of the Google DNS servers? "

generator = outlines.generate.regex(
    model,
    regex,
)
structured = generator(prompt, max_tokens=30)

print(structured)

model = outlines.models.transformers("microsoft/phi-2")
generator = outlines.generate.regex(
    model,
    regex,
)
structured = generator(prompt, max_tokens=30)

print(structured)

Expected result:

Both generations should conform to the regex.

Error message:

No response

Outlines/Python version information:

Version information

0.0.41
Python 3.11.4 (main, Nov 24 2023, 14:45:29) [Clang 15.0.0 (clang-1500.0.40.1)]
aiohttp==3.9.3
aiosignal==1.3.1
annotated-types==0.6.0
anyio==3.7.1
appdirs==1.4.4
appnope==0.1.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==23.2.0
Babel==2.14.0
beautifulsoup4==4.11.1
bleach==6.1.0
boto3==1.24.53
botocore==1.27.96
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
colorama==0.4.6
comm==0.2.1
contourpy==1.2.1
cycler==0.12.1
datasets==2.15.0
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.7
diskcache==5.6.3
distro==1.9.0
dnspython==2.5.0
easyocr==1.7.1
einops==0.7.0
environs==9.5.0
executing==2.0.1
faiss-cpu==1.7.4
fastjsonschema==2.19.1
filelock==3.13.1
fonttools==4.51.0
fqdn==1.5.1
frozendict==2.4.0
frozenlist==1.4.1
fsspec==2023.10.0
grpcio==1.56.0
h11==0.14.0
html5lib==1.1
httpcore==1.0.2
httpx==0.26.0
huggingface-hub==0.19.4
idna==3.6
imageio==2.34.1
interegular==0.3.3
ipykernel==6.29.2
ipython==8.21.0
ipywidgets==8.1.2
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
jmespath==1.0.1
joblib==1.3.2
json5==0.9.14
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.9.0
jupyter-lsp==2.2.2
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.12.5
jupyter_server_terminals==0.5.2
jupyterlab==4.0.12
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.2
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
lark==1.1.9
lazy_loader==0.4
llmware==0.2.3
llvmlite==0.42.0
lxml==4.9.3
MarkupSafe==2.1.5
marshmallow==3.20.2
matplotlib==3.8.4
matplotlib-inline==0.1.6
mistune==3.0.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
multitasking==0.0.11
nbclient==0.9.0
nbconvert==7.16.0
nbformat==5.9.2
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
nltk==3.8.1
notebook==7.0.8
notebook_shim==0.2.3
numba==0.59.1
numpy==1.26.4
openai==1.12.0
opencv-python-headless==4.9.0.80
outlines==0.0.41
overrides==7.7.0
packaging==23.2
pandas==2.2.0
pandocfilters==1.5.1
parso==0.8.3
pdf2image==1.16.0
pexpect==4.9.0
pgvector==0.2.4
pillow==10.2.0
platformdirs==4.2.0
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.25.2
psutil==5.9.8
psycopg==3.1.17
psycopg-binary==3.1.17
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==15.0.0
pyarrow-hotfix==0.6
pyclipper==1.3.0.post5
pycparser==2.21
pydantic==2.6.1
pydantic_core==2.16.2
Pygments==2.17.2
pymilvus==2.3.0
pymongo==4.5.0
pyparsing==3.1.2
pytesseract==0.3.10
python-bidi==0.4.2
python-dateutil==2.8.2
python-dotenv==1.0.1
python-json-logger==2.0.7
pytz==2024.1
PyYAML==6.0.1
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
referencing==0.33.0
regex==2023.12.25
requests==2.31.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.17.1
s3transfer==0.6.2
safetensors==0.4.2
# Editable install with no version control (sandbox==0.1.0)
-e /Users/breno/src/py/sandbox
scikit-image==0.23.2
scikit-learn==1.4.0
scipy==1.12.0
Send2Trash==1.8.2
sentence-transformers==2.2.2
sentencepiece==0.1.99
shapely==2.0.4
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
sseclient-py==1.8.0
stack-data==0.6.3
sympy==1.12
tabulate==0.9.0
terminado==0.18.0
threadpoolctl==3.2.0
tifffile==2024.5.3
timm==0.9.16
tinycss2==1.2.1
tokenizers==0.19.1
torch==2.2.0
torchvision==0.17.0
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.40.2
types-python-dateutil==2.8.19.20240106
typing_extensions==4.9.0
tzdata==2024.1
ujson==5.9.0
uri-template==1.3.0
urllib3==1.26.18
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
Werkzeug==3.0.1
widgetsnbextension==4.0.10
Wikipedia-API==0.6.0
word2number==1.1
xxhash==3.4.1
yarl==1.9.4
yfinance==0.2.28

Context for the issue:

No response

@brandonwillard
Copy link
Contributor

This might be the actual issue I was seeing in #853.

@br3no
Copy link
Contributor Author

br3no commented May 7, 2024

@brandonwillard yes, I believe this might fix #853 too.

@ekagra-ranjan ekagra-ranjan linked a pull request May 7, 2024 that will close this issue
@ekagra-ranjan
Copy link

ekagra-ranjan commented May 7, 2024

What are the chances that I found this issue today and there is a bug reported today itself in github! I have a fix locally which I raised in #876

@brandonwillard brandonwillard added the structured generation Linked to structured generation label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug structured generation Linked to structured generation
Projects
None yet
3 participants