New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SQLAlchemy for custom data layer #836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for turning to SQLAlchemy that fast! I have couple of comments, primarily about DB init instead of running DDL scripts
List thread update and imports
DALLE for example can return a url but it's only valid for a limited time, so we persist those now
can we add an interface/adapter for blob storage client? |
I also suggest using memcached or session storage, but this is a good first starting point already. |
@sandangel I was thinking about make another file called I added the functions based on the requirements from the chainlit documentation, except for delete_user_session because I cannot see a purpose. Where are you seeing delete_file and download_file? |
I think they are hidden in literalai API client. That is why I chose to implement the literalai API instead of BaseDataLayer in my MongoDB PR. |
I think this does not scale well for the reason I mentioned earlier:
As the number of providers grow, it will add more maintenance burden for us. That is why I suggest having a common interface on the chainlit side, and on the users side they can use their own blob storage client. Similar to how chainlit is creating BaseDataLayer. For SQLAlchemyDataLayer in this PR, I think it will only serve as an example implementation maintained by the community, not really something battle tested and ready to use for production that scale beyond > 100K users. For example, we will also need to add queue system for the write path, and key-value store for session and cache for faster querying and displaying data. |
I am not looking in literal because it's irrelevant for a custom data layer. Devs advise in the project ticket is to inherit from BaseDataLayer so that's what this implements.
@sandangel I don't understand what you mean I guess. If you have code for this please share. In any case, this isn't a fully feature complete PR and we should expect (and welcome!) more enhancements. I can only test some SQL dialects and storage providers. Tbh I am in no rush as I'm already live with this. Data persistence has blocked me and my team for months so I made this out of frustration mostly haha. |
@hayescode sorry for the confusion. Here is some pseudocode: # chainlit code
from typing import Protocol
class BlobStorageClient(Protocol):
def upload_file(self, key: str, data: bytes) -> str:
pass
class SQLAlchemyDataLayer(BaseDataLayer):
async def add_blob_storage_client(self, blob_storage_client: BlobStorageClient) -> None:
self.blob_storage_client = blob_storage_client
logger.info("Blob Storage client initialized")
@queue_until_user_message()
async def create_element(self, element: 'Element'):
# ...
element.url = self.blob_storage_client.upload_file(key=element.key, data=element.data)
# user code:
class S3StorageClient:
def upload_file(self, key: str, data: bytes) -> str:
import boto3
s3_client = boto3.client("s3")
s3_client.put_object(Bucket="my-bucket", Key=key, Body=data)
return f"s3://my-bucket/{key}"
cl._data_layer = SQLAlchemyDataLayer()
cl._data_layer.add_blob_storage_client(S3StorageClient()) |
@tpatel I ended up leveraging chainlit context by adding this to each of the functions to ensure DB writes only occur if a valid user is in context. if not getattr(context.session.user, 'id', None):
raise ValueError("No authenticated user in context") If you know of a better way please let us know. If not I believe all actions have been addressed? |
@hayescode looks good, however I'm getting a I'm looking into how we could make this work. |
@tpatel I removed the chainlit.context check from all except for create/upsert methods. This gives us the desired behavior where non-authenticated users cannot write to the DB, while maintaining functionality for authenticated users. |
Neat, it makes a ton of sense to remove the readonly checks because they are called from already-authenticated routes (so they won't be called if the user isn't correctly authenticated) 👍 Could we switch to a silent ignore (like |
async def delete_user_session(self, id: str) -> bool: | ||
return False # Not sure why documentation wants this | ||
|
||
async def get_all_user_threads(self, user_id: Optional[str] = None, thread_id: Optional[str] = None) -> Optional[List[ThreadDict]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how chainlit is using this function, but if they are using this for displaying the list of threads only, then I think we should not query all the information inside each thread. That should happen when user click to resume the chat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function expects a List[ThreadDict] which is basically everything to your point. I was able to speed it up a lot by querying it all at once per user instead of recursively calling each thread, steps in that thread, etc to build it.
This is also used for the search and feedback filters.
I added a user_thread_limit to cap threads since it'll only grow. I have several users filling up my limit (100 threads) and performance is fine. Are you seeing performance problems or just asking the Chainlit devs why it's like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function isn't part of BaseDataLayer
so it can be defined and used from within this custom data layer as needed.
@sandangel I agree that there might be some performance improvements to be done, I think it's fine for the first release of this custom data layer 👍
backend/chainlit/data/sql_alchemy.py
Outdated
if not getattr(context.session.user, 'id', None): | ||
raise ValueError("No authenticated user in context") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to remove this check. The upsert_feedback
is only called from a route, so it's outside of the user session (so attempting to use context
will always through).
I don't have a good solution to prevent adding feedback to the DB for non-authenticated chainlit apps for now.
if not getattr(context.session.user, 'id', None): | |
raise ValueError("No authenticated user in context") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we're overthinking this. I don't see a scenario where authenticated and non-authenticated users coexist in the same app/db. With authentication enabled users can't use my app with authentication. If someone doesn't have authentication set up I don't know why they'd set up a custom data layer.
I just don't see how this scenario we're trying to guard against would be possible, but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a edge edge case I agree. I can sadly think about a least one scenario where it could happen: a app owner disables the authentication for a few weeks.
To be fair this is the only thing that I've found in today's testing session, and getting this fixed would enable me to do a last code review / testing session.
An option that I've just though about is to use the require_login
method instead of checking the context (
chainlit/backend/chainlit/auth.py
Lines 31 to 37 in 5939877
def require_login(): | |
return ( | |
bool(os.environ.get("CHAINLIT_CUSTOM_AUTH")) | |
or config.code.password_auth_callback is not None | |
or config.code.header_auth_callback is not None | |
or is_oauth_enabled() | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! Then we can just block initialzation of the custom data layer itself if no authentication. Great solution!
Where would we call require_login()? I've tried the app.py before the decorators and in sqlalchemy.py but it is returning None
for me even when I do have authentication enabled.
from chainlit.auth import require_login
is_authentication_enabled = require_login()
class SQLAlchemyDataLayer(BaseDataLayer):
def __init__(self, conninfo: str, ssl_require: bool = False, storage_provider: Optional[BaseStorageClient] = None, user_thread_limit: Optional[int] = 1000):
if not is_authentication_enabled:
print(f'is_authentication_enabled: {is_authentication_enabled}')
raise PermissionError("Authentication is required to use SQLAlchemyDataLayer")
Is it possible to continue from previous thread? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨✨✨ AWESOME ✨✨✨
Thanks for your work @hayescode ! I believe this feature is ready now, we can still disable the persistence and display a proper warning for chainlit apps without authentication in a future PR.
@tpatel thank you very much for your assistance and advice throughout this process! I'm excited to see how the Chainlit community evolved this functionality! |
- adds custom, direct database, data layer using SQLAlchemy with support for a wide-range of SQL dialects - configures ADLS or S3 as the blob storage provider - duplicated `PageInfo` and `PaginatedResponse` from literal SDK into backend/chainlit/types.py and updated typing
Hi, it seems that this implementation does not support Azure Sql Databases: i get no error at boot but a Bad Request: 'User not persisted' 400 error, while the user is correctly stored in the user table. Also, on thread creation, an error appears to store thread.tags because this datatype (TEXT[]) does not exist in Azure Sql. from sqlalchemy import create_engine
import urllib
import chainlit.data as cl_data
from chainlit.data.sql_alchemy import SQLAlchemyDataLayer
from sqlalchemy.engine import URL
driver = 'ODBC Driver 17 for SQL Server'
params = urllib.parse.quote_plus(
'Driver=%s;' % driver +
'Server=tcp:%s,1433;' % os.environ["AZURE_DB_HOST"] +
'Database=%s;' % os.environ["AZURE_DB_DATABASE"] +
'Uid=%s;' % os.environ["AZURE_DB_USERNAME"] +
'Pwd={%s};' % os.environ["AZURE_DB_PASSWORD"] +
'Persist Security Info=False;' +
'MultipleActiveResultSets=False;' +
'Encrypt=yes;' +
'TrustServerCertificate=no;' +
'Connection Timeout=30;')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine_azure = create_engine(conn_str)
engine_azure.connect()
print('Connection to the Azure Db is ok')
connection_url = URL.create(
drivername='mssql+aioodbc',
username=os.environ["AZURE_DB_USERNAME"],
password=os.environ["AZURE_DB_PASSWORD"],
host=os.environ["AZURE_DB_HOST"],
database=os.environ["AZURE_DB_DATABASE"],
query={
'driver': driver,
'Encrypt': 'yes',
'TrustServerCertificate': 'no',
'Connection Timeout': '30'
}
)
cl_data._data_layer = SQLAlchemyDataLayer(
conninfo=connection_url,
ssl_require=True,
storage_provider=engine_azure,
user_thread_limit=100) Did i made a mistake somewhere ? |
@JeanRessouche I've added some docs yesterday to clarify the scope of this release. For now this has only been tested with PostgreSQL. I'll be happy to guide you if you choose to add support for Azure SQL! |
Thanks a lot @tpatel, the doc is helping but it's still a little bit cloudy for me. Adding support for Azure Sql is definitely something that i'm willing to do, but not in the short term, so I'm trying the Azure Datalake way. Based on the doc, i'm facing a blocker with the conninfo variable. [EDIT] Yeah, i certainly need Synapse, configured one, now i'm lost with the incompatible ddl as the datalake link seems to be SQL server compatible. |
For the tags I'm not sure what that azure SQL equivalent is but it's a list of strings in that column. For the data lake I am using ADLS gen2. User not persisted sounds like this isn't getting configured properly. Pass the account url and credential (managed identity or access key). Try a test script to write some test data. |
I'm curious about how you created the DDL in ADLS gen 2 on your side, i was only able to connect to it with sql server driver, thus the postgre ddl won't work. So far i was only able to make it work properly with a postgre server (which is great already!), failed with ADSL gen2 & Azure Sql database. |
I'm not sure what you mean, there is no DDL for ADLS. All you need is the account_url, credential, and the container name. Optionally add sas_token to append to the url to give users access. Once this is provided the code should handle it all and log any errors. |
Hum, ok, i have to retry then, did that but it wasn't working, certainly because i did not figure out what to put in conninfo when we use ADLS. |
you don't put it in the conninfo you instantiate the ADLS class and pass it from chainlit.data.storage_clients import AzureStorageClient
storage_client = AzureStorageClient(
account_url="https://<your_account>.dfs.core.windows.net",
container="<your_container>",
credential=credential,
sas_token=ADLS_SAS_TOKEN
)
cl_data._data_layer = SQLAlchemyDataLayer(
conninfo=SQLALCHEMY_CONNINFO,
ssl_require=True,
storage_provider=storage_client,
user_thread_limit=100) |
Hum, thanks, but there is still something i'm missing here, you ask me not to set conninfo but you do set it in your answer, and in the description above. In the other hand if i simply don't provide it or set it empty i get an error as it's a required parameter For ADLS i have no clue what is expected in the SQLALCHEMY_CONNINFO variable.
|
There are 2 clients. Pass the ALDS client to the SQLAlchemy via storage_client. I think you're conflating these. |
In next commits, pls fix this -> ERROR: |
Overview
list_threads
andget_threads
dramatically versus previous recursion.PageInfo
andPaginatedResponse
from literal SDK intobackend/chainlit/types.py
and updated typing.I had to add locks because the backend usesasyncio.gather
and sometimescreate_step
is attempted beforeupdate_thread
causing a foreign key violation and race condition.I prefer this approach in order to maintain referential integrity of the database..gather()
causingcreate_step
to be called beforeupdate_thread
resulting in foreign key violations. I assume this was done for performance reasons. The backend would have to be re-worked to handle this and a custom solution adds complexity that should not exist.cypress/e2e/custom_data_layer.py
.Shout-Outs
cypress
tests @tjroamer . I had never done these before.Test System
How to configure
Install necessary dependencies to use this custom data layer.
Run this SQL DDL in your sql database
DDL
Add
SQLALCHEMY_CONNINFO
environment variable in your.env
fileAdd this to your
app.py