RuntimeError when setting up self hosted model + langchain integration #9

dcavadia · 2023-02-25T03:37:22Z

Im having this bug when trying to setup a model within a lambda cloud running SelfHostedHuggingFaceLLM() after the rh.cluster() function.

`
from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM
from langchain import PromptTemplate, LLMChain
import runhouse as rh
gpu = rh.cluster(name="rh-a10", instance_type="A10:1").save()
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["pip:./", "transformers", "torch"])
`

I made sure with sky check that the lambda credentials are set, but the error i get within the log is this, which i havent been able to solve.

If i can get any help solving this i would appreciate it.

dongreenberg · 2023-02-27T10:19:25Z

Hi! Thanks for raising this. It looks like the GPU type you're specifying is "A10", which is not a valid GPU type. Can you try "A100:1"? To see all the GPU types available, you can run sky show-gpus or sky show-gpus --cloud lambda.

dongreenberg · 2023-02-27T10:23:14Z

Cc @concretevitamin - it looks to me like the accelerator validation for lambda is not catching properly?

concretevitamin · 2023-02-27T15:26:10Z

Hey, thanks for the report. This bug showed when Lambda console contained existing instances in addition to using SkyPilot to launch new instances.

This has been fixed in SkyPilot main branch.

dongreenberg · 2023-02-27T15:56:25Z

Ya, I spun up an A10 shortly after I wrote the above and realized it works and just wasn't in the catalogue 😄. Excellent, glad to hear it's fixed. @dcavadia I can help you get set up on the SkyPilot main branch if that's helpful, or use an existing Lambda instance you have up if you'd prefer to do that instead.

concretevitamin · 2023-02-27T15:59:20Z

@dongreenberg It's a quirk on our end: sky show-gpus --cloud lambda shows NVIDIA_GPU which really means an opinionated list of "common NVIDIA GPUs". If you pass --all / -a to the above, all supported GPUs in the catalog will be shown including A10, A6000, RTX6000, etc. Let us know if showing all GPUs by default or at least A10 is a good idea.

dongreenberg · 2023-02-27T16:06:36Z

Based on my intuition to run the show-gpus command to see if a particular variant exists in the catalogue, my bias would be to either show all by default or print a warning that this is only common GPUs, and to run -a to see the full list. Maybe a middle ground would be that if I just run sky show-gpus it shows only the common hardware variants so it's not a mess of cloud-specific hardware (with a warning about running -a), but if I run with --cloud it show the full catalogue for the given cloud.

dcavadia · 2023-02-27T16:34:35Z

Ya, I spun up an A10 shortly after I wrote the above and realized it works and just wasn't in the catalogue 😄. Excellent, glad to hear it's fixed. @dcavadia I can help you get set up on the SkyPilot main branch if that's helpful, or use an existing Lambda instance you have up if you'd prefer to do that instead.

Hi! im glad you guys find the issue thanks a lot. I just set up the SkyPilot main branch with pip install git+https://github.com/skypilot-org/skypilot and that solved the problem i had before... it now set up the instance in lambda i can launch it but i get a new error while still running the function, looks like a InactiveRpcError. Have any idea on this?

dongreenberg · 2023-02-27T16:42:34Z

Great! Glad this worked.

It's because your working directory (referenced in reqs by "./") is being detected as "ubuntu" (I assume your home directory), and the pip modifier in front of it is telling the grpc server to try to pip install it. Try changing "pip:./" to just "./" or "local:./" to avoid pip installing it. Sorry, the notebook you're using was inside the langchain directory (meaning langchain was the working directory) so it needed to be pip installed. You'll probably need to add "langchain" into the reqs too if you haven't already to make sure to install it on the server.

If for some reason you're getting an error about gRPC not finding methods, the gRPC server on your instance went down from this error. You can restart it by running gpu.restart_grpc_server().

dcavadia · 2023-02-27T18:32:10Z

Great! Glad this worked.

It's because your working directory (referenced in reqs by "./") is being detected as "ubuntu" (I assume your home directory), and the pip modifier in front of it is telling the grpc server to try to pip install it. Try changing "pip:./" to just "./" or "local:./" to avoid pip installing it. Sorry, the notebook you're using was inside the langchain directory (meaning langchain was the working directory) so it needed to be pip installed. You'll probably need to add "langchain" into the reqs too if you haven't already to make sure to install it on the server.

If for some reason you're getting an error about gRPC not finding methods, the gRPC server on your instance went down from this error. You can restart it by running gpu.restart_grpc_server().

Thanks for the quick reply. I created a new instance and set it all up again with the ./ and langchain parameter as SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["./", "transformers", "torch","langchain"]) and it seems that finally the setup has been successful.

I dont get an error about the gRPC but it get hang at the Running _generate_text via gRPC. Im not sure if its normal for the model integration to take >20mins with a A10 lambda instance.

dongreenberg · 2023-02-27T18:43:57Z

Great! But no, it shouldn't take nearly that long to download the model with a small model like gpt2. One way to see what's happening on the server is to call the RPC with stream_logs=True (though it's not integrated in a user-facing way into langchain). Can you halt that and try running the following:

llm.client(pipeline=llm.pipeline_ref, prompt="My prompt...", stream_logs=True)

If that doesn't work, there's a way to inspect the server logs directly that I can point you to. Thank you for bearing with us!

dcavadia · 2023-02-27T19:07:33Z

Yes that would be great if you can point me where to get the server logs directly. Thanks!

dongreenberg · 2023-02-27T19:12:26Z

If you ssh into the cluster (you can just run ssh rh-a10 from your command line) and then type screen -r, you can view the screen in which the server is running. Just be careful not the ctrl-C to exit or you'll kill the server (not a big deal, you can just restart it with the restart_grpc_server call I mentioned above). CMD-A-D exits screen without killing the server. Happy to live debug too, I'm free pretty much the rest of the day.

dcavadia · 2023-02-27T19:39:24Z

Great i now can look at the logs of the server, i noticed something about no available nodes can fulfill resource request.

And this is the ray status within the server instance.

dongreenberg · 2023-02-27T19:55:40Z

That would indeed cause the thread to hang. Confusing why Ray would be halting that when the resources are clearly available in ray status. Could you try running gpu.restart_grpc_server(restart_ray=True)?

dcavadia · 2023-02-27T19:59:33Z

That did something, now at least the message is send but still giving some errors.

`INFO | 2023-02-27 19:58:17,693 | Running _generate_text via gRPC
INFO | 2023-02-27 19:58:18,463 | Time to send message: 0.77 seconds
ERROR | 2023-02-27 19:58:18,464 | Error inside function call: 'str' object is not callable.
ERROR | 2023-02-27 19:58:18,464 | Traceback: Traceback (most recent call last):
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/runhouse/grpc_handler/unary_server.py", line 184, in RunModule
res = call_fn_by_type(fn, fn_type, fn_name, module_path, args, kwargs)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/runhouse/rns/run_module_utils.py", line 28, in call_fn_by_type
res = fn(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/langchain/llms/self_hosted_hugging_face.py", line 31, in _generate_text
response = pipeline(prompt, *args, **kwargs)
TypeError: 'str' object is not callable

ERROR | 2023-02-27 19:58:18,564 | Internal Python error in the inspect module.
Below is the traceback from this internal error.

TypeError: 'str' object is not callable

During handling of the above exception, another exception occurred:

AttributeError: 'TypeError' object has no attribute 'render_traceback'

During handling of the above exception, another exception occurred:

AssertionError
INFO | 2023-02-27 19:58:18,567 |
Unfortunately, your original traceback can not be constructed.
`

dongreenberg · 2023-02-27T20:08:27Z

Ok great - your local llm object is still using the pipeline reference string stored on the previous Ray KVstore that we killed. You should be able to fix this by rerunning the cells to create the llm object and LLMChain object, which will create the pipeline in the Ray KV store.

dcavadia · 2023-02-27T20:17:57Z

Ok great - your local llm object is still using the pipeline reference string stored on the previous Ray KVstore that we killed. You should be able to fix this by rerunning the cells to create the llm object and LLMChain object, which will create the pipeline in the Ray KV store.

Oh I see, i rerun the cells but i get back to the hanging at Running _generate_text via gRPC. Now with a different log info.

(raylet) [2023-02-27 20:17:05,913 E 68936 68936] (raylet) worker_pool.cc:502: Some workers of the worker process(69328) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet) Traceback (most recent call last):
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 101, in <module>
(raylet)     _configure_system()
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 98, in _configure_system
(raylet)     CDLL(so_path, ctypes.RTLD_GLOBAL)
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/ctypes/__init__.py", line 374, in __init__
(raylet)     self._handle = _dlopen(self._name, mode)
(raylet) OSError: /home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/_raylet.so: undefined symbol: _Py_CheckRecursionLimit

dongreenberg · 2023-02-27T20:42:02Z

Ok that's a new one - notebooks are funny, I think something is sticking in memory. Would it be possible to restart the notebook kernel, run from the top, and run gpu.restart_grpc_server(restart_ray=True) after defining the gpu object? (Also I'd just try runing normally through langchain, not through llm.client with stream logs)

dcavadia · 2023-02-27T21:51:10Z

pip:./

Yes its been funny, I tried even with a new instance, this is the code so far:

from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM
from langchain import PromptTemplate, LLMChain
import runhouse as rh
gpu = rh.cluster(name="rh-a10", instance_type="A100:1").save()
gpu.restart_grpc_server(restart_ray=True)
template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = SelfHostedHuggingFaceLLM(model_id="gpt2", hardware=gpu, model_reqs=["./", "transformers", "torch", "langchain"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Beiber was born?"
llm_chain.run(question)

im trying to run this in my virtual machine instead but im still setting that up, while investigating this issue wihin the notebook

dongreenberg · 2023-02-27T22:00:43Z

Hm, that code makes sense, it won't run?

dcavadia · 2023-02-27T22:07:01Z

Hm, that code makes sense, it won't run?

Its just get hang at the Running _generate_text via gRPC... Is it normal to have no resources show when going to the https://api.run.house/ dashboard interface?

dongreenberg · 2023-02-27T22:47:39Z

If you've logged into runhouse (i.e. you've run runhouse login and have your API token saved in the ~/.rh/config.yaml), calling .save() on any resource should save it in the resource naming system and it should show up in the dashboard. I see that you have no resources saved on my side as well. If you're not logged in, your resource metadata should be saving to an rh/ directory inside your working directory.

Would you mind just please confirming in the server if the RPC hanging is the Ray resource insufficiency again? If so, I'll raise it to the ray team, because it looks like a bug.

dcavadia · 2023-02-27T23:01:26Z

If you've logged into runhouse (i.e. you've run runhouse login and have your API token saved in the ~/.rh/config.yaml), calling .save() on any resource should save it in the resource naming system and it should show up in the dashboard. I see that you have no resources saved on my side as well. If you're not logged in, your resource metadata should be saving to an rh/ directory inside your working directory.

Would you mind just please confirming in the server if the RPC hanging is the Ray resource insufficiency again? If so, I'll raise it to the ray team, because it looks like a bug.

Oh i see. And yes, the resources problem dosent seem to appear anymore. This is the logs from the server:

INFO | 2023-02-27 22:59:07,628 | Reloaded module langchain.llms.self_hosted_hugging_face
(raylet) [2023-02-27 22:59:41,974 E 69555 69555] (raylet) worker_pool.cc:502: Some workers of the worker process(69854) have not registered within the timeout. The process is dead, probably it crashed during start.
(raylet) Traceback (most recent call last):
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 8, in <module>
(raylet)     import ray
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 101, in <module>
(raylet)     _configure_system()
(raylet)   File "/home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/__init__.py", line 98, in _configure_system
(raylet)     CDLL(so_path, ctypes.RTLD_GLOBAL)
(raylet)   File "/home/ubuntu/miniconda3/lib/python3.10/ctypes/__init__.py", line 374, in __init__
(raylet)     self._handle = _dlopen(self._name, mode)
(raylet) OSError: /home/ubuntu/ubuntu/.local/lib/python3.8/site-packages/ray/_raylet.so: undefined symbol: _Py_CheckRecursionLimit

And this is the status of the ray:

======== Autoscaler status: 2023-02-27 22:57:07.933530 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_0ba826e055591e93b1eedf2ca00b44c0c8e2ac28fa7b77053bca62f9
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 9.999999999976694e-05/30.0 CPU
 9.999999999998899e-05/1.0 GPU
 0.0/1.0 accelerator_type:A10
 0.00/127.547 GiB memory
 0.00/58.654 GiB object_store_memory

Demands:
 {'CPU': 0.0001, 'GPU': 0.0001}: 1+ pending tasks/actor

dongreenberg · 2023-02-28T13:55:41Z

Perfect, thank you. I'll report this to Ray, it looks like a bug. The requested resources are clearly less than the available resources, so I'm not sure why Ray is blocking. I've run your code and it worked for me (also on Lambda):

dcavadia · 2023-03-01T02:15:38Z

Perfect, thank you. I'll report this to Ray, it looks like a bug. The requested resources are clearly less than the available resources, so I'm not sure why Ray is blocking. I've run your code and it worked for me (also on Lambda):

Mhm, can you make sure you are setting the exact sames requirements as me?

pip install runhouse
pip install langchain
pip install git+https://github.com/skypilot-org/skypilot
pip install -U pyOpenSSL
mkdir -p ~/.lambda_cloud
echo "api_key = <your_api_key_here>" > ~/.lambda_cloud/lambda_keys

dcavadia · 2023-03-02T17:40:51Z

Perfect, thank you. I'll report this to Ray, it looks like a bug. The requested resources are clearly less than the available resources, so I'm not sure why Ray is blocking. I've run your code and it worked for me (also on Lambda):

Let me know. I appreciate all the help so far.

dongreenberg · 2023-03-03T16:23:59Z

Thanks for your patience and sorry for the delay. I filed the issue above into Ray. While filing I noticed that your traceback has both Python 3.8 (miniconda) and Python 3.10 (user) and is probably calling different Ray versions through different layers. Do you know why that would be?

dcavadia · 2023-03-03T17:37:08Z

Thanks for your patience and sorry for the delay. I filed the issue above into Ray. While filing I noticed that your traceback has both Python 3.8 (miniconda) and Python 3.10 (user) and is probably calling different Ray versions through different layers. Do you know why that would be?

Interesting, i didnt notice that. Im not sure why would that happen but i'll dig on that right now. On the other hand, can you confirm you used in your lambda instance these same libraries/req as me?.

pip install runhouse
pip install langchain
pip install git+https://github.com/skypilot-org/skypilot
pip install -U pyOpenSSL
mkdir -p ~/.lambda_cloud
echo "api_key = <your_api_key_here>" > ~/.lambda_cloud/lambda_keys

Thanks

dongreenberg mentioned this issue Feb 27, 2023

RuntimeError when setting up self hosted model + runhouse integration langchain-ai/langchain#1290

Closed

ewzeng mentioned this issue Feb 27, 2023

[UI] sky show-gpus only shows common gpus skypilot-org/skypilot#1733

Open

dongreenberg mentioned this issue Mar 3, 2023

[Core] Task waiting to start due to insufficient resources even though resources are present and detected ray-project/ray#33000

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError when setting up self hosted model + langchain integration #9

RuntimeError when setting up self hosted model + langchain integration #9

dcavadia commented Feb 25, 2023 •

edited

dongreenberg commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

concretevitamin commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

concretevitamin commented Feb 27, 2023 •

edited

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023 •

edited

dongreenberg commented Feb 27, 2023 •

edited

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023 •

edited

dcavadia commented Feb 27, 2023 •

edited

dongreenberg commented Feb 27, 2023 •

edited

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023 •

edited

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 28, 2023

dcavadia commented Mar 1, 2023

dcavadia commented Mar 2, 2023

dongreenberg commented Mar 3, 2023 •

edited

dcavadia commented Mar 3, 2023

RuntimeError when setting up self hosted model + langchain integration #9

RuntimeError when setting up self hosted model + langchain integration #9

Comments

dcavadia commented Feb 25, 2023 • edited

dongreenberg commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

concretevitamin commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

concretevitamin commented Feb 27, 2023 • edited

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023 • edited

dongreenberg commented Feb 27, 2023 • edited

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023 • edited

dcavadia commented Feb 27, 2023 • edited

dongreenberg commented Feb 27, 2023 • edited

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023 • edited

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 27, 2023

dcavadia commented Feb 27, 2023

dongreenberg commented Feb 28, 2023

dcavadia commented Mar 1, 2023

dcavadia commented Mar 2, 2023

dongreenberg commented Mar 3, 2023 • edited

dcavadia commented Mar 3, 2023

dcavadia commented Feb 25, 2023 •

edited

concretevitamin commented Feb 27, 2023 •

edited

dcavadia commented Feb 27, 2023 •

edited

dongreenberg commented Feb 27, 2023 •

edited

dongreenberg commented Feb 27, 2023 •

edited

dcavadia commented Feb 27, 2023 •

edited

dongreenberg commented Feb 27, 2023 •

edited

dongreenberg commented Feb 27, 2023 •

edited

dongreenberg commented Mar 3, 2023 •

edited