Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncaught error when bringing up on-demand GCP cluster with invalid image_id #100

Open
steve-marmalade opened this issue Sep 5, 2023 · 0 comments

Comments

@steve-marmalade
Copy link
Contributor

steve-marmalade commented Sep 5, 2023

Hi team, the Runhouse docs for on-demand clusters were not super clear about the format of the image_id, but helpfully my initial attempts to bring up a GCP cluster with e.g. image_id="pytorch-cpu-latest" (taken from the GCP docs) raised a clear error e.g. ValueError: Image 'pytorch-latest-cpu' not found in GCP.

I ended up going into the skypilot repo for clarification and found a GCP example in their yaml-spec: projects/deeplearning-platform-release/global/images/family/tf2-ent-2-1-cpu-ubuntu-2004

I modified the above for the image I wanted projects/deeplearning-platform-release/global/images/family/pytorch-1-13-cpu-v20230807-debian-11-py310 and while runhouse allowed me to submit, it hung until it timed out (and I saw no indication in the GCP Console that the instance was coming up).

I tried to run a similar command via sky launch, and saw the error, which I reported to them in this Github Issue. I am raising it here as well in case you want to update your wrapping code to catch this error.

Versions
Please run the following and paste the output below.

Python Platform: Linux-6.4.12-arch1-1-x86_64-with-glibc2.38
Python Version: 3.10.13 (main, Sep  4 2023, 15:52:34) [GCC 13.2.1 20230801]

Relevant packages: 
boto3==1.28.40
fastapi==0.103.1
fsspec==2023.5.0
gcsfs==2023.5.0
google-api-python-client==2.97.0
google-cloud-storage==2.10.0
pyarrow==13.0.0
pycryptodome==3.12.0
rich==13.5.2
runhouse==0.0.11
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.2
wheel==0.41.2

Checking credentials to enable clouds for SkyPilot.
  AWS: disabled          
    Reason: AWS credentials are not set. Run the following commands:
      $ pip install boto3
      $ aws configure
      $ aws configure list  # Ensure that this shows identity is set.
    For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
    Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
  Azure: disabled          
    Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
  GCP: enabled          
  Lambda: disabled          
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys
  IBM: disabled          
    Reason: Missing credential file at /home/user/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  SCP: disabled          
    Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
    Generate API key and add the following line to ~/.scp/scp_credential:
      access_key = [YOUR API ACCESS KEY]
      secret_key = [YOUR API SECRET KEY]
      project_id = [YOUR PROJECT ID]
  OCI: disabled          
    Reason: `oci` is not installed. Install it with: pip install oci
    For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
  Cloudflare (for R2 object store): disabled          
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
NAME         LAUNCHED     RESOURCES                                                                  STATUS  AUTOSTOP  COMMAND                       

Managed spot jobs
No in progress jobs. (See: sky spot -h)

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant