TPI website basic example is fragile #640

tasdomas · 2022-08-05T12:19:58Z

I found this while testing out basic example in the README

If a user were to set spot to anything other than 0, the task will fail to start:

TPI [INFO] LOG 0 >> 2022-08-05 11:20:29 Started tpi-task.service.                           
TPI [INFO] LOG 0 >> 2022-08-05 11:20:29 (re)starting training loop from 1337 up to 1337 epochs
TPI [INFO] LOG 0 >> 2022-08-05 11:20:30 1337                                                
TPI [INFO] LOG 0 >> 2022-08-05 11:20:50 Error: invalid argument "0.05" for "--spot" flag: strconv.ParseBool: parsing "0.05": invalid syntax
TPI [INFO] LOG 0 >> 2022-08-05 11:20:50 tpi-task.service: Control process exited, code=exited, status=1/FAILURE
TPI [INFO] LOG 0 >> 2022-08-05 11:20:50 tpi-task.service: Failed with result 'exit-code'.

An example main.tf that reproduces this (the only line changed is spot = 0.05):

terraform {
  required_providers { iterative = { source = "github.com/iterative/iterative" } }
}
provider "iterative" {}

resource "iterative_task" "example" {
  cloud      = "az" # or any of: gcp, az, k8s
  machine    = "m"   # medium. Or any of: l, xl, m+k80, xl+v100, ...
  spot       = 0.05     # auto-price. Default -1 to disable, or >0 for hourly USD limit
  disk_size  = -1    # GB. Default -1 for automatic
  permission_set = "/subscriptions/cee76754-ef49-49f7-b371-f6841fa82182/resourceGroups/domas-test-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/domas-test-managed-identity"
  storage {
    workdir = "."       # default blank (don't upload)
    output  = "results" # default blank (don't download). Relative to workdir
  }
  script = <<-END
    #!/bin/bash

    # create output directory if needed
    mkdir -p results
    # read last result (in case of spot/preemptible instance recovery)
    if test -f results/epoch.txt; then EPOCH="$(cat results/epoch.txt)"; fi
    EPOCH=$${EPOCH:-1}  # start from 1 if last result not found

    echo "(re)starting training loop from $EPOCH up to 1337 epochs"
    for epoch in $(seq $EPOCH 1337); do
      sleep 1
      echo "$epoch" | tee results/epoch.txt
    done
  END
}

This is caused by multiple decisions at multiple levels:

the tpi binary attempts to interpret a main.tf file for parameters
the tpi binary expects spot to parse as a boolean

There are several ways to address this, including updating the basic terraform configuration example, but I think the way tpi parses the spot field needs to be changed.

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2022-08-05T12:37:01Z

We could set spot to Float64Var instead of BoolVar so it's consistent with the legacy Terraform schema.

terraform-provider-iterative/cmd/create/create.go

Line 53 in 142cd5b

cmd.Flags().BoolVar(&o.Spot, "spot", false, "use spot instances")

0x2b3bfa0 · 2022-08-05T12:39:30Z

However, asking users to pass --spot=0 in order to enable automatic spot pricing seems rather counterintuitive to me. I'd even ask myself whether allowing specifying a fixed spot price is a good idea. 🤔

casperdcl · 2022-08-05T13:18:53Z

whether allowing specifying a fixed spot price is a good idea

Do all clouds guarantee default(auto) spot pricing < on-demand?

tasdomas · 2022-08-05T13:54:14Z

I think the root cause of this is using the same binary for the CLI tool, for instance shutdown in provisioned machines and as a terraform provider plugin. I don't think there would be any harm in separating the three. The only blocking issue would be finding proper names for the three..

0x2b3bfa0 · 2022-08-05T22:09:33Z

Do all clouds guarantee default(auto) spot pricing < on-demand?

✅ `aws`

No maximum price: Your Spot Instance will launch at the current Spot price. The price will never exceed the On-Demand price. (Recommended)

✅ `az`

The VM will not be evicted for pricing reasons. The max price will be the current price, up to the price for standard VMs. You will never be charged above the standard price.

❓ `gcp`

Not applicable

0x2b3bfa0 · 2022-08-05T22:17:54Z

I think the root cause¹ of this is using the same binary for the CLI tool, for instance shutdown in provisioned machines and as a terraform provider plugin.

Using the same binary as a Terraform provider and as a command line tool is definitely a bizarre choice. Still, it's not directly related to the float64 versus bool conundrum; API should look the same everywhere if it makes sense.

I don't think there would be any harm in separating the three.

The devil is in the details, but yes, task can be safely extracted as a module without changing a single byte of the code apart from the name.

The only blocking issue would be finding proper names for the three.

Not sure if it's “the only” issue, but it's probably the biggest one.

Note that root causes are rarely the root causes you're looking for. 😅 ↩

casperdcl · 2022-08-08T11:24:44Z

so something like this?

long term (major relase)
- everywhere: on_demand: bool=True
intermediate
- task: spot: float=-1
- standalone binary: on_demand: bool=True

0x2b3bfa0 · 2022-08-08T11:29:09Z

Given that spot is a widely used term, I'd consider using it instead of more creative alternatives like !on_demand 🤔

casperdcl · 2022-08-08T11:34:26Z

I thought GCP uses s/spot/preemptible/ and s/on.demand/standard/ :)

0x2b3bfa0 · 2022-08-08T11:37:18Z

Google Cloud followed the spot naming trend recently. 😅

casperdcl · 2022-08-08T11:37:46Z

Anyway the point of renaming is mostly to retain some backward-compatibility?
The alternative -- spot: bool=False -- will silently break existing user workflows that have spot: float (presumably the float will get cast to a bool in the most undesirable way?)

0x2b3bfa0 · 2022-08-08T11:41:03Z

the point of renaming is mostly to retain some backward-compatibility

Isn't it a bit too early to get started with “perfect backwards compatibility” on this project?

casperdcl · 2022-08-08T11:53:50Z

added to #641 :)

DavidGOrtega · 2022-08-19T07:03:34Z

However, asking users to pass --spot=0 in order to enable automatic spot pricing seems rather counterintuitive to me.

Was your idea 😁 machine and runner have that term as spot and spot_price @0x2b3bfa0

0x2b3bfa0 · 2022-08-19T10:33:20Z

Guilty as charged! 😅 Exposing two separate options didn't seem to me a good option either.

I wonder if we should just expose --spot=<boolean> and leave pricing to cloud providers. Who wants to set it manually anyway?¹

Not a haphazard assumption anymore, let's check telemetry data. ↩

casperdcl added leo standalone CLI binary resource-task iterative_task TF resource and removed leo standalone CLI binary labels Aug 5, 2022

casperdcl mentioned this issue Aug 8, 2022

next major release #641

Open

8 tasks

tasdomas mentioned this issue Oct 5, 2022

Separate leo and tpi executables #662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPI website basic example is fragile #640

TPI website basic example is fragile #640

tasdomas commented Aug 5, 2022

0x2b3bfa0 commented Aug 5, 2022

0x2b3bfa0 commented Aug 5, 2022

casperdcl commented Aug 5, 2022 •

edited

tasdomas commented Aug 5, 2022

0x2b3bfa0 commented Aug 5, 2022 •

edited

0x2b3bfa0 commented Aug 5, 2022

casperdcl commented Aug 8, 2022 •

edited

0x2b3bfa0 commented Aug 8, 2022

casperdcl commented Aug 8, 2022 •

edited

0x2b3bfa0 commented Aug 8, 2022

casperdcl commented Aug 8, 2022

0x2b3bfa0 commented Aug 8, 2022

casperdcl commented Aug 8, 2022

DavidGOrtega commented Aug 19, 2022 •

edited

0x2b3bfa0 commented Aug 19, 2022

TPI website basic example is fragile #640

TPI website basic example is fragile #640

Comments

tasdomas commented Aug 5, 2022

0x2b3bfa0 commented Aug 5, 2022

0x2b3bfa0 commented Aug 5, 2022

casperdcl commented Aug 5, 2022 • edited

tasdomas commented Aug 5, 2022

0x2b3bfa0 commented Aug 5, 2022 • edited

✅ aws

✅ az

❓ gcp

0x2b3bfa0 commented Aug 5, 2022

Footnotes

casperdcl commented Aug 8, 2022 • edited

0x2b3bfa0 commented Aug 8, 2022

casperdcl commented Aug 8, 2022 • edited

0x2b3bfa0 commented Aug 8, 2022

casperdcl commented Aug 8, 2022

0x2b3bfa0 commented Aug 8, 2022

casperdcl commented Aug 8, 2022

DavidGOrtega commented Aug 19, 2022 • edited

0x2b3bfa0 commented Aug 19, 2022

Footnotes

casperdcl commented Aug 5, 2022 •

edited

0x2b3bfa0 commented Aug 5, 2022 •

edited

✅ `aws`

✅ `az`

❓ `gcp`

casperdcl commented Aug 8, 2022 •

edited

casperdcl commented Aug 8, 2022 •

edited

DavidGOrtega commented Aug 19, 2022 •

edited