Skip to content

Releases: kubernetes-sigs/kueue

Kueue v0.7.0-rc.2

30 May 18:08
v0.7.0-rc.2
e2e2314
Compare
Choose a tag to compare
Kueue v0.7.0-rc.2 Pre-release
Pre-release

Changes since v0.6.0:

Urgent Upgrade Notes

(No, really, you MUST read this before you upgrade)

  • Added CRD validation rules to AdmissionCheck.

    Requires Kubernetes 1.25 or newer (#1975, @IrvingMg)

  • Added CRD validation rules to ClusterQueue.

    Requires Kubernetes 1.25 or newer (#1972, @IrvingMg)

  • Added CRD validation rules to LocalQueue.

    Requires Kubernetes 1.25 or newer (#1938, @IrvingMg)

  • Added CRD validation rules to ResourceFlavor.

    Requires Kubernetes 1.25 or newer (#1958, @IrvingMg)

  • Added CRD validation rules to Workload.

    Requires Kubernetes 1.25 or newer (#2008, @IrvingMg)

  • Increased the default value in the .waitForPodsReady.requeuingStrategy.backoffBaseSeconds to 60

    You can configure .waitForPodsReady.requeuingStrategy.backoffBaseSeconds as needed. (#2251, @mbobrovskyi)

  • Upgrade RayJob API to v1

    If you use KubeRay older than v1.0.0, you'll have to upgrade your existing installation
    to KubeRay v1.0.0, or any more recent version, that supports KubeRay v1 APIs, for it to
    remain compatible with Kueue. (#1802, @astefanutti)

  • When using admission checks, and they are not satisfied yet, the reason for the Admission condition with status=False is now
    UnsatisfiedChecks

    If you were watching for the reason NoChecks in the Admitted condition, use UnsatisfiedChecks instead. (#2150, @trasc)

API Change

  • Make ClusterQueue queueingStrategy field mutable. The field can be mutated while there are pending workloads. (#1934, @mimowo)
  • User can now pass parameters to ProvisioningRequest using job's annotations (#1869, @PBundyra)

Feature

  • A new condition with type Preempted allows to distinguish different reasons for the preemption to happen (#1942, @mimowo)

  • Add configuration to register Kinds as being managed by an external Kueue-compatible controller (#2059, @dgrove-oss)

  • Add fair sharing when borrowing unused resources from other ClusterQueues in a cohort.

    Fair sharing is based on DRF for usage above nominal quotas.
    When fair sharing is enabled, Kueue prefers to admit workloads from ClusterQueues with the lowest share first.
    Administrators can enable and configure fair sharing preemption using a combination of two policies: LessThanOrEqualtoFinalShare, LessThanInitialShare.

    You can define a fair sharing weight for ClusterQueues. The weight determines how much of the unused resources each ClusterQueue can take in comparison to others. (#2070, @alculquicondor)

  • Add metric evicted_workloads: the number of evicted workloads per 'cluster_queue' (#1955, @lowang-bh)

  • Add recommended Kubernetes labels to uniquely identify Pods and other resources installed with Kueue.
    The Deployment selector remains unchanged to allow for a seamless upgrade. (#1695, @astefanutti)

  • Added label copying from Pod/Job into the Kueue Workload. (#1959, @pajakd)

  • Added non-negative validations for the ".queueVisibility.clusterQueues.maxCount" in the Configuration. (#2309, @tenzen-y)

  • Added validations for the ".internalCertManagement" in the Configuration. (#2169, @tenzen-y)

  • Added validations for the "multiKueue.origin", ".multiKueue.gcInterval" and the "multiKueue.workerLostTimeout" in the Configuration. (#2129, @tenzen-y)

  • Added validations for the "waitForPodsReady.timeout" in the Configuration. (#2214, @tenzen-y)

  • Adds ObservedGeneration in conditions (#1939, @vladikkuzn)

  • Adds the BackoffMaxSeconds property to limit the retry period length for re-queing workloads. (#2264, @IrvingMg)

  • Allow for workload.spec.podSet.[*].count to be 0 (#2268, @mszadkow)

  • CLI: Add command to list ClusterQueues (#2156, @vladikkuzn)

  • CLI: Add commands to stop and Resume a ClusterQueue (#2200, @vladikkuzn)

  • CLI: Add kubectl kueue plugin that allows to create LocalQueues without writing yamls. (#2027, @mbobrovskyi)

  • CLI: Add list LocalQueue command (#2157, @mbobrovskyi)

  • CLI: Add stop/resume workload commands (#2134, @mbobrovskyi)

  • CLI: Add validation for ClusterQueue on creating LocalQueue (#2122, @mbobrovskyi)

  • CLI: Added list workloads command. (#2195, @mbobrovskyi)

  • CLI: Added pass-through commands support in kubectl-kueue for get, describe, edit, patch and delete. (#2181, @trasc)

  • Helm: Allow configuration of ipFamilyPolicy for ipDualStack kubernetes cluster (#1933, @dongjiang1989)

  • Helm: Allow configuration of custom annotations on Service and Deployment's Pod (#2030, @tozastation)

  • Improve metrics related to workload's quota reservation and admission:

    • fix admission_wait_time_seconds - to measure the time to "Admitted" condition since creation time or last requeue (as opposed to the "QuotaReserved" condition as before)
    • add quota_reserved_wait_time_seconds - measures time to "QuotaReserved" condition since creation time, or last eviction time
    • add quota_reserved_workloads_total - counts the number of workloads that got admitted
    • admission_checks_wait_time_seconds - measures the time to admit a workload with admission checks since quota reservation
    • use longer buckets (up to 10240s) for histogram metrics: admission_wait_time_seconds, quota_reserved_wait_time_seconds, admission_checks_wait_time_seconds (#1977, @mbobrovskyi)
  • Improve the kubectl output for workloads using admission checks. (#1991, @vladikkuzn)

  • Make the PodsReady base delay for requeuing configurable (#2040, @mimowo)

  • MuliKueue: Manage worker cluster unavailability (#1681, @trasc)

  • MultiKueue: Add support for JobSet spec.managedBy field (#1870, @trasc)

  • MultiKueue: Add the managedBy field to JobSets assigned to a ClusterQueue configured for MultiKueue (#2048, @vladikkuzn)

  • MultiKueue: Add worker connection monitoring and reconnect (#1806, @trasc)

  • Pod Integration: Add condition WaitingForReplacementPods to Workloads of pod groups with incomplete number of pods (#2234, @mbobrovskyi)

  • Pod Integration: Improve performance (#1952, @gabesaba)

  • Pod Integration: The reason for stopping a pod is now specified in the pod TerminationTarget condition (#2160, @pajakd)

  • Pods created by Kueue have now the ProvisioningRequest's classname annotation (#2052, @PBundyra)

  • ProvisioningRequest: Graduated to Beta and enabled by default (#1968, @pajakd)

  • ProvisioningRequest: Propagate the message for a ProvisioningRequest being provisioned (which might include an ETA, depending on the implementation) to the Workload status (#2007, @pajakd)

  • Show fair share of a CQ in status and a metric (#2276, @mbobrovskyi)

  • Updates in admission check messages are recorded as events for jobs/pods. (#2147, @pajakd)

  • Workload finished reason replaced with succeeded and failed reasons (#2026, @vladikkuzn)

  • You can configure Kueue to ignore container resources that match specified prefixes. (#2267, @pajakd)

  • You can define AdmissionChecks per ResourceFlavor in the ClusterQueue API, using admissionChecksStrategy (#1960, @PBundyra)

Bug or Regression

  • Avoid unnecessary preemptions when there are multiple candidates for preemption with the same admission timestamp (#1875, @alculquicondor)

  • Change the default pprof port to 8083 to fix a bug that causes conflicting listening ports between pprof and the visibility server. (#2228, @amy)

  • Check the containers limits for used resources in provisioning admission check controller and include them in the ProvisioningRequest as requests (#2286, @trasc)

  • Do not default to suspending a job whose parent is already managed by Kueue (#1846, @astefanutti)

  • Fix handling of eviction in StrictFIFO to ensure the evicted workload is in the head.
    Previously, in case of priority-based preemption, it was possible that the lower-priority
    workload might get admitted while the higher priority workload is being evicted. (#2061, @mimowo)

  • Fix incorrect quota management when lendingLimit enabled in preemption (#1770, @kerthcet)

  • Fix preemption algorithm to reduce the number of preemptions within a ClusterQueue when reclamation is not possible, and when using .preemption.borrowWithinCohort (#2110, @alculquicondor)

  • Fix preemption algorithm to reduce the number of preemptions within a ClusterQueue when reclamation is not possible. (#1979, @mimowo)

  • Fix preemption to reclaim quota that is blocked by an earlier pending Workload from another ClusterQueue in the same cohort. (#1866, @alculquicondor)

  • Fix support for MPIJobs when using a ProvisioningRequest engine that applies updates only to worker templates. (#2265, @trasc)

  • Fix the counter of pending workloads in cluster queue status.

    The counter would not count the head workload for StrictFIFO queues, if the workload cannot get admitted.

    This change also includes the blocked workload in the metrics and the visibility API for the list of pending workloads. (#1936, @mimowo)

  • Fix the resource requests computation taking into account sidecar containers. (#2099, @IrvingMg)

  • Helm: Fix a bug that prevented Kueue to work with the cert-manager. (#2087, @EladDolev)

  • Helm: Fix a bug where the configuration for integrations.podOptions.namespaceSelector didn't have an effect due to indentation issues. (#2086, @EladDolev)

  • Helm: Fix chart values configuration for the number of reconcilers for the Pod integration. (#2046, @alculquicondor)

  • Kueue visibility API is no longer installed by default. Users can install it via helm or applying the visibility-api.yaml artifact. (#1746, @trasc)

  • Make the defaults for PodsReadyTimeout backoff more practical, as for the original values
    the couple of first requeues made the impression as immediate on users (below 10s, which
    is negligible to the wait time spent waiting for PodsReady).

    The defaults values for the formula to determine the exponential back are changed as follows:

    • base 1s -> 10s
    • exponent: 1.41284738 -> 2
      So, now the consecutive times to requeue a workload are...
Read more

Kueue v0.6.3

28 May 18:45
v0.6.3
93154fb
Compare
Choose a tag to compare

Changes since v0.6.2:

Feature

  • Improve the kubectl output for workloads using admission checks. (#2014, @vladikkuzn)

Bug or Regression

  • Change the default pprof port to 8083 to fix a bug that causes conflicting listening ports between pprof and the visibility server. (#2232, @amy)

  • Check the containers limits for used resources in provisioning admission check controller and include them in the ProvisioningRequest as requests (#2293, @trasc)

  • Consider deleted pods without spec.nodeName inactive and subject for pod replacement. (#2217, @trasc)

  • Fix a bug that causes the reactivated Workload to be immediately deactivated even though it doesn't exceed the backoffLimit. (#2220, @tenzen-y)

  • Fix a bug that the ".waitForPodsReady.requeuingStrategy.backoffLimitCount" is ignored when the ".waitForPodsReady.requeuingStrategy.timestamp" is not set. (#2224, @tenzen-y)

  • Fix chart values configuration for the number of reconcilers for the Pod integration. (#2050, @alculquicondor)

  • Fix handling of eviction in StrictFIFO to ensure the evicted workload is in the head.
    Previously, in case of priority-based preemption, it was possible that the lower-priority
    workload might get admitted while the higher priority workload is being evicted. (#2081, @mimowo)

  • Fix preemption algorithm to reduce the number of preemptions within a ClusterQueue when reclamation is not possible, and when using .preemption.borrowWithinCohort (#2111, @alculquicondor)

  • Fix support for MPIJobs when using a ProvisioningRequest engine that applies updates only to worker templates. (#2281, @trasc)

  • Fix support for jobset v0.5.x (#2271, @alculquicondor)

  • Fix the resource requests computation taking into account sidecar containers. (#2159, @IrvingMg)

  • Helm Chart: Fix a bug that the kueue does not work with the cert-manager. (#2098, @EladDolev)

  • HelmChart: Fix a bug that the integrations.podOptions.namespaceSelector is not propagated. (#2095, @EladDolev)

  • JobFramework: The eviction by inactivation mechanism was moved to the workload controller.

    This fixes a problem where pod groups would remain with condition QuotaReserved set to True when replacement pods are missing. (#2229, @mbobrovskyi)

  • Make the defaults for PodsReadyTimeout backoff more practical, as for the original values
    the couple of first requeues made the impression as immediate on users (below 10s, which
    is negligible to the wait time spent waiting for PodsReady).

    The defaults values for the formula to determine the exponential back are changed as follows:

    • base 1s -> 10s
    • exponent: 1.41284738 -> 2
      So, now the consecutive times to requeue a workload are: 10s, 20s, 40s, ... (#2033, @mimowo)
  • MultiKueue: Fix a bug that could delay the joining clusters when it's MultiKueueCluster is created. (#2167, @trasc)

  • Prevent Pod from being deleted when admitted via ProvisioningRequest that has pod updates on tolerations (#2262, @vladikkuzn)

  • Use PATCH updates for pods. This fixes support for Pods when using the latest features in Kubernetes v1.29 (#2089, @mbobrovskyi)

Other (Cleanup or Flake)

  • Correctly log workload status for workloads with quota reserved, but awaiting for admission checks. (#2080, @mimowo)

Kueue v0.7.0-rc.1

08 May 18:29
v0.7.0-rc.1
515c225
Compare
Choose a tag to compare
Kueue v0.7.0-rc.1 Pre-release
Pre-release

Changes since v0.6.0:

Urgent Upgrade Notes

(No, really, you MUST read this before you upgrade)

  • Added CRD validation rules to AdmissionCheck.

    Requires Kubernetes 1.25 or newer (#1975, @IrvingMg)

  • Added CRD validation rules to ClusterQueue.

Requires Kubernetes 1.25 or newer (#1972, @IrvingMg)

  • Added CRD validation rules to ResourceFlavor.

Requires Kubernetes 1.25 or newer (#1958, @IrvingMg)

  • Added CRD validation rules to Workload.

Requires Kubernetes 1.25 or newer (#2008, @IrvingMg)

  • Replaced LocalQueue admission webhook with CRD validation rules.

Requires Kubernetes 1.25 or newer (#1938, @IrvingMg)

  • Upgrade RayJob API to v1

If you use KubeRay older than v1.0.0, you'll have to upgrade your existing installation
to KubeRay v1.0.0, or any more recent version, that supports KubeRay v1 APIs, for it to
remain compatible with Kueue. (#1802, @astefanutti)

  • Use recommended labels and a uniquely identifying selector for Kueue deployment resources.

You need to recreate the Kueue deployment if you had it previously installed,
as the label selector field is immutable. (#1695, @astefanutti)

Changes by Kind

API Change

  • Make ClusterQueue queueingStrategy field mutable. The field can be mutated while there are pending workloads. (#1934, @mimowo)
  • User can now pass parameters to ProvisioningRequest using job's annotations (#1869, @PBundyra)

Feature

  • A new condition with type Preempted allows to distinguish different reasons for the preemption to happen (#1942, @mimowo)

  • Add MultiKueue support for JobSet spec.managedBy field. (#1870, @trasc)

  • Add configuration to register Kinds as being managed by an external Kueue-compatible controller (#2059, @dgrove-oss)

  • Add fair sharing when borrowing unused resources from other ClusterQueues in a cohort.

    Fair sharing is based on DRF for usage above nominal quotas.
    When fair sharing is enabled, Kueue prefers to admit workloads from ClusterQueues with the lowest share first.
    Administrators can enable and configure fair sharing preemption using a combination of two policies: LessThanOrEqualtoFinalShare, LessThanInitialShare.

    You can define a fair sharing weight for ClusterQueues. The weight determines how much of the unused resources each ClusterQueue can take in comparison to others. (#2070, @alculquicondor)

  • Add kubectl kueue plugin that allows to create LocalQueues without writing yamls. (#2027, @mbobrovskyi)

  • Add support allow configuration of ipFamilyPolicy for ipDualStack kubernetes cluster (#1933, @dongjiang1989)

  • Add support allow configuration of custom annotations on Service and Deployment's Pod (#2030, @tozastation)

  • Added MultiKueue worker connection monitoring and reconnect. (#1806, @trasc)

  • Added label copying from Pod/Job into the Kueue Workload. (#1959, @pajakd)

  • Added scalability test for scheduling performance (#1931, @trasc)

  • Added validations for the "multiKueue.origin", ".multiKueue.gcInterval" and the "multiKueue.workerLostTimeout" in the Configuration. (#2129, @tenzen-y)

  • Adds ObservedGeneration in conditions (#1939, @vladikkuzn)

  • Improve metrics related to workload's quota reservation and admission:

    • fix admission_wait_time_seconds - to measure the time to "Admitted" condition since creation time or last requeue (as opposed to the "QuotaReserved" condition as before)
    • add quota_reserved_wait_time_seconds - measures time to "QuotaReserved" condition since creation time, or last eviction time
    • add quota_reserved_workloads_total - counts the number of workloads that got admitted
    • admission_checks_wait_time_seconds - measures the time to admit a workload with admission checks since quota reservation
    • use longer buckets (up to 10240s) for histogram metrics: admission_wait_time_seconds, quota_reserved_wait_time_seconds, admission_checks_wait_time_seconds (#1977, @mbobrovskyi)
  • Improve pod integration performance (#1952, @gabesaba)

  • Improve the kubectl output for workloads using admission checks. (#1991, @vladikkuzn)

  • Make the PodsReady base delay for requeuing configurable (#2040, @mimowo)

  • MuliKueue - Manage worker cluster unavailability (#1681, @trasc)

  • Pods created by Kueue have now the ProvisioningRequest's classname annotation (#2052, @PBundyra)

  • Provisioning Admission Check Controller (ProvisioningACC) feature is now enabled by default (#1968, @pajakd)

  • The message for a ProvisioningRequest being provisioned (which might include an ETA, depending on the implementation) is now propagated to workloads. (#2007, @pajakd)

  • Use PATCH updates for pods. This fixes support for Pods when using the latest features in Kubernetes v1.29 (#2074, @mbobrovskyi)

  • Users can define AdmissionChecks per ResourceFlavor in the ClusterQueue API, using admissionChecksStrategy. (#1960, @PBundyra)

  • Workload finished reason replaced with succeeded and failed reasons (#2026, @vladikkuzn)

Bug or Regression

  • Avoid unnecessary preemptions when there are multiple candidates for preemption with the same admission timestamp (#1875, @alculquicondor)

  • Do not default to suspending a job whose parent is already managed by Kueue (#1846, @astefanutti)

  • Exclude Pod labels, preemptionPolicy and container images when determining whether pods in a pod group have the same shape. (#1758, @alculquicondor)

  • Fix Pods in Pod groups stuck with finalizers when deleted immediately after Succeeded (#1905, @alculquicondor)

  • Fix chart values configuration for the number of reconcilers for the Pod integration. (#2046, @alculquicondor)

  • Fix handling of eviction in StrictFIFO to ensure the evicted workload is in the head.
    Previously, in case of priority-based preemption, it was possible that the lower-priority
    workload might get admitted while the higher priority workload is being evicted. (#2061, @mimowo)

  • Fix incorrect quota management when lendingLimit enabled in preemption (#1770, @kerthcet)

  • Fix preemption algorithm to reduce the number of preemptions within a ClusterQueue when reclamation is not possible, and when using .preemption.borrowWithinCohort (#2110, @alculquicondor)

  • Fix preemption algorithm to reduce the number of preemptions within a ClusterQueue when reclamation is not possible. (#1979, @mimowo)

  • Fix preemption to reclaim quota that is blocked by an earlier pending Workload from another ClusterQueue in the same cohort. (#1866, @alculquicondor)

  • Fix the configuration for the number of reconcilers for the Pod integration. It was only reconciling one group at a time. (#1835, @alculquicondor)

  • Fix the counter of pending workloads in cluster queue status.

    The counter would not count the head workload for StrictFIFO queues, if the workload cannot get admitted.

    This change also includes the blocked workload in the metrics and the visibility API for the list of pending workloads. (#1936, @mimowo)

  • Fix the resource requests computation taking into account sidecar containers. (#2099, @IrvingMg)

  • Fix transitions of Requeued condition. (#2063, @mbobrovskyi)

  • Helm Chart: Fix a bug that the kueue does not work with the cert-manager. (#2087, @EladDolev)

  • HelmChart: Fix a bug that the integrations.podOptions.namespaceSelector is not propagated. (#2086, @EladDolev)

  • Kueue visibility API is no longer installed by default. Users can install it via helm or applying the visibility-api.yaml artifact. (#1746, @trasc)

  • Make the defaults for PodsReadyTimeout backoff more practical, as for the original values
    the couple of first requeues made the impression as immediate on users (below 10s, which
    is negligible to the wait time spent waiting for PodsReady).

    The defaults values for the formula to determine the exponential back are changed as follows:

    • base 1s -> 10s
    • exponent: 1.41284738 -> 2
      So, now the consecutive times to requeue a workload are: 10s, 20s, 40s, ... (#2025, @mimowo)
  • Reduce number of Workload reconciliations due to wrong equality check. (#1897, @gabesaba)

  • The Failed pods in a pod-group are finalized once a replacement pods are created. (#1766, @trasc)

  • WaitForPodsReady: Fix a bug that the requeueState isn't reset. (#1838, @tenzen-y)

  • Сlear RequeuAt on workload backoff finished. (#2143, @mbobrovskyi)

Other (Cleanup or Flake)

  • Avoid API calls for admission attempts when Workload already has condition Admitted=false (#1820, @alculquicondor)
  • Correctly log workload status for workloads with quota reserved, but awaiting for admission checks. (#2062, @mimowo)
  • Dropped the usage of kueue.x-k8s.io/parent-workload annotation in favor of an object ownership based approach. (#1747, @trasc)
  • JobFramework: The eviction by inactivation mechanism was moved to the workload controller. (#2131, @tenzen-y)
  • Skip requeueing of Workloads when there is a status update for a ClusterQueue, saving on API calls for Workloads that were already attempted for admission. (#1822, @alculquicondor)
  • The hash suffix of the workload's name are now influenced by the job's object UID. Recreated jobs with the same name and kind will use different workload names. (#1732, @trasc)

Kueue v0.6.2

10 Apr 12:54
v0.6.2
223106e
Compare
Choose a tag to compare

Changes since v0.6.1:

Bug or Regression

  • Avoid unnecessary preemptions when there are multiple candidates for preemption with the same admission timestamp (#1880, @alculquicondor)
  • Fix Pods in Pod groups stuck with finalizers when deleted immediately after Succeeded (#1916, @alculquicondor)
  • Fix preemption to reclaim quota that is blocked by an earlier pending Workload from another ClusterQueue in the same cohort. (#1868, @alculquicondor)
  • Reduce number of Workload reconciliations due to wrong equality check. (#1917, @gabesaba)

Other (Cleanup or Flake)

Kueue v0.6.1

14 Mar 20:23
v0.6.1
eb01ce9
Compare
Choose a tag to compare

Changes Since v0.6.0:

Feature

  • Added MultiKueue worker connection monitoring and reconnect. (#1809, @trasc)
  • The Failed pods in a pod-group are finalized once a replacement pods are created. (#1801, @trasc)

Bug or Regression

  • Exclude Pod labels, preemptionPolicy and container images when determining whether pods in a pod group have the same shape. (#1760, @alculquicondor)
  • Fix incorrect quota management when lendingLimit enabled in preemption (#1826, @kerthcet, @B1F030)
  • Fix the configuration for the number of reconcilers for the Pod integration. It was only reconciling one group at a time. (#1837, @alculquicondor)
  • Kueue visibility API is no longer installed by default. Users can install it via helm or applying the visibility-api.yaml artifact. (#1764, @trasc)
  • WaitForPodsReady: Fix a bug that the requeueState isn't reset. (#1843, @tenzen-y)

Other (Cleanup or Flake)

  • Avoid API calls for admission attempts when Workload already has condition Admitted=false (#1845, @alculquicondor)
  • Skip requeueing of Workloads when there is a status update for a ClusterQueue, saving on API calls for Workloads that were already attempted for admission. (#1832, @alculquicondor)

Kueue v0.6.0

14 Feb 17:09
v0.6.0
650d2f2
Compare
Choose a tag to compare

Changes since v0.5.0:

API Change

  • A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)
  • Add a lendingLimit field in ClusterQueue's quotas, to allow restricting how much of the unused resources by the ClusterQueue can be borrowed by other ClusterQueues in the cohort.
    In other words, this allows a quota equal to nominal-lendingLimit to be exclusively used by the ClusterQueue. (#1385, @B1F030)
  • Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)
  • Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)
  • MultiKueue: Add Path location type for cluster KubeConfigs. (#1640, @trasc)
  • MultiKueue: Add garbage collection of deleted Workloads. (#1643, @trasc)
  • MultiKueue: Multi cluster job dispatching for k8s Job. This doesn't include support for live status updates. (#1313, @trasc)
  • Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)
  • Support for preemption while borrowing (#1397, @mimowo)
  • The leaderElection field in the Configuration API is now defaulted.
    Leader election is now enabled by default. (#1598, @astefanutti)
  • Visibility API: Add an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
  • Visibility API: Introduce an on-demand API endpoint for fetching pending workloads in a ClusterQueue. (#1251, @PBundyra)
  • Visibility API: extend the information returned for the pending workloads in a ClusterQueue, including the workload position in the queue. (#1362, @PBundyra)
  • WaitForPodsReady: Add a config field to allow admins to configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready on time. (#1542, @nstogner)
  • WaitForPodsReady: Support a backoff re-queueing mechanism with configurable limit. (#1709, @tenzen-y)

Feature

  • Add Prebuilt Workload support for JobSets. (#1575, @trasc)

  • Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)

  • Add prebuilt workload support for batch/job. (#1358, @trasc)

  • Add support for groups of plain Pods. (#1319, @achernevskii)

  • Allow configuring featureGates on helm charts. (#1314, @B1F030)

  • At log level 6, the usage of ClusterQueues and cohorts is included in logs.

    The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)

  • Changing tolerations in an inadmissible job triggers an admission retry with the updated tolerations. (#1304, @stuton)

  • Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)

  • Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)

  • Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)

  • MultiKueue: Add live status updates for multikueue JobSets (#1668, @trasc)

  • MultiKueue: Support for JobSets. (#1606, @trasc)

  • Support RayCluster as a queue-able workload in Kueue (#1520, @vicentefb)

  • Support for retry of provisioning request.

    When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
    This may cause job evictions for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)

  • The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)

  • The priority sorting within the cohort could be disabled by setting the feature gate PrioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)

  • Visibility API: Add HA support. (#1554, @astefanutti)

Bug or Regression

  • Add Missing RBAC on finalizer sub-resources for job integrations. (#1486, @astefanutti)

  • Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)

  • Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)

  • Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)

  • Avoid finished Workloads from blocking quota after a Kueue restart (#1689, @trasc)

  • Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)

  • Do not (re)create ProvReq if the state of admission check is Ready (#1617, @mimowo)

  • Fix Kueue crashing at the log level 6 when re-admitting workloads (#1644, @mimowo)

  • Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)

  • Fix a bug that plain pods managed by kueue will remain in a terminating state, due to a finalizer (#1342, @tenzen-y)

  • Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)

  • Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)

  • Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
    during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

    As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)

  • Fix the synchronization of the admission check state based on recreated ProvisioningRequest (#1585, @mimowo)

  • Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)

  • Kueue replicas are advertised as Ready only once the webhooks are functional.

    This allows users to wait with the first requests until the Kueue deployment is available, so that the
    early requests don't fail. (#1676, @mimowo)

  • Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)

  • Remove deleted pending workloads from the cache (#1679, @astefanutti)

  • Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)

  • Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)

  • Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

  • Expose utilization functions to setup jobframework reconcilers and webhooks (#1630, @tenzen-y)

Kueue v0.6.0-rc.3

12 Feb 21:00
v0.6.0-rc.3
5a0a714
Compare
Choose a tag to compare
Kueue v0.6.0-rc.3 Pre-release
Pre-release

Changes since v0.5.0:

Changes by Kind

API Change

  • A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)
  • Add MultiKueue garbage collection. (#1643, @trasc)
  • Add Path location type for MultiKueue cluster KubeConfigs (#1640, @trasc)
  • Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
  • Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)
  • Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
  • Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
  • Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
  • Support a backoff re-queueing mechanism for the waitForPodsReady (#1709, @tenzen-y)
  • The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
  • The lendingLimit field in ClusterQueue's quotas allows restricting home much of the unused resources by the ClusterQueue can be borrowed by other ClusterQueues in the cohort. In other words, this allows a quota equal to nominal-lendingLimit to be exclusively used by the ClusterQueue. (#1385, @B1F030)
  • Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)

Feature

  • Add HA support for the visibility API (#1554, @astefanutti)

  • Add MultiKueue support for JobSet (#1606, @trasc)

  • Add Prebuilt Workload support for JobSets. (#1575, @trasc)

  • Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)

  • Add live status updates for multikueue jobs (#1668, @trasc)

  • Add prebuilt workload support for batch/job. (#1358, @trasc)

  • Add support for groups of plain Pods. (#1319, @achernevskii)

  • Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)

  • Allow configuring featureGates on helm charts. (#1314, @B1F030)

  • Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)

  • At log level 6, the usage of ClusterQueues and cohorts is included in logs.

    The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)

  • Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)

  • Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)

  • Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)

  • RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)

  • Support RayCluster as a queue-able workload in Kueue (#1520, @vicentefb)

  • Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)

  • Support for preemption while borrowing (#1397, @mimowo)

  • Support for retry of provisioning request.

    When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
    This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)

  • The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)

  • The leaderElection field in the Configuration API is now defaulted.
    Leader election is now enabled by default. (#1598, @astefanutti)

  • The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)

  • Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)

Bug or Regression

  • Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)

  • Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)

  • Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)

  • Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)

  • Avoid finished Workloads from blocking quota after a Kueue restart (#1689, @trasc)

  • Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)

  • Do not (re)create ProvReq if the state of admission check is Ready (#1617, @mimowo)

  • Fix Kueue crashing at the log level 6 when re-admitting workloads (#1644, @mimowo)

  • Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)

  • Fix a bug that a workload, representing a pod group, was deleted soon after being marked as finished.
    This affected which were preempted during their lifetime. (#1683, @mimowo)

  • Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)

  • Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)

  • Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)

  • Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
    during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

    As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)

  • Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)

  • Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)

  • Kueue replicas are advertised as Ready only once the webhooks are functional.

    This allows users to wait with the first requests until the Kueue deployment is available, so that the
    early requests don't fail. (#1676, @mimowo)

  • Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)

  • Remove deleted pending workloads from the cache (#1679, @astefanutti)

  • Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)

  • Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)

  • Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)

  • Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

  • Adding toleration to a job leads to update workload (#1304, @stuton)
  • Expose utilization functions to setup jobframework reconcilers and webhooks (#1630, @tenzen-y)

Kueue v0.5.3

09 Feb 17:11
v0.5.3
f816b7f
Compare
Choose a tag to compare

Changes since v0.5.2:

Changes by Kind

Bug or Regression

  • Avoid finished Workloads from blocking quota after a Kueue restart (#1699, @trasc)

  • Do not (re)create ProvReq if the state of admission check is Ready (#1620, @mimowo)

  • Fix Kueue crashing at the log level 6 when re-admitting workloads (#1645, @mimowo)

  • Kueue replicas are advertised as Ready only once the webhooks are functional.

    This allows users to wait with the first requests until the Kueue deployment is available, so that the early requests don't fail. (#1682 #1713, @mimowo @trasc)

  • Remove deleted pending workloads from the cache (#1687, @astefanutti)

Kueue v0.6.0-rc.2

07 Feb 21:23
v0.6.0-rc.2
90fa327
Compare
Choose a tag to compare
Kueue v0.6.0-rc.2 Pre-release
Pre-release

Changes since v0.5.0:

Changes by Kind

API Change

  • Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
  • Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
  • Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
  • Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
  • The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
  • Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)

Feature

  • A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)

  • Add HA support for the visibility API (#1554, @astefanutti)

  • Add MultiKueue garbage collection. (#1643, @trasc)

  • Add MultiKueue support for JobSet (#1606, @trasc)

  • Add Path location type for MultiKueue cluster KubeConfigs (#1640, @trasc)

  • Add Prebuilt Workload support for JobSets. (#1575, @trasc)

  • Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)

  • Add live status updates for multikueue jobs (#1668, @trasc)

  • Add prebuilt workload support for batch/job. (#1358, @trasc)

  • Add support for groups of plain Pods. (#1319, @achernevskii)

  • Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)

  • Allow configuring featureGates on helm charts. (#1314, @B1F030)

  • Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)

  • At log level 6, the usage of ClusterQueues and cohorts is included in logs.

    The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)

  • Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)

  • Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)

  • Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)

  • Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)

  • RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)

  • Support RayCluster as a queue-able workload in Kueue (#1520, @vicentefb)

  • Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)

  • Support for preemption while borrowing (#1397, @mimowo)

  • Support for retry of provisioning request.

    When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
    This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)

  • The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)

  • The leaderElection field in the Configuration API is now defaulted.
    Leader election is now enabled by default. (#1598, @astefanutti)

  • The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)

  • Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)

Documentation

Bug or Regression

  • Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)

  • Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)

  • Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)

  • Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)

  • Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)

  • Do not (re)create ProvReq is the state of admission check is Ready (#1617, @mimowo)

  • Fix Kueue crashing at the log level 6 when re-admitting workloads (#1644, @mimowo)

  • Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)

  • Fix a bug that a workload, representing a pod group, was deleted soon after being marked as finished.
    This affected which were preempted during their lifetime. (#1683, @mimowo)

  • Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)

  • Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)

  • Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)

  • Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
    during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

    As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)

  • Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)

  • Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)

  • Kueue replicas are advertised as Ready only once the webhooks are functional.

    This allows users to wait with the first requests until the Kueue deployment is available, so that the
    early requests don't fail. (#1676, @mimowo)

  • Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)

  • Remove deleted pending workloads from the cache (#1679, @astefanutti)

  • Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)

  • Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)

  • Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)

  • Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

  • Adding toleration to a job leads to update workload (#1304, @stuton)
  • Expose utilization functions to setup jobframework reconcilers and webhooks (#1630, @tenzen-y)

Kueue v0.6.0-rc.1

23 Jan 19:14
v0.6.0-rc.1
44adc22
Compare
Choose a tag to compare
Kueue v0.6.0-rc.1 Pre-release
Pre-release

Changes since v0.5.0:

Changes by Kind

API Change

  • Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
  • Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
  • Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
  • Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
  • The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
  • Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)

Feature

  • A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)

  • Add MultiKueue support for JobSet (#1606, @trasc)

  • Add Prebuilt Workload support for JobSets. (#1575, @trasc)

  • Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)

  • Add prebuilt workload support for batch/job. (#1358, @trasc)

  • Add support for groups of plain Pods. (#1319, @achernevskii)

  • Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)

  • Allow configuring featureGates on helm charts. (#1314, @B1F030)

  • Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)

  • At log level 6, the usage of ClusterQueues and cohorts is included in logs.

    The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)

  • Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)

  • Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)

  • Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)

  • Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)

  • RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)

  • Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)

  • Support for preemption while borrowing (#1397, @mimowo)

  • Support for retry of provisioning request.

    When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
    This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)

  • The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)

  • The leaderElection field in the Configuration API is now defaulted.
    Leader election is now enabled by default. (#1598, @astefanutti)

  • The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)

  • Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)

Bug or Regression

  • Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)

  • Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)

  • Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)

  • Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)

  • Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)

  • Do not (re)create ProvReq is the state of admission check is Ready (#1617, @mimowo)

  • Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)

  • Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)

  • Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)

  • Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)

  • Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
    during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

    As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)

  • Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)

  • Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)

  • Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)

  • Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)

  • Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)

  • Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)

  • Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

  • Adding toleration to a job leads to update workload (#1304, @stuton)