Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] scheduler for running operations subsequently #1095

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

Hoxo
Copy link
Contributor

@Hoxo Hoxo commented Jul 31, 2023

No description provided.

@github-actions
Copy link

github-actions bot commented Jul 31, 2023

Unit Test Results

26 tests   - 321   23 ✔️  - 318   15s ⏱️ - 15m 22s
  6 suites  -   63     0 💤  -     6 
  6 files    -   63     2 +    2   1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 9559e47. ± Comparison against base commit d516052.

This pull request removes 340 and adds 19 tests. Note that renamed tests count towards both.
ai.lzy.allocator.test.AdminDaoTest ‑ emptyOnStart
ai.lzy.allocator.test.AdminDaoTest ‑ jupyterLab
ai.lzy.allocator.test.AdminDaoTest ‑ sync
ai.lzy.allocator.test.AdminDaoTest ‑ workers
ai.lzy.allocator.test.AllocatorAdminServiceTest ‑ adminAccess
ai.lzy.allocator.test.AllocatorAdminServiceTest ‑ noAccess
ai.lzy.allocator.test.AllocatorServiceCacheLimitsTest ‑ noLimits
ai.lzy.allocator.test.AllocatorServiceCacheLimitsTest ‑ poolLimit
ai.lzy.allocator.test.AllocatorServiceCacheLimitsTest ‑ userLimitMultipleSessions
ai.lzy.allocator.test.AllocatorServiceCacheLimitsTest ‑ userLimitSingleSession
…
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ create
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ delete
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ deleteUnknown
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ getUnknown
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ lockPendingBatch
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ lockPendingBatchWithAllRunning
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ multiCreate
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ update
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ updateLease
ai.lzy.longrunning.OperationTaskDaoImplTest ‑ updateLeaseUnknown
…

♻️ This comment has been updated with latest results.

@Override
public OperationTask get(long id, @Nullable TransactionHandle tx) throws SQLException {
return DbOperation.execute(tx, storage, c -> {
try (PreparedStatement ps = c.prepareStatement(SELECT_QUERY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we add FOR UPDATE if tx is not null?

in common scenario we do the next:

var tx = start_tx();
var some_state = dao.get(tx);
... some business logic ...
dao.update(new_state, tx);   <-- simple UPDATE, not CAS
tx.commit();

if we do simple UPDATE in tx, then we should add FOR UPDATE to our SELECT query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's necessary because we don't have long-lasting transactions that require read-and-update. It may be useful, of course, if we want to ensure that operation_task hasn't been updated by other instance (in case of parallel execution which is not desirable). So I'll revise the code and think about this problem


public MountDynamicDiskResolver(VmDao vmDao, DynamicMountDao dynamicMountDao, AllocationContext allocationContext,
OperationTaskDao operationTaskDao, OperationTaskScheduler taskScheduler,
Duration leaseDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaseDuration is not a bean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but it's just an example. It still requires fixes for circular dependencies and this configuration

@@ -0,0 +1,21 @@
CREATE TYPE task_status AS ENUM ('PENDING', 'RUNNING', 'FAILED', 'FINISHED', 'STALE');

CREATE TYPE task_type AS ENUM ('UNMOUNT', 'MOUNT');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's should be one type per one action. So this enum could be extended in future migrations to support new types of actions.


CREATE TYPE task_type AS ENUM ('UNMOUNT', 'MOUNT');

CREATE TABLE IF NOT EXISTS operation_task(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea is to provide DB as a single source of truth about order of task execution.
Here's short explaination of operation_task fields:

  • id - is bigserial and thus generated on insert of task. This is the main way to present order among certain tasks (see entity_id).
  • name - for debug and readability purposes
  • entity_id - this is the way to group tasks by some user-generated text id. Tasks with same entity_id are executed subsequently according the id field (in ascending order). Thus, task with smaller id will be executed first. Tasks with different entity_id can be executed in parallel.
  • type - is necessary to match code representation of a task
  • status - status of a task.
  • created_at, updated_at - self-explainatory, for debug purposes
  • metadata - JSON to keep task arguments and other useful information about the task. The content of this field is defined by user and parsed mainly depending by the type.
  • operation_id - an operation that is linked to a task. Contains all details about execution. There should be (0-1) <-> 1 relation between a task and an operation.
  • worker_id - name of the instance that captured a task. This is needed to ensure that a task is executed just once.
  • lease_till - deadline for scheduler instance to execute this task. Scheduler instance should update lease_till field. In case of instance death or any other reason that make instance impossible to finish a task, another scheduler instance can "capture" the task with expired lease_till deadline and replace worker_id field.

import java.util.Map;
import java.util.stream.Collectors;

public class DispatchingOperationTaskResolver implements OperationTaskResolver {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task resolver that can accept a list of different typed resolver to choose resolver by a task type.


import static ai.lzy.model.db.DbHelper.withRetries;

public abstract class OpTaskAwareAction extends OperationRunnerBase {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type of action that is connected to a task. All inheritants of this class will be executed by task scheduler.

}

@Override
protected void beforeStep() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New step in operation execution to update task lease deadline

}

@Override
protected void notifyFinished() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task should be moved to a final status on operation finish

metadata, operationId, null, null);
}

public enum Status {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assumed workflow:

┌─────────┐     ┌─────────┐      ┌──────────┐
│ PENDING ├─────► RUNNING ├──────► FINISHED │
└────┬────┘     └────┬────┘      └──────────┘
     │               │
     │               │
 ┌───▼───┐       ┌───▼────┐
 │ STALE │       │ FAILED │
 └───────┘       └────────┘


import java.sql.SQLException;

public interface OperationTaskResolver {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Component that is used to match and create code representation to a task from DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants