Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Sklearn never stop training model #1695

Open
whoisltd opened this issue Sep 19, 2023 · 3 comments
Open

Auto Sklearn never stop training model #1695

whoisltd opened this issue Sep 19, 2023 · 3 comments

Comments

@whoisltd
Copy link

whoisltd commented Sep 19, 2023

Describe the bug

I have a pod in k8s with 56 cpu. When i run fit() model with classification or regression it will never done task even though time trainng set time_left_for_this_task=60.
But when run it in local machine with 8cpu everything work fine. But if i increase time on local machine to time_left_for_this_task=1500. Local machine will not stop training after 1500 seconds like model on k8s. I dont know what leading this error maybe about computer configuration or something else
In case have an error i hope have any message return

Expected behavior

Model stop training after end time_left_for_this_task

Actual behavior, stacktrace or logfile

in AutoML(...).log two end lines shows:

[DEBUG] [2023-09-18 17:15:44,242:Client-pynisher] Redirecting output of the function to files. Access them via the stdout and stderr attributes of the wrapped function.
[DEBUG] [2023-09-18 17:15:44,243:Client-pynisher] call function

Environment and installation:

Please give details about your installation:

  • OS Ubuntu 20.04
  • virtual environment
  • Python 3.8
  • Auto-sklearn 0.15.0
@whoisltd
Copy link
Author

Have any update in this problem? And what is minimum configuration for run autosklearn ?

@whoisltd whoisltd changed the title Auto Sklearn never done training task Auto Sklearn never stop training model Oct 5, 2023
@00sapo
Copy link

00sapo commented May 16, 2024

Hello, I used auto-sklearn in several projects now, but never faced this issue... until today. I think the problem is that autosklearn doesn't really stops ongoing training for certain algorithms but just don't start a newer one if beyond the time limit. I guess that the reason is that certain algorithms ignore some kill signals. I'm also on Linux.

@00sapo
Copy link

00sapo commented May 17, 2024

I used this function as a work-around. Instead of using SIGSTOP, it uses SIGKILL, so any running process is killed and the fit errors, but continues. It needs psutil, though.

def _monitor_children_processes(min_time_limit, max_time_limit):
    """
    Monitor the children processes of this process and kill them if they take
    too long. This spawns a new process which does nothing until `min_time_limit`
    is reached, then it starts waiting for the children processes of this process
    (the parent, not the monitor). If the children processes are still running
    after `max_time_limit`, it kills them with -9.
    """
    import psutil
    from multiprocessing import Process

    def monitor_children_processes(parent):
        pid = psutil.Process().pid
        start_time = time.time()
        while True:
            if time.time() - start_time < min_time_limit:
                time.sleep(60)
                continue
            children = parent.children()
            if len(children) > 1:
                for child in children:
                    # avoid killing this same process
                    if child.pid != pid:
                        try:
                            remaining_time = max_time_limit - (time.time() - start_time)
                            if remaining_time < 0:
                                # kill with -9
                                child.kill()
                            else:
                                child.wait(timeout=remaining_time)
                        except psutil.TimeoutExpired:
                            # kill with -9
                            child.kill()
                        except psutil.NoSuchProcess:
                            pass
            else:
                break

    # run the monitor in a new process
    monitor = Process(target=monitor_children_processes, args=(psutil.Process(),))
    return monitor

monitor = _monitor_children_processes(3500, 3600)
monitor.start() # starts the monitor process
model.fit(X, y) # starts the fit
monitor.wait(3600) # waits for the monitor to finish, but it should end even without this command ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants