Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the duration of function calls limited by the nginx gateway #5429

Open
QWQyyy opened this issue Jul 8, 2023 · 18 comments
Open

Comments

@QWQyyy
Copy link

QWQyyy commented Jul 8, 2023

How to overcome the call duration limit of nginx gateway? Or which piece of source code has the limitation shown in the figure below, maybe I need to modify the source code and republish it.

https://192.168.1.22:31001/api/23bc46b1-71f6-4ed5-8c54-816aa4f8c502/resnet18-CIF-1
(wsk) root@work1:~/good/day0708_wsk_mL/func/resnet18/CIFAR10/dataset# curl -X POST http://192.168.1.22:31005/api/23bc46b1-71f6-4ed5-8c54-816aa4f8c502/resnet18-CIF-1

<title>504 Gateway Time-out</title>

504 Gateway Time-out


nginx/1.21.1

image

Meanwhile, how to contribute code to openwhisk?

@QWQyyy
Copy link
Author

QWQyyy commented Jul 8, 2023

@style95 hi boss, could you give me some suggestions?

@style95
Copy link
Member

style95 commented Jul 10, 2023

I suppose that is not related to the duration limit of nginx.
When the connection between apigateway and nginx is not properly made, I observed such errors.
Is it working with a short-running action or does it happen all the time?

Regarding the contribution, you may find this useful.
https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md

@QWQyyy
Copy link
Author

QWQyyy commented Jul 10, 2023

I suppose that is not related to the duration limit of nginx. When the connection between apigateway and nginx is not properly made, I observed such errors. Is it working with a short-running action or does it happen all the time?

Regarding the contribution, you may find this useful. https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md

Short-running functions don't produce this error, once my functions are a minute and a half or more long, the problem occurs, so I'm very confused.

@QWQyyy
Copy link
Author

QWQyyy commented Jul 10, 2023

this is my action code:

import pandas as pd
from torchvision import datasets, transforms
import numpy as np
import os
from minio import Minio
from minio.error import S3Error
import io

def main(args):
    # 转换器
    transform_train = transforms.Compose([transforms.RandomHorizontalFlip(p=0.5),
                                          transforms.ToTensor(),
                                          transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])
    transform_test = transforms.Compose([transforms.ToTensor(),
                                         transforms.Normalize((0.485, 0.456, 0.406), (0.226, 0.224, 0.225))])
    # 数据集路径
    data_path = "/dataset/CIFAR10"

    # 创建与MinIO服务器的连接
    client = Minio(
        "192.168.1.23:9000",
        access_key="CuZRDbhLVIIjfDp4M4p8",
        secret_key="tFDb4aoeSqlCB0bH6BoYfNvIWXqSCJQCZ6UD0Hd9",
        secure=False
    )

    # 创建存储桶
    bucket_name = "cifar"
    try:
        if not client.bucket_exists(bucket_name):
            client.make_bucket(bucket_name)
            print(f"存储桶 {bucket_name} 创建成功")
        else:
            print(f"存储桶 {bucket_name} 已存在")
    except S3Error as exc:
        print(f"存储桶 {bucket_name} 创建失败: {exc}")

    # 加载训练集和测试集
    train_dataset = datasets.CIFAR10(root=data_path, train=True, transform=transform_train, download=True)
    test_dataset = datasets.CIFAR10(root=data_path, train=False, transform=transform_test, download=True)
    # 将训练集数据保存到CSV文件
    train_data = []
    for i, (image, label) in enumerate(train_dataset):
        image_data = np.array(image).tolist()  # 转换为列表
        train_data.append([image_data, label])
    train_df = pd.DataFrame(train_data, columns=['image', 'label'])
    train_csv_data = train_df.to_csv(index=False)

    # 将数据保存到MinIO存储中
    try:
        data_bytes = train_csv_data.encode()  # 将字符串转换为字节类型
        data_stream = io.BytesIO(data_bytes)  # 创建可读取的字节流
        client.put_object(bucket_name, "train_dataset.csv", data_stream, len(data_bytes))
        print(f"训练集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:train_dataset.csv")
    except S3Error as exc:
        print(f"保存训练集数据到MinIO存储失败: {exc}")

    # 将测试集数据保存到CSV文件
    test_data = []
    for i, (image, label) in enumerate(test_dataset):
        image_data = np.array(image).tolist()  # 转换为列表
        test_data.append([image_data, label])
    test_df = pd.DataFrame(test_data, columns=['image', 'label'])
    test_csv_data = test_df.to_csv(index=False)

    # 将数据保存到MinIO存储中
    try:
        data_bytes = test_csv_data.encode()  # 将字符串转换为字节类型
        data_stream = io.BytesIO(data_bytes)  # 创建可读取的字节流
        client.put_object(bucket_name, "test_dataset.csv", data_stream, len(data_bytes))
        print(f"测试集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:test_dataset.csv")
    except S3Error as exc:
        print(f"保存测试集数据到MinIO存储失败: {exc}")

    result = {'res': '666'}
    return result

@QWQyyy
Copy link
Author

QWQyyy commented Jul 10, 2023

and we alse could discuss some othe: #5427

@QWQyyy
Copy link
Author

QWQyyy commented Jul 10, 2023

I figured out that this error might stem from long running functions, my data processing function behaves fine when processing a tenth of the size of the CIFAR10 dataset, but terminates at minutes when processing the full CIFAR10 dataset. Therefore, I guess that openwhisk has some bugs. On the surface, this bug is that when the function executes for more than 1 minute, it will cut off the client and make the function execute in the background, but the real situation is that it reports an error after running for several minutes. I think this may be a worthwhile question. @style95

Results of the:
image

No error action code:

import pandas as pd
from torchvision import datasets, transforms
import numpy as np
from minio import Minio
from minio.error import S3Error
import io

def main(args):
    # 转换器
    transform_train = transforms.Compose([transforms.RandomHorizontalFlip(p=0.5),
                                          transforms.ToTensor(),
                                          transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])
    transform_test = transforms.Compose([transforms.ToTensor(),
                                         transforms.Normalize((0.485, 0.456, 0.406), (0.226, 0.224, 0.225))])
    # 数据集路径
    data_path = "/dataset/CIFAR10"

    # 创建与MinIO服务器的连接
    client = Minio(
        "192.168.1.23:9000",
        access_key="CuZRDbhLVIIjfDp4M4p8",
        secret_key="tFDb4aoeSqlCB0bH6BoYfNvIWXqSCJQCZ6UD0Hd9",
        secure=False
    )

    # 创建存储桶
    bucket_name = "cifar"
    try:
        if not client.bucket_exists(bucket_name):
            client.make_bucket(bucket_name)
            print(f"存储桶 {bucket_name} 创建成功")
        else:
            print(f"存储桶 {bucket_name} 已存在")
    except S3Error as exc:
        print(f"存储桶 {bucket_name} 创建失败: {exc}")

    # 加载训练集和测试集
    train_dataset = datasets.CIFAR10(root=data_path, train=True, transform=transform_train, download=False)
    test_dataset = datasets.CIFAR10(root=data_path, train=False, transform=transform_test, download=False)
    # 将训练集数据保存到CSV文件
    train_data = []
    for i, (image, label) in enumerate(train_dataset):
        image_data = np.array(image).tolist()  # 转换为列表
        train_data.append([image_data, label])
        if i >= len(train_dataset) * 0.1:  # 控制训练集样本数量
            break
    train_df = pd.DataFrame(train_data, columns=['image', 'label'])
    train_csv_data = train_df.to_csv(index=False)

    # 将数据保存到MinIO存储中
    try:
        data_bytes = train_csv_data.encode()  # 将字符串转换为字节类型
        data_stream = io.BytesIO(data_bytes)  # 创建可读取的字节流
        client.put_object(bucket_name, "train_dataset.csv", data_stream, len(data_bytes))
        print(f"训练集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:train_dataset.csv")
    except S3Error as exc:
        print(f"保存训练集数据到MinIO存储失败: {exc}")

    # 将测试集数据保存到CSV文件
    test_data = []
    for i, (image, label) in enumerate(test_dataset):
        image_data = np.array(image).tolist()  # 转换为列表
        test_data.append([image_data, label])
        if i >= len(test_dataset) * 0.1:  # 控制测试集样本数量
            break
    test_df = pd.DataFrame(test_data, columns=['image', 'label'])
    test_csv_data = test_df.to_csv(index=False)

    # 将数据保存到MinIO存储中
    try:
        data_bytes = test_csv_data.encode()  # 将字符串转换为字节类型
        data_stream = io.BytesIO(data_bytes)  # 创建可读取的字节流
        client.put_object(bucket_name, "test_dataset.csv", data_stream, len(data_bytes))
        print(f"测试集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:test_dataset.csv")
    except S3Error as exc:
        print(f"保存测试集数据到MinIO存储失败: {exc}")

    result = {"res": "666"}
    return result

and error action code:

import pandas as pd
from torchvision import datasets, transforms
import numpy as np
from minio import Minio
from minio.error import S3Error
import io

def main(args):
    # 转换器
    transform_train = transforms.Compose([transforms.RandomHorizontalFlip(p=0.5),
                                          transforms.ToTensor(),
                                          transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])
    transform_test = transforms.Compose([transforms.ToTensor(),
                                         transforms.Normalize((0.485, 0.456, 0.406), (0.226, 0.224, 0.225))])
    # 数据集路径
    data_path = "/dataset/CIFAR10"

    # 创建与MinIO服务器的连接
    client = Minio(
        "192.168.1.23:9000",
        access_key="CuZRDbhLVIIjfDp4M4p8",
        secret_key="tFDb4aoeSqlCB0bH6BoYfNvIWXqSCJQCZ6UD0Hd9",
        secure=False
    )

    # 创建存储桶
    bucket_name = "cifar"
    try:
        if not client.bucket_exists(bucket_name):
            client.make_bucket(bucket_name)
            print(f"存储桶 {bucket_name} 创建成功")
        else:
            print(f"存储桶 {bucket_name} 已存在")
    except S3Error as exc:
        print(f"存储桶 {bucket_name} 创建失败: {exc}")

    # 加载训练集和测试集
    train_dataset = datasets.CIFAR10(root=data_path, train=True, transform=transform_train, download=False)
    test_dataset = datasets.CIFAR10(root=data_path, train=False, transform=transform_test, download=False)
    # 将训练集数据保存到CSV文件
    train_data = []
    for i, (image, label) in enumerate(train_dataset):
        image_data = np.array(image).tolist()  # 转换为列表
        train_data.append([image_data, label])

    train_df = pd.DataFrame(train_data, columns=['image', 'label'])
    train_csv_data = train_df.to_csv(index=False)

    # 将数据保存到MinIO存储中
    try:
        data_bytes = train_csv_data.encode()  # 将字符串转换为字节类型
        data_stream = io.BytesIO(data_bytes)  # 创建可读取的字节流
        client.put_object(bucket_name, "train_dataset.csv", data_stream, len(data_bytes))
        print(f"训练集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:train_dataset.csv")
    except S3Error as exc:
        print(f"保存训练集数据到MinIO存储失败: {exc}")

    # 将测试集数据保存到CSV文件
    test_data = []
    for i, (image, label) in enumerate(test_dataset):
        image_data = np.array(image).tolist()  # 转换为列表
        test_data.append([image_data, label])
    test_df = pd.DataFrame(test_data, columns=['image', 'label'])
    test_csv_data = test_df.to_csv(index=False)

    # 将数据保存到MinIO存储中
    try:
        data_bytes = test_csv_data.encode()  # 将字符串转换为字节类型
        data_stream = io.BytesIO(data_bytes)  # 创建可读取的字节流
        client.put_object(bucket_name, "test_dataset.csv", data_stream, len(data_bytes))
        print(f"测试集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:test_dataset.csv")
    except S3Error as exc:
        print(f"保存测试集数据到MinIO存储失败: {exc}")

    result = {"res": "666"}
    return result

The code that only differs from the reduced dataset

@style95
Copy link
Member

style95 commented Jul 11, 2023

We have a 1-minute duration limit for the blocking invocation at controllers.
But even after 1 minute, the invocation is not canceled but changed to a non-blocking call with a 202 response.

So if you get a developer error, I suspect there is an issue in your action.
As the log said, your action is not correctly initialized or run.

Just for sure, what is the timeout value of your action?

@QWQyyy
Copy link
Author

QWQyyy commented Jul 11, 2023

We have a 1-minute duration limit for the blocking invocation at controllers. But even after 1 minute, the invocation is not canceled but changed to a non-blocking call with a 202 response.

So if you get a developer error, I suspect there is an issue in your action. As the log said, your action is not correctly initialized or run.

Just for sure, what is the timeout value of your action?

We use this command line code to set the timeout:

wsk action update imagenet imagenet.zip --kind python:torch --web true -i -m 8192 -t 30000000

At the same time, we also made some configurations in the value.yaml file when configuring openwhisk:
image

So it should be possible to configure a large timeout, unless our configuration is invalid.

So currently our team is confused...

@QWQyyy
Copy link
Author

QWQyyy commented Jul 11, 2023

So, I think I should consider changing the source code of the controller to remove the limit of blocking in one minute.

@style95
Copy link
Member

style95 commented Jul 12, 2023

I think that is not related to your action failure.
Maybe you need to look into why it failed to initialize or run your action.
Aren't there any logs on the invoker side?

@QWQyyy
Copy link
Author

QWQyyy commented Jul 12, 2023

I think that is not related to your action failure. Maybe you need to look into why it failed to initialize or run your action. Aren't there any logs on the invoker side?

Here is our Invoker log:
[2023-07-12T01:30:37.114Z] [INFO] [#tid_sid_invokerWarmup] [KubernetesClient] [marker:invoker_kubeapi.create_finish:120318966:2082]
[2023-07-12T01:30:37.439Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [KubernetesContainer] initialization result: ok [marker:invoker_activationInit_finish:2501:2379]
[2023-07-12T01:30:37.440Z] [INFO] [#tid_sid_unknown] [ContainerProxy] resending up to 0 from 0 buffered jobs
[2023-07-12T01:30:37.440Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [KubernetesContainer] sending arguments to /guest/test at ContainerId(wskowdev-invoker-00-43-prewarm-pythontorch) ContainerAddress(10.244.0.120,8080) [marker:invoker_activationRun_start:2502]
[2023-07-12T01:33:30.204Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [KubernetesContainer] running result: not ok [marker:invoker_activationRun_finish:175266:172764]
[2023-07-12T01:33:30.205Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [ContainerProxy] [marker:invoker_collectLogs_start:175267]
[2023-07-12T01:33:30.209Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [MessagingActiveAck] posted result of activation fddaaaaa3b5943df9aaaaa3b5903dfc7
[2023-07-12T01:33:30.293Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [ContainerProxy] [marker:invoker_collectLogs_finish:175355:88]
[2023-07-12T01:33:30.294Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [CouchDbRestStore] [PUT] 'test_activations' saving document: 'id: guest/fddaaaaa3b5943df9aaaaa3b5903dfc7, rev: null' [marker:database_saveDocument_start:175356]
[2023-07-12T01:33:30.294Z] [INFO] [#tid_sid_dbBatcher] [CouchDbRestStore] 'test_activations' saving 1 documents [marker:database_saveDocumentBulk_start:120492146]
[2023-07-12T01:33:30.294Z] [ERROR] [#tid_sid_unknown] [ContainerProxy] Failed during use of warm container Some(ContainerId(wskowdev-invoker-00-43-prewarm-pythontorch)), queued activations will be resent.
[2023-07-12T01:33:30.295Z] [INFO] [#tid_sid_invokerWarmup] [ContainerPool] found 1 started and 0 starting; backfilling 4 pre-warms to desired count: 5 for kind:python:torch mem:8192 MB
[2023-07-12T01:33:30.295Z] [INFO] [#tid_sid_invokerNanny] [KubernetesClient] Deleting pod wskowdev-invoker-00-43-prewarm-pythontorch [marker:invoker_kubeapi.delete_start:120492147]
[2023-07-12T01:33:30.295Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [MessagingActiveAck] posted completion of activation fddaaaaa3b5943df9aaaaa3b5903dfc7
[2023-07-12T01:33:30.307Z] [INFO] [#tid_sid_invokerNanny] [KubernetesClient] [marker:invoker_kubeapi.delete_finish:120492159:12]
[2023-07-12T01:33:30.323Z] [INFO] [#tid_sid_dbBatcher] [CouchDbRestStore] [marker:database_saveDocumentBulk_finish:120492174:28]
[2023-07-12T01:33:30.323Z] [INFO] [#tid_64Kki46gspKlEifWfvOyfiopkn31kgTc] [CouchDbRestStore] [marker:database_saveDocument_finish:175385:29]

A more complete log is shown below:
image

@style95
Copy link
Member

style95 commented Jul 12, 2023

Yes as the log says, the running result is "not ok".
If possible, you may need to look into the container logs as well and see if there is any problem in the pod itself too.

@QWQyyy
Copy link
Author

QWQyyy commented Jul 12, 2023

Yes as the log says, the running result is "not ok". If possible, you may need to look into the container logs as well and see if there is any problem in the pod itself too.

However, it seems like I'm having a hard time finding logs for containers, since containers are always created and destroyed. So I opened two console windows, one to call the function, and the other to observe the real-time container log. The result was very unfortunate, and the same result as the wsk activation command was printed. So, how should we access the container logs correctly and see what we want. I temporarily suspect that the container may not have enough memory. I have configured the container memory to be 8GB. Maybe I should give it more memory.

XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
(base) root@work1:~# kubectl logs -n openwhisk wskowdev-invoker-00-5-prewarm-pythontorch
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX

image

I will continue to follow up with some tests, thank you for your attention.

@style95
Copy link
Member

style95 commented Jul 12, 2023

Are you able to access the underlying node? Then you would be able to collect container logs.
And regarding OOM issue, you may look into the kubernetes events to see if the pod is destroyed by OOM.

IIRC the flag(THE_END_OF_A_WHISK_ACTIVATION) is to denote the end of one activation.
A container(pod) can be used and each invocation can leave logs and that flag is to figure out the end of one activation.

@QWQyyy
Copy link
Author

QWQyyy commented Jul 12, 2023

Are you able to access the underlying node? Then you would be able to collect container logs. And regarding OOM issue, you may look into the kubernetes events to see if the pod is destroyed by OOM.

IIRC the flag(THE_END_OF_A_WHISK_ACTIVATION) is to denote the end of one activation. A container(pod) can be used and each invocation can leave logs and that flag is to figure out the end of one activation.

Well notes, I managed to solve this issue by increasing the memory of a single function container to 16GB, it was indeed caused by the OOM issue, having said that, I still think that OpenWhisk is currently difficult to troubleshoot this issue. So, I want to make some changes to the source code. One is to remove the 1m blocking restriction of the controller, and the other is to connect the OOM of kubernetes with the OpenWhisk message system, because the current OpenWhisk is difficult to determine the OOM problem.

@QWQyyy
Copy link
Author

QWQyyy commented Jul 12, 2023

In my opinion, compared to a few years ago, the current workflow has shifted to ML services, so the time limit of 1m should be configurable.

@style95
Copy link
Member

style95 commented Jul 12, 2023

1 min limit is introduced to protect controllers.
If there are a huge number of long-running connections, all connections of controllers would be easily exhausted.
But I think it would be viable to make the limit configurable so that OW operators may configure it according to their requirements.

Regarding the OOM issue, I believe you can easily catch it with Kubernetes events.
In my downstream, we have a Grafana dashboard and getting alarms from kube events like pod OOM crashed.

@QWQyyy
Copy link
Author

QWQyyy commented Jul 12, 2023

1 min limit is introduced to protect controllers. If there are a huge number of long-running connections, all connections of controllers would be easily exhausted. But I think it would be viable to make the limit configurable so that OW operators may configure it according to their requirements.

Regarding the OOM issue, I believe you can easily catch it with Kubernetes events. In my downstream, we have a Grafana dashboard and getting alarms from kube events like pod OOM crashed.

Great, I think I should get more familiar with K8s events and deploy the Prometheus+Grafana system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants