diff --git a/program/cohort-old.ipynb b/backup/cohort-7-linear.ipynb similarity index 100% rename from program/cohort-old.ipynb rename to backup/cohort-7-linear.ipynb diff --git a/backup/cohort-7.ipynb b/backup/cohort-7.ipynb new file mode 100644 index 0000000..14860f6 --- /dev/null +++ b/backup/cohort-7.ipynb @@ -0,0 +1,6802 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "00b7ba7b-433c-463c-8e5e-8b975a5be463", + "metadata": { + "tags": [] + }, + "source": [ + "# Building Production Machine Learning Systems\n" + ] + }, + { + "cell_type": "markdown", + "id": "bd7bb73f", + "metadata": {}, + "source": [ + "This notebook creates a [SageMaker Pipeline](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html) to build an end-to-end Machine Learning system to solve the problem of classifying penguin species. With a SageMaker Pipeline, you can create, automate, and manage end-to-end Machine Learning workflows at scale.\n", + "\n", + "You can find more information about Amazon SageMaker in the [Amazon SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html). The [AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/) is an excellent source to stay up-to-date with SageMaker.\n", + "\n", + "This example uses the [Penguins dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data), the [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html) library, and the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/).\n", + "\n", + "Penguins\n", + "\n", + "This notebook is part of the [Machine Learning School](https://www.ml.school) program.\n" + ] + }, + { + "cell_type": "markdown", + "id": "5ec22ac1", + "metadata": {}, + "source": [ + "## Initial setup\n", + "\n", + ":::{.callout-note}\n", + "Before running this notebook, follow the [setup instructions](https://program.ml.school/setup.html) for the program.\n", + ":::\n", + "\n", + "Let's start by setting up the environment and preparing to run the notebook.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 640, + "id": "4b2265b0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The autoreload extension is already loaded. To reload it, use:\n", + " %reload_ext autoreload\n", + "The dotenv extension is already loaded. To reload it, use:\n", + " %reload_ext dotenv\n" + ] + } + ], + "source": [ + "#| hide\n", + "\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "%load_ext dotenv\n", + "%dotenv\n", + "\n", + "import sys\n", + "import logging\n", + "import ipytest\n", + "import json\n", + "from pathlib import Path\n", + "\n", + "\n", + "CODE_FOLDER = Path(\"code\")\n", + "CODE_FOLDER.mkdir(parents=True, exist_ok=True)\n", + "INFERENCE_CODE_FOLDER = CODE_FOLDER / \"inference\"\n", + "INFERENCE_CODE_FOLDER.mkdir(parents=True, exist_ok=True)\n", + "\n", + "sys.path.append(f\"./{CODE_FOLDER}\")\n", + "sys.path.append(f\"./{INFERENCE_CODE_FOLDER}\")\n", + "\n", + "DATA_FILEPATH = \"penguins.csv\"\n", + "\n", + "ipytest.autoconfig(raise_on_error=True)\n", + "\n", + "# By default, The SageMaker SDK logs events related to the default\n", + "# configuration using the INFO level. To prevent these from spoiling\n", + "# the output of this notebook cells, we can change the logging\n", + "# level to ERROR instead.\n", + "logging.getLogger(\"sagemaker.config\").setLevel(logging.ERROR)" + ] + }, + { + "cell_type": "markdown", + "id": "588d34c9", + "metadata": {}, + "source": [ + "We can run this notebook is [Local Mode](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-local-mode.html) to test the pipeline in your local environment before using SageMaker. You can run the code in Local Mode by setting the `LOCAL_MODE` constant to `True`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 641, + "id": "32c4d764", + "metadata": {}, + "outputs": [], + "source": [ + "LOCAL_MODE = False" + ] + }, + { + "cell_type": "markdown", + "id": "d6be4f8d", + "metadata": {}, + "source": [ + "Let's load the S3 bucket name and the AWS Role from the environment variables:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 642, + "id": "3164a3af", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "bucket = os.environ[\"BUCKET\"]\n", + "role = os.environ[\"ROLE\"]\n", + "\n", + "S3_LOCATION = f\"s3://{bucket}/penguins\"" + ] + }, + { + "cell_type": "markdown", + "id": "daa700f4", + "metadata": {}, + "source": [ + "If you are running the pipeline in Local Mode on an ARM64 machine, you will need to use a custom Docker image to train and evaluate the model. This is because SageMaker doesn't provide a TensorFlow image that supports Apple's M chips.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 643, + "id": "7bc40d28", + "metadata": {}, + "outputs": [], + "source": [ + "architecture = !(uname -m)\n", + "IS_APPLE_M_CHIP = architecture[0] == \"arm64\"" + ] + }, + { + "cell_type": "markdown", + "id": "7d906ada", + "metadata": {}, + "source": [ + "Let's create a configuration dictionary with different settings depending on whether we are running the pipeline in Local Mode or not:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 644, + "id": "3b3f17e5", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker\n", + "from sagemaker.workflow.pipeline_context import PipelineSession, LocalPipelineSession\n", + "\n", + "pipeline_session = PipelineSession(default_bucket=bucket) if not LOCAL_MODE else None\n", + "\n", + "if LOCAL_MODE:\n", + " config = {\n", + " \"session\": LocalPipelineSession(default_bucket=bucket),\n", + " \"instance_type\": \"local\",\n", + " # We need to use a custom Docker image when we run the pipeline\n", + " # in Local Model on an ARM64 machine.\n", + " \"image\": \"sagemaker-tensorflow-toolkit-local\" if IS_APPLE_M_CHIP else None,\n", + " \"framework_version\": None if IS_APPLE_M_CHIP else \"2.11\",\n", + " \"py_version\": None if IS_APPLE_M_CHIP else \"py39\",\n", + " }\n", + "else:\n", + " config = {\n", + " \"session\": pipeline_session,\n", + " \"instance_type\": \"ml.m5.xlarge\",\n", + " \"image\": None,\n", + " \"framework_version\": \"2.11\",\n", + " \"py_version\": \"py39\",\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "9089696b", + "metadata": {}, + "source": [ + "Let's now initialize a few variables that we'll need throughout the notebook:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 645, + "id": "942a01b5", + "metadata": {}, + "outputs": [], + "source": [ + "import boto3\n", + "\n", + "sagemaker_session = sagemaker.session.Session()\n", + "sagemaker_client = boto3.client(\"sagemaker\")\n", + "iam_client = boto3.client(\"iam\")\n", + "region = boto3.Session().region_name" + ] + }, + { + "cell_type": "markdown", + "id": "11137928-6b4e-465c-8ad7-2297afbaa33c", + "metadata": {}, + "source": [ + "## Session 1 - Production Machine Learning is Different\n", + "\n", + "In this session we'll run Exploratory Data Analysis on the [Penguins dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) and we'll build a simple [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) with one step to split and transform the data. \n", + "\n", + " \"Training\"\n", + "\n", + "We'll use a [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for the transformations, and a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) with a [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) to execute a preprocessing script. Check the [SageMaker Pipelines Overview](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) for an introduction to the fundamental components of a SageMaker Pipeline.\n" + ] + }, + { + "cell_type": "markdown", + "id": "3a835695-557b-46d8-a901-a29bc57df5fe", + "metadata": {}, + "source": [ + "### Step 1 - Exploratory Data Analysis\n", + "\n", + "Let's run Exploratory Data Analysis on the dataset. The goal of this section is to understand the data and the problem we are trying to solve.\n", + "\n", + "Let's load the Penguins dataset:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 646, + "id": "f1cd2f0e-446d-48a9-a008-b4f1cc593bfc", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0MALE
1AdelieTorgersen39.517.4186.03800.0FEMALE
2AdelieTorgersen40.318.0195.03250.0FEMALE
3AdelieTorgersenNaNNaNNaNNaNNaN
4AdelieTorgersen36.719.3193.03450.0FEMALE
\n", + "
" + ], + "text/plain": [ + " species island culmen_length_mm culmen_depth_mm flipper_length_mm \\\n", + "0 Adelie Torgersen 39.1 18.7 181.0 \n", + "1 Adelie Torgersen 39.5 17.4 186.0 \n", + "2 Adelie Torgersen 40.3 18.0 195.0 \n", + "3 Adelie Torgersen NaN NaN NaN \n", + "4 Adelie Torgersen 36.7 19.3 193.0 \n", + "\n", + " body_mass_g sex \n", + "0 3750.0 MALE \n", + "1 3800.0 FEMALE \n", + "2 3250.0 FEMALE \n", + "3 NaN NaN \n", + "4 3450.0 FEMALE " + ] + }, + "execution_count": 646, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "penguins = pd.read_csv(DATA_FILEPATH)\n", + "penguins.head()" + ] + }, + { + "cell_type": "markdown", + "id": "c9eae10e-20c4-477e-b6b8-965c3a53566e", + "metadata": {}, + "source": [ + "We can see the dataset contains the following columns:\n", + "\n", + "1. `species`: The species of a penguin. This is the column we want to predict.\n", + "2. `island`: The island where the penguin was found\n", + "3. `culmen_length_mm`: The length of the penguin's culmen (bill) in millimeters\n", + "4. `culmen_depth_mm`: The depth of the penguin's culmen in millimeters\n", + "5. `flipper_length_mm`: The length of the penguin's flipper in millimeters\n", + "6. `body_mass_g`: The body mass of the penguin in grams\n", + "7. `sex`: The sex of the penguin\n", + "\n", + "If you are curious, here is the description of a penguin's culmen:\n", + "\n", + "Culmen\n", + "\n", + "Now, let's get the summary statistics for the features in our dataset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 647, + "id": "f2107c25-e730-4e22-a1b8-5bda53e61124", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
count344344342.000000342.000000342.000000342.000000334
unique33NaNNaNNaNNaN3
topAdelieBiscoeNaNNaNNaNNaNMALE
freq152168NaNNaNNaNNaN168
meanNaNNaN43.92193017.151170200.9152054201.754386NaN
stdNaNNaN5.4595841.97479314.061714801.954536NaN
minNaNNaN32.10000013.100000172.0000002700.000000NaN
25%NaNNaN39.22500015.600000190.0000003550.000000NaN
50%NaNNaN44.45000017.300000197.0000004050.000000NaN
75%NaNNaN48.50000018.700000213.0000004750.000000NaN
maxNaNNaN59.60000021.500000231.0000006300.000000NaN
\n", + "
" + ], + "text/plain": [ + " species island culmen_length_mm culmen_depth_mm flipper_length_mm \\\n", + "count 344 344 342.000000 342.000000 342.000000 \n", + "unique 3 3 NaN NaN NaN \n", + "top Adelie Biscoe NaN NaN NaN \n", + "freq 152 168 NaN NaN NaN \n", + "mean NaN NaN 43.921930 17.151170 200.915205 \n", + "std NaN NaN 5.459584 1.974793 14.061714 \n", + "min NaN NaN 32.100000 13.100000 172.000000 \n", + "25% NaN NaN 39.225000 15.600000 190.000000 \n", + "50% NaN NaN 44.450000 17.300000 197.000000 \n", + "75% NaN NaN 48.500000 18.700000 213.000000 \n", + "max NaN NaN 59.600000 21.500000 231.000000 \n", + "\n", + " body_mass_g sex \n", + "count 342.000000 334 \n", + "unique NaN 3 \n", + "top NaN MALE \n", + "freq NaN 168 \n", + "mean 4201.754386 NaN \n", + "std 801.954536 NaN \n", + "min 2700.000000 NaN \n", + "25% 3550.000000 NaN \n", + "50% 4050.000000 NaN \n", + "75% 4750.000000 NaN \n", + "max 6300.000000 NaN " + ] + }, + "execution_count": 647, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "penguins.describe(include=\"all\")" + ] + }, + { + "cell_type": "markdown", + "id": "b2e19af7-9f0f-45fe-b7d3-f19721c02a2b", + "metadata": {}, + "source": [ + "Let's now display the distribution of values for the three categorical columns in our data:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 648, + "id": "1242122a-726e-4c37-a718-dd8e873d1612", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "species\n", + "Adelie 152\n", + "Gentoo 124\n", + "Chinstrap 68\n", + "Name: count, dtype: int64\n", + "\n", + "island\n", + "Biscoe 168\n", + "Dream 124\n", + "Torgersen 52\n", + "Name: count, dtype: int64\n", + "\n", + "sex\n", + "MALE 168\n", + "FEMALE 165\n", + ". 1\n", + "Name: count, dtype: int64\n" + ] + } + ], + "source": [ + "species_distribution = penguins[\"species\"].value_counts()\n", + "island_distribution = penguins[\"island\"].value_counts()\n", + "sex_distribution = penguins[\"sex\"].value_counts()\n", + "\n", + "print(species_distribution)\n", + "print()\n", + "print(island_distribution)\n", + "print()\n", + "print(sex_distribution)" + ] + }, + { + "cell_type": "markdown", + "id": "e9d98fdd-3b8c-40a2-b8dc-15162b4049e2", + "metadata": {}, + "source": [ + "The distribution of the categories in our data are:\n", + "\n", + "- `species`: There are 3 species of penguins in the dataset: Adelie (152), Gentoo (124), and Chinstrap (68).\n", + "- `island`: Penguins are from 3 islands: Biscoe (168), Dream (124), and Torgersen (52).\n", + "- `sex`: We have 168 male penguins, 165 female penguins, and 1 penguin with an ambiguous gender ('.').\n", + "\n", + "Let's replace the ambiguous value in the `sex` column with a null value:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 649, + "id": "cf1cf582-8831-4f83-bb17-2175afb193e8", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "sex\n", + "MALE 168\n", + "FEMALE 165\n", + "Name: count, dtype: int64" + ] + }, + "execution_count": 649, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "penguins[\"sex\"] = penguins[\"sex\"].replace(\".\", np.nan)\n", + "penguins[\"sex\"].value_counts()" + ] + }, + { + "cell_type": "markdown", + "id": "6e8425ce-ce4e-43e6-9ed8-0398b780cc66", + "metadata": {}, + "source": [ + "Next, let's check for any missing values in the dataset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 650, + "id": "cc42cb08-275c-4b05-9d2b-77052da2f336", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "species 0\n", + "island 0\n", + "culmen_length_mm 2\n", + "culmen_depth_mm 2\n", + "flipper_length_mm 2\n", + "body_mass_g 2\n", + "sex 11\n", + "dtype: int64" + ] + }, + "execution_count": 650, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "penguins.isnull().sum()" + ] + }, + { + "cell_type": "markdown", + "id": "1b65207c-3e66-453a-87a1-751636c979ee", + "metadata": {}, + "source": [ + "Let's get rid of the missing values. For now, we are going to replace the missing values with the most frequent value in the column. Later, we'll use a different strategy to replace missing numeric values.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 651, + "id": "3c57d55d-afd6-467a-a7a8-ff04132770ed", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "species 0\n", + "island 0\n", + "culmen_length_mm 0\n", + "culmen_depth_mm 0\n", + "flipper_length_mm 0\n", + "body_mass_g 0\n", + "sex 0\n", + "dtype: int64" + ] + }, + "execution_count": 651, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.impute import SimpleImputer\n", + "\n", + "imputer = SimpleImputer(strategy=\"most_frequent\")\n", + "penguins.iloc[:, :] = imputer.fit_transform(penguins)\n", + "penguins.isnull().sum()" + ] + }, + { + "cell_type": "markdown", + "id": "5758214f-a4ab-4980-8892-91ec8d218ef3", + "metadata": {}, + "source": [ + "Let's visualize the distribution of categorical features.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 652, + "id": "2852c740", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "fig, axs = plt.subplots(3, 1, figsize=(6, 10))\n", + "\n", + "axs[0].bar(species_distribution.index, species_distribution.values)\n", + "axs[0].set_ylabel(\"Count\")\n", + "axs[0].set_title(\"Distribution of Species\")\n", + "\n", + "axs[1].bar(island_distribution.index, island_distribution.values)\n", + "axs[1].set_ylabel(\"Count\")\n", + "axs[1].set_title(\"Distribution of Island\")\n", + "\n", + "axs[2].bar(sex_distribution.index, sex_distribution.values)\n", + "axs[2].set_ylabel(\"Count\")\n", + "axs[2].set_title(\"Distribution of Sex\")\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "b04c8fae-35b4-4d8e-8fff-decee050af3a", + "metadata": {}, + "source": [ + "Let's visualize the distribution of numerical columns.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 653, + "id": "707cc972", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig, axs = plt.subplots(2, 2, figsize=(8, 6))\n", + "\n", + "axs[0, 0].hist(penguins[\"culmen_length_mm\"], bins=20)\n", + "axs[0, 0].set_ylabel(\"Count\")\n", + "axs[0, 0].set_title(\"Distribution of culmen_length_mm\")\n", + "\n", + "axs[0, 1].hist(penguins[\"culmen_depth_mm\"], bins=20)\n", + "axs[0, 1].set_ylabel(\"Count\")\n", + "axs[0, 1].set_title(\"Distribution of culmen_depth_mm\")\n", + "\n", + "axs[1, 0].hist(penguins[\"flipper_length_mm\"], bins=20)\n", + "axs[1, 0].set_ylabel(\"Count\")\n", + "axs[1, 0].set_title(\"Distribution of flipper_length_mm\")\n", + "\n", + "axs[1, 1].hist(penguins[\"body_mass_g\"], bins=20)\n", + "axs[1, 1].set_ylabel(\"Count\")\n", + "axs[1, 1].set_title(\"Distribution of body_mass_g\")\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "ef241df0-3acd-4401-a2c6-b70723d7595b", + "metadata": {}, + "source": [ + "Let's display the covariance matrix of the dataset. The \"covariance\" measures how changes in one variable are associated with changes in a second variable. In other words, the covariance measures the degree to which two variables are linearly associated.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 654, + "id": "3daf3ba1-d218-4ad4-b862-af679b91273f", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
culmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_g
culmen_length_mm29.679415-2.51698450.2605882596.971151
culmen_depth_mm-2.5169843.877201-16.108849-742.660180
flipper_length_mm50.260588-16.108849197.2695019792.552037
body_mass_g2596.971151-742.6601809792.552037640316.716388
\n", + "
" + ], + "text/plain": [ + " culmen_length_mm culmen_depth_mm flipper_length_mm \\\n", + "culmen_length_mm 29.679415 -2.516984 50.260588 \n", + "culmen_depth_mm -2.516984 3.877201 -16.108849 \n", + "flipper_length_mm 50.260588 -16.108849 197.269501 \n", + "body_mass_g 2596.971151 -742.660180 9792.552037 \n", + "\n", + " body_mass_g \n", + "culmen_length_mm 2596.971151 \n", + "culmen_depth_mm -742.660180 \n", + "flipper_length_mm 9792.552037 \n", + "body_mass_g 640316.716388 " + ] + }, + "execution_count": 654, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "penguins.cov(numeric_only=True)" + ] + }, + { + "cell_type": "markdown", + "id": "9fbbe6bc-0104-4663-8c30-8f9566755739", + "metadata": {}, + "source": [ + "Here are three examples of what we get from interpreting the covariance matrix below:\n", + "\n", + "1. Penguins that weight more tend to have a larger culmen.\n", + "2. The more a penguin weights, the shallower its culmen tends to be.\n", + "3. There's a small variance between the culmen depth of penguins.\n", + "\n", + "Let's now display the correlation matrix. \"Correlation\" measures both the strength and direction of the linear relationship between two variables.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 655, + "id": "1d793e09-2cb9-47ff-a0e6-199a0f4fc1b3", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
culmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_g
culmen_length_mm1.000000-0.2346350.6568560.595720
culmen_depth_mm-0.2346351.000000-0.582472-0.471339
flipper_length_mm0.656856-0.5824721.0000000.871302
body_mass_g0.595720-0.4713390.8713021.000000
\n", + "
" + ], + "text/plain": [ + " culmen_length_mm culmen_depth_mm flipper_length_mm \\\n", + "culmen_length_mm 1.000000 -0.234635 0.656856 \n", + "culmen_depth_mm -0.234635 1.000000 -0.582472 \n", + "flipper_length_mm 0.656856 -0.582472 1.000000 \n", + "body_mass_g 0.595720 -0.471339 0.871302 \n", + "\n", + " body_mass_g \n", + "culmen_length_mm 0.595720 \n", + "culmen_depth_mm -0.471339 \n", + "flipper_length_mm 0.871302 \n", + "body_mass_g 1.000000 " + ] + }, + "execution_count": 655, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "penguins.corr(numeric_only=True)" + ] + }, + { + "cell_type": "markdown", + "id": "8aec4c08-767c-4740-959c-2d76268c3513", + "metadata": {}, + "source": [ + "Here are three examples of what we get from interpreting the correlation matrix below:\n", + "\n", + "1. Penguins that weight more tend to have larger flippers.\n", + "2. Penguins with a shallower culmen tend to have larger flippers.\n", + "3. The length and depth of the culmen have a slight negative correlation.\n", + "\n", + "Let's display the distribution of species by island.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 656, + "id": "1258c99d", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "unique_species = penguins[\"species\"].unique()\n", + "\n", + "fig, ax = plt.subplots(figsize=(6, 6))\n", + "for species in unique_species:\n", + " data = penguins[penguins[\"species\"] == species]\n", + " ax.hist(data[\"island\"], bins=5, alpha=0.5, label=species)\n", + "\n", + "ax.set_xlabel(\"Island\")\n", + "ax.set_ylabel(\"Count\")\n", + "ax.set_title(\"Distribution of Species by Island\")\n", + "ax.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "d74ae740-3590-4dce-ac5a-6205975c83da", + "metadata": {}, + "source": [ + "Let's display the distribution of species by sex.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 657, + "id": "45b0a87f-028d-477f-9b65-199728c0b7ee", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 6))\n", + "\n", + "for species in unique_species:\n", + " data = penguins[penguins[\"species\"] == species]\n", + " ax.hist(data[\"sex\"], bins=3, alpha=0.5, label=species)\n", + "\n", + "ax.set_xlabel(\"Sex\")\n", + "ax.set_ylabel(\"Count\")\n", + "ax.set_title(\"Distribution of Species by Sex\")\n", + "\n", + "ax.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "587d06e0-b711-4e3d-b424-6fa611a51f94", + "metadata": { + "tags": [] + }, + "source": [ + "### Step 2 - Creating the Preprocessing Script\n", + "\n", + "The first step we need in the pipeline is a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) to run a script that will split and transform the data. This Processing Step will create a SageMaker Processing Job in the background, run the script, and upload the output to S3. You can use Processing Jobs to perform data preprocessing, post-processing, feature engineering, data validation, and model evaluation. Check the [ProcessingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.ProcessingStep) SageMaker's SDK documentation for more information.\n" + ] + }, + { + "cell_type": "markdown", + "id": "7d656af1", + "metadata": {}, + "source": [ + "The first step is to create the script that will split and transform the input data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 658, + "id": "fb6ba7c0-1bd6-4fe5-8b7f-f6cbdfd3846c", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting code/preprocessor.py\n" + ] + } + ], + "source": [ + "%%writefile {CODE_FOLDER}/preprocessor.py\n", + "#| label: preprocessing-script\n", + "#| echo: true\n", + "#| output: false\n", + "#| filename: preprocessor.py\n", + "#| code-line-numbers: true\n", + "\n", + "import os\n", + "import sys\n", + "import argparse\n", + "import json\n", + "import tarfile\n", + "import tempfile\n", + "import time\n", + "import joblib\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from io import StringIO\n", + "from pathlib import Path\n", + "from sklearn.compose import ColumnTransformer, make_column_selector\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.pipeline import Pipeline, make_pipeline\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder\n", + "\n", + "\n", + "def preprocess(base_directory):\n", + " \"\"\"\n", + " This function loads the supplied data, splits it and transforms it.\n", + " \"\"\"\n", + "\n", + " df = _read_data_from_input_csv_files(base_directory)\n", + " \n", + " target_transformer = ColumnTransformer(\n", + " transformers=[(\"species\", OrdinalEncoder(), [0])]\n", + " )\n", + " \n", + " numeric_transformer = make_pipeline(\n", + " SimpleImputer(strategy=\"mean\"),\n", + " StandardScaler()\n", + " )\n", + "\n", + " categorical_transformer = make_pipeline(\n", + " SimpleImputer(strategy=\"most_frequent\"),\n", + " OneHotEncoder()\n", + " )\n", + " \n", + " features_transformer = ColumnTransformer(\n", + " transformers=[\n", + " (\"numeric\", numeric_transformer, make_column_selector(dtype_exclude=\"object\")),\n", + " (\"categorical\", categorical_transformer, [\"island\"]),\n", + " ]\n", + " )\n", + "\n", + " df_train, df_validation, df_test = _split_data(df)\n", + "\n", + " _save_baselines(base_directory, df_train, df_test)\n", + "\n", + " y_train = target_transformer.fit_transform(np.array(df_train.species.values).reshape(-1, 1))\n", + " y_validation = target_transformer.transform(np.array(df_validation.species.values).reshape(-1, 1))\n", + " y_test = target_transformer.transform(np.array(df_test.species.values).reshape(-1, 1))\n", + " \n", + " df_train = df_train.drop(\"species\", axis=1)\n", + " df_validation = df_validation.drop(\"species\", axis=1)\n", + " df_test = df_test.drop(\"species\", axis=1)\n", + "\n", + " X_train = features_transformer.fit_transform(df_train)\n", + " X_validation = features_transformer.transform(df_validation)\n", + " X_test = features_transformer.transform(df_test)\n", + "\n", + " _save_splits(base_directory, X_train, y_train, X_validation, y_validation, X_test, y_test)\n", + " _save_model(base_directory, target_transformer, features_transformer)\n", + " \n", + "\n", + "def _read_data_from_input_csv_files(base_directory):\n", + " \"\"\"\n", + " This function reads every CSV file available and concatenates\n", + " them into a single dataframe.\n", + " \"\"\"\n", + "\n", + " input_directory = Path(base_directory) / \"input\"\n", + " files = [file for file in input_directory.glob(\"*.csv\")]\n", + " \n", + " if len(files) == 0:\n", + " raise ValueError(f\"The are no CSV files in {str(input_directory)}/\")\n", + " \n", + " raw_data = [pd.read_csv(file) for file in files]\n", + " df = pd.concat(raw_data)\n", + " \n", + " # Shuffle the data\n", + " return df.sample(frac=1, random_state=42)\n", + "\n", + "\n", + "def _split_data(df):\n", + " \"\"\"\n", + " Splits the data into three sets: train, validation and test.\n", + " \"\"\"\n", + "\n", + " df_train, temp = train_test_split(df, test_size=0.3)\n", + " df_validation, df_test = train_test_split(temp, test_size=0.5)\n", + "\n", + " return df_train, df_validation, df_test\n", + "\n", + "\n", + "def _save_baselines(base_directory, df_train, df_test):\n", + " \"\"\"\n", + " During the data and quality monitoring steps, we will need baselines\n", + " to compute constraints and statistics. This function saves the \n", + " untransformed data to disk so we can use them as baselines later.\n", + " \"\"\"\n", + "\n", + " for split, data in [(\"train\", df_train), (\"test\", df_test)]:\n", + " baseline_path = Path(base_directory) / f\"{split}-baseline\"\n", + " baseline_path.mkdir(parents=True, exist_ok=True)\n", + "\n", + " df = data.copy().dropna()\n", + "\n", + " # We want to save the header only for the train baseline\n", + " # but not for the test baseline. We'll use the test baseline\n", + " # to generate predictions later, and we can't have a header line\n", + " # because the model won't be able to make a prediction for it.\n", + " header = split == \"train\"\n", + " df.to_csv(baseline_path / f\"{split}-baseline.csv\", header=header, index=False)\n", + "\n", + "\n", + "def _save_splits(base_directory, X_train, y_train, X_validation, y_validation, X_test, y_test):\n", + " \"\"\"\n", + " This function concatenates the transformed features and the target variable, and\n", + " saves each one of the split sets to disk.\n", + " \"\"\"\n", + "\n", + " train = np.concatenate((X_train, y_train), axis=1)\n", + " validation = np.concatenate((X_validation, y_validation), axis=1)\n", + " test = np.concatenate((X_test, y_test), axis=1)\n", + "\n", + " train_path = Path(base_directory) / \"train\"\n", + " validation_path = Path(base_directory) / \"validation\"\n", + " test_path = Path(base_directory) / \"test\"\n", + "\n", + " train_path.mkdir(parents=True, exist_ok=True)\n", + " validation_path.mkdir(parents=True, exist_ok=True)\n", + " test_path.mkdir(parents=True, exist_ok=True)\n", + "\n", + " pd.DataFrame(train).to_csv(train_path / \"train.csv\", header=False, index=False)\n", + " pd.DataFrame(validation).to_csv(validation_path / \"validation.csv\", header=False, index=False)\n", + " pd.DataFrame(test).to_csv(test_path / \"test.csv\", header=False, index=False)\n", + "\n", + "\n", + "def _save_model(base_directory, target_transformer, features_transformer):\n", + " \"\"\"\n", + " This function creates a model.tar.gz file that contains the two transformation\n", + " pipelines we built to transform the data.\n", + " \"\"\"\n", + "\n", + " with tempfile.TemporaryDirectory() as directory:\n", + " joblib.dump(target_transformer, os.path.join(directory, \"target.joblib\"))\n", + " joblib.dump(features_transformer, os.path.join(directory, \"features.joblib\"))\n", + " \n", + " model_path = Path(base_directory) / \"model\"\n", + " model_path.mkdir(parents=True, exist_ok=True)\n", + " \n", + " with tarfile.open(f\"{str(model_path / 'model.tar.gz')}\", \"w:gz\") as tar:\n", + " tar.add(os.path.join(directory, \"target.joblib\"), arcname=\"target.joblib\")\n", + " tar.add(os.path.join(directory, \"features.joblib\"), arcname=\"features.joblib\")\n", + "\n", + " \n", + "if __name__ == \"__main__\":\n", + " preprocess(base_directory=\"/opt/ml/processing\")" + ] + }, + { + "cell_type": "markdown", + "id": "39301f9f", + "metadata": {}, + "source": [ + "Let's test the script to ensure everything is working as expected:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 659, + "id": "d1f122a4-acff-4687-91b9-bfef13567d88", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\n", + "\u001b[32m\u001b[32m\u001b[1m8 passed\u001b[0m\u001b[32m in 0.16s\u001b[0m\u001b[0m\n" + ] + } + ], + "source": [ + "%%ipytest -s\n", + "\n", + "#| code-fold: true\n", + "#| output: false\n", + "\n", + "import os\n", + "import shutil\n", + "import tarfile\n", + "import pytest\n", + "import tempfile\n", + "import joblib\n", + "from preprocessor import preprocess\n", + "\n", + "\n", + "@pytest.fixture(scope=\"function\", autouse=False)\n", + "def directory():\n", + " directory = tempfile.mkdtemp()\n", + " input_directory = Path(directory) / \"input\"\n", + " input_directory.mkdir(parents=True, exist_ok=True)\n", + " shutil.copy2(DATA_FILEPATH, input_directory / \"data.csv\")\n", + " \n", + " directory = Path(directory)\n", + " preprocess(base_directory=directory)\n", + " \n", + " yield directory\n", + " \n", + " shutil.rmtree(directory)\n", + "\n", + "\n", + "def test_preprocess_generates_data_splits(directory):\n", + " output_directories = os.listdir(directory)\n", + " \n", + " assert \"train\" in output_directories\n", + " assert \"validation\" in output_directories\n", + " assert \"test\" in output_directories\n", + "\n", + "\n", + "def test_preprocess_generates_baselines(directory):\n", + " output_directories = os.listdir(directory)\n", + "\n", + " assert \"train-baseline\" in output_directories\n", + " assert \"test-baseline\" in output_directories\n", + "\n", + "\n", + "def test_preprocess_creates_two_models(directory):\n", + " model_path = directory / \"model\"\n", + " tar = tarfile.open(model_path / \"model.tar.gz\", \"r:gz\")\n", + "\n", + " assert \"features.joblib\" in tar.getnames()\n", + " assert \"target.joblib\" in tar.getnames()\n", + "\n", + "\n", + "def test_splits_are_transformed(directory):\n", + " train = pd.read_csv(directory / \"train\" / \"train.csv\", header=None)\n", + " validation = pd.read_csv(directory / \"validation\" / \"validation.csv\", header=None)\n", + " test = pd.read_csv(directory / \"test\" / \"test.csv\", header=None)\n", + "\n", + " # After transforming the data, the number of features should be 7:\n", + " # * 3 - island (one-hot encoded)\n", + " # * 1 - culmen_length_mm = 1\n", + " # * 1 - culmen_depth_mm\n", + " # * 1 - flipper_length_mm\n", + " # * 1 - body_mass_g\n", + " number_of_features = 7\n", + "\n", + " # The transformed splits should have an additional column for the target\n", + " # variable.\n", + " assert train.shape[1] == number_of_features + 1\n", + " assert validation.shape[1] == number_of_features + 1\n", + " assert test.shape[1] == number_of_features + 1\n", + "\n", + "\n", + "def test_train_baseline_is_not_transformed(directory):\n", + " baseline = pd.read_csv(directory / \"train-baseline\" / \"train-baseline.csv\", header=None)\n", + "\n", + " island = baseline.iloc[:, 1].unique()\n", + "\n", + " assert \"Biscoe\" in island\n", + " assert \"Torgersen\" in island\n", + " assert \"Dream\" in island\n", + "\n", + "\n", + "def test_test_baseline_is_not_transformed(directory):\n", + " baseline = pd.read_csv(directory / \"test-baseline\" / \"test-baseline.csv\", header=None)\n", + "\n", + " island = baseline.iloc[:, 1].unique()\n", + "\n", + " assert \"Biscoe\" in island\n", + " assert \"Torgersen\" in island\n", + " assert \"Dream\" in island\n", + "\n", + "\n", + "def test_train_baseline_includes_header(directory):\n", + " baseline = pd.read_csv(directory / \"train-baseline\" / \"train-baseline.csv\")\n", + " assert baseline.columns[0] == \"species\"\n", + "\n", + "\n", + "def test_test_baseline_does_not_include_header(directory):\n", + " baseline = pd.read_csv(directory / \"test-baseline\" / \"test-baseline.csv\")\n", + " assert baseline.columns[0] != \"species\"" + ] + }, + { + "cell_type": "markdown", + "id": "dbff9c36", + "metadata": {}, + "source": [ + "### Step 3 - Setting up the Processing Step\n", + "\n", + "Let's now define the [ProcessingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.ProcessingStep) that we'll use in the pipeline to run the script that will split and transform the data.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ff061663", + "metadata": {}, + "source": [ + "Several SageMaker Pipeline steps support caching. When a step runs, and dependending on the configured caching policy, SageMaker will try to reuse the result of a previous successful run of the same step. You can find more information about this topic in [Caching Pipeline Steps](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html). Let's define a caching policy that we'll reuse on every step:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 660, + "id": "d88e9ccf", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.workflow.steps import CacheConfig\n", + "\n", + "cache_config = CacheConfig(enable_caching=True, expire_after=\"15d\")" + ] + }, + { + "cell_type": "markdown", + "id": "f3b1d96a", + "metadata": {}, + "source": [ + "We can parameterize a SageMaker Pipeline to make it more flexible. In this case, we'll use a paramater to pass the location of the dataset we want to process. We can execute the pipeline with different datasets by changing the value of this parameter. To read more about these parameters, check [Pipeline Parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-parameters.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 661, + "id": "331fe373", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.workflow.parameters import ParameterString\n", + "\n", + "dataset_location = ParameterString(\n", + " name=\"dataset_location\",\n", + " default_value=f\"{S3_LOCATION}/data\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cfb9a589", + "metadata": {}, + "source": [ + "A processor gives the Processing Step information about the hardware and software that SageMaker should use to launch the Processing Job. To run the script we created, we need access to Scikit-Learn, so we can use the [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) processor that comes out-of-the-box with the SageMaker's Python SDK. The [Data Processing with Framework Processors](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job-frameworks.html) page discusses other built-in processors you can use. The [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) page contains information about the available framework versions for each region.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 662, + "id": "3aa4471a", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:Defaulting to only available Python version: py3\n" + ] + } + ], + "source": [ + "from sagemaker.sklearn.processing import SKLearnProcessor\n", + "\n", + "processor = SKLearnProcessor(\n", + " base_job_name=\"split-and-transform-data\",\n", + " framework_version=\"1.2-1\",\n", + " # By default, a new account doesn't have access to `ml.m5.xlarge` instances.\n", + " # If you haven't requested a quota increase yet, you can use an\n", + " # `ml.t3.medium` instance type instead. This will work out of the box, but\n", + " # the Processing Job will take significantly longer than it should have.\n", + " # To get access to `ml.m5.xlarge` instances, you can request a quota\n", + " # increase under the Service Quotas section in your AWS account.\n", + " instance_type=config[\"instance_type\"],\n", + " instance_count=1,\n", + " role=role,\n", + " sagemaker_session=config[\"session\"],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6cf2cc58", + "metadata": {}, + "source": [ + "Let's now define the Processing Step that we'll use in the pipeline. This step requires a list of inputs that we need on the preprocessing script. In this case, the input is the dataset we stored in S3. We also have a few outputs that we want SageMaker to capture when the Processing Job finishes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 663, + "id": "cdbd9303", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/svpino/dev/ml.school/.venv/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:297: UserWarning: Running within a PipelineSession, there will be No Wait, No Logs, and No Job being started.\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.steps import ProcessingStep\n", + "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", + "\n", + "\n", + "split_and_transform_data_step = ProcessingStep(\n", + " name=\"split-and-transform-data\",\n", + " step_args=processor.run(\n", + " code=f\"{CODE_FOLDER}/preprocessor.py\",\n", + " inputs=[\n", + " ProcessingInput(\n", + " source=dataset_location, destination=\"/opt/ml/processing/input\"\n", + " ),\n", + " ],\n", + " outputs=[\n", + " ProcessingOutput(\n", + " output_name=\"train\",\n", + " source=\"/opt/ml/processing/train\",\n", + " destination=f\"{S3_LOCATION}/preprocessing/train\",\n", + " ),\n", + " ProcessingOutput(\n", + " output_name=\"validation\",\n", + " source=\"/opt/ml/processing/validation\",\n", + " destination=f\"{S3_LOCATION}/preprocessing/validation\",\n", + " ),\n", + " ProcessingOutput(\n", + " output_name=\"test\",\n", + " source=\"/opt/ml/processing/test\",\n", + " destination=f\"{S3_LOCATION}/preprocessing/test\",\n", + " ),\n", + " ProcessingOutput(\n", + " output_name=\"model\",\n", + " source=\"/opt/ml/processing/model\",\n", + " destination=f\"{S3_LOCATION}/preprocessing/model\",\n", + " ),\n", + " ProcessingOutput(\n", + " output_name=\"train-baseline\",\n", + " source=\"/opt/ml/processing/train-baseline\",\n", + " destination=f\"{S3_LOCATION}/preprocessing/train-baseline\",\n", + " ),\n", + " ProcessingOutput(\n", + " output_name=\"test-baseline\",\n", + " source=\"/opt/ml/processing/test-baseline\",\n", + " destination=f\"{S3_LOCATION}/preprocessing/test-baseline\",\n", + " ),\n", + " ],\n", + " ),\n", + " cache_config=cache_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fad062cb", + "metadata": {}, + "source": [ + "### Step 4 - Creating the Pipeline\n", + "\n", + "We can now create the SageMaker Pipeline and submit its definition to the SageMaker Pipelines service to create the pipeline if it doesn't exist or update it if it does.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 664, + "id": "e140642a", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session1-pipeline',\n", + " 'ResponseMetadata': {'RequestId': '02b62dd1-6de0-4723-9019-f4f72862ba5c',\n", + " 'HTTPStatusCode': 200,\n", + " 'HTTPHeaders': {'x-amzn-requestid': '02b62dd1-6de0-4723-9019-f4f72862ba5c',\n", + " 'content-type': 'application/x-amz-json-1.1',\n", + " 'content-length': '85',\n", + " 'date': 'Fri, 27 Oct 2023 14:38:36 GMT'},\n", + " 'RetryAttempts': 0}}" + ] + }, + "execution_count": 664, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.pipeline import Pipeline\n", + "from sagemaker.workflow.pipeline_definition_config import PipelineDefinitionConfig\n", + "\n", + "pipeline_definition_config = PipelineDefinitionConfig(use_custom_job_prefix=True)\n", + "\n", + "session1_pipeline = Pipeline(\n", + " name=\"session1-pipeline\",\n", + " parameters=[dataset_location],\n", + " steps=[\n", + " split_and_transform_data_step,\n", + " ],\n", + " pipeline_definition_config=pipeline_definition_config,\n", + " sagemaker_session=config[\"session\"],\n", + ")\n", + "\n", + "session1_pipeline.upsert(role_arn=role)" + ] + }, + { + "cell_type": "markdown", + "id": "ff8f99c1", + "metadata": {}, + "source": [ + "We can now start the pipeline:\n" + ] + }, + { + "cell_type": "markdown", + "id": "cc01c152", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": 665, + "id": "59d1e634", + "metadata": {}, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "session1_pipeline.start()" + ] + }, + { + "cell_type": "markdown", + "id": "5fa56512-f322-4d8e-ae96-598ba2366784", + "metadata": {}, + "source": [ + "### Assignments\n", + "\n", + "- Assignment 1.1 The SageMaker Pipeline we built supports running a few steps in Local Mode. The goal of this assignment is to run the pipeline on your local environment using Local Mode.\n", + "\n", + "- Assignment 1.2 For this assignment, we want to run the end-to-end pipeline in SageMaker Studio. Ensure you turn off Local Mode before doing so.\n", + "\n", + "- Assignment 1.3 The pipeline uses Random Sampling to split the dataset. Modify the code to use Stratified Sampling instead.\n", + "\n", + "- Assignment 1.4 For this assignment, we want to run a distributed Processing Job across multiple instances to capitalize the `island` column of the dataset. Your dataset will consist of 10 different files stored in S3. Set up a Processing Job using two instances. When specifying the input to the Processing Job, you must set the `ProcessingInput.s3_data_distribution_type` attribute to `ShardedByS3Key`. By doing this, SageMaker will run a cluster with two instances simultaneously, each with access to half the files.\n", + "\n", + "- Assignment 1.5 You can use [Amazon SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler/) to complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. For this assignment, load the Data Wrangler interface and use it to build the same transformations we implemented using the Scikit-Learn Pipeline. If you have questions, open the [Penguins Data Flow](penguins.flow) included in this repository.\n" + ] + }, + { + "cell_type": "markdown", + "id": "63c190c5-52b5-4ccc-8d42-847a694b8e66", + "metadata": {}, + "source": [ + "## Session 2 - Building Models And The Training Pipeline\n", + "\n", + "This session extends the [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) we built in the previous session with a step to train a model. We'll explore the [Training Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training) and the [Tuning Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-tuning).\n", + "\n", + " \"Training\"\n", + "\n", + "We'll introduce [Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) and use them during training. For more information about this topic, check the [SageMaker Experiments' SDK documentation](https://sagemaker.readthedocs.io/en/v2.174.0/experiments/sagemaker.experiments.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "c8608092-7aab-4fd2-aa99-47c2db27bdb7", + "metadata": {}, + "source": [ + "### Step 1 - Creating the Training Script\n", + "\n", + "This following script is responsible for training a neural network using the train data, validating the model, and saving it so we can later use it:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 666, + "id": "d92b121d-dcb9-43e8-9ee3-3ececb583e7e", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting code/train.py\n" + ] + } + ], + "source": [ + "%%writefile {CODE_FOLDER}/train.py\n", + "#| label: training-script\n", + "#| echo: true\n", + "#| output: false\n", + "#| filename: train.py\n", + "#| code-line-numbers: true\n", + "\n", + "import os\n", + "import argparse\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import tensorflow as tf\n", + "\n", + "from pathlib import Path\n", + "from sklearn.metrics import accuracy_score\n", + "\n", + "from tensorflow.keras.models import Sequential\n", + "from tensorflow.keras.layers import Dense\n", + "from tensorflow.keras.optimizers import SGD\n", + "\n", + "\n", + "def train(model_directory, train_path, validation_path, epochs=50, batch_size=32):\n", + " X_train = pd.read_csv(Path(train_path) / \"train.csv\")\n", + " y_train = X_train[X_train.columns[-1]]\n", + " X_train.drop(X_train.columns[-1], axis=1, inplace=True)\n", + " \n", + " X_validation = pd.read_csv(Path(validation_path) / \"validation.csv\")\n", + " y_validation = X_validation[X_validation.columns[-1]]\n", + " X_validation.drop(X_validation.columns[-1], axis=1, inplace=True)\n", + " \n", + " model = Sequential([\n", + " Dense(10, input_shape=(X_train.shape[1],), activation=\"relu\"),\n", + " Dense(8, activation=\"relu\"),\n", + " Dense(3, activation=\"softmax\"),\n", + " ])\n", + " \n", + " model.compile(\n", + " optimizer=SGD(learning_rate=0.01),\n", + " loss=\"sparse_categorical_crossentropy\",\n", + " metrics=[\"accuracy\"]\n", + " )\n", + "\n", + " model.fit(\n", + " X_train, \n", + " y_train, \n", + " validation_data=(X_validation, y_validation),\n", + " epochs=epochs, \n", + " batch_size=batch_size,\n", + " verbose=2,\n", + " )\n", + "\n", + " predictions = np.argmax(model.predict(X_validation), axis=-1)\n", + " print(f\"Validation accuracy: {accuracy_score(y_validation, predictions)}\")\n", + " \n", + " model_filepath = Path(model_directory) / \"001\"\n", + " model.save(model_filepath) \n", + " \n", + "\n", + "if __name__ == \"__main__\":\n", + " # Any hyperparameters provided by the training job are passed to \n", + " # the entry point as script arguments. \n", + " parser = argparse.ArgumentParser()\n", + " parser.add_argument(\"--epochs\", type=int, default=50)\n", + " parser.add_argument(\"--batch_size\", type=int, default=32)\n", + " args, _ = parser.parse_known_args()\n", + " \n", + "\n", + " train(\n", + " # This is the location where we need to save our model. SageMaker will\n", + " # create a model.tar.gz file with anything inside this directory when\n", + " # the training script finishes.\n", + " model_directory=os.environ[\"SM_MODEL_DIR\"],\n", + "\n", + " # SageMaker creates one channel for each one of the inputs to the\n", + " # Training Step.\n", + " train_path=os.environ[\"SM_CHANNEL_TRAIN\"],\n", + " validation_path=os.environ[\"SM_CHANNEL_VALIDATION\"],\n", + "\n", + " epochs=args.epochs,\n", + " batch_size=args.batch_size,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "50f0a4fa-ce70-4882-b9f5-8253df03d890", + "metadata": {}, + "source": [ + "Let's test the script to ensure everything is working as expected:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 667, + "id": "14ea27ce-c453-4cb0-b309-dbecd732957e", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "8/8 - 0s - loss: 1.0173 - accuracy: 0.4728 - val_loss: 0.9260 - val_accuracy: 0.6078 - 230ms/epoch - 29ms/step\n", + "2/2 [==============================] - 0s 1ms/step\n", + "Validation accuracy: 0.6078431372549019\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmpv4apdp15/model/001/assets\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[32m.\u001b[0m\n", + "\u001b[32m\u001b[32m\u001b[1m1 passed\u001b[0m\u001b[32m in 0.53s\u001b[0m\u001b[0m\n" + ] + } + ], + "source": [ + "%%ipytest -s\n", + "\n", + "#| code-fold: true\n", + "#| output: false\n", + "\n", + "import os\n", + "import shutil\n", + "import tarfile\n", + "import pytest\n", + "import tempfile\n", + "import joblib\n", + "\n", + "from preprocessor import preprocess\n", + "from train import train\n", + "\n", + "\n", + "@pytest.fixture(scope=\"function\", autouse=False)\n", + "def directory():\n", + " directory = tempfile.mkdtemp()\n", + " input_directory = Path(directory) / \"input\"\n", + " input_directory.mkdir(parents=True, exist_ok=True)\n", + " shutil.copy2(DATA_FILEPATH, input_directory / \"data.csv\")\n", + " \n", + " directory = Path(directory)\n", + " \n", + " preprocess(base_directory=directory)\n", + " train(\n", + " model_directory=directory / \"model\",\n", + " train_path=directory / \"train\", \n", + " validation_path=directory / \"validation\",\n", + " epochs=1\n", + " )\n", + " \n", + " yield directory\n", + " \n", + " shutil.rmtree(directory)\n", + "\n", + "\n", + "def test_train_saves_a_folder_with_model_assets(directory):\n", + " output = os.listdir(directory / \"model\")\n", + " assert \"001\" in output\n", + " \n", + " assets = os.listdir(directory / \"model\" / \"001\")\n", + " assert \"saved_model.pb\" in assets" + ] + }, + { + "cell_type": "markdown", + "id": "27cff4c1-6510-4d99-8ae1-cb14927b87c7", + "metadata": {}, + "source": [ + "### Step 2 - Setting up the Training Step\n", + "\n", + "We can now create a [Training Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training) that we can add to the pipeline. This Training Step will create a SageMaker Training Job in the background, run the training script, and upload the output to S3. Check the [TrainingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TrainingStep) SageMaker's SDK documentation for more information.\n", + "\n", + "SageMaker uses the concept of an [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to handle end-to-end training and deployment tasks. For this example, we will use the built-in [TensorFlow Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) to run the training script we wrote before. The [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) page contains information about the available framework versions for each region. Here, you can also check the available SageMaker [Deep Learning Container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).\n", + "\n", + "Notice the list of hyperparameters defined below. SageMaker will pass these hyperparameters as arguments to the entry point of the training script.\n", + "\n", + "We are going to use [SageMaker Experiments](https://sagemaker.readthedocs.io/en/v2.174.0/experiments/sagemaker.experiments.html) to log information from the Training Job. For more information, check [Manage Machine Learning with Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html). The list of metric definitions will tell SageMaker which metrics to track and how to parse them from the Training Job logs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 668, + "id": "90fe82ae-6a2c-4461-bc83-bb52d8871e3b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from sagemaker.tensorflow import TensorFlow\n", + "\n", + "estimator = TensorFlow(\n", + " base_job_name=\"training\",\n", + " entry_point=f\"{CODE_FOLDER}/train.py\",\n", + " # SageMaker will pass these hyperparameters as arguments\n", + " # to the entry point of the training script.\n", + " hyperparameters={\n", + " \"epochs\": 50,\n", + " \"batch_size\": 32,\n", + " },\n", + " # SageMaker will track these metrics as part of the experiment\n", + " # associated to this pipeline. The metric definitions tells\n", + " # SageMaker how to parse the values from the Training Job logs.\n", + " metric_definitions=[\n", + " {\"Name\": \"loss\", \"Regex\": \"loss: ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"accuracy\", \"Regex\": \"accuracy: ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"val_loss\", \"Regex\": \"val_loss: ([0-9\\\\.]+)\"},\n", + " {\"Name\": \"val_accuracy\", \"Regex\": \"val_accuracy: ([0-9\\\\.]+)\"},\n", + " ],\n", + " image_uri=config[\"image\"],\n", + " framework_version=config[\"framework_version\"],\n", + " py_version=config[\"py_version\"],\n", + " instance_type=config[\"instance_type\"],\n", + " instance_count=1,\n", + " disable_profiler=True,\n", + " sagemaker_session=config[\"session\"],\n", + " role=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "545d2b43-3bb5-4fe9-b3e4-cb8eb55c8a21", + "metadata": {}, + "source": [ + "We can now create a [Training Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training). This Training Step will create a SageMaker Training Job in the background, run the training script, and upload the output to S3. Check the [TrainingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TrainingStep) SageMaker's SDK documentation for more information.\n", + "\n", + "This step will receive the train and validation split from the previous step as inputs.\n", + "\n", + "Here, we are using two input channels, `train` and `validation`. SageMaker will automatically create an environment variable corresponding to each of these channels following the format `SM_CHANNEL_[channel_name]`:\n", + "\n", + "- `SM_CHANNEL_TRAIN`: This environment variable will contain the path to the data in the `train` channel\n", + "- `SM_CHANNEL_VALIDATION`: This environment variable will contain the path to the data in the `validation` channel\n" + ] + }, + { + "cell_type": "code", + "execution_count": 738, + "id": "99e4850c-83d6-4f4e-a813-d5a3f4bb7486", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.steps import TrainingStep\n", + "from sagemaker.inputs import TrainingInput\n", + "\n", + "train_model_step = TrainingStep(\n", + " name=\"train-model\",\n", + " step_args=estimator.fit(\n", + " inputs={\n", + " \"train\": TrainingInput(\n", + " s3_data=split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"train\"\n", + " ].S3Output.S3Uri,\n", + " content_type=\"text/csv\",\n", + " ),\n", + " \"validation\": TrainingInput(\n", + " s3_data=split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"validation\"\n", + " ].S3Output.S3Uri,\n", + " content_type=\"text/csv\",\n", + " ),\n", + " }\n", + " ),\n", + " cache_config=cache_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5814e258-c633-4e9a-85c5-6ed0f168b503", + "metadata": {}, + "source": [ + "### Step 3 - Setting up a Tuning Step\n", + "\n", + "Let's now create a [Tuning Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-tuning). This Tuning Step will create a SageMaker Hyperparameter Tuning Job in the background and use the training script to train different model variants and choose the best one. Check the [TuningStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TuningStep) SageMaker's SDK documentation for more information.\n" + ] + }, + { + "cell_type": "markdown", + "id": "90eb5075", + "metadata": {}, + "source": [ + "Since we could use the Training of the Tuning Step to create the model, we'll define this constant to indicate which approach we want to run.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 670, + "id": "f367d0e3", + "metadata": {}, + "outputs": [], + "source": [ + "USE_TUNING_STEP = False" + ] + }, + { + "cell_type": "markdown", + "id": "b045af84", + "metadata": {}, + "source": [ + "The Tuning Step requires a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) reference to configure the Hyperparameter Tuning Job.\n", + "\n", + "Here is the configuration that we'll use to find the best model:\n", + "\n", + "1. `objective_metric_name`: This is the name of the metric the tuner will use to determine the best model.\n", + "2. `objective_type`: This is the objective of the tuner. Should it \"Minimize\" the metric or \"Maximize\" it? In this example, since we are using the validation accuracy of the model, we want the objective to be \"Maximize.\" If we were using the loss of the model, we would set the objective to \"Minimize.\"\n", + "3. `metric_definitions`: Defines how the tuner will determine the metric's value by looking at the output logs of the training process.\n", + "\n", + "The tuner expects the list of the hyperparameters you want to explore. You can use subclasses of the [Parameter](https://sagemaker.readthedocs.io/en/stable/api/training/parameter.html#sagemaker.parameter.ParameterRange) class to specify different types of hyperparameters. This example explores different values for the `epochs` hyperparameter.\n", + "\n", + "Finally, you can control the number of jobs and how many of them will run in parallel using the following two arguments:\n", + "\n", + "- `max_jobs`: Defines the maximum total number of training jobs to start for the hyperparameter tuning job.\n", + "- `max_parallel_jobs`: Defines the maximum number of parallel training jobs to start.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 671, + "id": "c8c82750", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tuner import HyperparameterTuner\n", + "from sagemaker.parameter import IntegerParameter\n", + "\n", + "tuner = HyperparameterTuner(\n", + " estimator,\n", + " objective_metric_name=\"val_accuracy\",\n", + " objective_type=\"Maximize\",\n", + " hyperparameter_ranges={\n", + " \"epochs\": IntegerParameter(10, 50),\n", + " },\n", + " metric_definitions=[{\"Name\": \"val_accuracy\", \"Regex\": \"val_accuracy: ([0-9\\\\.]+)\"}],\n", + " max_jobs=3,\n", + " max_parallel_jobs=3,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "28c2abc2", + "metadata": {}, + "source": [ + "We can now create the Tuning Step using the tuner we configured before:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 672, + "id": "038ff2e5-ed28-445b-bc03-4e996ec2286f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from sagemaker.workflow.steps import TuningStep\n", + "\n", + "tune_model_step = TuningStep(\n", + " name=\"tune-model\",\n", + " step_args=tuner.fit(\n", + " inputs={\n", + " \"train\": TrainingInput(\n", + " s3_data=split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"train\"\n", + " ].S3Output.S3Uri,\n", + " content_type=\"text/csv\",\n", + " ),\n", + " \"validation\": TrainingInput(\n", + " s3_data=split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"validation\"\n", + " ].S3Output.S3Uri,\n", + " content_type=\"text/csv\",\n", + " ),\n", + " },\n", + " ),\n", + " cache_config=cache_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "4babe38c-1682-42d2-8442-101d17aa89b5", + "metadata": {}, + "source": [ + "### Step 4 - Creating the Pipeline\n", + "\n", + "Let's define the SageMaker Pipeline and submit its definition to the SageMaker Pipelines service to create the pipeline if it doesn't exist or update it if it does.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 673, + "id": "9799ab39-fcae-41f4-a68b-85ab71b3ba9a", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n" + ] + }, + { + "data": { + "text/plain": [ + "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session2-pipeline',\n", + " 'ResponseMetadata': {'RequestId': 'e99208aa-4074-41aa-a12b-90af6da62e3f',\n", + " 'HTTPStatusCode': 200,\n", + " 'HTTPHeaders': {'x-amzn-requestid': 'e99208aa-4074-41aa-a12b-90af6da62e3f',\n", + " 'content-type': 'application/x-amz-json-1.1',\n", + " 'content-length': '85',\n", + " 'date': 'Fri, 27 Oct 2023 14:38:38 GMT'},\n", + " 'RetryAttempts': 0}}" + ] + }, + "execution_count": 673, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "session2_pipeline = Pipeline(\n", + " name=\"session2-pipeline\",\n", + " parameters=[dataset_location],\n", + " steps=[\n", + " split_and_transform_data_step,\n", + " tune_model_step if USE_TUNING_STEP else train_model_step,\n", + " ],\n", + " pipeline_definition_config=pipeline_definition_config,\n", + " sagemaker_session=config[\"session\"],\n", + ")\n", + "\n", + "session2_pipeline.upsert(role_arn=role)" + ] + }, + { + "cell_type": "markdown", + "id": "50810a3e", + "metadata": {}, + "source": [ + "We can now start the pipeline:\n" + ] + }, + { + "cell_type": "markdown", + "id": "6bcb9d05", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": 674, + "id": "274a9b1e", + "metadata": {}, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "session2_pipeline.start()" + ] + }, + { + "cell_type": "markdown", + "id": "c044516e-f56c-4c91-8d94-6ef109eb7325", + "metadata": {}, + "source": [ + "### Assignments\n", + "\n", + "- Assignment 2.1 The training script trains the model using a hard-coded learning rate value. Modify the code to accept the learning rate as a parameter we can control from outside the script.\n", + "\n", + "- Assignment 2.2 We currently define the number of epochs to train the model as a constant that we pass to the Estimator using the list of hyperparameters. Replace this constant with a new Pipeline Parameter named `training_epochs`. You'll need to specify this new parameter when creating the Pipeline.\n", + "\n", + "- Assignment 2.3 The current tuning process aims to find the model with the highest validation accuracy. Modify the code to focus on the model with the lowest training loss.\n", + "\n", + "- Assignment 2.4 We used an instance of [`SKLearnProcessor`](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) to run the script that transforms and splits the data, but there's no way to add additional dependencies to the processing container. Modify the code to use an instance of [`FrameworkProcessor`](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor) instead. This class will allow you to specify a directory containing a `requirements.txt` file containing a list of dependencies. SageMaker will install these libraries in the processing container before triggering the processing job.\n", + "\n", + "- Assignment 2.5 We want to execute the pipeline whenever the dataset changes. We can accomplish this by using [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html). Configure an event to automatically start the pipeline when a new file is added to the S3 bucket where we store our dataset. Check [Amazon EventBridge Integration](https://docs.aws.amazon.com/sagemaker/latest/dg/pipeline-eventbridge.html) for an implementation tutorial.\n" + ] + }, + { + "cell_type": "markdown", + "id": "21d40fe8-ba74-4c12-9555-d8ea33d1c8b4", + "metadata": {}, + "source": [ + "## Session 3 - Evaluating and Versioning Models\n", + "\n", + "This session extends the [SageMaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) with a step to evaluate the model and register it if it reaches a predefined accuracy threshold.\n", + "\n", + " \"Training\"\n", + "\n", + "We'll use a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) to execute an evaluation script. We'll use a [Condition Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-condition) to determine whether the model's accuracy is above a threshold, and a [Model Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-model) to register the model in the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html).\n" + ] + }, + { + "cell_type": "markdown", + "id": "9eaa9691-f49f-48af-b272-3d4d17563b01", + "metadata": { + "tags": [] + }, + "source": [ + "### Step 1 - Creating the Evaluation Script\n", + "\n", + "Let's create the evaluation script. The Processing Step will spin up a Processing Job and run this script inside a container. This script is responsible for loading the model we created and evaluating it on the test set. Before finishing, this script will generate an evaluation report of the model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 675, + "id": "3ee3ab26-afa5-4ceb-9f7a-005d5fdea646", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting code/evaluation.py\n" + ] + } + ], + "source": [ + "%%writefile {CODE_FOLDER}/evaluation.py\n", + "#| label: evaluation-script\n", + "#| echo: true\n", + "#| output: false\n", + "#| filename: evaluation.py\n", + "#| code-line-numbers: true\n", + "\n", + "import os\n", + "import json\n", + "import tarfile\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from pathlib import Path\n", + "from tensorflow import keras\n", + "from sklearn.metrics import accuracy_score\n", + "\n", + "\n", + "MODEL_PATH = \"/opt/ml/processing/model/\"\n", + "TEST_PATH = \"/opt/ml/processing/test/\"\n", + "OUTPUT_PATH = \"/opt/ml/processing/evaluation/\"\n", + "\n", + "\n", + "def evaluate(model_path, test_path, output_path):\n", + " # The first step is to extract the model package so we can load \n", + " # it in memory.\n", + " with tarfile.open(Path(model_path) / \"model.tar.gz\") as tar:\n", + " tar.extractall(path=Path(model_path))\n", + " \n", + " model = keras.models.load_model(Path(model_path) / \"001\")\n", + " \n", + " X_test = pd.read_csv(Path(test_path) / \"test.csv\")\n", + " y_test = X_test[X_test.columns[-1]]\n", + " X_test.drop(X_test.columns[-1], axis=1, inplace=True)\n", + " \n", + " predictions = np.argmax(model.predict(X_test), axis=-1)\n", + " accuracy = accuracy_score(y_test, predictions)\n", + " print(f\"Test accuracy: {accuracy}\")\n", + "\n", + " # Let's create an evaluation report using the model accuracy.\n", + " evaluation_report = {\n", + " \"metrics\": {\n", + " \"accuracy\": {\n", + " \"value\": accuracy\n", + " },\n", + " },\n", + " }\n", + " \n", + " Path(output_path).mkdir(parents=True, exist_ok=True)\n", + " with open(Path(output_path) / \"evaluation.json\", \"w\") as f:\n", + " f.write(json.dumps(evaluation_report))\n", + " \n", + " \n", + "if __name__ == \"__main__\":\n", + " evaluate(\n", + " model_path=MODEL_PATH, \n", + " test_path=TEST_PATH,\n", + " output_path=OUTPUT_PATH\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "9dcc79a0-adfd-4ce9-8580-5cd228c3c2d9", + "metadata": {}, + "source": [ + "Let's test the script to ensure everything is working as expected:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 676, + "id": "9a2540d8-278a-4953-bc54-0469d154427d", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "8/8 - 0s - loss: 1.1330 - accuracy: 0.4142 - val_loss: 1.1001 - val_accuracy: 0.5098 - 236ms/epoch - 30ms/step\n", + "2/2 [==============================] - 0s 1ms/step\n", + "Validation accuracy: 0.5098039215686274\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmpprbc5h18/model/001/assets\n", + "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.RestoredOptimizer` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.RestoredOptimizer`.\n", + "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2/2 [==============================] - 0s 1ms/step\n", + "Test accuracy: 0.4117647058823529\n", + "\u001b[32m.\u001b[0m" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "8/8 - 0s - loss: 1.0329 - accuracy: 0.4644 - val_loss: 0.9795 - val_accuracy: 0.5882 - 235ms/epoch - 29ms/step\n", + "2/2 [==============================] - 0s 1ms/step\n", + "Validation accuracy: 0.5882352941176471\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmph0nj0wfb/model/001/assets\n", + "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.RestoredOptimizer` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.RestoredOptimizer`.\n", + "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2/2 [==============================] - 0s 2ms/step\n", + "Test accuracy: 0.5686274509803921\n", + "\u001b[32m.\u001b[0m\n", + "\u001b[32m\u001b[32m\u001b[1m2 passed\u001b[0m\u001b[32m in 1.35s\u001b[0m\u001b[0m\n" + ] + } + ], + "source": [ + "%%ipytest -s\n", + "\n", + "#| code-fold: true\n", + "#| output: false\n", + "\n", + "import os\n", + "import shutil\n", + "import tarfile\n", + "import pytest\n", + "import tempfile\n", + "import joblib\n", + "\n", + "from preprocessor import preprocess\n", + "from train import train\n", + "from evaluation import evaluate\n", + "\n", + "\n", + "@pytest.fixture(scope=\"function\", autouse=False)\n", + "def directory():\n", + " directory = tempfile.mkdtemp()\n", + " input_directory = Path(directory) / \"input\"\n", + " input_directory.mkdir(parents=True, exist_ok=True)\n", + " shutil.copy2(DATA_FILEPATH, input_directory / \"data.csv\")\n", + " \n", + " directory = Path(directory)\n", + " \n", + " preprocess(base_directory=directory)\n", + " \n", + " train(\n", + " model_directory=directory / \"model\",\n", + " train_path=directory / \"train\", \n", + " validation_path=directory / \"validation\",\n", + " epochs=1\n", + " )\n", + " \n", + " # After training a model, we need to prepare a package just like\n", + " # SageMaker would. This package is what the evaluation script is\n", + " # expecting as an input.\n", + " with tarfile.open(directory / \"model.tar.gz\", \"w:gz\") as tar:\n", + " tar.add(directory / \"model\" / \"001\", arcname=\"001\")\n", + " \n", + " evaluate(\n", + " model_path=directory, \n", + " test_path=directory / \"test\",\n", + " output_path=directory / \"evaluation\",\n", + " )\n", + "\n", + " yield directory / \"evaluation\"\n", + " \n", + " shutil.rmtree(directory)\n", + "\n", + "\n", + "def test_evaluate_generates_evaluation_report(directory):\n", + " output = os.listdir(directory)\n", + " assert \"evaluation.json\" in output\n", + "\n", + "\n", + "def test_evaluation_report_contains_accuracy(directory):\n", + " with open(directory / \"evaluation.json\", 'r') as file:\n", + " report = json.load(file)\n", + " \n", + " assert \"metrics\" in report\n", + " assert \"accuracy\" in report[\"metrics\"]\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "bec1109a-6c26-4464-8338-94960729d212", + "metadata": {}, + "source": [ + "### Step 2 - Setting up the Evaluation Step\n", + "\n", + "To run the evaluation script, we will use a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) configured with [TensorFlowProcessor](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job-frameworks-tensorflow.html) because the script needs access to TensorFlow.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 677, + "id": "2fdff07f", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.tensorflow import TensorFlowProcessor\n", + "\n", + "tensorflow_processor = TensorFlowProcessor(\n", + " base_job_name=\"evaluation-processor\",\n", + " image_uri=config[\"image\"],\n", + " framework_version=config[\"framework_version\"],\n", + " py_version=config[\"py_version\"],\n", + " instance_type=config[\"instance_type\"],\n", + " instance_count=1,\n", + " role=role,\n", + " sagemaker_session=config[\"session\"],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "419e354a", + "metadata": {}, + "source": [ + "One of the inputs to the Evaluation Step will be the model assets. We can use the `USE_TUNING_STEP` flag to determine whether we created the model using a Training Step or a Tuning Step. In case we are using the Tuning Step, we can use the [TuningStep.get_top_model_s3_uri()](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TuningStep.get_top_model_s3_uri) function to get the model assets from the top performing training job of the Hyperparameter Tuning Job.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 678, + "id": "4f19e15b", + "metadata": {}, + "outputs": [], + "source": [ + "model_assets = train_model_step.properties.ModelArtifacts.S3ModelArtifacts\n", + "\n", + "if USE_TUNING_STEP:\n", + " model_assets = tune_model_step.get_top_model_s3_uri(\n", + " top_k=0, s3_bucket=config[\"session\"].default_bucket()\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "08dae772", + "metadata": {}, + "source": [ + "SageMaker supports mapping outputs to property files. This is useful when accessing a specific property from the pipeline. In our case, we want to access the accuracy of the model in the Condition Step, so we'll map the evaluation report to a property file. Check [How to Build and Manage Property Files](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-propertyfile.html) for more information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 679, + "id": "1f27b2ef", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.workflow.properties import PropertyFile\n", + "\n", + "evaluation_report = PropertyFile(\n", + " name=\"evaluation-report\", output_name=\"evaluation\", path=\"evaluation.json\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "4a4dbc0e", + "metadata": {}, + "source": [ + "We are now ready to define the [ProcessingStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.ProcessingStep) that will run the evaluation script:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 680, + "id": "48139a07-5c8e-4bc6-b666-bf9531f7f520", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/svpino/dev/ml.school/.venv/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:297: UserWarning: Running within a PipelineSession, there will be No Wait, No Logs, and No Job being started.\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "evaluate_model_step = ProcessingStep(\n", + " name=\"evaluate-model\",\n", + " step_args=tensorflow_processor.run(\n", + " inputs=[\n", + " # The first input is the test split that we generated on\n", + " # the first step of the pipeline when we split and\n", + " # transformed the data.\n", + " ProcessingInput(\n", + " source=split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"test\"\n", + " ].S3Output.S3Uri,\n", + " destination=\"/opt/ml/processing/test\",\n", + " ),\n", + " # The second input is the model that we generated on\n", + " # the Training or Tunning Step.\n", + " ProcessingInput(\n", + " source=model_assets,\n", + " destination=\"/opt/ml/processing/model\",\n", + " ),\n", + " ],\n", + " outputs=[\n", + " # The output is the evaluation report that we generated\n", + " # in the evaluation script.\n", + " ProcessingOutput(\n", + " output_name=\"evaluation\", source=\"/opt/ml/processing/evaluation\"\n", + " ),\n", + " ],\n", + " code=f\"{CODE_FOLDER}/evaluation.py\",\n", + " ),\n", + " property_files=[evaluation_report],\n", + " cache_config=cache_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "328a2bf2", + "metadata": {}, + "source": [ + "### Step 3 - Registering the Model\n", + "\n", + "Let's now create a new version of the model and register it in the Model Registry. Check [Register a Model Version](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-version.html) for more information about model registration.\n", + "\n", + "First, let's define the name of the group where we'll register the model:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 681, + "id": "bb70f907", + "metadata": {}, + "outputs": [], + "source": [ + "MODEL_PACKAGE_GROUP = \"penguins\"" + ] + }, + { + "cell_type": "markdown", + "id": "40bcad3b", + "metadata": {}, + "source": [ + "Let's now create the model that we'll register in the Model Registry. The model we trained uses TensorFlow, so we can use the built-in [TensorFlowModel](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-serving-model) class to create an instance of the model:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 682, + "id": "4ca4cb61", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.tensorflow.model import TensorFlowModel\n", + "\n", + "tensorflow_model = TensorFlowModel(\n", + " model_data=model_assets,\n", + " image_uri=config[\"image\"],\n", + " framework_version=config[\"framework_version\"],\n", + " sagemaker_session=config[\"session\"],\n", + " role=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "99d6fd00", + "metadata": {}, + "source": [ + "When we register a model in the Model Registry, we can attach relevant metadata to it. We'll use the evaluation report we generated during the Evaluation Step to populate the [metrics](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_metrics.ModelMetrics) of this model:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 683, + "id": "8c05a7e1", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.model_metrics import ModelMetrics, MetricsSource\n", + "from sagemaker.workflow.functions import Join\n", + "\n", + "model_metrics = ModelMetrics(\n", + " model_statistics=MetricsSource(\n", + " s3_uri=Join(\n", + " on=\"/\",\n", + " values=[\n", + " evaluate_model_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"evaluation\"\n", + " ].S3Output.S3Uri,\n", + " \"evaluation.json\",\n", + " ],\n", + " ),\n", + " content_type=\"application/json\",\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6a51e61d", + "metadata": {}, + "source": [ + "We can use a [Model Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-model) to register the model. Check the [ModelStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.model_step.ModelStep) SageMaker's SDK documentation for more information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 684, + "id": "c9773a4a", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.tensorflow.model:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.model_step import ModelStep\n", + "\n", + "register_model_step = ModelStep(\n", + " name=\"register-model\",\n", + " step_args=tensorflow_model.register(\n", + " model_package_group_name=MODEL_PACKAGE_GROUP,\n", + " approval_status=\"Approved\",\n", + " model_metrics=model_metrics,\n", + " content_types=[\"text/csv\"],\n", + " response_types=[\"text/csv\"],\n", + " # This is the suggested inference instance types when\n", + " # deploying the model or using it as part of a batch\n", + " # transform job.\n", + " inference_instances=[\"ml.m5.xlarge\"],\n", + " transform_instances=[\"ml.m5.xlarge\"],\n", + " domain=\"MACHINE_LEARNING\",\n", + " task=\"CLASSIFICATION\",\n", + " framework=\"TENSORFLOW\",\n", + " framework_version=config[\"framework_version\"],\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "52c110f7-fe72-4db8-9d06-cfb9a0f2bfbd", + "metadata": {}, + "source": [ + "### Step 4 - Setting up a Condition Step\n", + "\n", + "We only want to register a new model if its accuracy exceeds a predefined threshold. We can use a [Condition Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-condition) together with the evaluation report we generated to accomplish this.\n" + ] + }, + { + "cell_type": "markdown", + "id": "b5a51f95", + "metadata": {}, + "source": [ + "Let's define a new [Pipeline Parameter](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-parameters.html) to specify the minimum accuracy that the model should reach for it to be registered.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 685, + "id": "745486b5", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.workflow.parameters import ParameterFloat\n", + "\n", + "accuracy_threshold = ParameterFloat(name=\"accuracy_threshold\", default_value=0.70)" + ] + }, + { + "cell_type": "markdown", + "id": "2c959c94", + "metadata": {}, + "source": [ + "If the model's accuracy is not greater than or equal our threshold, we will send the pipeline to a [Fail Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-fail) with the appropriate error message. Check the [FailStep](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.fail_step.FailStep) SageMaker's SDK documentation for more information.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 686, + "id": "c4431bbf", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.workflow.fail_step import FailStep\n", + "\n", + "fail_step = FailStep(\n", + " name=\"fail\",\n", + " error_message=Join(\n", + " on=\" \",\n", + " values=[\n", + " \"Execution failed because the model's accuracy was lower than\",\n", + " accuracy_threshold,\n", + " ],\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b47764f9", + "metadata": {}, + "source": [ + "We can use a [ConditionGreaterThanOrEqualTo](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.conditions.ConditionGreaterThanOrEqualTo) condition to compare the model's accuracy with the threshold. Look at the [Conditions](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#conditions) section in the documentation for more information about the types of supported conditions.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 687, + "id": "bebeecab", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.workflow.functions import JsonGet\n", + "from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo\n", + "\n", + "condition = ConditionGreaterThanOrEqualTo(\n", + " left=JsonGet(\n", + " step_name=evaluate_model_step.name,\n", + " property_file=evaluation_report,\n", + " json_path=\"metrics.accuracy.value\",\n", + " ),\n", + " right=accuracy_threshold,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1b0ce4b1", + "metadata": {}, + "source": [ + "Let's now define the Condition Step:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 688, + "id": "36e2a2b1-6711-4266-95d8-d2aebd52e199", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from sagemaker.workflow.condition_step import ConditionStep\n", + "\n", + "condition_step = ConditionStep(\n", + " name=\"check-model-accuracy\",\n", + " conditions=[condition],\n", + " if_steps=[register_model_step] if not LOCAL_MODE else [],\n", + " else_steps=[fail_step],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2309b8fa-f03e-4959-853f-dc2416f82bdd", + "metadata": {}, + "source": [ + "### Step 5 - Creating the Pipeline\n", + "\n", + "We can now define the SageMaker Pipeline and submit its definition to the SageMaker Pipelines service to create the pipeline if it doesn't exist or update it if it does.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 689, + "id": "f70bcd33-b499-4e2b-953e-94d1ed96c10a", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n", + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session3-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", + "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session3-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n", + "WARNING:sagemaker.workflow._utils:Popping out 'CertifyForMarketplace' from the pipeline definition since it will be overridden in pipeline execution time.\n", + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n", + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session3-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", + "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session3-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n" + ] + }, + { + "data": { + "text/plain": [ + "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session3-pipeline',\n", + " 'ResponseMetadata': {'RequestId': 'be91a772-a26a-4c1f-a98a-424951e6889a',\n", + " 'HTTPStatusCode': 200,\n", + " 'HTTPHeaders': {'x-amzn-requestid': 'be91a772-a26a-4c1f-a98a-424951e6889a',\n", + " 'content-type': 'application/x-amz-json-1.1',\n", + " 'content-length': '85',\n", + " 'date': 'Fri, 27 Oct 2023 14:38:43 GMT'},\n", + " 'RetryAttempts': 0}}" + ] + }, + "execution_count": 689, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "session3_pipeline = Pipeline(\n", + " name=\"session3-pipeline\",\n", + " parameters=[dataset_location, accuracy_threshold],\n", + " steps=[\n", + " split_and_transform_data_step,\n", + " tune_model_step if USE_TUNING_STEP else train_model_step,\n", + " evaluate_model_step,\n", + " condition_step,\n", + " ],\n", + " pipeline_definition_config=pipeline_definition_config,\n", + " sagemaker_session=config[\"session\"],\n", + ")\n", + "\n", + "session3_pipeline.upsert(role_arn=role)" + ] + }, + { + "cell_type": "markdown", + "id": "1b1f656e", + "metadata": {}, + "source": [ + "We can now start the pipeline:" + ] + }, + { + "cell_type": "markdown", + "id": "36144169", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": 690, + "id": "f3b4126e", + "metadata": {}, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "session3_pipeline.start()" + ] + }, + { + "cell_type": "markdown", + "id": "9418693c-ccd5-42b6-8ec4-04bb70fe213c", + "metadata": {}, + "source": [ + "### Assignments\n", + "\n", + "- Assignment 3.1 The evaluation script computes the accuracy of the model and exports it as part of the evaluation report. Extend the evaluation report by adding the precision and the recall of the model on each one of the classes.\n", + "\n", + "- Assignment 3.2 Extend the evaluation script to test the model on each island separately. The evaluation report should contain the accuracy of the model on each island and the overall accuracy.\n", + "\n", + "- Assignment 3.3 The Condition Step uses a hard-coded threshold value to determine if the model's accuracy is good enough to proceed. Modify the code so the pipeline uses the accuracy of the latest registered model version as the threshold. We want to register a new model version only if its performance is better than the previous version we registered.\n", + "\n", + "- Assignment 3.4 The current pipeline uses either a Training Step or a Tuning Step to build a model. Modify the pipeline to use both steps at the same time. The evaluation script should evaluate the model coming from the Training Step and the best model coming from the Tuning Step and output the accuracy and location in S3 of the best model. You should modify the code to register the model assets specified in the evaluation report.\n", + "\n", + "- Assignment 3.5 Pipeline steps can encounter exceptions. In some cases, retrying can resolve these issues. For this assignment, configure the Processing Step so it automatically retries the step a maximum of 5 times if it encounters an `InternalServerError`. Check the [Retry Policy for Pipeline Steps](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-retry-policy.html) documentation for more information." + ] + }, + { + "cell_type": "markdown", + "id": "565cf77e-7fc7-406e-a2e2-40c553f459f7", + "metadata": { + "tags": [] + }, + "source": [ + "## Session 4 - Deploying Models and Serving Predictions\n", + "\n", + "In this session we'll explore how to deploy a model to a SageMaker Endpoint and how to use a SageMaker Inference Pipeline to control the data that goes in and comes out of the endpoint.\n", + "\n", + " \"Deployment\"\n", + "\n", + "Let's start by defining the name of the endpoint where we'll deploy the model and creating a constant pointing to the location where we'll store the data that the endpoint will capture:" + ] + }, + { + "cell_type": "code", + "execution_count": 691, + "id": "befd5ad3", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.predictor import Predictor\n", + "\n", + "ENDPOINT = \"penguins-endpoint\"\n", + "DATA_CAPTURE_DESTINATION = f\"{S3_LOCATION}/monitoring/data-capture\"" + ] + }, + { + "cell_type": "markdown", + "id": "93727425-fac6-44ec-91ed-130a50fdd18a", + "metadata": {}, + "source": [ + "### Step 1 - Deploying Model From Registry\n", + "\n", + "Let's manually deploy the latest model from the Model Registry to an endpoint.\n", + "\n", + "We want to query the list of approved models from the Model Registry and get the last one:" + ] + }, + { + "cell_type": "code", + "execution_count": 692, + "id": "87437a26-e9ea-4866-9dc3-630444c0fb46", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ModelPackageGroupName': 'penguins',\n", + " 'ModelPackageVersion': 74,\n", + " 'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:325223348818:model-package/penguins/74',\n", + " 'CreationTime': datetime.datetime(2023, 10, 26, 14, 52, 37, 773000, tzinfo=tzlocal()),\n", + " 'ModelPackageStatus': 'Completed',\n", + " 'ModelApprovalStatus': 'Approved'}" + ] + }, + "execution_count": 692, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "response = sagemaker_client.list_model_packages(\n", + " ModelPackageGroupName=MODEL_PACKAGE_GROUP,\n", + " ModelApprovalStatus=\"Approved\",\n", + " SortBy=\"CreationTime\",\n", + " MaxResults=1,\n", + ")\n", + "\n", + "package = (\n", + " response[\"ModelPackageSummaryList\"][0]\n", + " if response[\"ModelPackageSummaryList\"]\n", + " else None\n", + ")\n", + "package" + ] + }, + { + "cell_type": "markdown", + "id": "af752269", + "metadata": {}, + "source": [ + "We can now create a [Model Package](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.ModelPackage) using the ARN of the model from the Model Registry:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 693, + "id": "dee516e9", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker import ModelPackage\n", + "\n", + "model_package = ModelPackage(\n", + " model_package_arn=package[\"ModelPackageArn\"],\n", + " sagemaker_session=sagemaker_session,\n", + " role=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b3119b48-2ddf-40b5-9ac0-680073a53d06", + "metadata": {}, + "source": [ + "Let's now deploy the model to an endpoint:\n" + ] + }, + { + "cell_type": "markdown", + "id": "fbf2ed9f", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": 694, + "id": "7c8852d5-818a-406c-944d-30bf6de90288", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "#| eval: false\n", + "\n", + "model_package.deploy(\n", + " endpoint_name=ENDPOINT, \n", + " initial_instance_count=1, \n", + " instance_type=config[\"instance_type\"]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "3dd7a725", + "metadata": {}, + "source": [ + "After deploying the model, we can test the endpoint to make sure it works.\n", + "\n", + "Each line of the payload we'll send to the endpoint contains the information of a penguin. Notice the model expects data that's already transformed. We can't provide the original data from our dataset because the model we registered will not work with it.\n", + "\n", + "The endpoint will return the predictions for each of these lines.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 695, + "id": "ba7da291", + "metadata": {}, + "outputs": [], + "source": [ + "payload = \"\"\"\n", + "0.6569590202313976,-1.0813829646495108,1.2097102831892812,0.9226343641317372,1.0,0.0,0.0\n", + "-0.7751048801481084,0.8822689351285553,-1.2168066120762704,0.9226343641317372,0.0,1.0,0.0\n", + "-0.837387834894918,0.3386660813829646,-0.26237731892812,-1.92351941317372,0.0,0.0,1.0\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "30bcfffa-0ba6-4ad8-8b4f-1ea19b35a22f", + "metadata": {}, + "source": [ + "Let's send the payload to the endpoint and print its response:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 696, + "id": "0817a25e-8224-4911-830b-d659e7458b4a", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "An error occurred (ValidationError) when calling the InvokeEndpoint operation: Endpoint penguins-endpoint of account 325223348818 not found.\n" + ] + } + ], + "source": [ + "predictor = Predictor(endpoint_name=ENDPOINT)\n", + "\n", + "try:\n", + " response = predictor.predict(payload, initial_args={\"ContentType\": \"text/csv\"})\n", + " response = json.loads(response.decode(\"utf-8\"))\n", + "\n", + " print(json.dumps(response, indent=2))\n", + " print(f\"\\nSpecies: {np.argmax(response['predictions'], axis=1)}\")\n", + "except Exception as e:\n", + " print(e)" + ] + }, + { + "cell_type": "markdown", + "id": "28f5d383-fcd7-454c-bbd6-ce4ce7b2104a", + "metadata": {}, + "source": [ + "After testing the endpoint, we need to ensure we delete it:\n" + ] + }, + { + "cell_type": "markdown", + "id": "d9ec7eeb", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 697, + "id": "6b32c3a4-312e-473c-a217-33606f77d1e9", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "predictor.delete_endpoint()" + ] + }, + { + "cell_type": "markdown", + "id": "99b90cce", + "metadata": {}, + "source": [ + "Deploying the model we trained directly to an endpoint doesn't lets us control the data that goes in and comes out of the endpoint. The TensorFlow model we trained requires transformed data, which makes it useless to other applications. Fortunately, we can create an Inference Pipeline using SageMaker to control the data that goes in and comes out of the endpoint.\n", + "\n", + "Our inference pipeline will have three components:\n", + "\n", + "1. A preprocessing transformer that will transform the input data into the format the model expects.\n", + "2. The TensorFlow model we trained.\n", + "3. A postprocessing transformer that will transform the output of the model into a human-readable format.\n", + "\n", + "We want our endpoint to handle unprocessed data in CSV and JSON format and return the penguin's species. Here is an example of the payload input we want the endpoint to support:\n", + "\n", + "```{json}\n", + "{\n", + " \"island\": \"Biscoe\",\n", + " \"culmen_length_mm\": 48.6,\n", + " \"culmen_depth_mm\": 16.0,\n", + " \"flipper_length_mm\": 230.0,\n", + " \"body_mass_g\": 5800.0,\n", + "}\n", + "```\n", + "\n", + "And here is an example of the output we'd like to get from the endpoint:\n", + "\n", + "```{json}\n", + "{\n", + " \"prediction\": \"Adelie\",\n", + " \"confidence\": 0.802672\n", + "}\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "0d0838bf", + "metadata": {}, + "source": [ + "### Step 2 - Creating the Preprocessing Script\n", + "\n", + "The first component of our inference pipeline will transform the input data into the format the model expects. We'll use the Scikit-Learn transformer we saved when we split and transformed the data. To deploy this component as part of an inference pipeline, we need to write a script that loads the transformer, uses it to modify the input data, and returns the output in the format the TensorFlow model expects.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 698, + "id": "e2d61d5c", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting code/inference/preprocessing_component.py\n" + ] + } + ], + "source": [ + "%%writefile {INFERENCE_CODE_FOLDER}/preprocessing_component.py\n", + "\n", + "#| label: preprocessing-component\n", + "#| echo: true\n", + "#| output: false\n", + "#| filename: preprocessing_component.py\n", + "#| code-line-numbers: true\n", + "\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "import json\n", + "import joblib\n", + "\n", + "from io import StringIO\n", + "\n", + "try:\n", + " from sagemaker_containers.beta.framework import encoders, worker\n", + "except ImportError:\n", + " # We don't have access to the `worker` instance when testing locally. \n", + " # We'll set it to None so we can change the way functions create a response.\n", + " worker = None\n", + "\n", + "\n", + "TARGET_COLUMN = \"species\"\n", + "FEATURE_COLUMNS = [\n", + " \"island\",\n", + " \"culmen_length_mm\",\n", + " \"culmen_depth_mm\", \n", + " \"flipper_length_mm\",\n", + " \"body_mass_g\",\n", + " \"sex\"\n", + "]\n", + "\n", + "\n", + "def input_fn(input_data, content_type):\n", + " \"\"\"\n", + " Parses the input payload and creates a Pandas DataFrame.\n", + " \n", + " This function will check whether the target column is present in the\n", + " input data, and will remove it.\n", + " \"\"\"\n", + " \n", + " if content_type == \"text/csv\":\n", + " df = pd.read_csv(StringIO(input_data), header=None, skipinitialspace=True)\n", + "\n", + " if len(df.columns) == len(FEATURE_COLUMNS) + 1:\n", + " df = df.drop(df.columns[0], axis=1)\n", + " \n", + " df.columns = FEATURE_COLUMNS\n", + " return df\n", + " \n", + " if content_type == \"application/json\":\n", + " df = pd.DataFrame([json.loads(input_data)])\n", + " \n", + " if \"species\" in df.columns:\n", + " df = df.drop(\"species\", axis=1)\n", + " \n", + " return df\n", + " \n", + " else:\n", + " raise ValueError(f\"{content_type} is not supported.!\")\n", + "\n", + "\n", + "def output_fn(prediction, accept):\n", + " \"\"\"\n", + " Formats the prediction output to generate a response.\n", + " \n", + " The default accept/content-type between containers for serial inference is JSON. \n", + " Since this model will preceed a TensorFlow model, we want to return a JSON object\n", + " following TensorFlow's input requirements.\n", + " \"\"\"\n", + " \n", + " if prediction is None:\n", + " raise Exception(f\"There was an error transforming the input data\")\n", + "\n", + " if accept == \"text/csv\":\n", + " return worker.Response(encoders.encode(prediction, accept), mimetype=accept) if worker else prediction, accept \n", + " \n", + " if accept == \"application/json\":\n", + " instances = [p for p in prediction.tolist()]\n", + " response = {\"instances\": instances}\n", + " return worker.Response(json.dumps(response), mimetype=accept) if worker else (response, accept)\n", + "\n", + " raise Exception(f\"{accept} accept type is not supported.\")\n", + "\n", + "\n", + "def predict_fn(input_data, model):\n", + " \"\"\"\n", + " Preprocess the input using the transformer.\n", + " \"\"\"\n", + " \n", + " try:\n", + " response = model.transform(input_data)\n", + " return response\n", + " except ValueError as e:\n", + " print(\"Error transforming the input data\", e)\n", + " return None\n", + "\n", + "\n", + "def model_fn(model_dir):\n", + " \"\"\"\n", + " Deserializes the model that will be used in this container.\n", + " \"\"\"\n", + " \n", + " return joblib.load(os.path.join(model_dir, \"features.joblib\"))" + ] + }, + { + "cell_type": "markdown", + "id": "037982c1", + "metadata": {}, + "source": [ + "Let's test the script to ensure everything is working as expected:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 699, + "id": "33893ef2", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m [100%]\u001b[0m\n", + "\u001b[32m\u001b[32m\u001b[1m10 passed\u001b[0m\u001b[32m in 0.07s\u001b[0m\u001b[0m\n" + ] + } + ], + "source": [ + "%%ipytest\n", + "#| code-fold: true\n", + "#| output: false\n", + "\n", + "from preprocessing_component import input_fn, predict_fn, output_fn, model_fn\n", + "\n", + "\n", + "@pytest.fixture(scope=\"function\", autouse=False)\n", + "def directory():\n", + " directory = tempfile.mkdtemp()\n", + " input_directory = Path(directory) / \"input\"\n", + " input_directory.mkdir(parents=True, exist_ok=True)\n", + " shutil.copy2(DATA_FILEPATH, input_directory / \"data.csv\")\n", + " \n", + " directory = Path(directory)\n", + " \n", + " preprocess(base_directory=directory)\n", + " \n", + " with tarfile.open(directory / \"model\" / \"model.tar.gz\") as tar:\n", + " tar.extractall(path=directory / \"model\")\n", + " \n", + " yield directory / \"model\"\n", + " \n", + " shutil.rmtree(directory)\n", + "\n", + "\n", + "\n", + "def test_input_csv_drops_target_column_if_present():\n", + " input_data = \"\"\"\n", + " Adelie, Torgersen, 39.1, 18.7, 181, 3750, MALE\n", + " \"\"\"\n", + " \n", + " df = input_fn(input_data, \"text/csv\")\n", + " assert len(df.columns) == 6 and \"species\" not in df.columns\n", + "\n", + "\n", + "def test_input_json_drops_target_column_if_present():\n", + " input_data = json.dumps({\n", + " \"species\": \"Adelie\", \n", + " \"island\": \"Torgersen\",\n", + " \"culmen_length_mm\": 44.1,\n", + " \"culmen_depth_mm\": 18.0,\n", + " \"flipper_length_mm\": 210.0,\n", + " \"body_mass_g\": 4000.0,\n", + " \"sex\": \"MALE\"\n", + " })\n", + " \n", + " df = input_fn(input_data, \"application/json\")\n", + " assert len(df.columns) == 6 and \"species\" not in df.columns\n", + "\n", + "\n", + "def test_input_csv_works_without_target_column():\n", + " input_data = \"\"\"\n", + " Torgersen, 39.1, 18.7, 181, 3750, MALE\n", + " \"\"\"\n", + " \n", + " df = input_fn(input_data, \"text/csv\")\n", + " assert len(df.columns) == 6\n", + "\n", + "\n", + "def test_input_json_works_without_target_column():\n", + " input_data = json.dumps({\n", + " \"island\": \"Torgersen\",\n", + " \"culmen_length_mm\": 44.1,\n", + " \"culmen_depth_mm\": 18.0,\n", + " \"flipper_length_mm\": 210.0,\n", + " \"body_mass_g\": 4000.0,\n", + " \"sex\": \"MALE\"\n", + " })\n", + " \n", + " df = input_fn(input_data, \"application/json\")\n", + " assert len(df.columns) == 6\n", + "\n", + "\n", + "def test_output_csv_raises_exception_if_prediction_is_none():\n", + " with pytest.raises(Exception):\n", + " output_fn(None, \"text/csv\")\n", + " \n", + " \n", + "def test_output_json_raises_exception_if_prediction_is_none():\n", + " with pytest.raises(Exception):\n", + " output_fn(None, \"application/json\")\n", + " \n", + " \n", + "def test_output_csv_returns_prediction():\n", + " prediction = np.array([\n", + " [-1.3944109908736013,1.15488062669371,-0.7954340636549508,-0.5536447804097907,0.0,1.0,0.0],\n", + " [1.0557485835338234,0.5040085971987002,-0.5824506029515057,-0.5851840035995248,0.0,1.0,0.0]\n", + " ])\n", + " \n", + " response = output_fn(prediction, \"text/csv\")\n", + " \n", + " assert response == (prediction, \"text/csv\")\n", + " \n", + " \n", + "def test_output_json_returns_tensorflow_ready_input():\n", + " prediction = np.array([\n", + " [-1.3944109908736013,1.15488062669371,-0.7954340636549508,-0.5536447804097907,0.0,1.0,0.0],\n", + " [1.0557485835338234,0.5040085971987002,-0.5824506029515057,-0.5851840035995248,0.0,1.0,0.0]\n", + " ])\n", + " \n", + " response = output_fn(prediction, \"application/json\")\n", + " \n", + " assert response[0] == {\n", + " \"instances\": [\n", + " [-1.3944109908736013,1.15488062669371,-0.7954340636549508,-0.5536447804097907,0.0,1.0,0.0],\n", + " [1.0557485835338234,0.5040085971987002,-0.5824506029515057,-0.5851840035995248,0.0,1.0,0.0]\n", + " ]\n", + " }\n", + " \n", + " assert response[1] == \"application/json\"\n", + "\n", + " \n", + "def test_predict_transforms_data(directory):\n", + " input_data = \"\"\"\n", + " Torgersen, 39.1, 18.7, 181, 3750, MALE\n", + " \"\"\"\n", + " \n", + " model = model_fn(str(directory))\n", + " df = input_fn(input_data, \"text/csv\")\n", + " response = predict_fn(df, model)\n", + " assert type(response) is np.ndarray\n", + " \n", + "\n", + "def test_predict_returns_none_if_invalid_input(directory):\n", + " input_data = \"\"\"\n", + " Invalid, 39.1, 18.7, 181, 3750, MALE\n", + " \"\"\"\n", + " \n", + " model = model_fn(str(directory))\n", + " df = input_fn(input_data, \"text/csv\")\n", + " assert predict_fn(df, model) is None" + ] + }, + { + "cell_type": "markdown", + "id": "8eacf7aa", + "metadata": {}, + "source": [ + "### Step 3 - Creating the Postprocessing Script\n", + "\n", + "The final component of our inference pipeline will transform the output from the model into a human-readable format. We'll use the Scikit-Learn target transformer we saved when we split and transformed the data. To deploy this component as part of an inference pipeline, we need to write a script that loads the transformer, uses it to modify the output from the model, and returns a human-readable format.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 700, + "id": "48c69002", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting code/inference/postprocessing_component.py\n" + ] + } + ], + "source": [ + "%%writefile {INFERENCE_CODE_FOLDER}/postprocessing_component.py\n", + "\n", + "#| label: postprocessing-component\n", + "#| echo: true\n", + "#| output: false\n", + "#| filename: postprocessing_component.py\n", + "#| code-line-numbers: true\n", + "\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "import argparse\n", + "import json\n", + "import tarfile\n", + "import joblib\n", + "\n", + "from pathlib import Path\n", + "from io import StringIO\n", + "\n", + "from sklearn.compose import ColumnTransformer, make_column_selector\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.pipeline import Pipeline, make_pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, OrdinalEncoder\n", + "from pickle import dump, load\n", + "\n", + "\n", + "try:\n", + " from sagemaker_containers.beta.framework import encoders, worker\n", + "except ImportError:\n", + " # We don't have access to the `worker` instance when testing locally. \n", + " # We'll set it to None so we can change the way functions create a response.\n", + " worker = None\n", + "\n", + "\n", + "def input_fn(input_data, content_type):\n", + " if content_type == \"application/json\":\n", + " predictions = json.loads(input_data)[\"predictions\"]\n", + " return predictions\n", + " \n", + " else:\n", + " raise ValueError(f\"{content_type} is not supported.!\")\n", + "\n", + "\n", + "def output_fn(prediction, accept):\n", + " if accept == \"text/csv\":\n", + " return worker.Response(encoders.encode(prediction, accept), mimetype=accept) if worker else (prediction, accept)\n", + " \n", + " if accept == \"application/json\":\n", + " response = []\n", + " for p, c in prediction:\n", + " response.append({\n", + " \"prediction\": p,\n", + " \"confidence\": c\n", + " })\n", + "\n", + " # If there's only one prediction, we'll return it\n", + " # as a single object.\n", + " if len(response) == 1:\n", + " response = response[0]\n", + " \n", + " return worker.Response(json.dumps(response), mimetype=accept) if worker else (response, accept)\n", + " \n", + " raise RuntimeException(f\"{accept} accept type is not supported.\")\n", + "\n", + "\n", + "def predict_fn(input_data, model):\n", + " \"\"\"\n", + " Transforms the prediction into its corresponding category.\n", + " \"\"\"\n", + "\n", + " predictions = np.argmax(input_data, axis=-1)\n", + " confidence = np.max(input_data, axis=-1)\n", + " return [(model[prediction], confidence) for confidence, prediction in zip(confidence, predictions)]\n", + "\n", + "\n", + "def model_fn(model_dir):\n", + " \"\"\"\n", + " Deserializes the target model and returns the list of fitted categories.\n", + " \"\"\"\n", + " \n", + " model = joblib.load(os.path.join(model_dir, \"target.joblib\"))\n", + " return model.named_transformers_[\"species\"].categories_[0]" + ] + }, + { + "cell_type": "markdown", + "id": "86c421c7", + "metadata": {}, + "source": [ + "Let's test the script to ensure everything is working as expected:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 701, + "id": "741b8402", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m [100%]\u001b[0m\n", + "\u001b[32m\u001b[32m\u001b[1m3 passed\u001b[0m\u001b[32m in 0.01s\u001b[0m\u001b[0m\n" + ] + } + ], + "source": [ + "%%ipytest\n", + "#| code-fold: true\n", + "#| output: false\n", + "\n", + "import numpy as np\n", + "\n", + "from postprocessing_component import predict_fn, output_fn\n", + "\n", + "\n", + "def test_predict_returns_prediction_as_first_column():\n", + " input_data = [\n", + " [0.6, 0.2, 0.2], \n", + " [0.1, 0.8, 0.1],\n", + " [0.2, 0.1, 0.7]\n", + " ]\n", + " \n", + " categories = [\"Adelie\", \"Gentoo\", \"Chinstrap\"]\n", + " \n", + " response = predict_fn(input_data, categories)\n", + " \n", + " assert response == [\n", + " (\"Adelie\", 0.6),\n", + " (\"Gentoo\", 0.8),\n", + " (\"Chinstrap\", 0.7)\n", + " ]\n", + "\n", + "\n", + "def test_output_does_not_return_array_if_single_prediction():\n", + " prediction = [(\"Adelie\", 0.6)]\n", + " response, _ = output_fn(prediction, \"application/json\")\n", + "\n", + " assert response[\"prediction\"] == \"Adelie\"\n", + "\n", + "\n", + "def test_output_returns_array_if_multiple_predictions():\n", + " prediction = [(\"Adelie\", 0.6), (\"Gentoo\", 0.8)]\n", + " response, _ = output_fn(prediction, \"application/json\")\n", + "\n", + " assert len(response) == 2\n", + " assert response[0][\"prediction\"] == \"Adelie\"\n", + " assert response[1][\"prediction\"] == \"Gentoo\"\n" + ] + }, + { + "cell_type": "markdown", + "id": "5e5526e5", + "metadata": {}, + "source": [ + "### Step 4 - Setting up the Inference Pipeline\n", + "\n", + "We can now create a [PipelineModel](https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html#sagemaker.pipeline.PipelineModel) to define our inference pipeline.\n" + ] + }, + { + "cell_type": "markdown", + "id": "2baf91d8", + "metadata": {}, + "source": [ + "We'll use the model we generated from the first step of the pipeline as the input to the first and last components of the inference pipeline. This `model.tar.gz` file contains the two transformers we need to preprocess and postprocess the data. Let's create a variable with the URI to this file:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 702, + "id": "53ea0ccf", + "metadata": {}, + "outputs": [], + "source": [ + "transformation_pipeline_model = Join(\n", + " on=\"/\",\n", + " values=[\n", + " split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"model\"\n", + " ].S3Output.S3Uri,\n", + " \"model.tar.gz\",\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1b7119a4", + "metadata": {}, + "source": [ + "Here is the first component of the inference pipeline. It will preprocess the data before sending it to the TensorFlow model:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 703, + "id": "11a0effd", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.sklearn.model import SKLearnModel\n", + "\n", + "preprocessing_model = SKLearnModel(\n", + " model_data=transformation_pipeline_model,\n", + " entry_point=\"preprocessing_component.py\",\n", + " source_dir=str(INFERENCE_CODE_FOLDER),\n", + " framework_version=\"1.2-1\",\n", + " sagemaker_session=config[\"session\"],\n", + " role=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "26a18bfb", + "metadata": {}, + "source": [ + "Here is the last component of the inference pipeline. It will postprocess the output from the TensorFlow model before sending it back to the user:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 704, + "id": "5d7a5926", + "metadata": {}, + "outputs": [], + "source": [ + "post_processing_model = SKLearnModel(\n", + " model_data=transformation_pipeline_model,\n", + " entry_point=\"postprocessing_component.py\",\n", + " source_dir=str(INFERENCE_CODE_FOLDER),\n", + " framework_version=\"1.2-1\",\n", + " sagemaker_session=config[\"session\"],\n", + " role=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2918f505", + "metadata": {}, + "source": [ + "We can now create the inference pipeline using the three models:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 705, + "id": "157b8858", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from sagemaker.pipeline import PipelineModel\n", + "\n", + "pipeline_model = PipelineModel(\n", + " name=\"inference-model\",\n", + " models=[preprocessing_model, tensorflow_model, post_processing_model],\n", + " sagemaker_session=config[\"session\"],\n", + " role=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "501cdaa1", + "metadata": {}, + "source": [ + "### Step 5 - Registering the Model\n", + "\n", + "We'll modify the pipeline to register the Pipeline Model in the Model Registry. We'll use a different group name to keep Pipeline Models separate.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 706, + "id": "aefe580a", + "metadata": {}, + "outputs": [], + "source": [ + "PIPELINE_MODEL_PACKAGE_GROUP = \"pipeline\"" + ] + }, + { + "cell_type": "markdown", + "id": "77b2b06e", + "metadata": {}, + "source": [ + "Let's now register the model. Notice that we will register the model with \"PendingManualApproval\" status. This means that we'll need to manually approve the model before it can be deployed to an endpoint. Check [Register a Model Version](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-version.html) for more information about model registration.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 707, + "id": "f84d2cd5", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/svpino/dev/ml.school/.venv/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:297: UserWarning: Running within a PipelineSession, there will be No Wait, No Logs, and No Job being started.\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "register_model_step = ModelStep(\n", + " name=\"register\",\n", + " display_name=\"register-model\",\n", + " step_args=pipeline_model.register(\n", + " model_package_group_name=PIPELINE_MODEL_PACKAGE_GROUP,\n", + " model_metrics=model_metrics,\n", + " approval_status=\"PendingManualApproval\",\n", + " # Our inference pipeline model supports two content\n", + " # types: text/csv and application/json.\n", + " content_types=[\"text/csv\", \"application/json\"],\n", + " response_types=[\"text/csv\", \"application/json\"],\n", + " # This is the suggested inference instance types when\n", + " # deploying the model or using it as part of a batch\n", + " # transform job.\n", + " inference_instances=[\"ml.m5.xlarge\"],\n", + " transform_instances=[\"ml.m5.xlarge\"],\n", + " domain=\"MACHINE_LEARNING\",\n", + " task=\"CLASSIFICATION\",\n", + " framework=\"TENSORFLOW\",\n", + " framework_version=config[\"framework_version\"],\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c00a4ebb-fb21-4935-9d7b-9500e47e07f9", + "metadata": {}, + "source": [ + "### Step 6 - Modifying the Condition Step\n", + "\n", + "Since we modified the registration step, we also need to modify the Condition Step to use the new registration:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 708, + "id": "b9712905-9fe3-4148-ae6d-05b0a48e742e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "condition_step = ConditionStep(\n", + " name=\"check-model-accuracy\",\n", + " conditions=[condition],\n", + " if_steps=[register_model_step] if not LOCAL_MODE else [],\n", + " else_steps=[fail_step],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0730a388", + "metadata": {}, + "source": [ + "### Step 7 - Creating the Pipeline\n", + "\n", + "We can now define the SageMaker Pipeline and submit its definition to the SageMaker Pipelines service to create the pipeline if it doesn't exist or update it if it does.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 709, + "id": "bad9f51d", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n", + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session4-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", + "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session4-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n", + "WARNING:sagemaker.workflow._utils:Popping out 'CertifyForMarketplace' from the pipeline definition since it will be overridden in pipeline execution time.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n", + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session4-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", + "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session4-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n" + ] + }, + { + "data": { + "text/plain": [ + "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session4-pipeline',\n", + " 'ResponseMetadata': {'RequestId': '2cd65edc-9bad-4b67-a1d2-aa22698d6a39',\n", + " 'HTTPStatusCode': 200,\n", + " 'HTTPHeaders': {'x-amzn-requestid': '2cd65edc-9bad-4b67-a1d2-aa22698d6a39',\n", + " 'content-type': 'application/x-amz-json-1.1',\n", + " 'content-length': '85',\n", + " 'date': 'Fri, 27 Oct 2023 14:38:46 GMT'},\n", + " 'RetryAttempts': 0}}" + ] + }, + "execution_count": 709, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "session4_pipeline = Pipeline(\n", + " name=\"session4-pipeline\",\n", + " parameters=[dataset_location, accuracy_threshold],\n", + " steps=[\n", + " split_and_transform_data_step,\n", + " tune_model_step if USE_TUNING_STEP else train_model_step,\n", + " evaluate_model_step,\n", + " condition_step,\n", + " ],\n", + " pipeline_definition_config=pipeline_definition_config,\n", + " sagemaker_session=config[\"session\"],\n", + ")\n", + "\n", + "session4_pipeline.upsert(role_arn=role)" + ] + }, + { + "cell_type": "markdown", + "id": "20c71f91", + "metadata": {}, + "source": [ + "We can now start the pipeline:\n" + ] + }, + { + "cell_type": "markdown", + "id": "34819536", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": 710, + "id": "20dfbd97", + "metadata": {}, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "session4_pipeline.start()" + ] + }, + { + "cell_type": "markdown", + "id": "2c74cc70", + "metadata": {}, + "source": [ + "### Step 8 - Creating the Lambda Function\n", + "\n", + "We will use [Amazon EventBridge](https://aws.amazon.com/pm/eventbridge/) to trigger a Lambda function that will deploy the model whenever its status changes from \"PendingManualApproval\" to \"Approved.\" Let's start by writing the Lambda function to take the model information and create a new endpoint.\n", + "\n", + "We'll enable [Data Capture](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture.html) as part of the endpoint configuration. With Data Capture we can record the inputs and outputs of the endpoint to use them later for monitoring the model:\n", + "\n", + "- `InitialSamplingPercentage` represents the percentage of traffic that we want to capture.\n", + "- `DestinationS3Uri` specifies the S3 location where we want to store the captured data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 744, + "id": "998314a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting code/lambda.py\n" + ] + } + ], + "source": [ + "%%writefile {CODE_FOLDER}/lambda.py\n", + "\n", + "import os\n", + "import json\n", + "import boto3\n", + "import time\n", + "\n", + "sagemaker = boto3.client(\"sagemaker\")\n", + "\n", + "def lambda_handler(event, context):\n", + " model_package_arn = event[\"detail\"][\"ModelPackageArn\"]\n", + " approval_status = event[\"detail\"][\"ModelApprovalStatus\"]\n", + "\n", + " print(f\"Model: {model_package_arn}\")\n", + " print(f\"Approval status: {approval_status}\")\n", + " \n", + " # We only want to deploy the approved models\n", + " if approval_status != \"Approved\":\n", + " response = {\n", + " \"message\": \"Skipping deployment.\",\n", + " \"approval_status\": approval_status,\n", + " }\n", + "\n", + " print(response)\n", + " return {\n", + " \"statusCode\": 200,\n", + " \"body\": json.dumps(response)\n", + " } \n", + " \n", + " endpoint_name = os.environ[\"ENDPOINT\"]\n", + " data_capture_destination = os.environ[\"DATA_CAPTURE_DESTINATION\"]\n", + " role = os.environ[\"ROLE\"]\n", + " \n", + " timestamp = time.strftime(\"%m%d%H%M%S\", time.localtime())\n", + " model_name = f\"{endpoint_name}-model-{timestamp}\"\n", + " endpoint_config_name = f\"{endpoint_name}-config-{timestamp}\"\n", + "\n", + " sagemaker.create_model(\n", + " ModelName=model_name, \n", + " ExecutionRoleArn=role, \n", + " Containers=[{\n", + " \"ModelPackageName\": model_package_arn\n", + " }] \n", + " )\n", + "\n", + " sagemaker.create_endpoint_config(\n", + " EndpointConfigName=endpoint_config_name,\n", + " ProductionVariants=[{\n", + " \"ModelName\": model_name,\n", + " \"InstanceType\": \"ml.m5.xlarge\",\n", + " \"InitialVariantWeight\": 1,\n", + " \"InitialInstanceCount\": 1,\n", + " \"VariantName\": \"AllTraffic\",\n", + " }],\n", + "\n", + " # We can enable Data Capture to record the inputs and outputs\n", + " # of the endpoint to use them later for monitoring the model. \n", + " DataCaptureConfig={\n", + " \"EnableCapture\": True,\n", + " \"InitialSamplingPercentage\": 100,\n", + " \"DestinationS3Uri\": data_capture_destination,\n", + " \"CaptureOptions\": [\n", + " {\n", + " \"CaptureMode\": \"Input\"\n", + " },\n", + " {\n", + " \"CaptureMode\": \"Output\"\n", + " },\n", + " ],\n", + " \"CaptureContentTypeHeader\": {\n", + " \"CsvContentTypes\": [\n", + " \"text/csv\",\n", + " \"application/octect-stream\"\n", + " ],\n", + " \"JsonContentTypes\": [\n", + " \"application/json\",\n", + " \"application/octect-stream\"\n", + " ]\n", + " }\n", + " },\n", + " )\n", + " \n", + " response = sagemaker.list_endpoints(NameContains=endpoint_name, MaxResults=1)\n", + "\n", + " if len(response[\"Endpoints\"]) == 0:\n", + " # If the endpoint doesn't exist, let's create it.\n", + " sagemaker.create_endpoint(\n", + " EndpointName=endpoint_name, \n", + " EndpointConfigName=endpoint_config_name,\n", + " )\n", + " else:\n", + " # If the endpoint already exist, let's update it with the\n", + " # new configuration.\n", + " sagemaker.update_endpoint(\n", + " EndpointName=endpoint_name, \n", + " EndpointConfigName=endpoint_config_name,\n", + " )\n", + " \n", + " return {\n", + " \"statusCode\": 200,\n", + " \"body\": json.dumps(\"Endpoint deployed successfully\")\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "5b582ace", + "metadata": {}, + "source": [ + "We need to ensure our Lambda function has permission to interact with SageMaker, so let's create a new role and then create the lambda function.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 745, + "id": "4ad4f1f2", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Role \"lambda-deployment-role\" created with ARN \"arn:aws:iam::325223348818:role/lambda-deployment-role\".\n" + ] + } + ], + "source": [ + "#| code: true\n", + "#| output: false\n", + "\n", + "lambda_role_name = \"lambda-deployment-role\"\n", + "lambda_role_arn = None\n", + "\n", + "try:\n", + " response = iam_client.create_role(\n", + " RoleName=lambda_role_name,\n", + " AssumeRolePolicyDocument=json.dumps(\n", + " {\n", + " \"Version\": \"2012-10-17\",\n", + " \"Statement\": [\n", + " {\n", + " \"Effect\": \"Allow\",\n", + " \"Principal\": {\n", + " \"Service\": [\"lambda.amazonaws.com\", \"events.amazonaws.com\"]\n", + " },\n", + " \"Action\": \"sts:AssumeRole\",\n", + " }\n", + " ],\n", + " }\n", + " ),\n", + " Description=\"Lambda Endpoint Deployment\",\n", + " )\n", + "\n", + " lambda_role_arn = response[\"Role\"][\"Arn\"]\n", + "\n", + " iam_client.attach_role_policy(\n", + " PolicyArn=\"arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole\",\n", + " RoleName=lambda_role_name,\n", + " )\n", + "\n", + " iam_client.attach_role_policy(\n", + " PolicyArn=\"arn:aws:iam::aws:policy/AmazonSageMakerFullAccess\",\n", + " RoleName=lambda_role_name,\n", + " )\n", + "\n", + " print(f'Role \"{lambda_role_name}\" created with ARN \"{lambda_role_arn}\".')\n", + "except iam_client.exceptions.EntityAlreadyExistsException:\n", + " response = iam_client.get_role(RoleName=lambda_role_name)\n", + " lambda_role_arn = response[\"Role\"][\"Arn\"]\n", + " print(f'Role \"{lambda_role_name}\" already exists with ARN \"{lambda_role_arn}\".')\n" + ] + }, + { + "cell_type": "markdown", + "id": "acef9d48", + "metadata": {}, + "source": [ + "We can now create the Lambda function:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 747, + "id": "ad8c8019", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ResponseMetadata': {'RequestId': '57179d72-6fc2-49cc-9326-cb87bd63bda1',\n", + " 'HTTPStatusCode': 201,\n", + " 'HTTPHeaders': {'date': 'Fri, 27 Oct 2023 16:01:42 GMT',\n", + " 'content-type': 'application/json',\n", + " 'content-length': '1421',\n", + " 'connection': 'keep-alive',\n", + " 'x-amzn-requestid': '57179d72-6fc2-49cc-9326-cb87bd63bda1'},\n", + " 'RetryAttempts': 0},\n", + " 'FunctionName': 'deploy_fn',\n", + " 'FunctionArn': 'arn:aws:lambda:us-east-1:325223348818:function:deploy_fn',\n", + " 'Runtime': 'python3.11',\n", + " 'Role': 'arn:aws:iam::325223348818:role/lambda-deployment-role',\n", + " 'Handler': 'lambda.lambda_handler',\n", + " 'CodeSize': 3194,\n", + " 'Description': '',\n", + " 'Timeout': 600,\n", + " 'MemorySize': 128,\n", + " 'LastModified': '2023-10-27T16:01:42.544+0000',\n", + " 'CodeSha256': 'IkCkE0e46WsdhSUEPRlsqEH/6nHhU5laPpgn308D30k=',\n", + " 'Version': '$LATEST',\n", + " 'Environment': {'Variables': {'ROLE': 'arn:aws:iam::325223348818:role/service-role/AmazonSageMaker-ExecutionRole-20230312T160501',\n", + " 'DATA_CAPTURE_DESTINATION': 's3://mlschool/penguins/monitoring/data-capture',\n", + " 'ENDPOINT': 'penguins-endpoint'}},\n", + " 'TracingConfig': {'Mode': 'PassThrough'},\n", + " 'RevisionId': '516fef1e-871b-4a52-81e2-a421f3547ec9',\n", + " 'Layers': [],\n", + " 'State': 'Pending',\n", + " 'StateReason': 'The function is being created.',\n", + " 'StateReasonCode': 'Creating',\n", + " 'PackageType': 'Zip',\n", + " 'Architectures': ['x86_64'],\n", + " 'EphemeralStorage': {'Size': 512},\n", + " 'SnapStart': {'ApplyOn': 'None', 'OptimizationStatus': 'Off'},\n", + " 'RuntimeVersionConfig': {'RuntimeVersionArn': 'arn:aws:lambda:us-east-1::runtime:6cf63f1a78b5c5e19617d6b4b111370fdbda415ea91bdfdc5aacef9fee76b64a'}}" + ] + }, + "execution_count": 747, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.lambda_helper import Lambda\n", + "\n", + "\n", + "deploy_lambda_fn = Lambda(\n", + " function_name=\"deploy_fn\",\n", + " execution_role_arn=lambda_role_arn,\n", + " script=str(CODE_FOLDER / \"lambda.py\"),\n", + " handler=\"lambda.lambda_handler\",\n", + " timeout=600,\n", + " session=sagemaker_session,\n", + " runtime=\"python3.11\",\n", + " environment={\n", + " \"Variables\": {\n", + " \"ENDPOINT\": ENDPOINT,\n", + " \"DATA_CAPTURE_DESTINATION\": DATA_CAPTURE_DESTINATION,\n", + " \"ROLE\": role,\n", + " }\n", + " },\n", + ")\n", + "\n", + "lambda_response = None\n", + "if not LOCAL_MODE:\n", + " lambda_response = deploy_lambda_fn.upsert()\n", + "\n", + "lambda_response" + ] + }, + { + "cell_type": "markdown", + "id": "d4ad06ac", + "metadata": {}, + "source": [ + "### Step 9 - Setting Up EventBridge\n", + "\n", + "We can now create an EventBridge rule that triggers the deployment process whenever a model approval status becomes \"Approved\". To do this, let's define the event pattern that will trigger the deployment process:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 748, + "id": "27ce7cc5", + "metadata": {}, + "outputs": [], + "source": [ + "event_pattern = f\"\"\"\n", + "{{\n", + " \"source\": [\"aws.sagemaker\"],\n", + " \"detail-type\": [\"SageMaker Model Package State Change\"],\n", + " \"detail\": {{\n", + " \"ModelPackageGroupName\": [\"{PIPELINE_MODEL_PACKAGE_GROUP}\"],\n", + " \"ModelApprovalStatus\": [\"Approved\"]\n", + " }}\n", + "}}\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "d1b23587", + "metadata": {}, + "source": [ + "Let's now create the EventBridge rule:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 749, + "id": "2a878179", + "metadata": {}, + "outputs": [], + "source": [ + "events_client = boto3.client(\"events\")\n", + "rule_response = events_client.put_rule(\n", + " Name=\"PipelineModelApprovedRule\",\n", + " EventPattern=event_pattern,\n", + " State=\"ENABLED\",\n", + " RoleArn=role,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0b3ba782", + "metadata": {}, + "source": [ + "Now, we need to define the target of the rule. The target will trigger whenever the rule matches an event. In this case, we want to trigger the Lambda function we created before:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 750, + "id": "dc714a97", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "response = events_client.put_targets(\n", + " Rule=\"PipelineModelApprovedRule\",\n", + " Targets=[\n", + " {\n", + " \"Id\": \"1\",\n", + " \"Arn\": lambda_response[\"FunctionArn\"],\n", + " }\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "400585a1", + "metadata": {}, + "source": [ + "Finally, we need to give the Lambda function permission to be triggered by the EventBridge rule:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 751, + "id": "d74be86b", + "metadata": {}, + "outputs": [], + "source": [ + "lambda_client = boto3.client(\"lambda\")\n", + "try:\n", + " response = lambda_client.add_permission(\n", + " Action=\"lambda:InvokeFunction\",\n", + " FunctionName=lambda_response[\"FunctionName\"],\n", + " Principal=\"events.amazonaws.com\",\n", + " SourceArn=rule_response[\"RuleArn\"],\n", + " StatementId=\"EventBridge\",\n", + " )\n", + "except lambda_client.exceptions.ResourceConflictException as e:\n", + " print(f'Function \"{lambda_response[\"FunctionName\"]}\" already has permissions.')" + ] + }, + { + "cell_type": "markdown", + "id": "7dfe7356-53e8-4ac1-9a7f-3bd51bb739a5", + "metadata": {}, + "source": [ + "### Step 10 - Testing the Endpoint\n", + "\n", + "Let's now test the endpoint we deployed automatically with the pipeline. We will use the function to create a predictor with a JSON encoder and decoder.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 718, + "id": "3cc966fb-b611-417f-a8b8-0c5d2f95252c", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Payload:\n", + "Torgersen,39.1,18.7,181.0,3750.0,MALE\n", + "Torgersen,39.5,17.4,186.0,3800.0,FEMALE\n", + "Torgersen,40.3,18.0,195.0,3250.0,FEMALE\n", + "\n", + "An error occurred (ValidationError) when calling the InvokeEndpoint operation: Endpoint penguins-endpoint of account 325223348818 not found.\n" + ] + } + ], + "source": [ + "from sagemaker.serializers import CSVSerializer\n", + "\n", + "predictor = Predictor(\n", + " endpoint_name=ENDPOINT, \n", + " serializer=CSVSerializer(),\n", + " sagemaker_session=sagemaker_session\n", + ")\n", + "\n", + "data = pd.read_csv(DATA_FILEPATH)\n", + "data = data.drop(\"species\", axis=1)\n", + "\n", + "payload = data.iloc[:3].to_csv(header=False, index=False)\n", + "print(f\"Payload:\\n{payload}\")\n", + "\n", + "try:\n", + " response = predictor.predict(payload, initial_args={\"ContentType\": \"text/csv\"})\n", + " print(response.decode(\"utf-8\"))\n", + "except Exception as e:\n", + " print(e)" + ] + }, + { + "cell_type": "markdown", + "id": "67e883b0", + "metadata": {}, + "source": [ + "Let's delete the endpoint:" + ] + }, + { + "cell_type": "markdown", + "id": "6cffc2b5", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 719, + "id": "8c3e851a-2416-4a0b-b8a1-c483cde3d776", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "predictor.delete_endpoint()" + ] + }, + { + "cell_type": "markdown", + "id": "d2b2e88b-0740-4214-a92f-ceba981c7e9c", + "metadata": {}, + "source": [ + "### Assignments\n", + "\n", + "* Assignment 4.1 Every Endpoint has an invocation URL you can use to generate predictions with the model from outside AWS. As part of this assignment, write a simple Python script that will run on your local computer and run a few samples through the Endpoint. You will need your AWS access key and secret to connect to the Endpoint.\n", + "\n", + "* Assignment 4.2 We can use model variants to perform A/B testing between a new model and an old model. Create a function that given the ARN of two models in the Model Registry deploys them to an Endpoint as separate variants. Each variant should receive 50% of the traffic. Write another function that invokes the endpoint by default, but allows the caller to invoke a specific variant if they want to.\n", + "\n", + "* Assignment 4.3 We can use SageMaker Model Shadow Deployments to create shadow variants to validate a new model version before promoting it to production. Write a function that given the ARN of a model in the Model Registry, updates an Endpoint and deploys the model as a shadow variant. Check [Shadow variants](https://docs.aws.amazon.com/sagemaker/latest/dg/model-shadow-deployment.html) for more information about this topic. Send some traffic to the Endpoint and compare the results from the main model with its shadow variant.\n", + "\n", + "* Assignment 4.4 SageMaker supports auto scaling your models. Auto scaling dynamically adjusts the number of instances provisioned for a model in response to changes in the workload. For this assignment, define a target-tracking scaling policy for a variant of your Endpoint and use the `SageMakerVariantInvocationsPerInstance` metric. `SageMakerVariantInvocationsPerInstance` is the average number of times per minute that the variant is invoked. Check [Automatically Scale Amazon SageMaker Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) for more information about auto scaling models.\n", + "\n", + "* Assignment 4.5 Modify the SageMaker Pipeline by adding a Lambda Step that will deploy the model directly as part of the pipeline. You won't need to set up Event Bridge anymore because your pipeline will automatically deploy the model.\n" + ] + }, + { + "cell_type": "markdown", + "id": "e544ae36-00b3-4bde-b133-c3a59bb7f1d8", + "metadata": {}, + "source": [ + "## Session 5 - Data Distribution Shifts And Model Monitoring\n", + "\n", + "In this session we'll set up a monitoring process to analyze the quality of the data our endpoint receives and the endpoint predictions. For this, we need to check the data received by the endpoint, generate ground truth labels, and compare them with a baseline performance.\n", + "\n", + " \"Monitoring\"\n", + "\n", + "To enable this functionality, we need a couple of steps:\n", + "\n", + "1. Create baselines we can use to compare against real-time traffic.\n", + "2. Set up a schedule to continuously evaluate and compare against the baselines.\n", + "\n", + "Check [Amazon SageMaker Model Monitor](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_monitoring.html) for a brief explanation of how to use SageMaker's Model Monitoring functionality. [Monitor models for data and model quality, bias, and explainability](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) is a much more extensive guide to monitoring in Amazon SageMaker.\n" + ] + }, + { + "cell_type": "markdown", + "id": "0ef0ad20", + "metadata": {}, + "source": [ + "Let's start by defining three variables we'll use throughout the session:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 720, + "id": "2bb846d0", + "metadata": {}, + "outputs": [], + "source": [ + "GROUND_TRUTH_LOCATION = f\"{S3_LOCATION}/monitoring/groundtruth\"\n", + "DATA_QUALITY_LOCATION = f\"{S3_LOCATION}/monitoring/data-quality\"\n", + "MODEL_QUALITY_LOCATION = f\"{S3_LOCATION}/monitoring/model-quality\"" + ] + }, + { + "cell_type": "markdown", + "id": "24c26ac4-5d30-41e9-8952-e4deb39de819", + "metadata": {}, + "source": [ + "### Step 1 - Generating Data Quality Baseline\n", + "\n", + "Let's start by configuring a [Quality Check Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-quality-check) to compute the general statistics of the data we used to build our model.\n", + "\n", + "We can configure the instance that will run the quality check using the [CheckJobConfig](https://sagemaker.readthedocs.io/en/v2.73.0/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.check_job_config.CheckJobConfig) class, and we can use the `DataQualityCheckConfig` class to configure the job.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 721, + "id": "0b80bcab-d2c5-437c-a1c8-8eea208c0e29", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: .\n", + "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.quality_check_step import (\n", + " QualityCheckStep,\n", + " DataQualityCheckConfig,\n", + ")\n", + "from sagemaker.workflow.check_job_config import CheckJobConfig\n", + "from sagemaker.model_monitor.dataset_format import DatasetFormat\n", + "\n", + "data_quality_baseline_step = QualityCheckStep(\n", + " name=\"generate-data-quality-baseline\",\n", + " check_job_config=CheckJobConfig(\n", + " instance_type=\"ml.c5.xlarge\",\n", + " instance_count=1,\n", + " volume_size_in_gb=20,\n", + " sagemaker_session=pipeline_session,\n", + " role=role,\n", + " ),\n", + " quality_check_config=DataQualityCheckConfig(\n", + " baseline_dataset=split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"train-baseline\"\n", + " ].S3Output.S3Uri,\n", + " dataset_format=DatasetFormat.csv(header=True, output_columns_position=\"START\"),\n", + " output_s3_uri=DATA_QUALITY_LOCATION,\n", + " ),\n", + " skip_check=True,\n", + " register_new_baseline=True,\n", + " cache_config=cache_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "81430dfd-2524-43e4-bfe9-c6545316005d", + "metadata": { + "tags": [] + }, + "source": [ + "### Step 2 - Generating Test Predictions\n", + "\n", + "To create a baseline to compare the model performance, we must create predictions for the test set and compare the model's metrics with the model performance on production data. We can do this by running a [Batch Transform Job](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) to predict every sample from the test set. We can use a [Transform Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-transform) as part of the pipeline to run this job. This Batch Transform Job will run every sample from the training dataset through the model so we can compute the baseline metrics.\n", + "\n", + "The Transform Step requires a model to generate predictions, so we need a Model Step that creates a model:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 722, + "id": "8194b462", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/svpino/dev/ml.school/.venv/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:297: UserWarning: Running within a PipelineSession, there will be No Wait, No Logs, and No Job being started.\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.model_step import ModelStep\n", + "\n", + "create_model_step = ModelStep(\n", + " name=\"create\",\n", + " display_name=\"create-model\",\n", + " step_args=pipeline_model.create(instance_type=config[\"instance_type\"]),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "eddb6ac7", + "metadata": {}, + "source": [ + "Let's configure the Batch Transform Job using an instance of the [Transformer](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html) class:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 723, + "id": "bf6aa4f0", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.transformer import Transformer\n", + "\n", + "transformer = Transformer(\n", + " model_name=create_model_step.properties.ModelName,\n", + " instance_type=config[\"instance_type\"],\n", + " instance_count=1,\n", + " strategy=\"MultiRecord\",\n", + " accept=\"text/csv\",\n", + " assemble_with=\"Line\",\n", + " output_path=f\"{S3_LOCATION}/transform\",\n", + " sagemaker_session=pipeline_session,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a7f01fb9", + "metadata": {}, + "source": [ + "We can now set up the [Transform Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-transform) using the transformer we configured before.\n", + "\n", + "Notice the following:\n", + "\n", + "- We'll generate predictions for the baseline output that we generated when we split and transformed the data. This baseline is the same data we used to test the model, but we saved it in its original format before transforming it.\n", + "- The output of this Batch Transform Job will have two fields. The first one will be the ground truth label, and the second one will be the prediction of the model.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 724, + "id": "1987a788-de7a-4f60-ac8d-819d9ffcdf8e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.steps import TransformStep\n", + "\n", + "generate_test_predictions_step = TransformStep(\n", + " name=\"generate-test-predictions\",\n", + " step_args=transformer.transform(\n", + " # We will use the baseline set we generated when we split the data.\n", + " # This set corresponds to the test split before the transformation step.\n", + " data=split_and_transform_data_step.properties.ProcessingOutputConfig.Outputs[\n", + " \"test-baseline\"\n", + " ].S3Output.S3Uri,\n", + "\n", + " join_source=\"Input\",\n", + " split_type=\"Line\",\n", + " content_type=\"text/csv\",\n", + " \n", + " # We want to output the first and the last field from the joint set.\n", + " # The first field corresponds to the groundtruth, and the last field\n", + " # corresponds to the prediction.\n", + " output_filter=\"$[0,-1]\",\n", + " ),\n", + " cache_config=cache_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2fafc7c4-6fef-4832-8b99-8c45d078fdd2", + "metadata": {}, + "source": [ + "### Step 3 - Generating Model Quality Baseline\n", + "\n", + "Let's now configure the [Quality Check Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-quality-check) and feed it the data we generated in the Transform Step. This step will automatically compute the performance metrics of the model on the test set:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 725, + "id": "9aa3a284-8763-4000-a263-70314b530652", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: .\n", + "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n" + ] + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.workflow.quality_check_step import ModelQualityCheckConfig\n", + "\n", + "model_quality_baseline_step = QualityCheckStep(\n", + " name=\"generate-model-quality-baseline\",\n", + " check_job_config=CheckJobConfig(\n", + " instance_type=\"ml.c5.xlarge\",\n", + " instance_count=1,\n", + " volume_size_in_gb=20,\n", + " sagemaker_session=pipeline_session,\n", + " role=role,\n", + " ),\n", + " quality_check_config=ModelQualityCheckConfig(\n", + " # We are going to use the output of the Transform Step to generate\n", + " # the model quality baseline.\n", + " baseline_dataset=generate_test_predictions_step.properties.TransformOutput.S3OutputPath,\n", + " dataset_format=DatasetFormat.csv(header=False),\n", + "\n", + " # We need to specify the problem type and the fields where the prediction\n", + " # and groundtruth are so the process knows how to interpret the results.\n", + " problem_type=\"MulticlassClassification\",\n", + " \n", + " # Since the data doesn't have headers, SageMaker will autocreate headers for it.\n", + " # _c0 corresponds to the first column, and _c1 corresponds to the second column.\n", + " ground_truth_attribute=\"_c0\",\n", + " inference_attribute=\"_c1\",\n", + " output_s3_uri=MODEL_QUALITY_LOCATION,\n", + " ),\n", + " skip_check=True,\n", + " register_new_baseline=True,\n", + " cache_config=cache_config,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "693535ba-fca7-4e89-a4cb-b4f333fa2d03", + "metadata": {}, + "source": [ + "### Step 4 - Setting up Model Metrics\n", + "\n", + "We can configure a new set of [ModelMetrics](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_metrics.ModelMetrics) using the results of the Data and Model Quality Steps. Check [Baseline and model version lifecycle and evolution with SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-quality-clarify-baseline-lifecycle.html#pipelines-quality-clarify-baseline-evolution) for an explanation of how SageMaker uses the `DriftCheckBaselines`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 726, + "id": "a773f134-ac2f-4dba-976e-9b7f0b384b6e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from sagemaker.drift_check_baselines import DriftCheckBaselines\n", + "\n", + "model_metrics = ModelMetrics(\n", + " model_data_statistics=MetricsSource(\n", + " s3_uri=data_quality_baseline_step.properties.CalculatedBaselineStatistics,\n", + " content_type=\"application/json\",\n", + " ),\n", + " model_data_constraints=MetricsSource(\n", + " s3_uri=data_quality_baseline_step.properties.CalculatedBaselineConstraints,\n", + " content_type=\"application/json\",\n", + " ),\n", + " model_statistics=MetricsSource(\n", + " s3_uri=model_quality_baseline_step.properties.CalculatedBaselineStatistics,\n", + " content_type=\"application/json\",\n", + " ),\n", + " model_constraints=MetricsSource(\n", + " s3_uri=model_quality_baseline_step.properties.CalculatedBaselineConstraints,\n", + " content_type=\"application/json\",\n", + " ),\n", + ")\n", + "\n", + "drift_check_baselines = DriftCheckBaselines(\n", + " model_data_statistics=MetricsSource(\n", + " s3_uri=data_quality_baseline_step.properties.BaselineUsedForDriftCheckStatistics,\n", + " content_type=\"application/json\",\n", + " ),\n", + " model_data_constraints=MetricsSource(\n", + " s3_uri=data_quality_baseline_step.properties.BaselineUsedForDriftCheckConstraints,\n", + " content_type=\"application/json\",\n", + " ),\n", + " model_statistics=MetricsSource(\n", + " s3_uri=model_quality_baseline_step.properties.BaselineUsedForDriftCheckStatistics,\n", + " content_type=\"application/json\",\n", + " ),\n", + " model_constraints=MetricsSource(\n", + " s3_uri=model_quality_baseline_step.properties.BaselineUsedForDriftCheckConstraints,\n", + " content_type=\"application/json\",\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "ba3487a0-05ad-4f3a-8f50-9884dc2aef64", + "metadata": {}, + "source": [ + "### Step 5 - Modifying the Registration Step\n", + "\n", + "Since we want to register the model using the new metrics, we need to modify the Registration Step to use the new metrics:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 727, + "id": "7056a009-91c0-4955-90dd-b90ef8cab149", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "register_model_step = ModelStep(\n", + " name=\"register\",\n", + " display_name=\"register-model\",\n", + " step_args=pipeline_model.register(\n", + " model_package_group_name=PIPELINE_MODEL_PACKAGE_GROUP,\n", + " model_metrics=model_metrics,\n", + " drift_check_baselines=drift_check_baselines,\n", + " approval_status=\"PendingManualApproval\",\n", + " content_types=[\"text/csv\", \"application/json\"],\n", + " response_types=[\"text/csv\", \"application/json\"],\n", + " inference_instances=[\"ml.m5.xlarge\"],\n", + " transform_instances=[\"ml.m5.xlarge\"],\n", + " domain=\"MACHINE_LEARNING\",\n", + " task=\"CLASSIFICATION\",\n", + " framework=\"TENSORFLOW\",\n", + " framework_version=config[\"framework_version\"],\n", + " ),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0d00b5e6-9858-4acc-bbfe-a2ce24ec20e0", + "metadata": {}, + "source": [ + "### Step 6 - Modifying the Condition Step\n", + "\n", + "Since we modified the registration step and added a few more steps, we need to modify the Condition Step. Now, we want to generate the test predictions and compute the model quality baseline if the condition is successful:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 728, + "id": "bacaa9c6-22b0-48df-b138-95b6422fe834", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "condition_step = ConditionStep(\n", + " name=\"check-model-accuracy\",\n", + " conditions=[condition],\n", + " if_steps=[\n", + " create_model_step,\n", + " generate_test_predictions_step,\n", + " model_quality_baseline_step,\n", + " register_model_step,\n", + " ]\n", + " if not LOCAL_MODE\n", + " else [],\n", + " else_steps=[fail_step],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c95a7905-2550-4979-b885-f2daabb5d45e", + "metadata": {}, + "source": [ + "### Step 7 - Creating the Pipeline\n", + "\n", + "We can now define the SageMaker Pipeline and submit its definition to the SageMaker Pipelines service to create the pipeline if it doesn't exist or update it if it does.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 729, + "id": "4da5e453-acd8-47a0-a39f-264d05dd93d0", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session5-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session5-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n", + "WARNING:sagemaker.workflow._utils:Popping out 'CertifyForMarketplace' from the pipeline definition since it will be overridden in pipeline execution time.\n", + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n", + "Using provided s3_resource\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session5-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", + "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session5-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n" + ] + }, + { + "data": { + "text/plain": [ + "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session5-pipeline',\n", + " 'ResponseMetadata': {'RequestId': 'e104a5af-2148-4ab4-85b3-af898d3bd315',\n", + " 'HTTPStatusCode': 200,\n", + " 'HTTPHeaders': {'x-amzn-requestid': 'e104a5af-2148-4ab4-85b3-af898d3bd315',\n", + " 'content-type': 'application/x-amz-json-1.1',\n", + " 'content-length': '85',\n", + " 'date': 'Fri, 27 Oct 2023 14:38:52 GMT'},\n", + " 'RetryAttempts': 0}}" + ] + }, + "execution_count": 729, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "session5_pipeline = Pipeline(\n", + " name=\"session5-pipeline\",\n", + " parameters=[dataset_location, accuracy_threshold],\n", + " steps=[\n", + " split_and_transform_data_step,\n", + " tune_model_step if USE_TUNING_STEP else train_model_step,\n", + " evaluate_model_step,\n", + " data_quality_baseline_step,\n", + " condition_step,\n", + " ],\n", + " pipeline_definition_config=pipeline_definition_config,\n", + " sagemaker_session=config[\"session\"],\n", + ")\n", + "\n", + "session5_pipeline.upsert(role_arn=role)" + ] + }, + { + "cell_type": "markdown", + "id": "9e6b1b39", + "metadata": {}, + "source": [ + "We can now start the pipeline:\n" + ] + }, + { + "cell_type": "markdown", + "id": "9d6e5995", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
\n" + ] + }, + { + "cell_type": "code", + "execution_count": 739, + "id": "10ba9909", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "_PipelineExecution(arn='arn:aws:sagemaker:us-east-1:325223348818:pipeline/session5-pipeline/execution/ifgn9itt6qcy', sagemaker_session=)" + ] + }, + "execution_count": 739, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# %%script false --no-raise-error\n", + "\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "session5_pipeline.start()" + ] + }, + { + "cell_type": "markdown", + "id": "6fd182a9", + "metadata": {}, + "source": [ + "### Step 8 - Checking Constraints and Statistics\n", + "\n", + "Our pipeline generated data baseline statistics and constraints. We can take a look at what these values look like by downloading them from S3. You need to wait for the pipeline to finish running before these files are available.\n", + "\n", + "Here are the data quality statistics:" + ] + }, + { + "cell_type": "code", + "execution_count": 752, + "id": "42daa82b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"name\": \"island\",\n", + " \"inferred_type\": \"String\",\n", + " \"string_statistics\": {\n", + " \"common\": {\n", + " \"num_present\": 232,\n", + " \"num_missing\": 0\n", + " },\n", + " \"distinct_count\": 3.0,\n", + " \"distribution\": {\n", + " \"categorical\": {\n", + " \"buckets\": [\n", + " {\n", + " \"value\": \"Dream\",\n", + " \"count\": 89\n", + " },\n", + " {\n", + " \"value\": \"Torgersen\",\n", + " \"count\": 24\n", + " },\n", + " {\n", + " \"value\": \"Biscoe\",\n", + " \"count\": 119\n", + " }\n", + " ]\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "from sagemaker.s3 import S3Downloader\n", + "\n", + "try:\n", + " response = json.loads(\n", + " S3Downloader.read_file(f\"{DATA_QUALITY_LOCATION}/statistics.json\")\n", + " )\n", + " print(json.dumps(response[\"features\"][1], indent=2))\n", + "except Exception as e:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "id": "8104ad3c", + "metadata": {}, + "source": [ + "Here are the data quality constraints:" + ] + }, + { + "cell_type": "code", + "execution_count": 753, + "id": "898d9626", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"name\": \"island\",\n", + " \"inferred_type\": \"String\",\n", + " \"completeness\": 1.0,\n", + " \"string_constraints\": {\n", + " \"domains\": [\n", + " \"Dream\",\n", + " \"Torgersen\",\n", + " \"Biscoe\"\n", + " ]\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "try:\n", + " response = json.loads(S3Downloader.read_file(f\"{DATA_QUALITY_LOCATION}/constraints.json\"))\n", + " print(json.dumps(response[\"features\"][1], indent=2))\n", + "except Exception as e:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "id": "35eaf9af", + "metadata": {}, + "source": [ + "And here are the model quality constraints:" + ] + }, + { + "cell_type": "code", + "execution_count": 754, + "id": "2df52332", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"version\": 0.0,\n", + " \"multiclass_classification_constraints\": {\n", + " \"accuracy\": {\n", + " \"threshold\": 0.9259259259259259,\n", + " \"comparison_operator\": \"LessThanThreshold\"\n", + " },\n", + " \"weighted_recall\": {\n", + " \"threshold\": 0.9259259259259259,\n", + " \"comparison_operator\": \"LessThanThreshold\"\n", + " },\n", + " \"weighted_precision\": {\n", + " \"threshold\": 0.933862433862434,\n", + " \"comparison_operator\": \"LessThanThreshold\"\n", + " },\n", + " \"weighted_f0_5\": {\n", + " \"threshold\": 0.928855833521148,\n", + " \"comparison_operator\": \"LessThanThreshold\"\n", + " },\n", + " \"weighted_f1\": {\n", + " \"threshold\": 0.9247293447293448,\n", + " \"comparison_operator\": \"LessThanThreshold\"\n", + " },\n", + " \"weighted_f2\": {\n", + " \"threshold\": 0.9242942991137502,\n", + " \"comparison_operator\": \"LessThanThreshold\"\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "try:\n", + " response = json.loads(S3Downloader.read_file(f\"{MODEL_QUALITY_LOCATION}/constraints.json\"))\n", + " print(json.dumps(response, indent=2))\n", + "except Exception as e:\n", + " pass" + ] + }, + { + "cell_type": "markdown", + "id": "b948aa92-8064-4f03-af08-0f6a8fc329cf", + "metadata": {}, + "source": [ + "### Step 9 - Generating Fake Traffic\n", + "\n", + "To test the monitoring functionality, we need to generate traffic to the endpoint. To generate traffic, we will send every sample from the dataset to the endpoint to simulate real prediction requests:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 755, + "id": "c658bad0", + "metadata": {}, + "outputs": [], + "source": [ + "# | code: true\n", + "# | output: false\n", + "\n", + "from sagemaker.serializers import JSONSerializer\n", + "\n", + "data = penguins.drop([\"species\"], axis=1)\n", + "data = data.dropna()\n", + "\n", + "predictor = Predictor(\n", + " endpoint_name=ENDPOINT,\n", + " serializer=JSONSerializer(),\n", + " sagemaker_session=sagemaker_session,\n", + ")\n", + "\n", + "for index, row in data.iterrows():\n", + " try:\n", + " predictor.predict(row.to_dict(), inference_id=str(index))\n", + " except Exception as e:\n", + " print(e)\n", + " break" + ] + }, + { + "cell_type": "markdown", + "id": "0d3f61b9", + "metadata": {}, + "source": [ + "We can check the location where the endpoint stores the captured data, download a file, and display its content. It may take a few minutes for the first few files to show up in S3.\n", + "\n", + "These files contain the data captured by the endpoint in a SageMaker-specific JSON-line format. Each inference request is captured in a single line in the `jsonl` file. The line contains both the input and output merged together:" + ] + }, + { + "cell_type": "code", + "execution_count": 756, + "id": "3f35e8db-24d7-4d4b-9264-78ee5070cf27", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"captureData\": {\n", + " \"endpointInput\": {\n", + " \"observedContentType\": \"application/json\",\n", + " \"mode\": \"INPUT\",\n", + " \"data\": \"{\\\"island\\\": \\\"Torgersen\\\", \\\"culmen_length_mm\\\": 39.1, \\\"culmen_depth_mm\\\": 18.7, \\\"flipper_length_mm\\\": 181.0, \\\"body_mass_g\\\": 3750.0, \\\"sex\\\": \\\"MALE\\\"}\",\n", + " \"encoding\": \"JSON\"\n", + " },\n", + " \"endpointOutput\": {\n", + " \"observedContentType\": \"application/json\",\n", + " \"mode\": \"OUTPUT\",\n", + " \"data\": \"{\\\"prediction\\\": \\\"Adelie\\\", \\\"confidence\\\": 0.953110516}\",\n", + " \"encoding\": \"JSON\"\n", + " }\n", + " },\n", + " \"eventMetadata\": {\n", + " \"eventId\": \"ddf80c99-e582-4243-9309-4bc9085c01ec\",\n", + " \"inferenceId\": \"0\",\n", + " \"inferenceTime\": \"2023-10-24T19:10:30Z\"\n", + " },\n", + " \"eventVersion\": \"0\"\n", + "}\n" + ] + } + ], + "source": [ + "files = S3Downloader.list(DATA_CAPTURE_DESTINATION)[:3]\n", + "if len(files):\n", + " lines = S3Downloader.read_file(files[0])\n", + " print(json.dumps(json.loads(lines.split(\"\\n\")[0]), indent=2))" + ] + }, + { + "cell_type": "markdown", + "id": "59e53138", + "metadata": {}, + "source": [ + "These files contain the data captured by the endpoint in a SageMaker-specific JSON-line format. Each inference request is captured in a single line in the `jsonl` file. The line contains both the input and output merged together:" + ] + }, + { + "cell_type": "markdown", + "id": "5754a314-3bc0-4b41-8767-e9f06d96d250", + "metadata": {}, + "source": [ + "### Step 10 - Generating Fake Labels\n", + "\n", + "To test the performance of the model, we need to label the samples captured by the endpoint. We can simulate the labeling process by generating a random label for every sample. Check [Ingest Ground Truth Labels and Merge Them With Predictions](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-merge.html) for more information about this.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 757, + "id": "bb999995", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'s3://mlschool/penguins/monitoring/groundtruth/2023/10/27/17/0816.jsonl'" + ] + }, + "execution_count": 757, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#| code: true\n", + "#| output: false\n", + "\n", + "import random\n", + "from datetime import datetime\n", + "from sagemaker.s3 import S3Uploader\n", + "\n", + "records = []\n", + "for inference_id in range(len(data)):\n", + " random.seed(inference_id)\n", + "\n", + " records.append(json.dumps({\n", + " \"groundTruthData\": {\n", + " \"data\": random.choice([\"Adelie\", \"Chinstrap\", \"Gentoo\"]),\n", + " \"encoding\": \"CSV\",\n", + " },\n", + " \"eventMetadata\": {\n", + " \"eventId\": str(inference_id),\n", + " },\n", + " \"eventVersion\": \"0\",\n", + " }))\n", + "\n", + "groundtruth_payload = \"\\n\".join(records)\n", + "upload_time = datetime.utcnow()\n", + "uri = f\"{GROUND_TRUTH_LOCATION}/{upload_time:%Y/%m/%d/%H/%M%S}.jsonl\"\n", + "S3Uploader.upload_string_as_file_body(groundtruth_payload, uri)" + ] + }, + { + "cell_type": "markdown", + "id": "a65bd669", + "metadata": {}, + "source": [ + "### Step 11 - Preparing Monitoring Functions\n", + "\n", + "Let's create a few functions that will help us work with monitoring schedules later on:" + ] + }, + { + "cell_type": "code", + "execution_count": 758, + "id": "da145ba1-4966-4dab-8a73-281db364cbc7", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "from sagemaker.model_monitor import MonitoringExecution\n", + "\n", + "\n", + "def describe_monitoring_schedules(endpoint_name):\n", + " schedules = []\n", + " response = sagemaker_client.list_monitoring_schedules(EndpointName=endpoint_name)[\n", + " \"MonitoringScheduleSummaries\"\n", + " ]\n", + " for item in response:\n", + " name = item[\"MonitoringScheduleName\"]\n", + " schedule = {\n", + " \"MonitoringScheduleName\": name,\n", + " \"MonitoringType\": item[\"MonitoringType\"],\n", + " }\n", + "\n", + " description = sagemaker_client.describe_monitoring_schedule(\n", + " MonitoringScheduleName=name\n", + " )\n", + "\n", + " schedule[\"Status\"] = description[\"LastMonitoringExecutionSummary\"][\n", + " \"MonitoringExecutionStatus\"\n", + " ]\n", + "\n", + " if schedule[\"Status\"] == \"Failed\":\n", + " schedule[\"FailureReason\"] = description[\"LastMonitoringExecutionSummary\"][\n", + " \"FailureReason\"\n", + " ]\n", + " elif schedule[\"Status\"] == \"CompletedWithViolations\":\n", + " processing_job_arn = description[\"LastMonitoringExecutionSummary\"][\n", + " \"ProcessingJobArn\"\n", + " ]\n", + " execution = MonitoringExecution.from_processing_arn(\n", + " sagemaker_session=sagemaker_session,\n", + " processing_job_arn=processing_job_arn,\n", + " )\n", + " execution_destination = execution.output.destination\n", + "\n", + " violations_filepath = os.path.join(\n", + " execution_destination, \"constraint_violations.json\"\n", + " )\n", + " violations = json.loads(S3Downloader.read_file(violations_filepath))[\n", + " \"violations\"\n", + " ]\n", + "\n", + " schedule[\"Violations\"] = violations\n", + "\n", + " schedules.append(schedule)\n", + "\n", + " return schedules\n", + "\n", + "\n", + "def describe_monitoring_schedule(endpoint_name, monitoring_type):\n", + " found = False\n", + "\n", + " schedules = describe_monitoring_schedules(endpoint_name)\n", + " for schedule in schedules:\n", + " if schedule[\"MonitoringType\"] == monitoring_type:\n", + " found = True\n", + " print(json.dumps(schedule, indent=2))\n", + "\n", + " if not found:\n", + " print(f\"There's no {monitoring_type} Monitoring Schedule.\")\n", + "\n", + "\n", + "def describe_data_monitoring_schedule(endpoint_name):\n", + " describe_monitoring_schedule(endpoint_name, \"DataQuality\")\n", + "\n", + "\n", + "def describe_model_monitoring_schedule(endpoint_name):\n", + " describe_monitoring_schedule(endpoint_name, \"ModelQuality\")\n", + "\n", + "\n", + "def delete_monitoring_schedule(endpoint_name, monitoring_type):\n", + " attempts = 30\n", + " found = False\n", + "\n", + " response = sagemaker_client.list_monitoring_schedules(EndpointName=endpoint_name)[\n", + " \"MonitoringScheduleSummaries\"\n", + " ]\n", + " for item in response:\n", + " if item[\"MonitoringType\"] == monitoring_type:\n", + " found = True\n", + " status = sagemaker_client.describe_monitoring_schedule(\n", + " MonitoringScheduleName=item[\"MonitoringScheduleName\"]\n", + " )[\"MonitoringScheduleStatus\"]\n", + " while status in (\"Pending\", \"InProgress\") and attempts > 0:\n", + " attempts -= 1\n", + " print(\n", + " f\"Monitoring schedule status: {status}. Waiting for it to finish.\"\n", + " )\n", + " sleep(30)\n", + "\n", + " status = sagemaker_client.describe_monitoring_schedule(\n", + " MonitoringScheduleName=item[\"MonitoringScheduleName\"]\n", + " )[\"MonitoringScheduleStatus\"]\n", + "\n", + " if status not in (\"Pending\", \"InProgress\"):\n", + " sagemaker_client.delete_monitoring_schedule(\n", + " MonitoringScheduleName=item[\"MonitoringScheduleName\"]\n", + " )\n", + " print(\"Monitoring schedule deleted.\")\n", + " else:\n", + " print(\"Waiting for monitoring schedule timed out\")\n", + "\n", + " if not found:\n", + " print(f\"There's no {monitoring_type} Monitoring Schedule.\")\n", + "\n", + "\n", + "def delete_data_monitoring_schedule(endpoint_name):\n", + " delete_monitoring_schedule(endpoint_name, \"DataQuality\")\n", + "\n", + "\n", + "def delete_model_monitoring_schedule(endpoint_name):\n", + " delete_monitoring_schedule(endpoint_name, \"ModelQuality\")" + ] + }, + { + "cell_type": "markdown", + "id": "d936df76-e0b8-4dad-a04f-ef77ce2a2df1", + "metadata": {}, + "source": [ + "### Step 12 - Setting Up Data Monitoring Job\n", + "\n", + "SageMaker looks for violations in the data captured by the endpoint. By default, it combines the input data with the endpoint output and compares the result with the baseline we generated. If we let SageMaker do this, we will get a few violations, for example an \"extra column check\" violation because the field `confidence` doesn't exist in the baseline data.\n", + "\n", + "We can fix these violations by creating a preprocessing script configuring the data we want the monitoring job to use. Check [Preprocessing and Postprocessing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-and-post-processing.html) for more information about how to configure these scripts.\n", + "\n", + "Let's define the name of the preprocessing script:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 759, + "id": "cc119422-2e85-4e8c-86cd-6d59e353d09d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "DATA_QUALITY_PREPROCESSOR = \"data_quality_preprocessor.py\"" + ] + }, + { + "cell_type": "markdown", + "id": "72c1023e", + "metadata": {}, + "source": [ + "We can now define the preprocessing script. Notice that this script will return the input data the endpoint receives with a new `species` column containing the prediction of the model:" + ] + }, + { + "cell_type": "code", + "execution_count": 760, + "id": "083b0bd0-4035-43fe-9b2c-946b12a5e266", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Overwriting code/data_quality_preprocessor.py\n" + ] + } + ], + "source": [ + "%%writefile {CODE_FOLDER}/{DATA_QUALITY_PREPROCESSOR}\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "import json\n", + "\n", + "def preprocess_handler(inference_record):\n", + " input_data = inference_record.endpoint_input.data\n", + " output_data = json.loads(inference_record.endpoint_output.data)\n", + "\n", + " response = json.loads(input_data)\n", + " response[\"species\"] = output_data[\"prediction\"]\n", + "\n", + " # The `response` variable contains the data that we want the\n", + " # monitoring job to use to compare with the baseline.\n", + " return response" + ] + }, + { + "cell_type": "markdown", + "id": "840d54c5-f09c-4559-a1d2-63587da0ad14", + "metadata": {}, + "source": [ + "The monitoring schedule expects an S3 location pointing to the preprocessing script. Let's upload the script to the default bucket.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 761, + "id": "96e5c0c1-7e40-47df-8f40-1d891db13875", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials\n" + ] + } + ], + "source": [ + "#| code: true\n", + "#| output: false\n", + "\n", + "bucket = boto3.Session().resource(\"s3\").Bucket(pipeline_session.default_bucket())\n", + "prefix = \"penguins-monitoring\"\n", + "bucket.Object(os.path.join(prefix, DATA_QUALITY_PREPROCESSOR)).upload_file(\n", + " str(CODE_FOLDER / DATA_QUALITY_PREPROCESSOR)\n", + ")\n", + "data_quality_preprocessor = (\n", + " f\"s3://{os.path.join(bucket.name, prefix, DATA_QUALITY_PREPROCESSOR)}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "56e107eb-546d-431c-b74d-1bfd412711b7", + "metadata": {}, + "source": [ + "We can now set up the Data Quality Monitoring Job using the [DefaultModelMonitor](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.DefaultModelMonitor) class. Notice how we specify the `record_preprocessor_script` using the S3 location where we uploaded our script." + ] + }, + { + "cell_type": "markdown", + "id": "e653b628", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15caf9e1-97fc-4379-893b-6062d4bd876e", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# %%script false --no-raise-error\n", + "#| code: true\n", + "#| output: false\n", + "#| eval: false\n", + "\n", + "from sagemaker.model_monitor import CronExpressionGenerator, DefaultModelMonitor\n", + "\n", + "data_monitor = DefaultModelMonitor(\n", + " instance_type=\"ml.m5.xlarge\",\n", + " instance_count=1,\n", + " max_runtime_in_seconds=3600,\n", + " role=role,\n", + ")\n", + "\n", + "data_monitor.create_monitoring_schedule(\n", + " monitor_schedule_name=\"penguins-data-monitoring-schedule\",\n", + " endpoint_input=ENDPOINT,\n", + " record_preprocessor_script=data_quality_preprocessor,\n", + " statistics=f\"{DATA_QUALITY_LOCATION}/statistics.json\",\n", + " constraints=f\"{DATA_QUALITY_LOCATION}/constraints.json\",\n", + " schedule_cron_expression=CronExpressionGenerator.hourly(),\n", + " output_s3_uri=DATA_QUALITY_LOCATION,\n", + " enable_cloudwatch_metrics=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "018800f7-315f-4f5e-b082-ba94bbde91ad", + "metadata": {}, + "source": [ + "We can check the results of the monitoring job by looking at whether it generated any violations:" + ] + }, + { + "cell_type": "code", + "execution_count": 781, + "id": "2c04fdd4-cc03-496c-a0a1-405854505c46", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"MonitoringScheduleName\": \"penguins-data-monitoring-schedule\",\n", + " \"MonitoringType\": \"DataQuality\",\n", + " \"Status\": \"Failed\",\n", + " \"FailureReason\": \"Job inputs had no data\"\n", + "}\n" + ] + } + ], + "source": [ + "describe_data_monitoring_schedule(ENDPOINT)" + ] + }, + { + "cell_type": "markdown", + "id": "3a9d201d-f60f-49f2-b4e9-eb0a0159ecfd", + "metadata": {}, + "source": [ + "### Step 13 - Setting up Model Monitoring Job\n", + "\n", + "To set up a Model Quality Monitoring Job, we can use the [ModelQualityMonitor](https://sagemaker.readthedocs.io/en/stable/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.ModelQualityMonitor) class. The [EndpointInput](https://sagemaker.readthedocs.io/en/v2.24.2/api/inference/model_monitor.html#sagemaker.model_monitor.model_monitoring.EndpointInput) instance configures the attribute the monitoring job should use to determine the prediction from the model.\n", + "\n", + "Check [Amazon SageMaker Model Quality Monitor](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_model_monitor/model_quality/model_quality_churn_sdk.html) for a complete tutorial on how to run a Model Monitoring Job in SageMaker." + ] + }, + { + "cell_type": "markdown", + "id": "9d217afd", + "metadata": {}, + "source": [ + "We can now start the Model Quality Monitoring Job:" + ] + }, + { + "cell_type": "markdown", + "id": "cd771884", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 388, + "id": "070e0d73-5375-4fc3-b94c-da0574600c05", + "metadata": {}, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "#| code: true\n", + "#| output: false\n", + "#| eval: false\n", + "from sagemaker.model_monitor import ModelQualityMonitor, EndpointInput\n", + "\n", + "model_monitor = ModelQualityMonitor(\n", + " instance_type=\"ml.m5.xlarge\",\n", + " instance_count=1,\n", + " max_runtime_in_seconds=1800,\n", + " role=role\n", + ")\n", + "\n", + "model_monitor.create_monitoring_schedule(\n", + " monitor_schedule_name=\"penguins-model-monitoring-schedule\",\n", + " \n", + " endpoint_input = EndpointInput(\n", + " endpoint_name=ENDPOINT,\n", + " inference_attribute=\"prediction\",\n", + " destination=\"/opt/ml/processing/input_data\",\n", + " ),\n", + " \n", + " problem_type=\"MulticlassClassification\",\n", + " ground_truth_input=GROUND_TRUTH_LOCATION,\n", + " constraints=f\"{MODEL_QUALITY_LOCATION}/constraints.json\",\n", + " schedule_cron_expression=CronExpressionGenerator.hourly(),\n", + " output_s3_uri=MODEL_QUALITY_LOCATION,\n", + " enable_cloudwatch_metrics=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8d9e523e-49c5-4382-b28a-cdbece9bd0e0", + "metadata": {}, + "source": [ + "We can check the results of the monitoring job by looking at whether it generated any violations.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 402, + "id": "347de298-16f2-42e0-85c4-dfc916080020", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"MonitoringScheduleName\": \"penguins-model-monitoring-schedule\",\n", + " \"MonitoringType\": \"ModelQuality\",\n", + " \"Status\": \"CompletedWithViolations\",\n", + " \"Violations\": [\n", + " {\n", + " \"constraint_check_type\": \"LessThanThreshold\",\n", + " \"description\": \"Metric weightedF2 with 0.3505018546481581 +/- 0.004778110439777429 was LessThanThreshold '0.9242942991137502'\",\n", + " \"metric_name\": \"weightedF2\"\n", + " },\n", + " {\n", + " \"constraint_check_type\": \"LessThanThreshold\",\n", + " \"description\": \"Metric accuracy with 0.35755813953488375 +/- 0.004625699974871179 was LessThanThreshold '0.9259259259259259'\",\n", + " \"metric_name\": \"accuracy\"\n", + " },\n", + " {\n", + " \"constraint_check_type\": \"LessThanThreshold\",\n", + " \"description\": \"Metric weightedRecall with 0.3575581395348837 +/- 0.004625699974871179 was LessThanThreshold '0.9259259259259259'\",\n", + " \"metric_name\": \"weightedRecall\"\n", + " },\n", + " {\n", + " \"constraint_check_type\": \"LessThanThreshold\",\n", + " \"description\": \"Metric weightedPrecision with 0.35662633279042494 +/- 0.005592963346101618 was LessThanThreshold '0.933862433862434'\",\n", + " \"metric_name\": \"weightedPrecision\"\n", + " },\n", + " {\n", + " \"constraint_check_type\": \"LessThanThreshold\",\n", + " \"description\": \"Metric weightedF1 with 0.34519661584972283 +/- 0.004997774377359799 was LessThanThreshold '0.9247293447293448'\",\n", + " \"metric_name\": \"weightedF1\"\n", + " }\n", + " ]\n", + "}\n" + ] + } + ], + "source": [ + "describe_model_monitoring_schedule(ENDPOINT)" + ] + }, + { + "cell_type": "markdown", + "id": "38c3d9f6", + "metadata": {}, + "source": [ + "### Step 14 - Tearing Down Resources\n", + "\n", + "The following code will stop the monitoring jobs and delete the endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": 783, + "id": "bb74dc04-54a1-4a3f-854f-4877f7f0b4a1", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Monitoring schedule deleted.\n", + "There's no ModelQuality Monitoring Schedule.\n" + ] + } + ], + "source": [ + "#| code: true\n", + "#| output: false\n", + "\n", + "delete_data_monitoring_schedule(ENDPOINT)\n", + "delete_model_monitoring_schedule(ENDPOINT)" + ] + }, + { + "cell_type": "markdown", + "id": "c97e5419", + "metadata": {}, + "source": [ + "Let's delete the endpoint:" + ] + }, + { + "cell_type": "markdown", + "id": "307f5062", + "metadata": {}, + "source": [ + "#| hide\n", + "\n", + "
Note: \n", + " The %%script cell magic is a convenient way to prevent the notebook from executing a specific cell. If you want to run the cell, comment out the line containing the %%script cell magic.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9eabe84e", + "metadata": {}, + "outputs": [], + "source": [ + "%%script false --no-raise-error\n", + "#| eval: false\n", + "#| code: true\n", + "#| output: false\n", + "\n", + "predictor.delete_endpoint()" + ] + }, + { + "cell_type": "markdown", + "id": "db0d6d8d-791c-4ae0-ba79-e0da33d0ece2", + "metadata": {}, + "source": [ + "### Assignments\n", + "\n", + "* Assignment 5.1 You can visualize the results of your monitoring jobs in Amazon SageMaker Studio. Go to your endpoint, and visit the Data quality and Model quality tabs. View the details of your monitoring jobs, and create a few charts to explore the baseline and the captured values for any metric that the monitoring job calculates.\n", + "\n", + "* Assignment 5.2 The QualityCheck Step runs a processing job to compute baseline statistics and constraints from the input dataset. We configured the pipeline to generate the initial baselines every time it runs. Modify the code to prevent the pipeline from registering a new version of the model if the dataset violates the baseline of the previous model version. You can configure the QualityCheck Step to accomplish this.\n", + "\n", + "* Assignment 5.3 We are generating predictions for the test set twice during the execution of our pipeline. First, during the Evaluation Step, and then using a Transform Step in anticipation of generating the baseline to monitor the model. Modify the Evaluation Step so it reuses the model performance computed by the QualityCheck Step instead of generating predictions again.\n", + "\n", + "* Assignment 5.4 [Evidently AI](https://evidentlyai.com/) is an open-source Machine Learning observability platform that you can use to evaluate, test, and monitor models. For this assignment, integrate the endpoint we built with Evidently AI to use its capabilities to monitor the model.\n", + "\n", + "* Assignment 5.5 Instead of running the entire pipeline from start to finish, sometimes you may only need to iterate over particular steps. SageMaker Pipelines supports [Selective Execution for Pipeline Steps](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-selective-ex.html). In this assignment you will use Selective Execution to only run one specific step of the pipeline. [Unlocking efficiency: Harnessing the power of Selective Execution in Amazon SageMaker Pipelines](https://aws.amazon.com/blogs/machine-learning/unlocking-efficiency-harnessing-the-power-of-selective-execution-in-amazon-sagemaker-pipelines/) is a great article that explains this feature." + ] + } + ], + "metadata": { + "availableInstances": [ + { + "_defaultOrder": 0, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.t3.medium", + "vcpuNum": 2 + }, + { + "_defaultOrder": 1, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.t3.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 2, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.t3.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 3, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.t3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 4, + "_isFastLaunch": true, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 5, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 6, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 7, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 8, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 9, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 10, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 11, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 12, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.m5d.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 13, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.m5d.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 14, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.m5d.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 15, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.m5d.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 16, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.m5d.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 17, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.m5d.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 18, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.m5d.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 19, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.m5d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 20, + "_isFastLaunch": false, + "category": "General purpose", + "gpuNum": 0, + "hideHardwareSpecs": true, + "memoryGiB": 0, + "name": "ml.geospatial.interactive", + "supportedImageNames": [ + "sagemaker-geospatial-v1-0" + ], + "vcpuNum": 0 + }, + { + "_defaultOrder": 21, + "_isFastLaunch": true, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 4, + "name": "ml.c5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 22, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 8, + "name": "ml.c5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 23, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.c5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 24, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.c5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 25, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 72, + "name": "ml.c5.9xlarge", + "vcpuNum": 36 + }, + { + "_defaultOrder": 26, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 96, + "name": "ml.c5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 27, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 144, + "name": "ml.c5.18xlarge", + "vcpuNum": 72 + }, + { + "_defaultOrder": 28, + "_isFastLaunch": false, + "category": "Compute optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.c5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 29, + "_isFastLaunch": true, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g4dn.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 30, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g4dn.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 31, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g4dn.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 32, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g4dn.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 33, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g4dn.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 34, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g4dn.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 35, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 61, + "name": "ml.p3.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 36, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 244, + "name": "ml.p3.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 37, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 488, + "name": "ml.p3.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 38, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.p3dn.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 39, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.r5.large", + "vcpuNum": 2 + }, + { + "_defaultOrder": 40, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.r5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 41, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.r5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 42, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.r5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 43, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.r5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 44, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.r5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 45, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 512, + "name": "ml.r5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 46, + "_isFastLaunch": false, + "category": "Memory Optimized", + "gpuNum": 0, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.r5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 47, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 16, + "name": "ml.g5.xlarge", + "vcpuNum": 4 + }, + { + "_defaultOrder": 48, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 32, + "name": "ml.g5.2xlarge", + "vcpuNum": 8 + }, + { + "_defaultOrder": 49, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 64, + "name": "ml.g5.4xlarge", + "vcpuNum": 16 + }, + { + "_defaultOrder": 50, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 128, + "name": "ml.g5.8xlarge", + "vcpuNum": 32 + }, + { + "_defaultOrder": 51, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 1, + "hideHardwareSpecs": false, + "memoryGiB": 256, + "name": "ml.g5.16xlarge", + "vcpuNum": 64 + }, + { + "_defaultOrder": 52, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 192, + "name": "ml.g5.12xlarge", + "vcpuNum": 48 + }, + { + "_defaultOrder": 53, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 4, + "hideHardwareSpecs": false, + "memoryGiB": 384, + "name": "ml.g5.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 54, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 768, + "name": "ml.g5.48xlarge", + "vcpuNum": 192 + }, + { + "_defaultOrder": 55, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4d.24xlarge", + "vcpuNum": 96 + }, + { + "_defaultOrder": 56, + "_isFastLaunch": false, + "category": "Accelerated computing", + "gpuNum": 8, + "hideHardwareSpecs": false, + "memoryGiB": 1152, + "name": "ml.p4de.24xlarge", + "vcpuNum": 96 + } + ], + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + }, + "lcc_arn": "arn:aws:sagemaker:us-east-1:325223348818:studio-lifecycle-config/packages", + "toc-autonumbering": false, + "toc-showcode": false, + "toc-showmarkdowntxt": false + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/program/cohort.ipynb b/program/cohort.ipynb index 8376552..14860f6 100644 --- a/program/cohort.ipynb +++ b/program/cohort.ipynb @@ -42,7 +42,7 @@ }, { "cell_type": "code", - "execution_count": 589, + "execution_count": 640, "id": "4b2265b0", "metadata": {}, "outputs": [ @@ -101,7 +101,7 @@ }, { "cell_type": "code", - "execution_count": 590, + "execution_count": 641, "id": "32c4d764", "metadata": {}, "outputs": [], @@ -119,7 +119,7 @@ }, { "cell_type": "code", - "execution_count": 591, + "execution_count": 642, "id": "3164a3af", "metadata": {}, "outputs": [], @@ -142,7 +142,7 @@ }, { "cell_type": "code", - "execution_count": 592, + "execution_count": 643, "id": "7bc40d28", "metadata": {}, "outputs": [], @@ -161,7 +161,7 @@ }, { "cell_type": "code", - "execution_count": 593, + "execution_count": 644, "id": "3b3f17e5", "metadata": {}, "outputs": [], @@ -201,7 +201,7 @@ }, { "cell_type": "code", - "execution_count": 594, + "execution_count": 645, "id": "942a01b5", "metadata": {}, "outputs": [], @@ -242,7 +242,7 @@ }, { "cell_type": "code", - "execution_count": 595, + "execution_count": 646, "id": "f1cd2f0e-446d-48a9-a008-b4f1cc593bfc", "metadata": { "tags": [] @@ -349,7 +349,7 @@ "4 3450.0 FEMALE " ] }, - "execution_count": 595, + "execution_count": 646, "metadata": {}, "output_type": "execute_result" } @@ -386,7 +386,7 @@ }, { "cell_type": "code", - "execution_count": 596, + "execution_count": 647, "id": "f2107c25-e730-4e22-a1b8-5bda53e61124", "metadata": { "tags": [] @@ -565,7 +565,7 @@ "max 6300.000000 NaN " ] }, - "execution_count": 596, + "execution_count": 647, "metadata": {}, "output_type": "execute_result" } @@ -584,7 +584,7 @@ }, { "cell_type": "code", - "execution_count": 597, + "execution_count": 648, "id": "1242122a-726e-4c37-a718-dd8e873d1612", "metadata": { "tags": [] @@ -642,7 +642,7 @@ }, { "cell_type": "code", - "execution_count": 598, + "execution_count": 649, "id": "cf1cf582-8831-4f83-bb17-2175afb193e8", "metadata": { "tags": [] @@ -657,7 +657,7 @@ "Name: count, dtype: int64" ] }, - "execution_count": 598, + "execution_count": 649, "metadata": {}, "output_type": "execute_result" } @@ -677,7 +677,7 @@ }, { "cell_type": "code", - "execution_count": 599, + "execution_count": 650, "id": "cc42cb08-275c-4b05-9d2b-77052da2f336", "metadata": { "tags": [] @@ -696,7 +696,7 @@ "dtype: int64" ] }, - "execution_count": 599, + "execution_count": 650, "metadata": {}, "output_type": "execute_result" } @@ -715,7 +715,7 @@ }, { "cell_type": "code", - "execution_count": 600, + "execution_count": 651, "id": "3c57d55d-afd6-467a-a7a8-ff04132770ed", "metadata": { "tags": [] @@ -734,7 +734,7 @@ "dtype: int64" ] }, - "execution_count": 600, + "execution_count": 651, "metadata": {}, "output_type": "execute_result" } @@ -757,7 +757,7 @@ }, { "cell_type": "code", - "execution_count": 601, + "execution_count": 652, "id": "2852c740", "metadata": {}, "outputs": [ @@ -803,7 +803,7 @@ }, { "cell_type": "code", - "execution_count": 602, + "execution_count": 653, "id": "707cc972", "metadata": {}, "outputs": [ @@ -851,7 +851,7 @@ }, { "cell_type": "code", - "execution_count": 603, + "execution_count": 654, "id": "3daf3ba1-d218-4ad4-b862-af679b91273f", "metadata": { "tags": [] @@ -931,7 +931,7 @@ "body_mass_g 640316.716388 " ] }, - "execution_count": 603, + "execution_count": 654, "metadata": {}, "output_type": "execute_result" } @@ -956,7 +956,7 @@ }, { "cell_type": "code", - "execution_count": 604, + "execution_count": 655, "id": "1d793e09-2cb9-47ff-a0e6-199a0f4fc1b3", "metadata": { "tags": [] @@ -1036,7 +1036,7 @@ "body_mass_g 1.000000 " ] }, - "execution_count": 604, + "execution_count": 655, "metadata": {}, "output_type": "execute_result" } @@ -1061,7 +1061,7 @@ }, { "cell_type": "code", - "execution_count": 605, + "execution_count": 656, "id": "1258c99d", "metadata": {}, "outputs": [ @@ -1101,7 +1101,7 @@ }, { "cell_type": "code", - "execution_count": 606, + "execution_count": 657, "id": "45b0a87f-028d-477f-9b65-199728c0b7ee", "metadata": { "tags": [] @@ -1155,7 +1155,7 @@ }, { "cell_type": "code", - "execution_count": 607, + "execution_count": 658, "id": "fb6ba7c0-1bd6-4fe5-8b7f-f6cbdfd3846c", "metadata": { "tags": [] @@ -1351,7 +1351,7 @@ }, { "cell_type": "code", - "execution_count": 608, + "execution_count": 659, "id": "d1f122a4-acff-4687-91b9-bfef13567d88", "metadata": { "tags": [] @@ -1362,7 +1362,7 @@ "output_type": "stream", "text": [ "\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\u001b[32m.\u001b[0m\n", - "\u001b[32m\u001b[32m\u001b[1m8 passed\u001b[0m\u001b[32m in 0.17s\u001b[0m\u001b[0m\n" + "\u001b[32m\u001b[32m\u001b[1m8 passed\u001b[0m\u001b[32m in 0.16s\u001b[0m\u001b[0m\n" ] } ], @@ -1489,7 +1489,7 @@ }, { "cell_type": "code", - "execution_count": 609, + "execution_count": 660, "id": "d88e9ccf", "metadata": {}, "outputs": [], @@ -1509,7 +1509,7 @@ }, { "cell_type": "code", - "execution_count": 610, + "execution_count": 661, "id": "331fe373", "metadata": {}, "outputs": [], @@ -1532,7 +1532,7 @@ }, { "cell_type": "code", - "execution_count": 611, + "execution_count": 662, "id": "3aa4471a", "metadata": {}, "outputs": [ @@ -1573,7 +1573,7 @@ }, { "cell_type": "code", - "execution_count": 612, + "execution_count": 663, "id": "cdbd9303", "metadata": { "tags": [] @@ -1654,7 +1654,7 @@ }, { "cell_type": "code", - "execution_count": 613, + "execution_count": 664, "id": "e140642a", "metadata": { "tags": [] @@ -1664,16 +1664,16 @@ "data": { "text/plain": [ "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session1-pipeline',\n", - " 'ResponseMetadata': {'RequestId': '10ebb4eb-c57e-4b9d-b94a-dc6307f0207e',\n", + " 'ResponseMetadata': {'RequestId': '02b62dd1-6de0-4723-9019-f4f72862ba5c',\n", " 'HTTPStatusCode': 200,\n", - " 'HTTPHeaders': {'x-amzn-requestid': '10ebb4eb-c57e-4b9d-b94a-dc6307f0207e',\n", + " 'HTTPHeaders': {'x-amzn-requestid': '02b62dd1-6de0-4723-9019-f4f72862ba5c',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '85',\n", - " 'date': 'Thu, 26 Oct 2023 18:42:58 GMT'},\n", + " 'date': 'Fri, 27 Oct 2023 14:38:36 GMT'},\n", " 'RetryAttempts': 0}}" ] }, - "execution_count": 613, + "execution_count": 664, "metadata": {}, "output_type": "execute_result" } @@ -1722,7 +1722,7 @@ }, { "cell_type": "code", - "execution_count": 614, + "execution_count": 665, "id": "59d1e634", "metadata": {}, "outputs": [], @@ -1780,7 +1780,7 @@ }, { "cell_type": "code", - "execution_count": 615, + "execution_count": 666, "id": "d92b121d-dcb9-43e8-9ee3-3ececb583e7e", "metadata": { "tags": [] @@ -1889,7 +1889,7 @@ }, { "cell_type": "code", - "execution_count": 616, + "execution_count": 667, "id": "14ea27ce-c453-4cb0-b309-dbecd732957e", "metadata": { "tags": [] @@ -1906,16 +1906,16 @@ "name": "stdout", "output_type": "stream", "text": [ - "8/8 - 0s - loss: 1.0050 - accuracy: 0.4561 - val_loss: 0.9934 - val_accuracy: 0.4118 - 239ms/epoch - 30ms/step\n", + "8/8 - 0s - loss: 1.0173 - accuracy: 0.4728 - val_loss: 0.9260 - val_accuracy: 0.6078 - 230ms/epoch - 29ms/step\n", "2/2 [==============================] - 0s 1ms/step\n", - "Validation accuracy: 0.4117647058823529\n" + "Validation accuracy: 0.6078431372549019\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmp_d8wmtx2/model/001/assets\n" + "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmpv4apdp15/model/001/assets\n" ] }, { @@ -1923,7 +1923,7 @@ "output_type": "stream", "text": [ "\u001b[32m.\u001b[0m\n", - "\u001b[32m\u001b[32m\u001b[1m1 passed\u001b[0m\u001b[32m in 0.55s\u001b[0m\u001b[0m\n" + "\u001b[32m\u001b[32m\u001b[1m1 passed\u001b[0m\u001b[32m in 0.53s\u001b[0m\u001b[0m\n" ] } ], @@ -1992,7 +1992,7 @@ }, { "cell_type": "code", - "execution_count": 617, + "execution_count": 668, "id": "90fe82ae-6a2c-4461-bc83-bb52d8871e3b", "metadata": { "tags": [] @@ -2047,21 +2047,12 @@ }, { "cell_type": "code", - "execution_count": 618, + "execution_count": 738, "id": "99e4850c-83d6-4f4e-a813-d5a3f4bb7486", "metadata": { "tags": [] }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/svpino/dev/ml.school/.venv/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:297: UserWarning: Running within a PipelineSession, there will be No Wait, No Logs, and No Job being started.\n", - " warnings.warn(\n" - ] - } - ], + "outputs": [], "source": [ "# | code: true\n", "# | output: false\n", @@ -2069,7 +2060,6 @@ "from sagemaker.workflow.steps import TrainingStep\n", "from sagemaker.inputs import TrainingInput\n", "\n", - "\n", "train_model_step = TrainingStep(\n", " name=\"train-model\",\n", " step_args=estimator.fit(\n", @@ -2112,7 +2102,7 @@ }, { "cell_type": "code", - "execution_count": 619, + "execution_count": 670, "id": "f367d0e3", "metadata": {}, "outputs": [], @@ -2143,7 +2133,7 @@ }, { "cell_type": "code", - "execution_count": 620, + "execution_count": 671, "id": "c8c82750", "metadata": {}, "outputs": [], @@ -2174,7 +2164,7 @@ }, { "cell_type": "code", - "execution_count": 621, + "execution_count": 672, "id": "038ff2e5-ed28-445b-bc03-4e996ec2286f", "metadata": { "tags": [] @@ -2217,7 +2207,7 @@ }, { "cell_type": "code", - "execution_count": 622, + "execution_count": 673, "id": "9799ab39-fcae-41f4-a68b-85ab71b3ba9a", "metadata": { "tags": [] @@ -2227,9 +2217,6 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" ] }, @@ -2244,9 +2231,6 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" ] }, @@ -2261,16 +2245,16 @@ "data": { "text/plain": [ "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session2-pipeline',\n", - " 'ResponseMetadata': {'RequestId': '35562718-eb3c-4377-b5c3-bd7bd65f2077',\n", + " 'ResponseMetadata': {'RequestId': 'e99208aa-4074-41aa-a12b-90af6da62e3f',\n", " 'HTTPStatusCode': 200,\n", - " 'HTTPHeaders': {'x-amzn-requestid': '35562718-eb3c-4377-b5c3-bd7bd65f2077',\n", + " 'HTTPHeaders': {'x-amzn-requestid': 'e99208aa-4074-41aa-a12b-90af6da62e3f',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '85',\n", - " 'date': 'Thu, 26 Oct 2023 18:43:00 GMT'},\n", + " 'date': 'Fri, 27 Oct 2023 14:38:38 GMT'},\n", " 'RetryAttempts': 0}}" ] }, - "execution_count": 622, + "execution_count": 673, "metadata": {}, "output_type": "execute_result" } @@ -2315,7 +2299,7 @@ }, { "cell_type": "code", - "execution_count": 623, + "execution_count": 674, "id": "274a9b1e", "metadata": {}, "outputs": [], @@ -2375,7 +2359,7 @@ }, { "cell_type": "code", - "execution_count": 624, + "execution_count": 675, "id": "3ee3ab26-afa5-4ceb-9f7a-005d5fdea646", "metadata": { "tags": [] @@ -2461,7 +2445,7 @@ }, { "cell_type": "code", - "execution_count": 625, + "execution_count": 676, "id": "9a2540d8-278a-4953-bc54-0469d154427d", "metadata": { "tags": [] @@ -2478,16 +2462,16 @@ "name": "stdout", "output_type": "stream", "text": [ - "8/8 - 0s - loss: 1.2371 - accuracy: 0.3138 - val_loss: 1.0864 - val_accuracy: 0.4902 - 237ms/epoch - 30ms/step\n", + "8/8 - 0s - loss: 1.1330 - accuracy: 0.4142 - val_loss: 1.1001 - val_accuracy: 0.5098 - 236ms/epoch - 30ms/step\n", "2/2 [==============================] - 0s 1ms/step\n", - "Validation accuracy: 0.49019607843137253\n" + "Validation accuracy: 0.5098039215686274\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmp6a4kt1az/model/001/assets\n", + "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmpprbc5h18/model/001/assets\n", "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.RestoredOptimizer` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.RestoredOptimizer`.\n", "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.\n" ] @@ -2496,8 +2480,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "2/2 [==============================] - 0s 2ms/step\n", - "Test accuracy: 0.35294117647058826\n", + "2/2 [==============================] - 0s 1ms/step\n", + "Test accuracy: 0.4117647058823529\n", "\u001b[32m.\u001b[0m" ] }, @@ -2512,22 +2496,16 @@ "name": "stdout", "output_type": "stream", "text": [ - "8/8 - 0s - loss: 1.3224 - accuracy: 0.2385 - val_loss: 1.2449 - val_accuracy: 0.1765 - 232ms/epoch - 29ms/step\n", - "2/2 [==============================] - 0s 1ms/step\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Validation accuracy: 0.17647058823529413\n" + "8/8 - 0s - loss: 1.0329 - accuracy: 0.4644 - val_loss: 0.9795 - val_accuracy: 0.5882 - 235ms/epoch - 29ms/step\n", + "2/2 [==============================] - 0s 1ms/step\n", + "Validation accuracy: 0.5882352941176471\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmpq00sk8jn/model/001/assets\n", + "INFO:tensorflow:Assets written to: /var/folders/4c/v1q3hy1x4mb5w0wpc72zl3_w0000gp/T/tmph0nj0wfb/model/001/assets\n", "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.RestoredOptimizer` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.RestoredOptimizer`.\n", "WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.SGD` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.SGD`.\n" ] @@ -2536,8 +2514,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "2/2 [==============================] - 0s 1ms/step\n", - "Test accuracy: 0.21568627450980393\n", + "2/2 [==============================] - 0s 2ms/step\n", + "Test accuracy: 0.5686274509803921\n", "\u001b[32m.\u001b[0m\n", "\u001b[32m\u001b[32m\u001b[1m2 passed\u001b[0m\u001b[32m in 1.35s\u001b[0m\u001b[0m\n" ] @@ -2622,7 +2600,7 @@ }, { "cell_type": "code", - "execution_count": 626, + "execution_count": 677, "id": "2fdff07f", "metadata": {}, "outputs": [ @@ -2662,7 +2640,7 @@ }, { "cell_type": "code", - "execution_count": 627, + "execution_count": 678, "id": "4f19e15b", "metadata": {}, "outputs": [], @@ -2685,7 +2663,7 @@ }, { "cell_type": "code", - "execution_count": 628, + "execution_count": 679, "id": "1f27b2ef", "metadata": {}, "outputs": [], @@ -2707,7 +2685,7 @@ }, { "cell_type": "code", - "execution_count": 629, + "execution_count": 680, "id": "48139a07-5c8e-4bc6-b666-bf9531f7f520", "metadata": { "tags": [] @@ -2774,7 +2752,7 @@ }, { "cell_type": "code", - "execution_count": 630, + "execution_count": 681, "id": "bb70f907", "metadata": {}, "outputs": [], @@ -2792,7 +2770,7 @@ }, { "cell_type": "code", - "execution_count": 631, + "execution_count": 682, "id": "4ca4cb61", "metadata": {}, "outputs": [], @@ -2818,7 +2796,7 @@ }, { "cell_type": "code", - "execution_count": 632, + "execution_count": 683, "id": "8c05a7e1", "metadata": {}, "outputs": [], @@ -2852,7 +2830,7 @@ }, { "cell_type": "code", - "execution_count": 633, + "execution_count": 684, "id": "c9773a4a", "metadata": { "tags": [] @@ -2913,7 +2891,7 @@ }, { "cell_type": "code", - "execution_count": 634, + "execution_count": 685, "id": "745486b5", "metadata": {}, "outputs": [], @@ -2933,7 +2911,7 @@ }, { "cell_type": "code", - "execution_count": 635, + "execution_count": 686, "id": "c4431bbf", "metadata": {}, "outputs": [], @@ -2962,7 +2940,7 @@ }, { "cell_type": "code", - "execution_count": 636, + "execution_count": 687, "id": "bebeecab", "metadata": {}, "outputs": [], @@ -2990,7 +2968,7 @@ }, { "cell_type": "code", - "execution_count": 637, + "execution_count": 688, "id": "36e2a2b1-6711-4266-95d8-d2aebd52e199", "metadata": { "tags": [] @@ -3019,7 +2997,7 @@ }, { "cell_type": "code", - "execution_count": 638, + "execution_count": 689, "id": "f70bcd33-b499-4e2b-953e-94d1ed96c10a", "metadata": { "tags": [] @@ -3029,9 +3007,6 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" ] }, @@ -3050,9 +3025,6 @@ "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session3-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session3-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n", "WARNING:sagemaker.workflow._utils:Popping out 'CertifyForMarketplace' from the pipeline definition since it will be overridden in pipeline execution time.\n", - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" ] }, @@ -3068,13 +3040,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session3-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session3-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session3-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n" ] }, @@ -3082,16 +3048,16 @@ "data": { "text/plain": [ "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session3-pipeline',\n", - " 'ResponseMetadata': {'RequestId': 'db9b4844-9210-4700-8a1c-1a32ef0c2251',\n", + " 'ResponseMetadata': {'RequestId': 'be91a772-a26a-4c1f-a98a-424951e6889a',\n", " 'HTTPStatusCode': 200,\n", - " 'HTTPHeaders': {'x-amzn-requestid': 'db9b4844-9210-4700-8a1c-1a32ef0c2251',\n", + " 'HTTPHeaders': {'x-amzn-requestid': 'be91a772-a26a-4c1f-a98a-424951e6889a',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '85',\n", - " 'date': 'Thu, 26 Oct 2023 18:43:05 GMT'},\n", + " 'date': 'Fri, 27 Oct 2023 14:38:43 GMT'},\n", " 'RetryAttempts': 0}}" ] }, - "execution_count": 638, + "execution_count": 689, "metadata": {}, "output_type": "execute_result" } @@ -3138,21 +3104,10 @@ }, { "cell_type": "code", - "execution_count": 639, + "execution_count": 690, "id": "f3b4126e", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "_PipelineExecution(arn='arn:aws:sagemaker:us-east-1:325223348818:pipeline/session3-pipeline/execution/bbkxqpvtxyxu', sagemaker_session=)" - ] - }, - "execution_count": 639, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "%%script false --no-raise-error\n", "\n", @@ -3199,7 +3154,7 @@ }, { "cell_type": "code", - "execution_count": 512, + "execution_count": 691, "id": "befd5ad3", "metadata": {}, "outputs": [], @@ -3224,7 +3179,7 @@ }, { "cell_type": "code", - "execution_count": 513, + "execution_count": 692, "id": "87437a26-e9ea-4866-9dc3-630444c0fb46", "metadata": { "tags": [] @@ -3234,14 +3189,14 @@ "data": { "text/plain": [ "{'ModelPackageGroupName': 'penguins',\n", - " 'ModelPackageVersion': 67,\n", - " 'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:325223348818:model-package/penguins/67',\n", - " 'CreationTime': datetime.datetime(2023, 10, 17, 17, 7, 1, 325000, tzinfo=tzlocal()),\n", + " 'ModelPackageVersion': 74,\n", + " 'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:325223348818:model-package/penguins/74',\n", + " 'CreationTime': datetime.datetime(2023, 10, 26, 14, 52, 37, 773000, tzinfo=tzlocal()),\n", " 'ModelPackageStatus': 'Completed',\n", " 'ModelApprovalStatus': 'Approved'}" ] }, - "execution_count": 513, + "execution_count": 692, "metadata": {}, "output_type": "execute_result" } @@ -3272,7 +3227,7 @@ }, { "cell_type": "code", - "execution_count": 514, + "execution_count": 693, "id": "dee516e9", "metadata": {}, "outputs": [], @@ -3308,7 +3263,7 @@ }, { "cell_type": "code", - "execution_count": 515, + "execution_count": 694, "id": "7c8852d5-818a-406c-944d-30bf6de90288", "metadata": { "tags": [] @@ -3339,7 +3294,7 @@ }, { "cell_type": "code", - "execution_count": 516, + "execution_count": 695, "id": "ba7da291", "metadata": {}, "outputs": [], @@ -3361,7 +3316,7 @@ }, { "cell_type": "code", - "execution_count": 517, + "execution_count": 696, "id": "0817a25e-8224-4911-830b-d659e7458b4a", "metadata": { "tags": [] @@ -3410,7 +3365,7 @@ }, { "cell_type": "code", - "execution_count": 518, + "execution_count": 697, "id": "6b32c3a4-312e-473c-a217-33606f77d1e9", "metadata": { "tags": [] @@ -3472,7 +3427,7 @@ }, { "cell_type": "code", - "execution_count": 519, + "execution_count": 698, "id": "e2d61d5c", "metadata": { "tags": [] @@ -3605,7 +3560,7 @@ }, { "cell_type": "code", - "execution_count": 520, + "execution_count": 699, "id": "33893ef2", "metadata": { "tags": [] @@ -3767,7 +3722,7 @@ }, { "cell_type": "code", - "execution_count": 521, + "execution_count": 700, "id": "48c69002", "metadata": { "tags": [] @@ -3876,7 +3831,7 @@ }, { "cell_type": "code", - "execution_count": 522, + "execution_count": 701, "id": "741b8402", "metadata": { "tags": [] @@ -3955,7 +3910,7 @@ }, { "cell_type": "code", - "execution_count": 523, + "execution_count": 702, "id": "53ea0ccf", "metadata": {}, "outputs": [], @@ -3981,7 +3936,7 @@ }, { "cell_type": "code", - "execution_count": 524, + "execution_count": 703, "id": "11a0effd", "metadata": {}, "outputs": [], @@ -4008,7 +3963,7 @@ }, { "cell_type": "code", - "execution_count": 525, + "execution_count": 704, "id": "5d7a5926", "metadata": {}, "outputs": [], @@ -4033,7 +3988,7 @@ }, { "cell_type": "code", - "execution_count": 526, + "execution_count": 705, "id": "157b8858", "metadata": { "tags": [] @@ -4062,7 +4017,7 @@ }, { "cell_type": "code", - "execution_count": 527, + "execution_count": 706, "id": "aefe580a", "metadata": {}, "outputs": [], @@ -4080,7 +4035,7 @@ }, { "cell_type": "code", - "execution_count": 528, + "execution_count": 707, "id": "f84d2cd5", "metadata": { "tags": [] @@ -4135,7 +4090,7 @@ }, { "cell_type": "code", - "execution_count": 529, + "execution_count": 708, "id": "b9712905-9fe3-4148-ae6d-05b0a48e742e", "metadata": { "tags": [] @@ -4162,7 +4117,7 @@ }, { "cell_type": "code", - "execution_count": 530, + "execution_count": 709, "id": "bad9f51d", "metadata": { "tags": [] @@ -4172,9 +4127,6 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" ] }, @@ -4199,9 +4151,6 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", - "WARNING:sagemaker.estimator:No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config\n", "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" ] }, @@ -4209,6 +4158,7 @@ "name": "stdout", "output_type": "stream", "text": [ + "Using provided s3_resource\n", "Using provided s3_resource\n" ] }, @@ -4220,27 +4170,20 @@ "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session4-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n" ] }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Using provided s3_resource\n" - ] - }, { "data": { "text/plain": [ "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session4-pipeline',\n", - " 'ResponseMetadata': {'RequestId': '510d5be0-0a1a-4daa-997a-12ac7b4f8e0b',\n", + " 'ResponseMetadata': {'RequestId': '2cd65edc-9bad-4b67-a1d2-aa22698d6a39',\n", " 'HTTPStatusCode': 200,\n", - " 'HTTPHeaders': {'x-amzn-requestid': '510d5be0-0a1a-4daa-997a-12ac7b4f8e0b',\n", + " 'HTTPHeaders': {'x-amzn-requestid': '2cd65edc-9bad-4b67-a1d2-aa22698d6a39',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '85',\n", - " 'date': 'Thu, 26 Oct 2023 18:24:38 GMT'},\n", + " 'date': 'Fri, 27 Oct 2023 14:38:46 GMT'},\n", " 'RetryAttempts': 0}}" ] }, - "execution_count": 530, + "execution_count": 709, "metadata": {}, "output_type": "execute_result" } @@ -4287,7 +4230,7 @@ }, { "cell_type": "code", - "execution_count": 531, + "execution_count": 710, "id": "20dfbd97", "metadata": {}, "outputs": [], @@ -4318,7 +4261,7 @@ }, { "cell_type": "code", - "execution_count": 348, + "execution_count": 744, "id": "998314a3", "metadata": {}, "outputs": [ @@ -4385,7 +4328,7 @@ " \"InitialInstanceCount\": 1,\n", " \"VariantName\": \"AllTraffic\",\n", " }],\n", - " \n", + "\n", " # We can enable Data Capture to record the inputs and outputs\n", " # of the endpoint to use them later for monitoring the model. \n", " DataCaptureConfig={\n", @@ -4445,7 +4388,7 @@ }, { "cell_type": "code", - "execution_count": 536, + "execution_count": 745, "id": "4ad4f1f2", "metadata": { "tags": [] @@ -4455,11 +4398,14 @@ "name": "stdout", "output_type": "stream", "text": [ - "Role lambda-deployment-role already exists.\n" + "Role \"lambda-deployment-role\" created with ARN \"arn:aws:iam::325223348818:role/lambda-deployment-role\".\n" ] } ], "source": [ + "#| code: true\n", + "#| output: false\n", + "\n", "lambda_role_name = \"lambda-deployment-role\"\n", "lambda_role_arn = None\n", "\n", @@ -4491,15 +4437,15 @@ " )\n", "\n", " iam_client.attach_role_policy(\n", - " RoleName=lambda_role_name,\n", " PolicyArn=\"arn:aws:iam::aws:policy/AmazonSageMakerFullAccess\",\n", + " RoleName=lambda_role_name,\n", " )\n", - " \n", + "\n", " print(f'Role \"{lambda_role_name}\" created with ARN \"{lambda_role_arn}\".')\n", "except iam_client.exceptions.EntityAlreadyExistsException:\n", - " print(f\"Role {lambda_role_name} already exists.\")\n", " response = iam_client.get_role(RoleName=lambda_role_name)\n", - " lambda_role_arn = response[\"Role\"][\"Arn\"]" + " lambda_role_arn = response[\"Role\"][\"Arn\"]\n", + " print(f'Role \"{lambda_role_name}\" already exists with ARN \"{lambda_role_arn}\".')\n" ] }, { @@ -4512,7 +4458,7 @@ }, { "cell_type": "code", - "execution_count": 350, + "execution_count": 747, "id": "ad8c8019", "metadata": { "tags": [] @@ -4521,36 +4467,35 @@ { "data": { "text/plain": [ - "{'ResponseMetadata': {'RequestId': 'a6e915cb-e440-4ecd-94bb-458139388602',\n", - " 'HTTPStatusCode': 200,\n", - " 'HTTPHeaders': {'date': 'Tue, 24 Oct 2023 18:28:36 GMT',\n", + "{'ResponseMetadata': {'RequestId': '57179d72-6fc2-49cc-9326-cb87bd63bda1',\n", + " 'HTTPStatusCode': 201,\n", + " 'HTTPHeaders': {'date': 'Fri, 27 Oct 2023 16:01:42 GMT',\n", " 'content-type': 'application/json',\n", - " 'content-length': '1428',\n", + " 'content-length': '1421',\n", " 'connection': 'keep-alive',\n", - " 'x-amzn-requestid': 'a6e915cb-e440-4ecd-94bb-458139388602'},\n", + " 'x-amzn-requestid': '57179d72-6fc2-49cc-9326-cb87bd63bda1'},\n", " 'RetryAttempts': 0},\n", " 'FunctionName': 'deploy_fn',\n", " 'FunctionArn': 'arn:aws:lambda:us-east-1:325223348818:function:deploy_fn',\n", " 'Runtime': 'python3.11',\n", " 'Role': 'arn:aws:iam::325223348818:role/lambda-deployment-role',\n", " 'Handler': 'lambda.lambda_handler',\n", - " 'CodeSize': 3202,\n", + " 'CodeSize': 3194,\n", " 'Description': '',\n", " 'Timeout': 600,\n", " 'MemorySize': 128,\n", - " 'LastModified': '2023-10-24T18:28:36.000+0000',\n", - " 'CodeSha256': 'gTB7D5GxQS4xUk99eaZAfIFv2GPHZ6s2D+aNyzOy19Q=',\n", + " 'LastModified': '2023-10-27T16:01:42.544+0000',\n", + " 'CodeSha256': 'IkCkE0e46WsdhSUEPRlsqEH/6nHhU5laPpgn308D30k=',\n", " 'Version': '$LATEST',\n", " 'Environment': {'Variables': {'ROLE': 'arn:aws:iam::325223348818:role/service-role/AmazonSageMaker-ExecutionRole-20230312T160501',\n", " 'DATA_CAPTURE_DESTINATION': 's3://mlschool/penguins/monitoring/data-capture',\n", " 'ENDPOINT': 'penguins-endpoint'}},\n", " 'TracingConfig': {'Mode': 'PassThrough'},\n", - " 'RevisionId': '175878c6-9ff3-47b1-b1a3-b5df361b9fc9',\n", + " 'RevisionId': '516fef1e-871b-4a52-81e2-a421f3547ec9',\n", " 'Layers': [],\n", - " 'State': 'Active',\n", - " 'LastUpdateStatus': 'InProgress',\n", - " 'LastUpdateStatusReason': 'The function is being created.',\n", - " 'LastUpdateStatusReasonCode': 'Creating',\n", + " 'State': 'Pending',\n", + " 'StateReason': 'The function is being created.',\n", + " 'StateReasonCode': 'Creating',\n", " 'PackageType': 'Zip',\n", " 'Architectures': ['x86_64'],\n", " 'EphemeralStorage': {'Size': 512},\n", @@ -4558,7 +4503,7 @@ " 'RuntimeVersionConfig': {'RuntimeVersionArn': 'arn:aws:lambda:us-east-1::runtime:6cf63f1a78b5c5e19617d6b4b111370fdbda415ea91bdfdc5aacef9fee76b64a'}}" ] }, - "execution_count": 350, + "execution_count": 747, "metadata": {}, "output_type": "execute_result" } @@ -4606,7 +4551,7 @@ }, { "cell_type": "code", - "execution_count": 351, + "execution_count": 748, "id": "27ce7cc5", "metadata": {}, "outputs": [], @@ -4633,7 +4578,7 @@ }, { "cell_type": "code", - "execution_count": 352, + "execution_count": 749, "id": "2a878179", "metadata": {}, "outputs": [], @@ -4657,7 +4602,7 @@ }, { "cell_type": "code", - "execution_count": 353, + "execution_count": 750, "id": "dc714a97", "metadata": { "tags": [] @@ -4685,18 +4630,10 @@ }, { "cell_type": "code", - "execution_count": 354, + "execution_count": 751, "id": "d74be86b", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Function \"deploy_fn\" already has permissions.\n" - ] - } - ], + "outputs": [], "source": [ "lambda_client = boto3.client(\"lambda\")\n", "try:\n", @@ -4723,7 +4660,7 @@ }, { "cell_type": "code", - "execution_count": 355, + "execution_count": 718, "id": "3cc966fb-b611-417f-a8b8-0c5d2f95252c", "metadata": { "tags": [] @@ -4786,7 +4723,7 @@ }, { "cell_type": "code", - "execution_count": 356, + "execution_count": 719, "id": "8c3e851a-2416-4a0b-b8a1-c483cde3d776", "metadata": { "tags": [] @@ -4848,7 +4785,7 @@ }, { "cell_type": "code", - "execution_count": 357, + "execution_count": 720, "id": "2bb846d0", "metadata": {}, "outputs": [], @@ -4872,7 +4809,7 @@ }, { "cell_type": "code", - "execution_count": 358, + "execution_count": 721, "id": "0b80bcab-d2c5-437c-a1c8-8eea208c0e29", "metadata": { "tags": [] @@ -4936,7 +4873,7 @@ }, { "cell_type": "code", - "execution_count": 359, + "execution_count": 722, "id": "8194b462", "metadata": {}, "outputs": [ @@ -4972,7 +4909,7 @@ }, { "cell_type": "code", - "execution_count": 360, + "execution_count": 723, "id": "bf6aa4f0", "metadata": {}, "outputs": [], @@ -5006,7 +4943,7 @@ }, { "cell_type": "code", - "execution_count": 361, + "execution_count": 724, "id": "1987a788-de7a-4f60-ac8d-819d9ffcdf8e", "metadata": { "tags": [] @@ -5052,7 +4989,7 @@ }, { "cell_type": "code", - "execution_count": 362, + "execution_count": 725, "id": "9aa3a284-8763-4000-a263-70314b530652", "metadata": { "tags": [] @@ -5116,7 +5053,7 @@ }, { "cell_type": "code", - "execution_count": 363, + "execution_count": 726, "id": "a773f134-ac2f-4dba-976e-9b7f0b384b6e", "metadata": { "tags": [] @@ -5176,7 +5113,7 @@ }, { "cell_type": "code", - "execution_count": 364, + "execution_count": 727, "id": "7056a009-91c0-4955-90dd-b90ef8cab149", "metadata": { "tags": [] @@ -5218,7 +5155,7 @@ }, { "cell_type": "code", - "execution_count": 365, + "execution_count": 728, "id": "bacaa9c6-22b0-48df-b138-95b6422fe834", "metadata": { "tags": [] @@ -5252,24 +5189,31 @@ }, { "cell_type": "code", - "execution_count": 366, + "execution_count": 729, "id": "4da5e453-acd8-47a0-a39f-264d05dd93d0", "metadata": { "tags": [] }, "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Using provided s3_resource\n" + ] + }, { "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", + "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session5-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "Using provided s3_resource\n", "Using provided s3_resource\n" ] }, @@ -5277,15 +5221,16 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session5-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session5-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n", - "WARNING:sagemaker.workflow._utils:Popping out 'CertifyForMarketplace' from the pipeline definition since it will be overridden in pipeline execution time.\n" + "WARNING:sagemaker.workflow._utils:Popping out 'CertifyForMarketplace' from the pipeline definition since it will be overridden in pipeline execution time.\n", + "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ + "Using provided s3_resource\n", "Using provided s3_resource\n" ] }, @@ -5293,32 +5238,24 @@ "name": "stderr", "output_type": "stream", "text": [ - "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", "INFO:sagemaker.processing:Uploaded None to s3://mlschool/session5-pipeline/code/09fea667a5ab7c37a068f22c00762d0b/sourcedir.tar.gz\n", "INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool/session5-pipeline/code/2c207c809cb0e0e9a1d77e5247f961f9/runproc.sh\n" ] }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Using provided s3_resource\n" - ] - }, { "data": { "text/plain": [ "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:325223348818:pipeline/session5-pipeline',\n", - " 'ResponseMetadata': {'RequestId': '450e9e6e-9ec9-40de-bda6-be8564981011',\n", + " 'ResponseMetadata': {'RequestId': 'e104a5af-2148-4ab4-85b3-af898d3bd315',\n", " 'HTTPStatusCode': 200,\n", - " 'HTTPHeaders': {'x-amzn-requestid': '450e9e6e-9ec9-40de-bda6-be8564981011',\n", + " 'HTTPHeaders': {'x-amzn-requestid': 'e104a5af-2148-4ab4-85b3-af898d3bd315',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '85',\n", - " 'date': 'Tue, 24 Oct 2023 18:28:40 GMT'},\n", + " 'date': 'Fri, 27 Oct 2023 14:38:52 GMT'},\n", " 'RetryAttempts': 0}}" ] }, - "execution_count": 366, + "execution_count": 729, "metadata": {}, "output_type": "execute_result" } @@ -5366,23 +5303,23 @@ }, { "cell_type": "code", - "execution_count": 367, + "execution_count": 739, "id": "10ba9909", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "_PipelineExecution(arn='arn:aws:sagemaker:us-east-1:325223348818:pipeline/session5-pipeline/execution/a8jrffhsgbcm', sagemaker_session=)" + "_PipelineExecution(arn='arn:aws:sagemaker:us-east-1:325223348818:pipeline/session5-pipeline/execution/ifgn9itt6qcy', sagemaker_session=)" ] }, - "execution_count": 367, + "execution_count": 739, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "%%script false --no-raise-error\n", + "# %%script false --no-raise-error\n", "\n", "#| eval: false\n", "#| code: true\n", @@ -5405,7 +5342,7 @@ }, { "cell_type": "code", - "execution_count": 406, + "execution_count": 752, "id": "42daa82b", "metadata": {}, "outputs": [ @@ -5467,7 +5404,7 @@ }, { "cell_type": "code", - "execution_count": 405, + "execution_count": 753, "id": "898d9626", "metadata": {}, "outputs": [ @@ -5508,7 +5445,7 @@ }, { "cell_type": "code", - "execution_count": 370, + "execution_count": 754, "id": "2df52332", "metadata": {}, "outputs": [ @@ -5568,7 +5505,7 @@ }, { "cell_type": "code", - "execution_count": 393, + "execution_count": 755, "id": "c658bad0", "metadata": {}, "outputs": [], @@ -5607,7 +5544,7 @@ }, { "cell_type": "code", - "execution_count": 394, + "execution_count": 756, "id": "3f35e8db-24d7-4d4b-9264-78ee5070cf27", "metadata": { "tags": [] @@ -5669,17 +5606,17 @@ }, { "cell_type": "code", - "execution_count": 395, + "execution_count": 757, "id": "bb999995", "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "'s3://mlschool/penguins/monitoring/groundtruth/2023/10/24/19/3952.jsonl'" + "'s3://mlschool/penguins/monitoring/groundtruth/2023/10/27/17/0816.jsonl'" ] }, - "execution_count": 395, + "execution_count": 757, "metadata": {}, "output_type": "execute_result" } @@ -5725,7 +5662,7 @@ }, { "cell_type": "code", - "execution_count": 380, + "execution_count": 758, "id": "da145ba1-4966-4dab-8a73-281db364cbc7", "metadata": { "tags": [] @@ -5864,7 +5801,7 @@ }, { "cell_type": "code", - "execution_count": 381, + "execution_count": 759, "id": "cc119422-2e85-4e8c-86cd-6d59e353d09d", "metadata": { "tags": [] @@ -5884,7 +5821,7 @@ }, { "cell_type": "code", - "execution_count": 382, + "execution_count": 760, "id": "083b0bd0-4035-43fe-9b2c-946b12a5e266", "metadata": { "tags": [] @@ -5927,7 +5864,7 @@ }, { "cell_type": "code", - "execution_count": 383, + "execution_count": 761, "id": "96e5c0c1-7e40-47df-8f40-1d891db13875", "metadata": { "tags": [] @@ -5977,14 +5914,14 @@ }, { "cell_type": "code", - "execution_count": 385, + "execution_count": null, "id": "15caf9e1-97fc-4379-893b-6062d4bd876e", "metadata": { "tags": [] }, "outputs": [], "source": [ - "%%script false --no-raise-error\n", + "# %%script false --no-raise-error\n", "#| code: true\n", "#| output: false\n", "#| eval: false\n", @@ -6020,7 +5957,7 @@ }, { "cell_type": "code", - "execution_count": 401, + "execution_count": 781, "id": "2c04fdd4-cc03-496c-a0a1-405854505c46", "metadata": { "tags": [] @@ -6184,7 +6121,7 @@ }, { "cell_type": "code", - "execution_count": 403, + "execution_count": 783, "id": "bb74dc04-54a1-4a3f-854f-4877f7f0b4a1", "metadata": { "tags": [] @@ -6195,7 +6132,7 @@ "output_type": "stream", "text": [ "Monitoring schedule deleted.\n", - "Monitoring schedule deleted.\n" + "There's no ModelQuality Monitoring Schedule.\n" ] } ], @@ -6229,19 +6166,10 @@ }, { "cell_type": "code", - "execution_count": 404, + "execution_count": null, "id": "9eabe84e", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:sagemaker:Deleting endpoint configuration with name: penguins-endpoint-config-1024190416\n", - "INFO:sagemaker:Deleting endpoint with name: penguins-endpoint\n" - ] - } - ], + "outputs": [], "source": [ "%%script false --no-raise-error\n", "#| eval: false\n", diff --git a/program/index.qmd b/program/index.qmd index b487df0..3004711 100644 --- a/program/index.qmd +++ b/program/index.qmd @@ -4,7 +4,7 @@ listing: contents: posts sort: "date desc" type: default - categories: true + categories: false --- Welcome to the program! @@ -73,8 +73,14 @@ Welcome to the program! * The 3 strategies to keep your models working despite data distribution shifts * Understanding SageMaker’s Transform Step, QualityCheck Step, Transform Jobs, and Monitoring Jobs - -## Table of Contents - -* [Configuration Setup](setup.qmd) -* [Cohort Notebook](cohort.ipynb) \ No newline at end of file +#### Session 6 - Continual Learning And Testing in Production + +* The importance of Continual Learning and why every company wants to to do it +* 3 challenges when implementing Continual Learning +* A 4-step plan to implement Continual Learning +* How to determine what data to use to retrain a model +* A 3-step progressive plan to decide how frequently you should retrain your models +* The differences between training from scratch and incremental training +* An introduction to Testing in Production +* 5 strategies to test models in production: A/B testing, shadow deployments, canary releases, interleaving experiments, and multi-armed bandits +* Highlights from the program diff --git a/program/project.qmd b/program/project.qmd new file mode 100644 index 0000000..d34e295 --- /dev/null +++ b/program/project.qmd @@ -0,0 +1,28 @@ +--- +title: "Class Project" +--- + +The goal of this project is to build a training pipeline to preprocess, train, evaluate, and register a machine learning model. + +You'll start from the template pipeline that we discussed during the program and make the necessary changes to it. Before making any changes, ensure you can run the pipeline from Session 4 without issues. + +The project has three different levels of complexity. Pick the one that you feel most comfortable tackling first. + +## Simple complexity +We want to replace the Penguins dataset with a different classification problem. Feel free to use any dataset you like. If you don't have any ideas, here are three options you can choose from: + +1. [Iris flowers](https://archive.ics.uci.edu/dataset/53/iris) dataset - This is a multi-class classification problem where you'll predict the flower species given the measurements of iris flowers. +2. [Adult income](https://archive.ics.uci.edu/dataset/2/adult) dataset - This is a binary classification problem where you'll predict whether the income of a person exceeds $50,000/yr based on census data. +3. [Banknote authentication](https://archive.ics.uci.edu/dataset/267/banknote+authentication) dataset - This is a binary classification problem where you'll predict whether a given banknote is authentic given the measures from a photograph. + +Start with the pipeline from Session 4 and modify the preprocessing, training, and evaluation scripts to use the new dataset. + +## Intermediate complexity +We want to replace TensorFlow with PyTorch in the pipeline we built in Session 4. Everything else will stay the same, except the framework to train the model. + +Start with the pipeline from Session 4 and modify the training and evaluation scripts to train and evaluate the model using PyTorch. Notice you'll need to use a [PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) to configure the Training Step and a [PyTorch processor](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-processor) to configure the evaluation step. + +## Advanced complexity +At this stage, we want to combine replacing the Penguins dataset with replacing TensorFlow with PyTorch in the pipeline. + +Start with the pipeline from Session 4 and make the necessary changes described in the simple and intermediate complexity sections. \ No newline at end of file diff --git a/program/setup.qmd b/program/setup.qmd index 8a6b6ab..bba8279 100644 --- a/program/setup.qmd +++ b/program/setup.qmd @@ -4,7 +4,7 @@ listing: contents: posts sort: "date desc" type: default - categories: true + categories: false --- Here are the steps you need to follow to set up the project: diff --git a/program/sidebar.yml b/program/sidebar.yml index 71e914e..c3b3d58 100644 --- a/program/sidebar.yml +++ b/program/sidebar.yml @@ -3,4 +3,5 @@ website: contents: - index.qmd - setup.qmd + - project.qmd - cohort.ipynb \ No newline at end of file