Annotated Data to Model Training with AWS Sage Maker: A Guide for Model Training

Puranjay kwatra
Photo by Stephen Dawson / Unsplash

Introduction

This blog aims to delve into data annotation for training guidance within the AWS ecosystem. Whether or not you're familiar with AWS, we've got you covered, starting from the basics. In the realm of supervised machine learning (ML), labels signify values that a model is expected to learn and predict. The process of obtaining accurate labels involves either real-time recording or offline data annotation—tasks that assign labels to the dataset based on human intelligence.

However, manual dataset annotation proves to be a laborious and exhausting task for humans, particularly when dealing with large datasets. Even with labels that may seem obvious, the process remains error-prone due to human fatigue. Consequently, constructing training datasets consumes a significant chunk, up to 80%, of a data scientist's time. Addressing this challenge necessitates human intervention for the manual labeling of datasets, making it the most time-consuming aspect. Before diving into the details, let's familiarize ourselves with the platforms we'll be working on: Amazon S3 and SageMaker.

Understanding Amazon S3

Amazon S3 stands as a scalable, durable, and available object storage service equipped with a range of security features and cost-effective storage classes. Its versatility extends to applications such as storing static websites, hosting large files, managing backups, and serving data for big data analytics and machine learning.

The Significance of AWS SageMaker

Amazon SageMaker emerges as a fully managed service, empowering developers and data scientists to expedite the building, training, and deployment of machine learning (ML) models. SageMaker efficiently handles the intricacies of each step in the machine-learning process, simplifying the development of high-quality models.

SageMaker has the capability to build models trained by data stored in S3 buckets or sourced from a streaming data source like Kinesis shards. Once the models are trained, SageMaker streamlines the deployment process, making it seamless to transition them into production.


Data annotation:

In the realm of machine learning, data annotation is the pivotal process of labeling data to indicate the anticipated outcome that your machine learning model is meant to predict. It involves marking—labeling, tagging, transcribing, or processing—a dataset with the features you want your machine learning system to recognize. The ultimate goal is for your deployed model to autonomously recognize these features and make informed decisions or take actions accordingly.

Choosing the Optimal Algorithm for Your Use Case: Object Detection

Taking object detection as our focal point, selecting the right algorithm hinges on various factors, such as desired accuracy, speed, and computational resources. Below, we explore some of the most prominent object detection algorithms, along with their respective strengths and weaknesses.

Faster R-CNN (Region-based Convolutional Neural Network):

Advantages:

  • Good accuracy.
  • Two-stage architecture with a region proposal network (RPN) for improved localization.

Considerations:

  • Can be computationally intensive.
  • May not be as fast as some newer models for real-time applications.

YOLO (You Only Look Once):

Advantages:

  • Fast and efficient, suitable for real-time applications.
  • Single-stage architecture, making it simpler and faster than two-stage detectors.

Considerations:

  • May sacrifice some accuracy compared to slower, two-stage detectors.
  • YOLOv3 and later versions have improved accuracy.

SSD (Single Shot Multibox Detector):

Advantages:

  • Balances speed and accuracy.
  • Single-stage architecture, making it faster than Faster R-CNN.
  • Good performance on small objects.

Considerations:

  • May not be as accurate as two-stage detectors like Faster R-CNN.
  • Can have more false positives compared to other models.

Importance of Data Labeling:

1. Providing Ground Truth: Data labels act as the "ground truth" against which machine learning models are assessed. By comparing the model's predictions to the labelled data, we can gauge its accuracy and pinpoint areas for improvement. High-quality labeling ensures that the model learns from accurate and consistent information.

2. Enabling Supervised Learning: Supervised learning, a prevalent machine learning approach, relies on labelled data to train models. By furnishing examples of inputs and their corresponding outputs, data labeling empowers the model to discern patterns and relationships within the data. This proves critical for tasks such as image classification, speech recognition, and sentiment analysis.

3. Improving Model Performance: The precision of data labeling directly influences model performance. The more accurate and consistent the labels, the more reliable the model's predictions become. Poor-quality labelling introduces noise and inconsistencies into the training data, leading to inaccurate or biased predictions.

Annotating Data with AWS Ground Truth for bird Classification



To kickstart our journey, we will commence by annotating the data using AWS Ground Truth, ensuring that we obtain meticulously labelled datasets. In this instance, let's focus on the task of building a model for classifying birds. Our goal is to acquire data where the bird is distinctly highlighted in the image. This will be achieved by leveraging the annotation services offered by AWS Ground Truth.

Roboflow

Step 0: Create an S3 bucket, upload images, and generate manifest files.

{"source-ref": "s3://bucket-input/input/image1.jpeg"} 
{"source-ref": "s3://bucket-input/input/image2.jpeg"} 
{"source-ref": "s3://bucket-input/input/image3.jpeg"} 
{"source-ref": "s3://bucket-input/input/image4.jpeg"} 

Step 1: In the AWS console, search for AWS SageMaker, and then navigate to Labeling Jobs/Ground Truth. Here, you can label your images by clicking on Create Labeling Job.

step1

Step 2: Fill out the Job name and your bucket URL where your data is stored.

step2

Step 3: Select the task type based on your use case, whether you require a single class label, multi-class label, semantic segmentation, or bounding box.

step3

Step 4: Label your image with the tools mentioned below, and you can label multiple images to create your own dataset.

step4

Annotated images are visible directly in the AWS console, which comes in handy for sanity checks. You can also click on any image and see the list of labels that have been applied. Our main purpose is to use this information to train machine learning models, and you can find this data in your bucket.

Disadvantages of AWS ground truth

However, for large datasets, Amazon AWS does not provide support for collaboration and better annotation tools. As a result, many individuals opt to outsource a labeling platform that offers automated labeling, faster manual labeling, and collaborative annotation.

This enables the division of labeling tasks among multiple individuals, significantly accelerating the labeling process—nearly 90 times faster than the traditional approach.

Labellerr Helps Getting Large Volume Labels

This tool addresses all the disadvantages mentioned above in Ground Truth, significantly accelerating the annotation process for all images.

Additionally, after annotating all the images, it can export them to your desired cloud service, whether it's AWS, Azure, or Google Cloud. Now that you have all the annotated images, and you want to write code for model training, let's follow the steps below to start training your model.


First, let's set up AWS SageMaker!



step 1: Go to AWS, management console

step5

Step 2: Open the Notebook Instances and click on Create Notebook Instance.

step6

  • During the setup, you'll encounter prompts for Notebook Instance Settings. Provide it with a descriptive name and opt for the ml.t2.medium instance type. This represents the basic tier and is suitable for our current needs. As we delve into more advanced models, we can consider higher-tier instances. Cost considerations need not be a concern, as AWS operates on a pay-per-use pricing model, ensuring you only pay for the resources you consume.


Other services can be left to their default settings. If you wish to explore them further, you can always refer to the documentation for more detailed information. documentation

Step 3: Create an IAM role

step7

Step 4: Confirmation screen

After this, you will receive a message saying

success! you created an IAM role.

After some time after loading, we can see the console says the notebook instance is in "Inservice"

step8

Step 5: Create a new file and select.ipynb file from the drop-down menu

step9


Now that our environment setup is complete, let's dive into coding. For this exercise, we'll focus on object detection for bird species using the Bird Species dataset. You can find this dataset here in Bird Species Dataset.

step10

Step 0: Setup

! pip install distro
import distro
if "debian" in distro.linux_distribution()[0].lower():
    ! apt-get update
    ! apt-get install ffmpeg libsm6 libxext6  -y
import sys

!{sys.executable} -m pip install opencv-python
!{sys.executable} -m pip install mxnet

Step 1: Code for Retrieving Default S3 Bucket and Prefix in SageMaker

In the following code snippet, our objective is to identify the S3 bucket designated for the provision of training and validation datasets. Furthermore, this bucket will serve as the repository for storing the trained model artifacts. Although a customized bucket is utilized in this notebook, it is essential to recognize that a default session bucket is a feasible alternative. To augment the organization of the bucket content, an object prefix is employed.

import sagemaker

bucket = sagemaker.Session().default_bucket()
prefix = "DEMO-ObjectDetection-birds"

print("s3://{}/{}/".format(bucket, prefix))

Step 2: Set up and authenticate the use of AWS services.

from sagemaker import get_execution_role

role = get_execution_role()
print(role)
sess = sagemaker.Session()

step 3: Download and unpack the dataset

import os
import urllib.request


def download(url):
    filename = url.split("/")[-1]
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)
%%time
# download('http://www.vision.caltech.edu/visipedia-data/CUB-200-2011/CUB_200_2011.tgz')
# CalTech's download is (at least temporarily) unavailable since August 2020.

# Can now use one made available by fast.ai .
download("https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011.tgz")

step 4: Unpacking and removing tar files

%%time
# Clean up prior version of the downloaded dataset if you are running this again
!rm -rf CUB_200_2011

# Unpack and then remove the downloaded compressed tar file
!gunzip -c ./CUB_200_2011.tgz | tar xopf -
!rm CUB_200_2011.tgz

step 5: Defining the parameters

import pandas as pd
import cv2
import boto3
import json

runtime = boto3.client(service_name="runtime.sagemaker")

import matplotlib.pyplot as plt

%matplotlib inline

RANDOM_SPLIT = False
SAMPLE_ONLY = True
FLIP = False

# To speed up training and experimenting, you can use a small handful of species.
# To see the full list of the classes available, look at the content of CLASSES_FILE.
CLASSES = [17, 36, 47, 68, 73]

# Otherwise, you can use the full set of species
if not SAMPLE_ONLY:
    CLASSES = []
    for c in range(200):
        CLASSES += [c + 1]

RESIZE_SIZE = 256

BASE_DIR = "CUB_200_2011/"
IMAGES_DIR = BASE_DIR + "images/"

CLASSES_FILE = BASE_DIR + "classes.txt"
BBOX_FILE = BASE_DIR + "bounding_boxes.txt"
IMAGE_FILE = BASE_DIR + "images.txt"
LABEL_FILE = BASE_DIR + "image_class_labels.txt"
SIZE_FILE = BASE_DIR + "sizes.txt"
SPLIT_FILE = BASE_DIR + "train_test_split.txt"

TRAIN_LST_FILE = "birds_ssd_train.lst"
VAL_LST_FILE = "birds_ssd_val.lst"

if SAMPLE_ONLY:
    TRAIN_LST_FILE = "birds_ssd_sample_train.lst"
    VAL_LST_FILE = "birds_ssd_sample_val.lst"

TRAIN_RATIO = 0.8
CLASS_COLS = ["class_number", "class_id"]
IM2REC_SSD_COLS = [
    "header_cols",
    "label_width",
    "zero_based_id",
    "xmin",
    "ymin",
    "xmax",
    "ymax",
    "image_file_name",
]

step 6:Explore the dataset images

def show_species(species_id):
    _im_list = !ls $IMAGES_DIR/$species_id

    NUM_COLS = 6
    IM_COUNT = len(_im_list)

    print('Species ' + species_id + ' has ' + str(IM_COUNT) + ' images.')
    
    NUM_ROWS = int(IM_COUNT / NUM_COLS)
    if ((IM_COUNT % NUM_COLS) > 0):
        NUM_ROWS += 1

    fig, axarr = plt.subplots(NUM_ROWS, NUM_COLS)
    fig.set_size_inches(8.0, 16.0, forward=True)

    curr_row = 0
    for curr_img in range(IM_COUNT):
        # fetch the url as a file type object, then read the image
        f = IMAGES_DIR + species_id + '/' + _im_list[curr_img]
        a = plt.imread(f)

        # find the column by taking the current index modulo 3
        col = curr_img % NUM_ROWS
        # plot on relevant subplot
        axarr[col, curr_row].imshow(a)
        if col == (NUM_ROWS - 1):
            # we have finished the current row, so increment row counter
            curr_row += 1

    fig.tight_layout()       
    plt.show()
        
    # Clean up
    plt.clf()
    plt.cla()
    plt.close()

step 7: Checking the output of the class

classes_df = pd.read_csv(CLASSES_FILE, sep=" ", names=CLASS_COLS, header=None)
criteria = classes_df["class_number"].isin(CLASSES)
classes_df = classes_df[criteria]
print(classes_df.to_csv(columns=["class_id"], sep="\t", index=False, header=False))
show_species("017.Cardinal")

Step 8: Generate the record IO files

For this specific dataset, bounding box annotations are specified in absolute terms. The RecordIO format, however, necessitates them to be defined in terms relative to the image size. The subsequent code iterates through each image, extracts the height and width, and stores this information in a file for future use. It's worth noting that certain publicly available datasets already offer such a file explicitly for this purpose.

%%time
SIZE_COLS = ["idx", "width", "height"]


def gen_image_size_file():
    print("Generating a file containing image sizes...")
    images_df = pd.read_csv(
        IMAGE_FILE, sep=" ", names=["image_pretty_name", "image_file_name"], header=None
    )
    rows_list = []
    idx = 0
    image_file_name = images_df["image_file_name"].dropna(axis=0)
    for i in image_file_name:
        # TODO: add progress bar
        idx += 1
        img = cv2.imread(IMAGES_DIR + i)
        dimensions = img.shape
        height = img.shape[0]
        width = img.shape[1]
        image_dict = {"idx": idx, "width": width, "height": height}
        rows_list.append(image_dict)

    sizes_df = pd.DataFrame(rows_list)
    print("Image sizes:\n" + str(sizes_df.head()))

    sizes_df[SIZE_COLS].to_csv(SIZE_FILE, sep=" ", index=False, header=None)


gen_image_size_file()

step 9: Generate a list of producing recordIO files.

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print(
            "\n{}:\n---------\ntotal:{}\ntrain_df:{}\ntest_df:{}".format(
                lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df)
            )
        )
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)
    return train_df, test_df


def gen_list_files():
    # use generated sizes file
    sizes_df = pd.read_csv(
        SIZE_FILE, sep=" ", names=["image_pretty_name", "width", "height"], header=None
    )
    bboxes_df = pd.read_csv(
        BBOX_FILE,
        sep=" ",
        names=["image_pretty_name", "x_abs", "y_abs", "bbox_width", "bbox_height"],
        header=None,
    )
    split_df = pd.read_csv(
        SPLIT_FILE, sep=" ", names=["image_pretty_name", "is_training_image"], header=None
    )
    print(IMAGE_FILE)
    images_df = pd.read_csv(
        IMAGE_FILE, sep=" ", names=["image_pretty_name", "image_file_name"], header=None
    )
    print("num images total: " + str(images_df.shape[0]))
    image_class_labels_df = pd.read_csv(
        LABEL_FILE, sep=" ", names=["image_pretty_name", "class_id"], header=None
    )

    # Merge the metadata into a single flat dataframe for easier processing
    full_df = pd.DataFrame(images_df)
    full_df.reset_index(inplace=True)
    full_df = pd.merge(full_df, image_class_labels_df, on="image_pretty_name")
    full_df = pd.merge(full_df, sizes_df, on="image_pretty_name")
    full_df = pd.merge(full_df, bboxes_df, on="image_pretty_name")
    full_df = pd.merge(full_df, split_df, on="image_pretty_name")
    full_df.sort_values(by=["index"], inplace=True)

    # Define the bounding boxes in the format required by SageMaker's built in Object Detection algorithm.
    # the xmin/ymin/xmax/ymax parameters are specified as ratios to the total image pixel size
    full_df["header_cols"] = 2  # one col for the number of header cols, one for the label width
    full_df["label_width"] = 5  # number of cols for each label: class, xmin, ymin, xmax, ymax
    full_df["xmin"] = full_df["x_abs"] / full_df["width"]
    full_df["xmax"] = (full_df["x_abs"] + full_df["bbox_width"]) / full_df["width"]
    full_df["ymin"] = full_df["y_abs"] / full_df["height"]
    full_df["ymax"] = (full_df["y_abs"] + full_df["bbox_height"]) / full_df["height"]

    # object detection class id's must be zero based. map from
    # class_id's given by CUB to zero-based (1 is 0, and 200 is 199).

    if SAMPLE_ONLY:
        # grab a small subset of species for testing
        criteria = full_df["class_id"].isin(CLASSES)
        full_df = full_df[criteria]

    unique_classes = full_df["class_id"].drop_duplicates()
    sorted_unique_classes = sorted(unique_classes)

    id_to_zero = {}
    i = 0.0
    for c in sorted_unique_classes:
        id_to_zero[c] = i
        i += 1.0

    full_df["zero_based_id"] = full_df["class_id"].map(id_to_zero)

    full_df.reset_index(inplace=True)

    # use 4 decimal places, as it seems to be required by the Object Detection algorithm
    pd.set_option("display.precision", 4)

    train_df = []
    val_df = []

    if RANDOM_SPLIT:
        # split into training and validation sets
        train_df, val_df = split_to_train_test(full_df, "class_id", TRAIN_RATIO)

        train_df[IM2REC_SSD_COLS].to_csv(TRAIN_LST_FILE, sep="\t", float_format="%.4f", header=None)
        val_df[IM2REC_SSD_COLS].to_csv(VAL_LST_FILE, sep="\t", float_format="%.4f", header=None)
    else:
        train_df = full_df[(full_df.is_training_image == 1)]
        train_df[IM2REC_SSD_COLS].to_csv(TRAIN_LST_FILE, sep="\t", float_format="%.4f", header=None)

        val_df = full_df[(full_df.is_training_image == 0)]
        val_df[IM2REC_SSD_COLS].to_csv(VAL_LST_FILE, sep="\t", float_format="%.4f", header=None)

    print("num train: " + str(train_df.shape[0]))
    print("num val: " + str(val_df.shape[0]))
    return train_df, val_df
train_df, val_df = gen_list_files()

Step 10: Convert the data into  RecordIo format

!tail -3 $TRAIN_LST_FILE
!python tools/im2rec.py --resize $RESIZE_SIZE --pack-label birds_ssd_sample $BASE_DIR/images/

step 11: Upload the record IO files to S3.

Upload the training and validation data to the S3 bucket. This process is executed through multiple channels, which are essentially directories in the bucket that distinguish the types of data supplied to the algorithm. In the context of the object detection algorithm, these directories are named train and validation.

# Upload the RecordIO files to train and validation channels
train_channel = prefix + "/train"
validation_channel = prefix + "/validation"

sess.upload_data(path="birds_ssd_sample_train.rec", bucket=bucket, key_prefix=train_channel)
sess.upload_data(path="birds_ssd_sample_val.rec", bucket=bucket, key_prefix=validation_channel)

s3_train_data = "s3://{}/{}".format(bucket, train_channel)
s3_validation_data = "s3://{}/{}".format(bucket, validation_channel)

Step 12: Training the model

Subsequently, we specify an output location in S3 where the model artifacts will be deposited upon the completion of the training. These artifacts constitute the output of the algorithm's training job. Additionally, we obtain the URI for the Amazon SageMaker Object Detection Docker image. This ensures that the estimator utilizes the correct algorithm corresponding to the current region.

from sagemaker import image_uris

training_image = image_uris.retrieve(
    region=sess.boto_region_name, framework="object-detection", version="latest"
)
print(training_image
s3_output_location = "s3://{}/{}/output".format(bucket, prefix)
od_model = sagemaker.estimator.Estimator(
    training_image,
    role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    volume_size=50,
    max_run=360000,
    input_mode="File",
    output_path=s3_output_location,
    sagemaker_session=sess,


step 13: Define the hyperparameters:

def set_hyperparameters(num_epochs, lr_steps):
    num_classes = classes_df.shape[0]
    num_training_samples = train_df.shape[0]
    print("num classes: {}, num training images: {}".format(num_classes, num_training_samples))

    od_model.set_hyperparameters(
        base_network="resnet-50",
        use_pretrained_model=1,
        num_classes=num_classes,
        mini_batch_size=16,
        epochs=num_epochs,
        learning_rate=0.001,
        lr_scheduler_step=lr_steps,
        lr_scheduler_factor=0.1,
        optimizer="sgd",
        momentum=0.9,
        weight_decay=0.0005,
        overlap_threshold=0.5,
        nms_threshold=0.45,
        image_shape=512,
        label_width=350,
        num_training_samples=num_training_samples,
    )
 set_hyperparameters(100, "33,67")

step 14: Defining the data channels

With the hyperparameters configured, we proceed to define the data channels to be passed to the algorithm. To achieve this, we create sagemaker.session.s3_input objects from our data channels. These objects are then organized into a straightforward dictionary, which the algorithm utilizes. It's worth noting that you have the option to include a third channel named model for incremental training, allowing you to continue training from where you left off with a prior model.

train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

step 15:Submit the training Job:

The training process encompasses several steps. Initially, the instances requested during the creation of the Estimator classes are provisioned and set up with the necessary libraries. Subsequently, the data from our channels is downloaded onto the instance. Once this preparatory phase is complete, the actual training process commences.

%%time
od_model.fit(inputs=data_channels, logs=True)

Step 16: Now that the training job is complete, you can observe it listed in the following manner.

step16

Host The model

After the training is successfully completed, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This deployment enables us to make predictions, also known as inferences, from the model.

%%time
object_detector = od_model.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

Conclusion:


In summary, the outlined steps detail the procedure for preparing, training, and deploying a model through Amazon SageMaker. The process encompasses configuring and authenticating AWS services, transforming data into RecordIO format, and organizing it into distinct channels.

The resulting model artifacts are stored in an S3 bucket. Upon completing the training, the model is deployed as a real-time hosted endpoint, allowing for predictions or inferences. These steps provide a structured and effective approach to machine learning tasks within the AWS platform.

Train Your Vision/NLP/LLM Models 10X Faster

Book our demo with one of our product specialist

Book a Demo