How to create an import file for Vertex AI
Import file needed to create a dataset in Vertex AI

How to create an import file for Vertex AI

Right before summer vacations Google released a promising pipeline tool for the ML community. I’ve played around with Vertex AI and must say it is impressive.

The tool is comprehensive and easy to use. You can be a novice and see it as a drag-and-drop tool to train your first ML model without a single line of code. You can also know your ML stuff, and continue to have full control of what you are used to, but now have a pipeline aligning all the needed steps from gathering your dataset, to pull your magic on feature engineering, training and deploying the model into one tool. I’ve found Vertex AI to save me tons of time not having to hold my own pipeline together.

So long, there has only been one roadblock that I’ve encountered. Google has introduced something called an import file. This is a .csv file telling Vertex AI where to fetch your data. Minimum is also to state the label for each datapoint. If you would like to, you can also add information such as how to split into validation and test set, but if you don’t, Vertex AI will handle that automatically for you.

Below, I am sharing my Python code to generate an import file. In my example I am preparing for doing image classification with single label. For example: a classifier to predict if an image is showing a cat or a dog. Though, since I’m building on an old project of mine, I am classifying if an image of a face is AI-generated or showing an actual real person.

Preparation:

  • Create a storage bucket in your GCP account (my example: real-or-fake-faces-bucket)
  • Create folders corresponding to your labels (my example: real-world and ai-generated)
  • Upload your images to the corresponding folders in the bucket

Note!

  • Prefix here is corresponding to the folder-name in your bucket
  • You will need to authenticate for the package google.cloud to work

From terminal:

> gcloud auth application-default login

The Python code to generate your import file to use in Vertex AI to create the dataset for training an ML-model:

from google.cloud import storage
import pandas as pd


BUCKET='real_or_fake_faces_bucket'
DELIMITER='/'
PREFIX_REAL_WORLD='real-world/'
PREFIX_AI_GENERATED='ai-generated/'
FOLDERS = [PREFIX_REAL_WORLD, PREFIX_AI_GENERATED]

# BASE_PATH = f'gs://{BUCKET}/{PREFIX}'

print(f'BUCKET : {BUCKET}')
print('Connecting to GCP Storage')
client = storage.Client()
bucket = client.get_bucket(BUCKET)


print('Fetching list of objects to generate the import file')

data = []

for folder in FOLDERS :

    blobs = client.list_blobs(BUCKET, prefix=folder, delimiter=DELIMITER)

    for blob in blobs:
        # range to remove the last character which is a delimiter
        label = folder[:-1] 
        data.append({
            'FILE_PATH': f'gs://{BUCKET}/{blob.name}',
            'LABEL': label
        })


df = pd.DataFrame(data)

print('Exporting import file data to CSV-file')
df.to_csv('import_file_faces.csv', index=False, header=False)

When you have got your .csv file, put it in your bucket for Vertex AI to fetch it. This is how your dataset will be created.

VOILA, now you will be well on the way to just click a button to start training your model!


For the Jayway Blog,

Silvia Man,
Senior software engineer


Leave a Reply