Dataset Support

Included Datasets

Out of the box, the Nervana cloud already has support for the following, commonly used stock datasets:

  • cifar10 60,000 record 10 class image classification dataset
  • cifar100 60,000 record 100 class image classification dataset
  • housing 506 record tabular data giving prices and other attributes on houses in Boston.
  • flickr8k 8,000 image and associated multiple text caption dataset.
  • flick30k 31,783 images and 158,915 text caption dataset (extension of flickr8k)
  • mscoco 300,000 image dataset with object segmentation, classification, and textual caption information
  • iris 150 sample tabular data on 3 different species of flower.
  • mnist 60,000 record handwritten digit image dataset.
  • mobydick text content of the book Moby Dick.
  • pascal VOC 11,500 image dataset with annotated object classifications and segmentations.
  • librispeech 1000 hour english speech audio dataset.

To use these datasets in your python script, you just need to load the associated class. For example: (X_train, y_train), (X_test, y_test), nclass = load_mnist(args.data_dir)

If you have already obtained a license to use the imagenet i1k dataset, please let us know and we can make that available for your tenant.

Custom Datasets

NOTE The following sections describe the use of dataset commands, however the datasets command line interface (ncloud dataset commands) has been disabled at this time. This functionality has been replaced by volume commands. Volume commands support all functionality provided by datasets, plus more. Users can still run training jobs against existing datasets.

It is possible to use any data type in the cloud, but the expectation is that preprocessing and batching are handled by the user prior to uploading, or as part of the train script.

linking your S3 dataset bucket

NOTE: Dataset commands are disabled at this time; use volume commands.

Instead of having to use ncloud dataset upload to get your data into the Nervana cloud (a potentially time consuming process), if you already have your data on AWS, you can link your datasets so that they can be read directly via ncloud dataset link.

In order to facilitate this, you’ll first need to modify your S3 access policy as follows to grant the Nervana cloud read access to the bucket in which your dataset resides.

From AWS web UI, select S3, highlight the relevant bucket, then Properties -> Permissions -> Add Bucket Policy. Then paste in the following code snippet, replacing references to customer-bucket-name as appropriate.

{
    "Version": "2012-10-17",
    "Id": "Policy1458087577273",
    "Statement": [
        {
            "Sid": "Stmt1458087543219",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::369948638977:user/helium-prod"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::customer-bucket-name"
        },
        {
            "Sid": "Stmt1458087575474",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::369948638977:user/helium-prod"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::customer-bucket-name/*"
        }
    ]
}

If your S3 bucket resides in a region other than us-west-1 (N. California), you will also need to specify that at linking time by passing the string name of the region via the --region flag to the ncloud dataset link command.