Out of the box, the Nervana cloud already has support for the following, commonly used stock datasets:
- cifar10 60,000 record 10 class image classification dataset
- cifar100 60,000 record 100 class image classification dataset
- housing 506 record tabular data giving prices and other attributes on houses in Boston.
- flickr8k 8,000 image and associated multiple text caption dataset.
- flick30k 31,783 images and 158,915 text caption dataset (extension of flickr8k)
- mscoco 300,000 image dataset with object segmentation, classification, and textual caption information
- iris 150 sample tabular data on 3 different species of flower.
- mnist 60,000 record handwritten digit image dataset.
- mobydick text content of the book Moby Dick.
- pascal VOC 11,500 image dataset with annotated object classifications and segmentations.
- librispeech 1000 hour english speech audio dataset.
To use these datasets in your python script, you just need to load the
associated class. For example:
(X_train, y_train), (X_test, y_test), nclass = load_mnist(args.data_dir)
If you have already obtained a license to use the imagenet i1k dataset, please let us know and we can make that available for your tenant.
NOTE The following sections describe the use of dataset commands, however the datasets command line interface (ncloud dataset commands) has been disabled at this time. This functionality has been replaced by volume commands. Volume commands support all functionality provided by datasets, plus more. Users can still run training jobs against existing datasets.
It is possible to use any data type in the cloud, but the expectation is that preprocessing and batching are handled by the user prior to uploading, or as part of the train script.