Inference Support

On the Intel® Nervana™ Cloud, there are basically two ways to generate predictions from trained models: streaming inference and batch inference. The best choice to use basically depends on how much data you currently have ready and your application.

If you have a situation where your evaluation data comes in periodically over time or needs to be evaluated synchronously in real-time, it makes more sense to generate these predictions on the fly via a streaming inference workflow.

Alternatively, in cases where you’ve already amassed a large volume that you’d like to evaluate asynchronously, batch inference makes more sense as you can get these predictions all in the same job.

In both approaches, you’ll first need to go through the process of training (or importing a pre-trained) neural network.

Streaming Inference

In streaming inference, the first step involves deploying a trained model. This produces a new stream object that can then be queried and predictions generated against. This stream continues to consume assigned resources during its deployment lifetime. When you are finished with it you would then “undeploy” it to free those resources. Here’s a typical ncloud command sequence:

$ ncloud model deploy 2
|      id      |         presigned_token          |    status    |
|          23  |  8e517a6e6ed915103240c0c9ef0b4b  |   Deploying  |

$ ncloud stream predict 8e517a6e6ed915103240c0c9ef0b4b ./img1.jpg
{"outputs": [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0], "shape": [10, 1]}

# repeat predict calls on additional inputs as needed

$ ncloud stream undeploy 23
|      id      |    status     |
|          23  |  Undeploying  |

By default, each stream uses only a single CPU. The number of GPUs can be specified with the -g flag much like training.

The same model can be deployed multiple times in separate streams, which is useful if you want to increase throughput, or modify the data pre-processing or output post-processing and formatting. For the latter see Customizing Inference Flow.

Batch Inference

In batch inference, the first step involves uploading the data you’d like to generate predictions for to create a new volume (via ncloud volume upload or ncloud volume link). You then specify this volume along with the model to use, and a new batch prediction job will be kicked off. Since this could be long running if you have a large volume, you can periodically check in on its status, halt it, and when complete download the prediction results. Again, here’s an example ncloud command workflow:

$ ncloud volume upload tiny_image_volume
Created volume with ID 56.
3/3 Uploaded. 0 Failed.
|    failed    |      id      |   success    |  total_files  |
|           0  |          56  |           3  |            3  |

$ ncloud batch predict 2 56
|      id      |    status    |
|          21  |    Received  |

$ ncloud batch list
|      id      |   model_id   |    status    |
|          21  |           2  |      Queued  |

$ ncloud batch results 21

$ ncloud batch stop 21

The results will be a CSV file. The first column will be the example path name and the second column will be the output.

Customizing Inference Flow

The default behavior may not suit your needs and the following steps can be customized:

  1. Preprocessing the raw byte stream input.
  2. Loading the neon model object.
  3. Performing the actual prediction.
  4. Post-processing and formatting the network output to be shown to the user.

Any number of the four steps can be overriden. If they are not customized, they will use the default behavior. The process to overriding any of these steps is as follows:

  1. Create a python file named (must be named exactly this).
  2. In that file define a function that performs the desired behavior. The expected inputs, outputs, and function signatures for each of the above four steps will be described in the following sections. If multiple steps are customized, please use the same file.
  3. A requirements.txt file may be included if your function requires additional packages.
  4. You can include additional files to use in your function. These files will be available in the path /code.

Making additional auxiliary files available comes in handy when doing things like evaluating NLP type models. These models often require an input vocabulary at inference time to map text strings to numeric word or sentence vectors in the same manner as accomplished during training time.

The file and auxilliary files can be uploaded via the -f/--extra-files flag of the ncloud model deploy command. The argument can be a single file, a volume ID (corresponding to a previously uploaded or linked volume via ncloud volume upload or ncloud volume link), or a zip file. The contents must include a file named exactly and can optionally include any auxilliary user data such as an input vocabulary file. The user may include additional files only if a volume ID or a zip file are specified on deploy. All of these files can be accessed by custom functions in through the path /code.

Initial Model Loading

If you want complete control over how your trained model gets deserialized and loaded during the deployment process, you can do so by defining your loading function in It must have the following input signature: def init(prm). prm will be a string giving the local path to the serialized model file being deployed. Your custom function must return a neon Model instance.

Input Data Preprocessing

If you want complete control over how input data gets manipulated before being passed into your trained neural network for streaming or batch inference, you can define a preprocessing function in The preprocessing function must have the following input signature: def preprocess(input, model). input will be the raw contents of the file transferred during the predict call as a byte stream. model will be a neon Model instance. Your custom function must return a neon backend Tensor instance of the appropriate shape.

Next is an example of the preprocess function. It takes the byte stream of the image, ensures the proper number of channels and shape, and finally converts it to a neon backend tensor.

def preprocess(x, model):
   dtype = "float32"
   dtype = numpy.dtype(dtype)
   in_shape = model.layers.in_shape
   res =
   if len(in_shape) == 3 and in_shape[0] == 1:
       out_mode = "L"
   elif len(in_shape) == 3 and in_shape[0] == 3:
       out_mode = "RGB"
   res = res.convert(mode=out_mode)
   res = res.resize((in_shape[2], in_shape[1]))
   res = numpy.array(res).astype(dtype)
   if out_mode == "RGB":
       res = res.transpose(2, 0, 1)
   res = res.reshape(-1, 1)

Canned Input Preprocessors

Image and JSON preprocessors are available. These will be run if no preprocessing function is defined. The preprocessor is chosen based on MIME type.

Prediction Generation

If you want complete control over how a preprocessed input passes through your trained and loaded model to generate an inference output, you can insert your own python script, which will be called instead of the standard neon model.fprop(data) function. The function must have the following input signature: def predict(data, model). data will be the preprocessed input to be predicted on and should be a neon Tensor object of the appropriate shape. model will be the loaded neon Model object. Unless you are also specifying a custom post-processing function, you should return a numpy ndarray representing the last layer outputs from your network (to match what model.fprop() returns).

One use-case for overriding the predict function is for beam search. Next is an example.

def predict(data, model):
    Generates prediction on data.
    return model.get_outputs_beam(data, num_beams=2)

Predicted Output Post-processing

If you’d like complete control over how the raw last layer activations from your network get interpreted and manipulated prior to being passed back to the user during a streaming or batch inference predict call, you can define a custom function. The function must have the following input signature: def postprocess(output, raw_input). output will be the raw outputs of the network (like that returned by neon’s Model.fprop() but converted to a host numpy ndarray. raw_input will be the raw contents of the file transferred during the predict call as a byte stream. This input is useful if you have initial metadata that you want to include in your output. Whatever your custom function returns will then be passed back to the user as-is (JSON encoded strings are recommended).

Below is an example of a postprocess function. Often times a user may want to return the colloquial label of a prediction rather than the probability distribution. The labels.pkl file contains a dictionary object mapping class indices to class labels. This pickle file (and any additional files) can be included in the zip file uploaded on deploy and be made available for use during pre-processing and post-processing.

with open('/code/labels.pkl') as labels_file:
    labels = pickle.load(labels_file)

def postprocess(x, raw_input):
    index = int(numpy.argmax(x))
    return labels[index]

Canned Output Formatters

The output formatters listed below are used to post-process the final layer network values during inference. This may influence the type and of the value ultimately returned to the user.

If no formatter is explicitly specified, the “raw” formatter will be used.

Additional arguments for the formatters should be specified in a JSON object string, with the member name as key.


The default formatter. Passes through the raw output values from the last layer of the network as a list.

Optional Arguments

  • format_as: string. How to format the output values. Defaults to "json" if not specified, but other valid values include "csv", and "tsv" for delimited outputs.


Three-node raw output, format_as not specified:

  "outputs": [0.32123, 0.2915, 0.38727]

Five-node raw output, {"format_as": "csv"}:

0.12345, 0.212, 0.1005, 0.01238, 0.55167


Useful when each node of the last layer represents the likelihood of that particular outcome (can be interpreted probabilistically).

Optional Arguments

  • num_preds: integer. Limit the number of returned results to just the top num_preds most likely values. Defaults to 1.
  • probability: boolean. Interpret and include node output value as a probability. Defaults to true.
  • label: boolean. Include label information in the result (requires that the trained model is aware of this information). Defaults to false
  • index: boolean. Include node offset index in the result. Defaults to false


Three-node classification output, no arguments specified:

  "predictions": {[
    "probability": 0.643225

Five-node classification output, {"num_preds": 3,  "label": true, "index": true}:

  "predictions": {[{
    "probability": 0.712,
    "label": "car",
    "index": 3
  }, {
    "probability": 0.2,
    "label": "truck",
    "index": 4
  }, {
    "probability": 0.088,
    "label": "boat",
    "index": 1

Custom Code

Sometimes you may need to extend Neon to include custom layers, activation or cost functions, or other supplements. In these cases, you can link your custom repository directly and use that as the basis for inference for a model. You’ll need to use the --custom-code-url and potentially the --custom-code-commit flags to pass the path (and branch/commit) of your repository to the ncloud model deploy command.