Running Multiple Simulations

FLAME GPU 2 provides CUDAEnsemble as a facility for executing batch runs of multiple configurations of a model.

Creating a CUDAEnsemble

An ensemble is a group of simulations executed in batch, optionally using all available GPUs. To use an ensemble, construct a RunPlanVector and CUDAEnsemble instead of a CUDASimulation.

First you must define a model as usual, followed by creating a CUDAEnsemble:

flamegpu::ModelDescription model("example model");

// Fully define the model
...

// Create a CUDAEnsemble
flamegpu::CUDAEnsemble ensemble(model);
// Handle any runtime args
ensemble.initialise(argc, argv);

Creating a RunPlanVector

RunPlanVector is a data structure which can be used to build run configurations, specifying; simulation speed, steps and initialising environment properties. These are a std::vector of RunPlan (which was introduced in the previous chapter), with some additional methods included to enable easy configuration of batches of runs.

Operations performed on the vector, will apply to all elements, whereas individual elements can also be updated directly.

It is also possible to specify subdirectories for a particular runs’ logging output to be sent to, this can be useful when constructing large batch runs or parameter sweeps:

// Create a template run plan
flamegpu::RunPlanVector runs_control(model, 128);
// Ensure that repeated runs use the same Random values within the RunPlans
runs_control.setRandomPropertySeed(34523);
{  // Initialise values across the whole vector
    // All runs require 3600 steps
    runs_control.setSteps(3600);
    // Random seeds for each run should take the values (12, 13, 14, 15, etc)
    runs_control.setRandomSimulationSeed(12, 1);
    // Initialise environment property 'lerp_float' with values uniformly distributed between 1 and 256
    runs_control.setPropertyLerpRange<float>("lerp_float", 1.0f, 256.0f);
    // Initialise environment property 'step_int' with values 0, 2, 4, 6, 8, etc
    runs_control.setPropertyStep<int>("step_int", 0, 2);

    // Initialise environment property 'random_int' with values uniformly distributed in the range [0, 10]
    runs_control.setPropertyUniformRandom<int>("random_int", 0, 10);
    // Initialise environment property 'random_float' with values from the normal dist (mean: 1, stddev: 2)
    runs_control.setPropertyNormalRandom<float>("random_float", 1.0f, 2.0f);
    // Initialise environment property 'random_double' with values from the log normal dist (mean: 2, stddev: 1)
    runs_control.setPropertyLogNormalRandom<double>("random_double", 2.0, 1.0);

    // Initialise environment property array 'int_array_3' with [1, 3, 5]
    runs_control.setProperty<int, 3>("int_array_3", {1, 3, 5});

    // Iterate vector to manually assign properties
    for (RunPlan &plan:runs_control) {
        // e.g. manually set all 'manual_float' to 32
        plan.setProperty<float>("manual_float", 32.0f);
    }
}
// Create an empty RunPlanVector, that we will construct by mutating and copying runs_control several times
flamegpu::RunPlanVector runs(model, 0);
for (const float &mutation : {0.2f, 0.5f, 0.8f, 1.5f, 1.9f, 2.5f}) {
    // Dynamically generate a name for mutation sub directory
    char subdir[24];
    sprintf(subdir, "mutation_%g", mutation);
    runs_control.setOutputSubdirectory(subdir);
    // Fill in specialised parameters
    runs_control.setProperty<float>("mutation", mutation);
    // Append to the main run plan vector
    runs += runs_control;
}

Creating a Logging Configuration

Next you need to decide which data will be collected, as it is not possible to export full agent states from a CUDAEnsemble.

A short example is shown below, however you should refer to the previous chapter for the comprehensive guide.

One benefit of using CUDAEnsemble to carry out experiments, is that the specific RunPlan data is included in each log file, allowing them to be automatically processed and used for reproducible research. However, this does not identify the particular version or build of your model.

If you wish to post-process the logs programmatically, then CUDAEnsemble::getLogs() can be used to fetch a map of RunLog where keys correspond to the index of successful runs within the input RunPlanVector (if a simulation run failed it will not have a log within the map).

Agent data is logged according to agent state, so agents with multiple states must have the config specified for each state required to be logged.

// Specify the desired LoggingConfig or StepLoggingConfig
flamegpu::StepLoggingConfig step_log_cfg(model);
{
    // Log every step (not available to LoggingConfig, for exit logs)
    step_log_cfg.setFrequency(1);
    step_log_cfg.logEnvironment("random_float");
    // Include the current number of 'boid' agents, within the 'default' state
    step_log_cfg.agent("boid").logCount();
    // Include the current mean speed of 'boid' agents, within the 'alive' state
    step_log_cfg.agent("boid", "alive").logMean<float>("speed");
}
flamegpu::LoggingConfig exit_log_cfg(model);
exit_log_cfg.logEnvironment("lerp_float");

// Pass the logging configs to the CUDAEnsemble
ensemble.setStepLog(step_log_cfg);
ensemble.setExitLog(exit_log_cfg);

Configuring & Running the Ensemble

Now you can execute the CUDAEnsemble from the command line, using the below parameters, it will execute the runs and log the collected data to file.

Long Argument

Short Argument

Description

--help

-h

Print the command line guide and exit.

--devices <device ids>

-d <device ids>

Comma separated list of GPU ids to be used to execute the ensemble. By default all devices will be used.

--concurrent <runs>

-c <runs>

The number of concurrent simulations to run per GPU. By default 4 concurrent simulations will run per GPU.

--out <directory> <format>

-o <directory> <format>

Directory and format (JSON/XML) for ensemble logging.

--quiet

-q

Don’t print ensemble progress to console.

--verbose

-v

Print config, progress and timing (-t) information to console.

--timing

-t

Output timing information to console at exit.

--silence-unknown-args

-u

Silence warnings for unknown arguments passed after this flag.

--error

-e <error level>

The ErrorLevel to use: 0, 1, 2, “off”, “slow” or “fast”. By default the ErrorLevel will be set to “slow” (1).

--standby

Allow the operating system to enter standby during ensemble execution. The standby blocking feature is currently only supported on Windows, where it is enabled by default.

You may also wish to specify your own defaults, by setting the values prior to calling initialise():

// Fully declare a ModelDescription, RunPlanVector and LoggingConfig/StepLoggingConfig
...

// Create a CUDAEnsemble to run the RunPlanVector
flamegpu::CUDAEnsemble ensemble(model);

// Override config defaults
ensemble.Config().out_directory = "results";
ensemble.Config().out_format = "json";
ensemble.Config().concurrent_runs = 1;
ensemble.Config().timing = true;
ensemble.Config().error_level = CUDAEnsemble::EnsembleConfig::Fast;
ensemble.Config().devices = {0};

// Handle any runtime args
// If this is instead performed before overriding defaults, overridden args will be ignored from command line
ensemble.initialise(argc, argv);

// Pass the logging configs to the CUDAEnsemble
ensemble.setStepLog(step_log_cfg);
ensemble.setExitLog(exit_log_cfg);

// Execute the ensemble using the specified RunPlans
const unsigned int errs = ensemble.simulate(runs);

// Fetch the RunLogs of successful runs
const std::map<unsigned int, flamegpu::RunLog> &logs = ensemble.getLogs();
for (const auto &[plan_id, log] : logs) {
    // Post-process the logs
    ...
}

// Ensure profiling / memcheck work correctly (and trigger MPI_Finalize())
flamegpu::util::cleanup();

Error Handling Within Ensembles

CUDAEnsemble has three supported levels of error handling.

Level

Name

Description

0

Off

Runs which fail do not cause an exception to be raised.

1

Slow

If any runs fail, an EnsembleError will be raised after all runs have been attempted.

2

Fast

An EnsembleError will be raised as soon as a failed run is detected, cancelling remaining runs.

The default error level is “Slow” (1), which will cause an exception to be raised if any of the simulations fail to complete. However, all simulations will be attempted first, so partial results will be available.

Alternatively, calls to simulate() return the number of errors, when the error level is set to “Off” (0). Therefore, failed runs can be probed manually via checking that the return value of simulate() does not equal zero.

Distributed Ensembles via MPI

For particularly expensive batch runs you may wish to distribute the workload across multiple nodes within a HPC cluster. This can be achieved via Message Passing Interface (MPI) support. This feature is supported by both the C++ and Python interfaces to FLAMEGPU, however it is not available in pre-built binaries/packages/wheels and must be compiled from source as required.

To enable MPI support FLAMEGPU should be configured with the CMake flag FLAMEGPU_ENABLE_MPI enabled. When compiled with this flag CUDAEnsemble will use MPI. The mpi member of the CUDAEnsemble::EnsembleConfig which will be set true if MPI support was enabled at compile time.

It is not necessary to use a CUDA aware MPI library, as CUDAEnsemble<flamegpu::CUDAEnsemble> will make use of all available GPUs by default using the it’s existing multi-gpu support (as opposed to GPU direct MPI comms). Hence it’s only necessary to launch 1 process per node, although requesting multiple CPU cores in a HPC environment are still recommended (e.g. a minimum of N+1, where N is the number of GPUs in the node).

If more than one MPI process is launched per node, the available GPUs will be load-balanced across the MPI ranks. If more MPI processes are launched per node than there are GPUs available, a warning will be issued as the additional MPI ranks will not participate in execution of the ensemble as they are unnecessary.

Note

MPI implementations differ in how to request 1 process per node when calling MPI. A few examples are provided below:

  • Open MPI: mpirun --pernode or mpirun --npernode 1

  • MVAPICH2: mpirun_rsh -ppn 1

  • Bede: bede-mpirun --bede-par 1ppn

When executing with MPI, CUDAEnsemble will execute the input RunPlanVector across all available GPUs and concurrent runs, automatically assigning jobs when a runner becomes free. This should achieve better load balancing than manually dividing work across nodes, but may lead to increased HPC queue times as the nodes must be available concurrently.

The call to CUDAEnsemble::simulate() will initialise MPI state if this has necessary, in order to cleanly exit flamegpu::util::cleanup() must be called before the program exits. Hence, you may call CUDAEnsemble::simulate() multiple times to execute multiple ensembles via MPI in a single execution, or probe the MPI world state prior to launching the ensemble, but flamegpu::util::cleanup() must only be called once.

All three error-levels are supported and behave similarly. In all cases the rank 0 process will be the only process to raise an exception after the MPI group exits cleanly.

If programmatically accessing run logs when using MPI, via CUDAEnsemble::getLogs(), each MPI process will return the logs for the runs it personally completed. This enables further post-processing to remain distributed.

For more guidance around using MPI, such as how to launch MPI jobs, you should refer to the documentation for the HPC system you will be using.

Warning

CUDAEnsemble MPI support distributes GPUs within a shared memory system (node) across the MPI ranks assigned to that node, to avoid overallocation of resources and unnecessary model failures. It’s only necessary to launch 1 MPI process per node, as CUDAEnsemble is natively able to utilise multiple GPUs within a single node, and a warning will be emitted if more MPI ranks are assigned to a node than there are visible GPUs.

Warning

flamegpu::util::cleanup() must be called before the program returns when using MPI, this triggers MPI_Finalize(). It must only be called once per process.

FLAMEGPU has a dedicated MPI test suite, this can be built and ran via the tests_mpi CMake target. It is configured to automatically divide GPUs between MPI processes when executed with MPI on a single node (e.g. mpirun -n 2 ./tests_mpi) and scale across any multi-node configuration. Some tests will not run if only a single GPU (and therefore MPI rank) is available. Due to limitations with GoogleTest each runner will execute tests and print to stdout/stderr, crashes during a test may cause the suite to deadlock.