Class MPIEnsemble
Defined in File MPIEnsemble.h
Class Documentation
-
class MPIEnsemble
Public Functions
-
explicit MPIEnsemble(const CUDAEnsemble::EnsembleConfig &_config, unsigned int _total_runs)
Construct the object for managing MPI comms during an ensemble
Initialises the MPI_Datatype MPI_ERROR_DETAIL, detects world rank and size.
- Parameters:
_config – The parent ensemble’s config (mostly used to check error/verbosity levels)
_total_runs – The total number of runs to be executed (only used for printing error warnings)
-
int receiveErrors(std::multimap<int, AbstractSimRunner::ErrorDetail> &err_detail)
If world_rank==0, receive any waiting errors and add their details to err_detail
- Parameters:
err_detail – The map to store new error details within
- Returns:
The number of errors that occurred.
-
int receiveJobRequests(unsigned int &next_run)
If world_rank==0, receive and process any waiting job requests
- Parameters:
next_run – A reference to the int which tracks the progress through the run plan vector
- Returns:
The number of runners that have been told to exit (if next_run>total_runs)
-
void sendErrorDetail(AbstractSimRunner::ErrorDetail &e_detail)
If world_rank!=0, send the provided error detail to world_rank==0
- Parameters:
e_detail – The error detail to be sent
-
int requestJob()
If world_rank!=0, request a job from world_rank==0 and return the response
- Returns:
The index of the assigned job
-
void worldBarrier()
Wait for all MPI ranks to reach a barrier
-
std::string assembleGPUsString()
If world_rank!=0 and local_rank == 0, send the local GPU string to world_rank==0 and return empty string If world_rank==0, receive GPU strings and assemble the full remote GPU string to be returned
-
void retrieveLocalErrorDetail(std::mutex &log_export_queue_mutex, std::multimap<int, AbstractSimRunner::ErrorDetail> &err_detail, std::vector<AbstractSimRunner::ErrorDetail> &err_detail_local, int i, std::set<int> devices)
Common function for handling local errors during MPI execution Needs the set of in-use devices, not the config specified list of devices
-
bool createParticipatingCommunicator(bool isParticipating)
Create the split MPI Communicator based on if the thread is participating in ensemble execution or not, based on the group rank and number of local GPUs.
- Parameters:
isParticipating – If this rank is participating (i.e. it has a local device assigned)
- Returns:
success of this method
-
inline int getRankIsParticipating()
Accessor method for if the rank is participating or not (i.e. the colour of the communicator split)
-
inline int getParticipatingCommSize()
Accessor method for the size of the MPI communicator containing “participating” (or non-participating) ranks
-
inline int getParticipatingCommRank()
Accessor method for the rank within the MPI communicator containing “participating” (or non-participating) ranks
-
std::set<int> devicesForThisRank(std::set<int> devicesToSelectFrom)
Method to select devices for the current mpi rank, based on the provided list of devices. This non-static version calls the other overload with the current mpi size/ranks, i.e. this is the version that should be used.
- Parameters:
devicesToSelectFrom – set of device indices to use, provided from the config or initialised to contain all visible devices.
- Returns:
the gpus to be used by the current mpi rank, which may be empty.
Public Members
-
const int world_rank
The rank within the world MPI communicator
-
const int world_size
The size of the world MPI communicator
-
const int local_rank
The rank within the MPI shared memory communicator (i.e. the within the node)
-
const int local_size
The size of the MPI shared memory communicator (i.e. the within the node)
-
const unsigned int total_runs
The total number of runs to be executed (only used for printing error warnings)
Public Static Functions
-
static std::set<int> devicesForThisRank(std::set<int> devicesToSelectFrom, int local_size, int local_rank)
Static method to select devices for the current mpi rank, based on the provided list of devices. This static version exists so that it is testable. A non static version which queries the curernt mpi environment is also provided as a simpler interface
- Parameters:
devicesToSelectFrom – set of device indices to use, provided from the config or initialised to contain all visible devices.
local_size – the number of mpi processes on the current shared memory system
local_rank – the current process’ mpi rank within the shared memory system
- Returns:
the gpus to be used by the current mpi rank, which may be empty.
-
explicit MPIEnsemble(const CUDAEnsemble::EnsembleConfig &_config, unsigned int _total_runs)