T N T

TNT: parallelization with MPI. Install and configure with Open MPI (OMPI)

technical notes by Martín Morales

TNT works with the OMPI implementation, version 4.1.0. Installations of TNT and OMPI will be independent. Therefore, as MPI libraries won't be in the TNT files, the user must compile OMPI from source. TNT will get those libraries at execution time automatically (the ordinary linking action with any shared library, such as OMPI’s). The platforms supported are Linux and Cygwin (Windows 7 and higher were tested).

On Linux:

Download and build OMPI: https://www.open-mpi.org/software/ompi/v4.1/

A common procedure is:

Unpack the files and get into the directory
Run the Configure script with the path where the installation files will be (if not, a default build path will be used):

./configure --prefix=<INSTALL DIRECTORY>

(It shows a lot of output)

make all install

(Usually takes a while)

Set the required variables in the system, generally in the user’s shell startup scripts (.bashrc file, etc.), to indicate the path of the OMPI installation, IP address (if there is more than one network interface), and oversubscribing (to run more processes than CPUs availables; this is optional):

A very common setting is:

export PATH="<MPI INSTALL DIRECTORY>/bin:$PATH"

export LD_LIBRARY_PATH="<MPI INSTALL DIRECTORY>/lib:$LD_LIBRARY_PATH"

export OMPI_MCA_btl_tcp_if_include=<IP ADDRESS>/<MASK>

(this one is because we've experienced some issues when using more than one network interface; this setting corrects the issue. Use the IP address of your computer, with the last field as 0; the proper mask is generally 24, although this depends on the exact configuration of your network)

export OMPI_MCA_rmaps_base_oversubscribe=1

(otherwise, you get crashing errors when oversubscribing processes)

Get and run the TNT-MPI Linux version.

On Cygwin:

Install Cygwin.

With the cygwin-setup application (e.g. setup-x86_64.exe or some such), install these packages:

openmpi (4.1.0)
openssh

Set Windows’ environment variables (e.g. from Control Panel > User Accounts > Environment Variables):

OMPI_MCA_btl_tcp_if_include = <IP ADDRESS>/<MASK>

(this is because we've experienced some issues when using more than one network interface; this setting corrects the issue. Use the IP address of your computer, with the last field as 0; the proper mask is generally 24, although this depends on the exact configuration of your network)

OMPI_MCA_rmaps_base_oversubscribe = 1

(if you don't want some crashing error when running more processes than CPUs)

Get and run TNT-MPI Cygwin version, and run in a Cygwin terminal (the program can be run in a Windows command shell, but MPI does not work properly).

Cluster considerations on Linux

Requirements:

SSH passwordless-logins (i.e., SSH connection without password) to every node (i.e. computer, host, etc.); typically, with the same user account on all machines.
TNT and MPI installations must be in every node in the cluster; the PATH and LD_LIBRARY_PATH variables too, as above.

The preferred approach for the later is to have a common file system, such as NFS (Network File System). It would be just one computer with the libraries/binaries, which would share all that by a network directory, to all the nodes of the cluster. Another way is to install everything on the local hard drive of each node; this one, clearly, makes maintenance more difficult: if you want to upgrade TNT or MPI, you would need to reinstall the upgrades in each node's hard drive (when, using the the first method, this work needs to be done in just one machine). Both approaches work the same once they’re done. Less usual scenarios bring into consideration the networked filesystem costs as a negative factor. However, NFS is, again, the most frequent answer.

hostfile

Unlike PVM, in MPI systems it is not necessary to preload hosts to create a virtual machine. They are just listed in an ordinary text file called by convention hostfile, which TNT reads at runtime. An example of its content:

master	slots=16	max_slots=16
fast1	slots=16	max_slots=16
fast2	slots=16	max_slots=8
slow1	slots=8	max_slots=8
slow2	slots=8	max_slots=8

Here, there are 5 rows -one per host- arranged in 3 columns as follows:

Name of the host: computer name, usually defined in /etc/hosts file in a Linux system.
slots (optional): indicate how many processes can be potentially allocated to that node. For best performance, the number of slots should be the number of physical cores (not logical) or processors in the node.
max-slots (optional): derived from the last point: the number of processes launched could be higher than slots value (i.e. "oversubscription"; this generally produces worse performance but may be needed for easier partitioning for some routines). If we want to limit that oversubscription, we use the max-slots value.

The hostfile would be usually just in the master node (called in the computer from which TNT launches parallel jobs to the other ones in the network), and TNT will get it from the current working directory, or a user-defined one.

Notes:

The system accepts line comments (disabling) in the file with the "#" symbol preceding the line.
The default value of slots is the number of physical cores or CPUs.