Frequently Asked Question

Using R on HPE DSI Resources

R is a programming language and software environment for statistical computing and graphics.

For use on the Opuntia and Sabine clusters, we have installed R from the source code. We also installed a number of external R libraries. If there is another library that you want to use, please try to install the library in your own environment. If you run into trouble, feel free to ask us to perform the installation.

The currently highest supported version are 3.5 (on Opuntia) and 3.6 (on Sabine). It was built with the Intel compilers and its threaded Math Kernel Library (MKL). The presence of MKL may result in a considerable speed-up when compared to R builds which rely solely on non-optimized mathematical libraries. As a rule of thumb, programs that use a lot of floating point numerical calculations should benefit from multi-threading the most.

If you would like to use R in parallel, please scroll down to the Parallel R section.

How to load R in your environment?

You can obtain R in your environment by loading the R module i.e.:

module load R

The command R --version returns the version of R you have loaded:

R --version
R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

The command which R returns the location where the R executable resides:

which R
/project/cacds/apps/easybuild/software/R/3.5.1-intel-2017b-X11-20171023/bin/R

Note: if you use an ~/.Rprofile file, it should be independent of the version of R, i.e. library paths should NEVER be set within this file.

Running an R script on the command line in interactive mode

Reminder: Don't run your code on the login node! Always make sue you either request a node for interactive usage or submit a job for batch execution. We are using SLURM as a job scheduling system.

Interactive mode is useful for code development and debugging. As a first step, you can request a node for interactive use, e.g.

srun -A  #Allocation_AWARD_ID --pty /bin/bash -l

Once you are on the node, you need to first load the R module.

module load R

There are several ways to launch an R script on the command line:

Rscript yourfile.R
R CMD BATCH yourfile.R
R --no-save
./yourfile2.R

The first approach (i.e. using the Rscript command) redirects the output into stdout. The second approach (i.e. using the R CMD BATCHcommand) redirects its output into a file (in case yourfile.Rout). A third approach is to redirect the input of the file yourfile.R to the R executable. Note that in the latter approach you must specify one of the following flags: --save, --no-save or --vanilla.

The R code can be launched as a Linux script (fourth approach) as well. In order to be run as a Linux script:

One needs to insert an extra line (#!/usr/bin/env Rscript) at the top of the file yourfile.R
As a result we have a new file yourfile2.R
The permissions of the R script (i.e.yourfile2.R)need to be altered (-> executable)

Sometimes we need to feed arguments to the R script. This is especially useful if running parallel independent calculations - different arguments can be used to differentiate between the calculations, e.g. by feeding in different initial parameters. To read the arguments, one can use the commandArgs() function, e.g., if we have a script called myScript.R:

## myScript.R 
args <- commandArgs(trailingOnly =TRUE)
rnorm(n=as.numeric(args[1]), mean=as.numeric(args[2]))

then we can call it with arguments as e.g.:

Rscript myScript.R 5 100
[1] 101.35122 100.60181 100.54685  98.13926  99.19416

Running R jobs on the cluster in batch mode

In the previous section we described how to launch an R script on the command line. In order to run a R script as a SLURM batch job on the compute nodes we just need to create a SLURM script/wrapper "around" the R command line.

Below you will find an example of a SLURM batch script called runR.slurm (see attachment). For demonstration purposes we are using the same myScript.R as above.

#!/bin/bash
#SBATCH --time=00:10:00 # Walltime
#SBATCH --nodes=1          # Use 1 Node     (Unless code is multi-node parallelized)
#SBATCH --ntasks=1         # We only run one R instance = 1 task
#SBATCH --cpus-per-task=12 # number of threads we want to run on
#SBATCH -o slurm-%j.out-%N
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@uh.edu   # Your email address
#SBATCH --job-name=myrjob


# Load R (default version)
module load R


# Run the R script in batch
Rscript myScript.R 5 100


echo "End of program at `date`"

We then submit our job via the sbatch commnd to SLURM.

sbatch runR.slurm

Your job will be scheduled, you receive a short message about the submission and the job will eventually run. All output will be stored in a file called slurm-[jobid].out.

Parallel R

The R environment itself is not parallelized, which is important to keep in mind when running on HPE DSI cluster nodes which have at least 8 CPU cores. Typical unvectorized R programs will run using only a single core.

The R installation detailed above can run certain workloads (mostly linear algebra) using multiple threads through the Intel Math Kernel Library (MKL). We recommend to benchmark your first run using OMP_NUM_THREADS=1, and then using higher core count (e.g. for 8 core node, OMP_NUM_THREADS=8), to see if it achieves any speed-up.

By default we have turned off multi-threading by setting the environmental variable OMP_NUM_THREADS to 1, i.e.

setenv OMP_NUM_THREADS 1   # Tcsh/Csh Shell
export OMP_NUM_THREADS=1   # Bash Shell

to facilitate easier use of parallel independent calculations. If you want to run R in a multithreaded fashion (e.g. on a compute node), we strongly recommend not to use more threads than there are physical cores on the node.

If the multi-threading does not provide much speedup, or, one needs to run on more than one node, some kind of parallelization of the R code is necessary. There are numerous R packages that implement various levels of parallelism, which are summarized at this CRAN page.

In our relatively limited experience, if the parallel tasks are independent of each other, one can relatively simply use the foreach package. Or, even better, run the parallel tasks completely independently through the SLURM --multi-prog. If you need any assistance, contact us.

Installing additional R packages

R Library locations

R packages are installed in libraries. Before addressing the installation of R packages as such, we will first detail the hierarchical structure of the R libraries that are installed on the CHPC Linux systems.

The command .libPaths() returns the names of the libraries (directories) which are accessible to the R executable which has been loaded in your environment.

In the recently installed R distributions, we can have three library levels:

Core/Default Library
User Libraries

The Core & Default R Packages were installed in a sub directory of the main installation directory when a new version of R was compiled. The location of the library can be retrieved by the .Library command. Among the packages in this library we have "base", "datasets", "utils", etc.

R
> .Library
[1] "/project/cacds/apps/easybuild/software/R/3.5.1-intel-2017b-X11-20171023/lib64/R/library"

The User Library is a subdirectory in the user's space (e. g. $HOME) where the user can install his/her packages. Note that each version of R for which you want to install your own packages, should have its own user library directory. The User Library subdirectories are by default not present and should be created if the user wants to install R packages themselves.

R
> .libPaths()
[1] "/home/usertest/local/R_libs"                                                            
[2] "/project/cacds/apps/easybuild/software/R/3.5.1-intel-2017b-X11-20171023/lib64/R/library"

Installing packages in your environment

We can install libraries in 2 different ways:

High-level version using install.packages() (invoked within R)
Low-level version using R CMD INSTALL (invoked from a Linux Shell)

HIGH-LEVEL INSTALLATION

The high-level installation is the easiest way to install packages. It is the preferred way when the package to be installed does not depend on C, C++, Fortran libraries which are installed in non-traditional directories. The R function to be invoked is install.packages()

R
>library(maRketSim)
Error in library(maRketSim) : there is no package called ‘maRketSim’
>install.packages(c("maRketSim"),
                  lib=c(paste("/home/usertest/local",Sys.getenv("USER"),"/RLibs/",Sys.getenv("R_VERSION"),sep="")),
                  repos=c("http://cran.us.r-project.org"),verbose=TRUE)
>library(maRketSim)

The library($PACKAGE) function tries to load a package $PACKAGE. If R can't find it an error will be printed on stdout. The install.packages()function has several flags. The lib flag needs to be followed by the directory where you want to install the package (should be $R_LIBS_USER). From the installation output we notice that the install.packages() function calls the low-level installation command (R CMD INSTALL). This command will be discussed in the next section.

Note that the lib flag can be also used with other repository packages, e.g. with Bioconductor. As we have some Bioconductor packages installed in our default location, use also lib.loc flag to tell Bioconductor to tell where the "original" Bioconductor location is:

source("https://bioconductor.org/biocLite.R")
biocLite(pkgs, lib.loc = "/home/$USER/local/RLibs/$R_VERSION", lib="/home/$USER/local/RLibs/$R_VERSION")

LOW-LEVEL INSTALLATION

The low-level installation is to be used when you need to install R packages that depend on external libraries that are installed in non-default locations. E.g. Let's consider the package rgdal.

The installation of this package depends on the external libraries gdal and udunits2. The command to be invoked to install the rgdal package in a User Library is (assuming bash shell):

module load intel netcdf-c udunits
export PATH=$NETCDFC/bin:$PATH (or in tcsh shell, setenv PATH $NETCDFC/bin:$PATH)
export PATH=$UDUNITS/bin:$PATH (or in tcsh shell, setenv PATH $UDUNITS/bin:$PATH)
wget https://cran.r-project.org/src/contrib/RNetCDF_1.9-1.tar.gz
R CMD INSTALL --library=/uufs/chpc.utah.edu/common/home/$USER/RLibs/$R_VERSION \ 
              --configure-args="CPPFLAGS='-I$UDUNITS/include'\
                LDFLAGS='-Wl,-rpath=$NETCDFC/lib \
               -L$NETCDFC/lib -lnetcdf \
               -Wl,-rpath=$UDUNITS/lib\
               -L$UDUNITS/lib -ludunits2 ' \
               --with-nc-config=$NETCDFC/bin/nc-config " RNetCDF_1.9-1.tar.gz

R CMD INSTALL calls ./configure under the hood. The best way to tackle such an installation is to download the tar.gz file first, find the appropriate installation flags (different for each package!) and then feed those flags to the R CMD INSTALL command.

POTENTIAL PROBLEMS

The Intel compiler, that we use to build R, conflicts with gcc headers when using complex data types, resulting in an error similar to the one below when installing some R libraries:

/project/cacds/apps/easybuild/software/icc/2017.8.262-GCC-6.4.0-2.28/compilers_and_libraries_2017.8.262/linux/include/complex(310): error #308: member "std::complex::_M_value" (declared at line 1337 of "/usr/include/c++/4.8.5/complex") is inaccessible
 return __x / __y._M_value;

The workaround this is to disable this diagnostic error by creating (or modifying) file ~/.R/Makevars such as:

CFLAGS += -wd308
CXXFLAGS += -wd308
CPPFLAGS += -wd308
PKG_CFLAGS += -wd308
PKG_CXXFLAGS += -wd308
PKG_CPPFLAGS += -wd308

Last Updated 5 years ago

Attachments:

runR.slurm