guidelines for sheffield hpc setup. It will contain guidelines but also script examples to run julia code (may be R code too), to submit jobs and to parallelize julia code (and R code too may be if useful).
Do not hesitate to raise issues and to give feedbacks.
Contributors: Alain Danet and John Jackson
Here I take ShARC cluster as example, but you can change for bessemer and stanage.
Those guidelines are oriented for UNIX users (e.g. Linux/MACOS), but contributions are welcome for WINDOWS users.
Connect to your favorite cluster (e.g. sharc or bessemer) and clone this git repository into it :) (if you clone it elsewhere than in your home folder, you may have to adapt a little some scripts!)
Because nothing is so simple with HPCs, ShARC uses SGE (Sun Grid Engine) for job
managements while bessemer uses SLURM. It means that the options for job
submission will be different while being quite similar. It is for this reason
that the minimal scripts for job submissions are placed in sharc
and
bessemer
respectively.
Let's get through the examples!
We just load julia and ask julia for a little hello. To run this job, just type
qsub job_test_julia.sh
(while being in the sharc
folder).
After 30s (time for the job to be allocated to a node and executed), you should
see two files: try_julia.error
and try_julia.out
. The former should be
empty if everything went fine. Read the latter file to see the ouput of the job,
e.g. type cat try_julia.out
A little example of job with R using multiple cores. The R script is
test_parallel.R
Carefully examine job_test_parallel_R.sh
and then run this command:
qsub job_test_parallel_R.sh
This job should take 2 minutes to run (you can monitor its state by running
qstat
).
Now if you look at the output of parallelR.out
(e.g. cat parallel.out
),
you should see something like that:
Number of cores:[1] 16
Parallel computation time:
user system elapsed
0.017 0.017 20.034
Serial computation time:
user system elapsed
0.047 0.003 100.138
We asked for 5 cores but R detected 16. You can see that the parallel computation took env. 20s while the serial computation took env. 100s, i.e. five times more! It corresponds to the fact that we defined the parallel computation with five cores :)
The clusters have internet connection, so you can install julia package normally.
You can test it by running qsub job_pkg_install_julia.sh
!
If you look at pkg_add.error
, you will see something like that:
Installing known registries into `~/.julia`
Updating registry at `~/.julia/registries/General.toml`
Resolving package versions...
Installed JpegTurbo_jll ──────────────────── v2.1.2+0
Installed x265_jll ───────────────────────── v3.5.0+0
Installed Calculus ───────────────────────── v0.5.1
Installed TreeViews ──────────────────────── v0.3.0
Installed libfdk_aac_jll ─────────────────── v2.0.2+0
Installed DifferentialEquations ──────────── v7.6.0
So julia created in your home directory a folder architecture in ~/.julia
to
keep track of the installed package and their version.
Run lorenz equation example: see julia_lorenz.jl
In sharc
folder, type qsub job_lorenz_julia.sh
Let's first test interactively:
# Request 1 node with 5G
qrsh -pe smp 1 -l h_vmem=5G
julia
In julia:
using Distributed
println("n workers: ",nworkers(), ", n process:", nprocs())
You should see that there is only one core and one process. Now quit Julia.
See how many cpu you have at your node by typing nproc
Restart Julia:
julia -p 5
In julia:
using Distributed
println("n workers: ",nworkers(), ", n process:", nprocs())
Now you should see that you have access to 5 workers. You can even see their names:
pmap(x->run(`hostname`),workers());
The node names are the same because we requested only one name.
Interesting example for simulation
When you have very intensive computational task, you may want to split the work load in several jobs. SGE allows to do that by generating iterators.
You can even generate iterators from one number to another with a specified stepsize.
In a design, imagine that you have a table with each row containing parameter combination to run over. You want want that each job perform a number of paramer combination, like one hundred, one thousand, etc...
I provide a dummy example for julia
and R
in the
sharc
folder:
job_test_arrayjob_julia.sh
job_test_arrayjob_R.sh
qsub job_test_arrayjob_R.sh
qsub job_test_arrayjob_julia.sh
Those example will show you how to set an iterator and how to pass it to julia
and R
.
- Another example with R where every job will append its result to a common file:
qsub jobtest_array_script_csv.sh
In a terminal:
using qrshx (support graphical application) on sharc
qrhx
or
with srun on bessemer
srun --pty bash -i
- List available modules:
module avail
Now load Julia and R:
module load apps/R/4.0.0/gcc-10.1
module load apps/julia
Please note that you have to load the above specific version of R if you intend also to use Julia. Other R versions will make Julia fails (From Desmond Ryan: "The julia install is pre-built binaries so suspect there is an incapability between gcc versions (for the R version you used & the compilation of julia)") .
git clone https://github.com/...git
If using ssh (you would need to generate ssh keys on the cluster and to add the public key to your github account):
git clone [email protected]
If you are tired of typing ssh [email protected] ...
Add the following to your ~/.ssh/config
:
Host sharc
HostName sharc.shef.ac.uk
User MyUserName
Port 22
AddKeysToAgent yes
Restart ssh-agent: sudo service ssh retart
Now you should be able to connect to sharc by just typing ssh sharc
If you are tired of entering your password and to use MFA each time you are connecting to the cluster... But you probably need to know a bit about ssh keys.
You can follow the nice guidelines of github here.
scp ~/.ssh/id_rsa.pub [email protected]:~/.ssh/
Add the following in your .bashrc:
nano ~/.bashrc
if [ -z "$SSH_AUTH_SOCK" ] ; then
eval `ssh-agent -s`
ssh-add
fi
logout and now you can connect without password and MFA!!!