2. Getting started on expanse

2.1. How to login (after you have set up 2-Factor Authentication)

ssh -l username login.xsede.org

Answer 2FA prompts (with your phone)

gsissh expanse

or, once you have done the above at least once

ssh -l username login.expanse.sdsc.edu

2.2. Storage on expanse

There are 3 directories you can use on expanse.

Your home directory is in

/home/username

and can be used for permanent files you want backed up. There is limited space here (100 GB), and the compute nodes cannot see this directory.

There is shared project space (2TB) for our class XSEDE account in the directory

/expanse/lustre/projects/uic406/$USER

where $USER is your expanse username. This folder appears to not support batch job use.

Scratch space is also provided at

/expanse/lustre/scratch/$USER/temp_project

This is the directory where you should run batch jobs. It is not limited for space, but there is a user limit of 2 million files. It is not backed up, so important results should be copied to other directories.


2.3. Obtaining and compiling WRF on expanse

2.3.1. Download WRF from github

You should download your WRF code do your home directory or the shared project space.

git clone https://github.com/wrf-model/WRF.git

This will download the code into a subdirectory called WRF in the current directory. Enter this directory:

cd WRF

Caution

Note that the steps below must be done in one ssh session. Library paths on expanse change from session to session, so you must start from here if you lose your connection.

2.3.2. Configuring WRF

We will now configure WRF for compilation. First, we need to set up the software environment for WRF. expanse uses the modules package to configure software. Here is the software environment we need for WRF:

module purge

module load cpu intel mvapich2

module load netcdf-c netcdf-cxx netcdf-fortran

Next we need to find the NETCDF library path, and set an environment variable for it. Run the following command:

echo $LD_LIBRARY_PATH | grep netcdf

Grab the path that looks like this, containing netcdf-fortran:

/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen2/intel-19.1.1.217/netcdf-fortran-4.5.3-2wjlrztnogahr6sgpaxuwwd2mfl5ligr/

and set the environment variable NETCDF to that path:

export NETCDF=/cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen2/intel-19.1.1.217/netcdf-fortran-4.5.3-2wjlrztnogahr6sgpaxuwwd2mfl5ligr/

Also, for performance increase over data output size, set the following:

export NETCDF_classic=1

Now run configure:

./configure

Select option 20, nesting option 1, and it should complete.

Note

Edit the resulting configure.wrf, on the line that says OPTAVX, use a text editor such as nano configure.wrf to change it to read:

OPTAVX          =       -march=core-avx2

Save the file and you’re good to go!

2.3.3. Compiling WRF

Now, you can compile WRF. To compile the Quarter-circle hodograph supercell case on 4 processors, type

./compile -j 4 em_quarter_ss

After 20-30 minutes, you should get a “SUCCESS” message. You shouldn’t ever have to recompile unless you need to change something in the fortran code or need to change the “Registry” - we’ll talk about this later in the semester. If you get disconnected from the session, you will have to reconnect and start over from the configure step (due to the way the libraries are loaded on expanse).

Note

If something goes wrong during the compile process, be sure to start up at step 2.3.2. In the top level WRF directory, run ./clean -a to remove all compiled files, and configure and compile again.

2.4. Running WRF ideal on expanse

All right, you’re good to go. Let’s try a test run. At this point you can configure the namelist.input or input_sounding to what you would like for your particular run.

You’ll need a slurm script for running your job - here is one you can save in the test\em_quarter_ss directory as wrf_sbatch.sh. To create this file, open it in nano:

nano wrf_sbatch.sh

and paste in the contents below (then write out the file and exit).

#!/bin/bash
#SBATCH --job-name="idealwrf" #NAME OF JOB NAME
#SBATCH --output="wrf.em_q_ss.%j.%N.out" #NAME OF LOG FILE
#SBATCH --partition=shared #SHARED PARTITION
#SBATCH --nodes=1 #HOW MANY NODES - EACH NODE HAS 64 PROCESSORS
#SBATCH --ntasks-per-node=16 #HOW MANY PROCESSORS, IF YOU GO OVER 64, TAKE MORE NODES
#SBATCH --mem=48G #RAM LIMIT
#SBATCH --account=uic406 #ACCOUNT
#SBATCH --export=ALL #WRITE LOGS?
#SBATCH -t 01:30:00 #WALL TIME LIMIT

#This job runs with 1 nodes, 16 cores per node for a total of 16 tasks.

#LOAD SOFTWARE MODULES
module purge
module load cpu intel mvapich2
module load netcdf-c netcdf-cxx netcdf-fortran
module load slurm

#RUN INITIALIZATION
srun --mpi=pmi2 -n 16 ideal.exe
#RUN MODEL
srun --mpi=pmi2 -n 16 wrf.exe

When you’re ready to submit the job to the queue:

sbatch wrf_sbatch.sh

Which should respond with Submitted batch job [job number].

To check on the job status, type

squeue -u $USER

It will have an R next to the job if it is running:

(base) [snesbitt@login01 em_quarter_ss]$ squeue -u snesbitt
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1164268   compute idealwrf snesbitt  R       0:07      1 exp-3-17

When it is complete, a C should appear.

You can also look at the contents of a file rsl.out.0000, which is a running log of the simulation. It should give the status, and when WRF is running, it will show the current time of the simulation.

Try tail rsl.out.0000 during the simulation:

(base) [snesbitt@login01 em_quarter_ss]$ tail rsl.out.0000
Timing for main: time 0001-01-01_00:58:48 on domain   1:    0.00994 elapsed seconds
Timing for main: time 0001-01-01_00:59:00 on domain   1:    0.00994 elapsed seconds
Timing for main: time 0001-01-01_00:59:12 on domain   1:    0.01001 elapsed seconds
Timing for main: time 0001-01-01_00:59:24 on domain   1:    0.00996 elapsed seconds

And when it is completed successfully, squeue -u should say a C status, or nothing (it is out of the queue), and tail rsl.out.0000 should show:

Timing for main: time 0001-01-01_00:59:36 on domain   1:    0.00995 elapsed seconds
Timing for main: time 0001-01-01_00:59:48 on domain   1:    0.01000 elapsed seconds
Timing for main: time 0001-01-01_01:00:00 on domain   1:    0.00997 elapsed seconds
Timing for Writing wrfout_d01_0001-01-01_01:00:00 for domain        1:    0.10085 elapsed seconds
d01 0001-01-01_01:00:00 wrf: SUCCESS COMPLETE WRF
taskid: 0 hostname: exp-3-17
(base) [snesbitt@login01 em_quarter_ss]$

If there is an error, the log file will end prematurely before the job is completed in the queue. Hopefully the error message you get is explanatory. If you get stuck, message me on Slack!