This lesson is in the early stages of development (Alpha version)

Introduction to High-Performance Computing: Knowledge Base

Key Points

Why use a Cluster?
  • High Performance Computing (HPC) typically involves connecting to very large computing systems elsewhere in the world.

  • These other systems can be used to do work that would either be impossible or much slower on smaller systems.

  • HPC resources are shared by multiple users.

  • The standard method of interacting with such systems is via a command line interface.

Connecting to a remote HPC system
  • An HPC system is a set of networked machines.

  • HPC systems typically provide login nodes and a set of worker nodes.

  • The resources found on independent (worker) nodes can vary in volume and type (amount of RAM, processor architecture, availability of network mounted filesystems, etc.).

  • Files saved on one node are available on all nodes.

Exploring Remote Resources
  • An HPC system is a set of networked machines.

  • HPC systems typically provide login nodes and a set of compute nodes.

  • The resources found on independent (worker) nodes can vary in volume and type (amount of RAM, processor architecture, availability of network mounted filesystems, etc.).

  • Files saved on shared storage are available on all nodes.

  • The login node is a shared machine: be considerate of other users.

Scheduler Fundamentals
  • The scheduler handles how compute resources are shared between users.

  • A job is just a shell script.

  • Request slightly more resources than you will need.

Environment Variables
  • Shell variables are by default treated as strings

  • Variables are assigned using “=” and recalled using the variable’s name prefixed by “$

  • Use “export” to make an variable available to other programs

  • The PATH variable defines the shell’s search path

Accessing software via Modules
  • Load software with module load softwareName.

  • Unload software with module unload

  • The module system handles software versioning and package conflicts for you automatically.

Transferring files with remote computers
  • wget and curl -O download a file from the internet.

  • scp and rsync transfer files to and from your computer.

  • You can use an SFTP client like FileZilla to transfer files through a GUI.

Running a parallel job
  • Parallel programming allows applications to take advantage of parallel hardware.

  • The queuing system facilitates executing parallel tasks.

  • Performance improvements from parallel execution do not scale linearly.

Using resources effectively
  • Accurate job scripts help the queuing system efficiently allocate shared resources.

Using shared resources responsibly
  • Be careful how you use the login node.

  • Your data on the system is your responsibility.

  • Plan and test large data transfers.

  • It is often best to convert many files to a single archive file before transferring.

Quick Reference or “Cheat Sheets” for Queuing System Commands

Search online for the one that fits you best, but here’s some to start:

Units and Language

A computer’s memory and disk are measured in units called Bytes (one Byte is 8 bits). As today’s files and memory have grown to be large given historic standards, volumes are noted using the SI prefixes. So 1000 Bytes is a Kilobyte (kB), 1000 Kilobytes is a Megabyte (MB), 1000 Megabytes is a Gigabyte (GB), etc.

History and common language have however mixed this notation with a different meaning. When people say “Kilobyte”, they mean 1024 Bytes instead. In that spirit, a Megabyte is 1024 Kilobytes.

To address this ambiguity, the International System of Quantities standardizes the binary prefixes (with base of 210=1024) by the prefixes Kibi (ki), Mebi (Mi), Gibi (Gi), etc. For more details, see here.

“No such file or directory” or “symbol 0096” Errors

scp and rsync may throw a perplexing error about files that very much do exist. One source of these errors is copy-and-paste of command line arguments from Web browsers, where the double-dash string -- is rendered as an em-dash character “—” (or en-dash “–”, or horizontal bar ). For example, instead of showing the transfer rate in real time, the following command fails mysteriously.

user@laptop:~$ rsync —progress my_precious_data.txt abc@blackbird.nist.gov
rsync: link_stat "/home//—progress" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors)
(code 23) at main.c(1207) [sender=3.1.3]

The correct command, different only by two characters, succeeds:

user@laptop:~$ rsync --progress my_precious_data.txt abc@blackbird.nist.gov

We have done our best to wrap all commands in code blocks, which prevents this subtle conversion. If you encounter this error, please open an issue or pull request on the lesson repository to help others avoid it.

Transferring Files Interactively With sftp

scp is useful, but what if we don’t know the exact location of what we want to transfer? Or perhaps we’re simply not sure which files we want to transfer yet. sftp is an interactive way of downloading and uploading files. Let’s connect to a cluster, using sftp – you’ll notice it works the same way as SSH:

user@laptop:~$ sftp yourUsername@remote.computer.address

This will start what appears to be a bash shell (though our prompt says sftp>). However we only have access to a limited number of commands. We can see which commands are available with help:

sftp> help
Available commands:
bye                                Quit sftp
cd path                            Change remote directory to 'path'
chgrp grp path                     Change group of file 'path' to 'grp'
chmod mode path                    Change permissions of file 'path' to 'mode'
chown own path                     Change owner of file 'path' to 'own'
df [-hi] [path]                    Display statistics for current directory or
                                   filesystem containing 'path'
exit                               Quit sftp
get [-afPpRr] remote [local]       Download file
reget [-fPpRr] remote [local]      Resume download file
reput [-fPpRr] [local] remote      Resume upload file
help                               Display this help text
lcd path                           Change local directory to 'path'
lls [ls-options [path]]            Display local directory listing
lmkdir path                        Create local directory
ln [-s] oldpath newpath            Link remote file (-s for symlink)
lpwd                               Print local working directory
ls [-1afhlnrSt] [path]             Display remote directory listing

# omitted further output for clarity

Notice the presence of multiple commands that make mention of local and remote. We are actually connected to two computers at once (with two working directories!).

To show our remote working directory:

sftp> pwd
Remote working directory: /global/home/yourUsername

To show our local working directory, we add an l in front of the command:

sftp> lpwd
Local working directory: /home/jeff/Documents/teaching/hpc-intro

The same pattern follows for all other commands:

To upload a file, we type put some-file.txt (tab-completion works here).

sftp> put config.toml
Uploading config.toml to /global/home/yourUsername/config.toml
config.toml                                  100%  713     2.4KB/s   00:00

To download a file we type get some-file.txt:

sftp> get config.toml
Fetching /global/home/yourUsername/config.toml to config.toml
/global/home/yourUsername/config.toml        100%  713     9.3KB/s   00:00

And we can recursively put/get files by just adding -r. Note that the directory needs to be present beforehand.

sftp> mkdir content
sftp> put -r content/
Uploading content/ to /global/home/yourUsername/content
Entering content/
content/scheduler.md              100%   11KB  21.4KB/s   00:00
content/index.md                  100% 1051     7.2KB/s   00:00
content/transferring-files.md     100% 6117    36.6KB/s   00:00
content/.transferring-files.md.sw 100%   24KB  28.4KB/s   00:00
content/cluster.md                100% 5542    35.0KB/s   00:00
content/modules.md                100%   17KB 158.0KB/s   00:00
content/resources.md              100% 1115    29.9KB/s   00:00

To quit, we type exit or bye.