User Guide
Learn how to use Skyway - RCC Cloud Solution Offering
## Requirements of using Skyway
* Have an active RCC user account
* Experience using the Midway cluster
* Experience using the SLURM resource scheduler (optional)
## Access to Skyway
* Login to the RR Midway3 cluster
```
ssh [your-cnetid]@midway3.rcc.uchicago.edu
```
* Load the Skyway module on the login node
```
module load skyway
```
## Using Skyway on the RCC Midway3 cluster
You submit jobs to cloud in a similar manner to what do on your HPC cluster. The difference is that you should specify different partitions and accounts corresponding to the cloud services you have access to. Additionally, the instance configuration should be specified via --constraint.
### List all the node types available to an account
```
skyway_nodetypes --account=your-cloud-account
skyway_nodetypes --account=your-gcp-account
```
To submit jobs to cloud, you must specify a type of virtual machine (VM) by the option `--constraint=[VM Type]`. For instance, the AWS VM types currently available through Skyway can be found in the table below. Other VM types will be included per requests.
|
VM Type
| AWS EC2 Instance | Configuration | Description |
| ----------- | ----------- | ----------- | ----------- |
| t1 | t2.micro | 1 core, 1GB RAM | for testing and building software |
| c1 | c5.large | 1 core, 4B RAM | for serial jobs |
| c8 | c5.4xlarge | 8 cores, 32GB RAM | for medium sized multicore jobs |
| c36 | c5.18xlarge | 36 cores, 144GB RAM | for large memory jobs |
| g1 | p3.2xlarge | 4 cores, 61 GB RAM, 1x V100 GPU | for GPU jobs |
| g4 | p3.8xlarge | 16 cores, 244 GB RAM, 4x V100 GPU | for heavy GPU jobs |
| g5 | p5.2xlarge | 8 cores, 32 GB RAM, 1x A10G GPU | for heavy GPU jobs |
| m24 | c5.12xlarge | 24 cores, 384GB RAM | for large memory jobs |
The following steps show a representative workflow with Skyway.
### Allocate/provision an instance
```
skyway_alloc --account=your-cloud-account --constraint=t1 --time=01:00:00
```
For a GPU instance, use
```
skyway_alloc -A your-cloud-account --constraint=g5 --time=00:30:00
```
### List all the running VMs with an account
```
skyway_list --account=your-cloud-account
```
### Transfer data
To copy a file from the login node to the instance named `your-run`
```
skyway_transfer -A your-cloud-account -J your-run training.py
```
Transfer a file from an instance to the login node
```
skyway_transfer -A your-cloud-account -J your-run --from-cloud --cloud-path=~/output.txt $HOME/output.txt
```
### Connect to the VM named your-run
```
skyway_connect --account=your-cloud-account your-run
```
Once on the VM, do
```
nvidia-smi
source activate pytorch
python training.py > ~/output.txt
scp output.txt [yourcnetid]@midway3.rcc.uchicago.edu:~/
exit
```
At this point, there would be a file named output.txt in your Midway3 home folder.
### Stop/restart a job
To stop an instance, you use the `skyway_stop` command with your account and the instance ID:
```
skyway_stop -A=your-cloud-account -i [instanceID]
```
The stopped instance does not get charged and the data on the VM is preserved.
To restart a stopped instance, you use the `skyway_restart` command with your account and the instance ID:
```
skyway_restart -A=your-cloud-account -i [instanceID] -t 02:00:00
```
### Cancel/terminate/cancel a job
```
skyway_cancel --account=your-cloud-account [job_name]
```
Expected behavior: The jobs (VMs) got terminated. When run `skyway_list` (step 3 above) the VM will not be present.
Note that when a VM is terminated, all the data on the VM is erased. The user should transfer the intermediate output to a cloud storage space or to local storage.
The following steps are for launching interactive and batch jobs.
### Submit an interactive job (combinig steps 4, 6 and 7)
```
skyway_interative --account=your-cloud-account --constraint=t1 --time=01:00:00
```
For a GPU instance, use
```
skyway_interative --account=your-cloud-account --constraint=g5 -t 00:30:00
```
Expected behavior: the user lands on a compute node or a VM on a separate terminal.
### Submit a batch job
A sample job script `job_script.sh` is given as bellow
```
#!/bin/sh
#SBATCH --job-name=your-run
#SBATCH --account=your-cloud-account
#SBATCH --constraint=g1
#SBATCH --time=06:00:00
skyway_transfer training.py
source activate pytorch
wget [url-of-the-dataset]
python training.py
```
To submit the job, use the `skyway_batch` command
```
skyway_batch job_script.sh
```
If the job is terminated due to time limit, the instance is stopped. You can find the instance ID with the `skyway_list` command and restart it to back up the data.
Transfer output data from cloud
```
skyway_transfer -A your-cloud-account -J your-run --from-cloud --cloud-path=~/model*.pkl .
```
You can cancel the job using the `skyway_cancel` command.
## Troubleshooting
For further assistance, please contact our Help Desk at [help@rcc.uchicago.edu](mailto:help@rcc.uchicago.edu).