Advanced Shell Scripting

Overview

Teaching: 60 min
Exercises: 0 min

Questions

How do I use if statements?

Can I add date to filenames in bash script?

Can I use file attributes like image size or properties to filter files?

Objectives

Learn how to use conditionals

Learn about getting system information into scripts

Learn about getting file information into scripts

Instructor note: there are intentional typos in these examples to show the importnace of spaces

Data Organization

To start the workshop, we need to download some data files. You’ll want to unzip the files some place where you can find them.

Having a file/folder naming convention is the first step for good data management. The library has a great worksheet that steps you through lots of options. For this workshop the important metadata is the data and the type of workshop. So open a terminal window and type

mkdir ~/Documents/2025-03-14-shell-hpc

where Documents is the path to wherever on your computer you want to store your files.

Next we’ll need to move our data files into this folder. You’ll need to remember where you downloaded and unzipped the shell-lesson-data zip file. The north-pacific-gyre folder has everything we’re going to need, and we’re going to set up a subfolder arrangement for our data

cd ~/Documents/2025-03-14-shell-hpc mkdir data cp ~/Desktop/shell-lesson-data/north-pacific-gyre/* data/.

I don’t like that the applications are in the data folder, so let’s move those out.

mv data/goo* .

Let’s write a short readme describing the setup.

nano README.md

# Carpentry Shell Lesson Data Analysis

Data copied from the north-pacific-gyre folder in the carpentries shell lesson data downloaded from https://swcarpentry.github.io/shell-novice/data/shell-lesson-data.zip

We also need a place to put our results

mkdir results

Reviewing scripting

We’re going to re-do the demonstration script from the shell-novice lesson with our new structure.

Nelle’s Pipeline: Processing Files

Nelle is now ready to process her data files using goostats — a shell script written by her supervisor. This calculates some statistics from a protein sample file, and takes two arguments:

an input file (containing the raw data)
an output file (to store the calculated statistics)

Since she’s still learning how to use the shell, she decides to build up the required commands in stages. Her first step is to make sure that she can select the right input files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’. Starting from her home directory, Nelle types:

Now type nano run.sh that will generate an input and output file name

 #!/bin/bash 
 for datafile in data/NENE*[AB].txt
 do
     filename=$(basename "$datafile")
     echo $datafile results/stats-$filename
 done

data/NENE01729A.txt results/stats-NENE01729A.txt
data/NENE01729B.txt results/stats-NENE01729B.txt
data/NENE01736A.txt results/stats-NENE01736A.txt
...
data/NENE02043A.txt results/stats-NENE02043A.txt
data/NENE02043B.txt results/stats-NENE02043B.txt

She hasn’t actually run goostats yet, but now she’s sure she can select the right files and generate the right output filenames.

$ for datafile in NENE*[AB].txt
 do
     filename=$(basename "$datafile")
     bash goostats.sh $datafile results/stats-$filename
 done

When she presses Enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the running command by typing Ctrl-C, uses up-arrow to repeat the command, and edits it to read:

$ for datafile in NENE*[AB].txt
 do
     filename=$(basename "$datafile")
     echo $filename
     bash goostats.sh $datafile results/stats-$filename
 done

System information and variables

You can get the current date using the date command. There are lots of formatting options, but we’re going to go with the recommended year-month-day option.

date "+%F"

Let’s make a script that prints out the date. We can save the date in a variable like

date = $(date "+%F")

You probably got an error like

date: illegal time format

This is because we had extra spaces around the equals sign. This is a bit confusing, because the error is coming from the variable name we used ‘date’. Since there is a space, bash thinks that ‘date’ variable name is a command we want to run. If you use

date=$(date "+%F")
echo $date

You should get the date printed as expected

Conditionals

You can use conditional statements to test whether something is true or false and do a programmatic behavior as a result. Let’s go into the molecules directory and make a script that will show us molecules with at least a certain number of lines. Make a new script called is_big.sh

We know that wc -l gives us the number of lines in a file. Let’s save that to a variable.

num=$(wc -l $1)

We build an if statement like a loop

if ["$num" -gt "5"]
then
    echo $1 "is big enough"
fi

Does that work? You’ll probably get an error

[      30: command not found

This is again a spacing issue, but the opposite of the earlier one we saw. You need a space after the [, otherwise bash thinks it is a command. Once we fix the spacing

if [ "$num" -gt "5" ]
then
    echo $1 "is big enough"
fi

We get a different error

is_big.sh: line 2: :       30 octane.pdb: integer expression expected

We forgot to check our import. wc -l gives us the size and the file name which isn’t a number. If we redirect the file into wc it will work

num=$(wc -l < $1)

We can add else to have the script always print something

if [ "$num" -gt "5" ]
then
    echo $1 "is big enough"
else
    echo $1 "is not big enough"
fi

Activity: Make the size cutoff generalizabla

System Variables

You can set variables outside of your script that you can use in the script. This is useful for saving passwords or things that you don’t want to put in the script and don’t want to have to type at the command line every time. Let’s go back to molecules and have the size cutoff be an environment variable. First we’ll set the variable.

$ export CUTOFF=5

Then add the variable to your script

if [ "$num" -gt $CUTOFF ]

If you want variables to be set wevery time you log in, you can add them to the .bash_profile file (or .zshrc file if you’re using the most recent OSX version

don’t know type echo “$SHELL” and is if says /bin/zsh you’re using the most recent version) in your home directory

Key Points

Shell scripts can be used to more complicated programming tasks

Working Remotely

Overview

Teaching: 30 min
Exercises: 0 min

Questions

How do I use ‘ssh’ and ‘scp’ ?

Objectives

Learn what SSH is

Learn what an SSH key is

Generate your own SSH key pair

Learn how to use your SSH key

Learn how to work remotely using ssh and scp

Add your SSH key to an remote server

Let’s take a closer look at what happens when we use the shell on a desktop or laptop computer. The first step is to log in so that the operating system knows who we are and what we’re allowed to do. We do this by typing our username and password; the operating system checks those values against its records, and if they match, runs a shell for us.

As we type commands, the 1’s and 0’s that represent the characters we’re typing are sent from the keyboard to the shell. The shell displays those characters on the screen to represent what we type, and then, if what we typed was a command, the shell executes it and displays its output (if any).

What if we want to run some commands on another machine, such as the server in the basement that manages our database of experimental results? To do this, we have to first log in to that machine. We call this a remote login.

In order for us to be able to login, the remote computer must be running a remote login server and we will run a client program that can talk to that server. The client program passes our login credentials to the remote login server and, if we are allowed to login, that server then runs a shell for us on the remote computer.

Once our local client is connected to the remote server, everything we type into the client is passed on, by the server, to the shell running on the remote computer. That remote shell runs those commands on our behalf, just as a local shell would, then sends back output, via the server, to our client, for our computer to display.

SSH History

Back in the day, when everyone trusted each other and knew every chip in their computer by its first name, people didn’t encrypt anything except the most sensitive information when sending it over a network and the two programs used for running a shell (usually back then, the Bourne Shell, sh) on, or copying files to, a remote machine were named rsh and rcp, respectively. Think (r)emote sh and cp

However, anyone could watch the unencrypted network traffic, which meant that villains could steal usernames and passwords, and use them for all manner of nefarious purposes.

The SSH protocol was invented to prevent this (or at least slow it down). It uses several sophisticated, and heavily tested, encryption protocols to ensure that outsiders can’t see what’s in the messages going back and forth between different computers.

The remote login server which accepts connections from client programs is known as the SSH daemon, or sshd.

The client program we use to login remotely is the secure shell, or ssh, think (s)ecure sh.

The ssh login client has a companion program called scp, think (s)ecure cp, which allows us to copy files to or from a remote computer using the same kind of encrypted connection.

A remote login using `ssh`

To make a remote login, we issue the command ssh username@computer which tries to make a connection to the SSH daemon running on the remote computer we have specified.

After we log in, we can use the remote shell to use the remote computer’s files and directories.

Typing exit or Control-D terminates the remote shell, and the local client program, and returns us to our previous shell.

$ pwd

/users/vlad

If you’re using the Caltech HPC Cluster, use your Caltech username in place of “username” and type

$ ssh username@login.hpc.caltech.edu
Password: ********

You’ll also need to respond to a Duo two factor authentication prompt.

If you’re using XSEDE, use your XSEDE username in place of “username” and type

$ ssh username@comet.sdsc.xsede.org
Password: ********

If this is the first time logging into the remote system, you might see a message like

The authenticity of host 'comet.sdsc.xsede.org (198.202.113.253)' can't be established.
RSA key fingerprint is SHA256:z2NBrOo633o/lePpqSVDyaLaOODcoU0zn8S2k1xDkW0.
Are you sure you want to continue connecting (yes/no)?

This is a security message that protects you from signing into a computer that is impersonating a system you use. Since this is your first time logging into the system, you can say yes and click enter. If you see this message another time, you may want to ask the system owner if a change has been made on the remote system.

You’re now terminal is now on the remote system. Check to the left of the prompt to see where you are

    moon> pwd

    /home/vlad

Copying files to, and from a remote machine using `scp`

Let’s open up a new terminal window, and transfer some files to the cluster

To copy a file, we specify the source and destination paths, either of which may include computer names. If we leave out a computer name, scp assumes we mean the machine we’re running on.

Let’s copy all the files we just created to the remote system

cd ~/Documents
~~


~~
$ scp -r 2025-03-14-shell-hpc tmorrell@login.hpc.caltech.edu:.
Password: ********

goodiff.sh                      100%  345    44.3KB/s   00:00    
run.sh                          100%  150    18.2KB/s   00:00
...

Note the colon :, seperating the hostname of the server and the pathname of the file we are copying to. It is this character that informs scp that the source or target of the copy is on the remote machine and the reason it is needed can be explained as follows:

In the same way that the default directory into which we are placed when running a shell on a remote machine is our home directory on that machine, the default target, for a remote copy, is also the home directory.

This means that

$ scp results.dat vlad@backupserver:

would copy results.dat into our home directory on backupserver, however, if we did not have the colon to inform scp of the remote machine, we would still have a valid commmad

$ scp results.dat vlad@backupserver

but now we have merely created a file called vlad@backupserver on our local machine, as we would have done with cp.

$ cp results.dat vlad@backupserver

Copying a whole directory betwen remote machines uses the same syntax as the cp command: we just use the -r option to signal that we want copying to be recursively.

We can now go back to the other terminal window to see that all our files transferred.

Running commands on a remote machine using `ssh`

Here’s one more thing the ssh client program can do for us. Suppose we want to check whether we have already created the file backups/results-2011-11-12.dat on the backup server. Instead of logging in and then typing ls, we could do this:

$ ssh vlad@backupserver "ls results*"
Password: ********

results-2011-09-18.dat  results-2011-10-28.dat
results-2011-10-04.dat  results-2011-11-11.dat

Here, ssh takes the argument after our remote username and passes them to the shell on the remote computer. (We have to put quotes around it to make it look like a single argument.) Since those arguments are a legal command, the remote shell runs ls results for us and sends the output back to our local shell for display.

SSH Keys

Typing our password over and over again is annoying, especially if the commands we want to run remotely are in a loop. To remove the need to do this, we can create an SSH key to tell the remote machine that it should always trust us.

SSH keys come in pairs, a public key that gets shared with services like GitHub, and a private key that is stored only on your computer. If the keys match, you’re granted access.

The cryptography behind SSH keys ensures that no one can reverse engineer your private key from the public one.

The first step in using SSH authorization is to generate your own key pair.

You might already have an SSH key pair on your machine. You can check to see if one exists by moving to your .ssh directory and listing the contents.

$ cd ~/.ssh
$ ls

If you see id_rsa.pub, you already have a key pair and don’t need to create a new one.

If you don’t see id_rsa.pub, use the following command to generate a new key pair. Make sure to replace your@email.com with your own email address.

$ ssh-keygen -t rsa -C "your@email.com"

When asked where to save the new key, hit enter to accept the default location.

Generating public/private rsa key pair.
Enter file in which to save the key (/Users/username/.ssh/id_rsa):

You will then be asked to provide an optional passphrase. This can be used to make your key even more secure, but if what you want is avoiding type your password every time you can skip it by hitting enter twice.

Enter passphrase (empty for no passphrase):
Enter same passphrase again:

When the key generation is complete, you should see the following confirmation:

Your identification has been saved in /Users/username/.ssh/id_rsa.
Your public key has been saved in /Users/username/.ssh/id_rsa.pub.
The key fingerprint is:
01:0f:f4:3b:ca:85:d6:17:a1:7d:f0:68:9d:f0:a2:db your@email.com
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|        . E +    |
|       . o = .   |
|      . S =   o  |
|       o.O . o   |
|       o .+ .    |
|      . o+..     |
|       .+=o      |
+-----------------+

The random art image is an alternate way to match keys but we won’t be needing this.

Now you need to place a copy of your public key ony any servers you would like to use SSH to connect to, instead of logging in with a username and passwd.

Display the contents of your new public key file with cat:

$ cat ~/.ssh/id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA879BJGYlPTLIuc9/R5MYiN4yc/YiCLcdBpSdzgK9Dt0Bkfe3rSz5cPm4wmehdE7GkVFXrBJ2YHqPLuM1yx1AUxIebpwlIl9f/aUHOts9eVnVh4NztPy0iSU/Sv0b2ODQQvcy2vYcujlorscl8JjAgfWsO3W4iGEe6QwBpVomcME8IU35v5VbylM9ORQa6wvZMVrPECBvwItTY8cPWH3MGZiK/74eHbSLKA4PY3gM4GHI450Nie16yggEg2aTQfWA1rry9JYWEoHS9pJ1dnLqZU3k/8OWgqJrilwSoC5rGjgp93iu0H8T6+mEHGRQe84Nk1y5lESSWIbn6P636Bl3uQ== your@email.com

Copy the contents of the output.

$ ssh vlad@moon.euphoric.edu
Password: ********

Paste the content that you copy at the end of ~/.ssh/authorized_keys.

    moon> nano ~/.ssh/authorized_keys

After append the content, logout of the remote machine and try login again. If you setup your SSH key correctly you won’t need to type your password.

    moon> exit

$ ssh vlad@moon.euphoric.edu

SSH Files and Directories

The example of copying our public key to a remote machine, so that it can then be used when we next SSH into that remote machine, assumed that we already had a directory ~/.ssh/.

Whilst a remote server may support the use of SSH to login, your home directory there may not contain a .ssh directory by default.

We have already seen that we can use SSH to run commands on remote machines, so we can ensure that everything is set up as required before we place the copy of our public key on a remote machine.

Walking through this process allows us to highlight some of the typical requirements of the SSH protocol itself, as documented in the man-page for the ssh command.

Firstly, we check that we have a .ssh/ directory on another remote machine, comet

$ ssh vlad@comet "ls -ld ~/.ssh"
Password: ********

    ls: cannot access /home/vlad/.ssh: No such file or directory

Oh dear! We should create the directory; and check that it’s there (Note: two commands, seperated by a semicolon)

$ ssh vlad@comet "mkdir ~/.ssh; ls -ld ~/.ssh"
Password: ********

    drwxr-xr-x 2 vlad vlad 512 Jan 01 09:09 /home/vlad/.ssh

Now we have a dot-SSH directory, into which to place SSH-related files but we can see that the default permissions allow anyone to inspect the files within that directory.

For a protocol that is supposed to be secure, this is not considered a good thing and so the recommended permissions are read/write/execute for the user, and not accessible by others.

Let’s alter the permissions on the directory

$ ssh vlad@comet "chmod 700 ~/.ssh; ls -ld ~/.ssh"
Password: ********

    drwx------ 2 vlad vlad 512 Jan 01 09:09 /home/vlad/.ssh

That’s looks much better.

In the above example, it was suggested that we paste the content of our public key at the end of ~/.ssh/authorized_keys, however as we didn’t have a ~/.ssh/ on this remote machine, we can simply copy our public key over as the initial ~/.ssh/authorized_keys, and of course, we will use scp to do this, even though we don’t yet have passwordless SSH access set up.

$ scp ~/.ssh/id_rsa.pub vlad@comet:.ssh/authorized_keys
Password: ********

Note that the default target for the scp command on a remote machine is the home directory, so we have not needed to use the shorthand ~/.ssh/ or even the full path /home/vlad/.ssh/ to our home directory there.

Checking the permissions of the file we have just created on the remote machine, also serves to indicate that we no longer need to use our password, because we now have what’s needed to use SSH without it.

$ ssh vlad@comet "ls -l ~/.ssh"

    -rw-r--r-- 2 vlad vlad 512 Jan 01 09:11 /home/vlad/.ssh/authorized_keys

Whilst the authorized keys file is not considered to be highly sensitive, (after all, it contains public keys), we alter the permissions to match the man page’s recommendations

$ ssh vlad@comet "chmod go-r ~/.ssh/authorized_keys; ls -l ~/.ssh"

    -rw------- 2 vlad vlad 512 Jan 01 09:11 /home/vlad/.ssh/authorized_keys

Key Points

SSH is a secure alternative to username/password authorization

SSH keys are generated in public/private pairs. Your public key can be shared with others. The private keys stays on your machine only.

The ‘ssh’ and ‘scp’ utilities are secure alternatives to logging into, and copying files to/from remote machine

Running Bash Scripts on an HPC Cluster

Overview

Teaching: 30 min
Exercises: 0 min

Questions

How do I run a bash script on a HPC cluster ?

Objectives

Learn the organization of a HPC cluster

Learn the SLURM scheduler

Let’s ssh back to the cluster, where we’re copied Nell’s goobash files and the bash script we created to automate her analysis. Now we want to modify the script such that it can run on the HPC cluster.

Job scheduler

An HPC system might have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

The scheduler for the clusters we’ll use today us SLURM. We have to provide SLURM details on how to run our job. Be do this by adding special comments at the top of our bash script.

First add

#SBATCH --job-name="goostats"
#SBATCH --output="goostats.%j.%N.out"

to the top of your script. This is the name for your job and the output file name for anything that would normally show up on the command line. The ‘%j’ variable is the unique job id that SLURM will assign, and %N is the hostname where the job ran. You can customize these names to be whatever your want.

Next we’ll set information about what computing resources the job needs. These include the number of individual tasks your job will generate (the number of processors you need), the number of nodes (computers) yur job needs, and the amount of memory your job requires. Choosing these values depends on the setup of the cluster (how many cores are available per node) and how well your code works when split up among many CPUs (paralellization). You can experiment with these values to determine which combo gives you the best performance.

#SBATCH --ntasks=1   # number of processor cores (i.e. tasks)
#SBATCH --nodes=1   # number of nodes
#SBATCH --mem-per-cpu=1G   # memory per CPU core

Next we add how long we expect the calculation to take. It’s better to overestimate this value, because if your calculation is still running after this time the scheduler will kill it.

#SBATCH --time=00:30:00   # walltime

The Caltech HPC cluster doesn’t have queues, and instead uses a quality of service flag to indicate test jobs

#SBATCH --qos=debug

If you have access to multiple accounts to charge usage, you may have to specify which to use.

#SBATCH --account=

salloc

If you want, you can have the scheduler send you emails when the job starts, finishes, or has a problem.

#SBATCH --mail-user=myemail@example.com   # email address
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL

You can then send your script to the scheduler by typing

$ sbatch do-stats.sh

You can check how your job is doing by typing

$ squeue

or use the -u option to filter out only the jobs for your username.

Your jobs will have a status indicated by the ST column, likely PD for something that is still in the queue or R for something that is running. All statuses are listed at https://slurm.schedmd.com/squeue.html

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          31263068     debug goostats tmorrell PD       0:00      1 (None)

Once your job is completed, you should see that stats- files have been created, as we expect for this bash script. There is also a new file named something like ‘goostats.31263068.comet-14-01.out’. Type

$ cat goostats.31263068.comet-14-01.out

and you’ll see the same output as you would if you ran the script on the command line.

Let’s say we want to share out results with CaltechDATA. We’ll start by installing the caltechdata_api command line tool

$ pip install caltechdata_api

We’re going to work on the CaltechDATA test system. Log into data.caltechlibrary.dev, go to your username in the upper right hand corner, and click on “Applications”.

Then in the “Personal access token” section select “Add token”. Save the token somewhere safe; it’s like a password.

Let’s say we want to share our results.

$ caltechdata_api -test

We’re going to “create” a new record. You’ll need to enter your CaltechDATA test token, which will be saved on the HPC cluster

We’re going to “create” metadata for our record. Enter a title and description for your files. Select your desired license and your ORCID identifier.

You can directly upload small files, like our script. For large files, we’ll want to directly upload to S3.

When you create the record, it will show up as a draft in CaltechDATA. You’ll can review the record, or upload files to S3.

When you want to link the S3 files, re-run caltechdata_api and select the edit option. When you get to files, pick link. When you publish in CaltechDATA, the linked files will now appear.

Key Points

Clusters work like your computer, but with a scheduler

Permissions

Overview

Teaching: 10 min
Exercises: 0 min

Questions

Understanding file/directory permissions

Objectives

What are file/directory permissions?

How to view permissions?

How to change permissions?

File/directory permissions in Windows

Unix controls who can read, modify, and run files using permissions. We’ll discuss how Windows handles permissions at the end of the section: the concepts are similar, but the rules are different.

Let’s start with Nelle. She has a unique user name, nnemo, and a user ID, 1404.

Why Integer IDs?

Why integers for IDs? Again, the answer goes back to the early 1970s. Character strings like alan.turing are of varying length, and comparing one to another takes many instructions. Integers, on the other hand, use a fairly small amount of storage (typically four characters), and can be compared with a single instruction. To make operations fast and simple, programmers often keep track of things internally using integers, then use a lookup table of some kind to translate those integers into user-friendly text for presentation. Of course, programmers being programmers, they will often skip the user-friendly string part and just use the integers, in the same way that someone working in a lab might talk about Experiment 28 instead of “the chronotypical alpha-response trials on anacondas”.

Users can belong to any number of groups, each of which has a unique group name and numeric group ID. The list of who’s in what group is usually stored in the file /etc/group. (If you’re in front of a Unix machine right now, try running cat /etc/group to look at that file.)

Now let’s look at files and directories. Every file and directory on a Unix computer belongs to one owner and one group. Along with each file’s content, the operating system stores the numeric IDs of the user and group that own it.

The user-and-group model means that for each file every user on the system falls into one of three categories: the owner of the file, someone in the file’s group, and everyone else.

For each of these three categories, the computer keeps track of whether people in that category can read the file, write to the file, or execute the file (i.e., run it if it is a program).

For example, if a file had the following set of permissions:

	user	group	all
read	yes	yes	no
write	yes	no	no
execute	no	no	no

it would mean that:

the file’s owner can read and write it, but not run it;
other people in the file’s group can read it, but not modify it or run it; and
everybody else can do nothing with it at all.

Let’s look at this model in action. If we cd into the labs directory and run ls -F, it puts a * at the end of setup’s name. This is its way of telling us that setup is executable, i.e., that it’s (probably) something the computer can run.

$ cd labs
$ ls -F

safety.txt    setup*     waiver.txt

Necessary But Not Sufficient

The fact that something is marked as executable doesn’t actually mean it contains a program of some kind. We could easily mark this HTML file as executable using the commands that are introduced below. Depending on the operating system we’re using, trying to “run” it will either fail (because it doesn’t contain instructions the computer recognizes) or cause the operating system to open the file with whatever application usually handles it (such as a web browser).

Now let’s run the command ls -l:

$ ls -l

-rw-rw-r-- 1 vlad bio  1158  2010-07-11 08:22 safety.txt
-rwxr-xr-x 1 vlad bio 31988  2010-07-23 20:04 setup
-rw-rw-r-- 1 vlad bio  2312  2010-07-11 08:23 waiver.txt

The -l flag tells ls to give us a long-form listing. It’s a lot of information, so let’s go through the columns in turn.

On the right side, we have the files’ names. Next to them, moving left, are the times and dates they were last modified. Backup systems and other tools use this information in a variety of ways, but you can use it to tell when you (or anyone else with permission) last changed a file.

Next to the modification time is the file’s size in bytes and the names of the user and group that owns it (in this case, vlad and bio respectively). We’ll skip over the second column for now (the one showing 1 for each file) because it’s the first column that we care about most. This shows the file’s permissions, i.e., who can read, write, or execute it.

Let’s have a closer look at one of those permission strings: -rwxr-xr-x. The first character tells us what type of thing this is: ‘-‘ means it’s a regular file, while ‘d’ means it’s a directory, and other characters mean more esoteric things.

The next three characters tell us what permissions the file’s owner has. Here, the owner can read, write, and execute the file: rwx. The middle triplet shows us the group’s permissions. If the permission is turned off, we see a dash, so r-x means “read and execute, but not write”. The final triplet shows us what everyone who isn’t the file’s owner, or in the file’s group, can do. In this case, it’s ‘r-x’ again, so everyone on the system can look at the file’s contents and run it.

To change permissions, we use the chmod command (whose name stands for “change mode”). Here’s a long-form listing showing the permissions on the final grades in the course Vlad is teaching:

$ ls -l final.grd

-rwxrwxrwx 1 vlad bio  4215  2010-08-29 22:30 final.grd

Whoops: everyone in the world can read it—and what’s worse, modify it! (They could also try to run the grades file as a program, which would almost certainly not work.)

The command to change the owner’s permissions to rw- is:

$ chmod u=rw final.grd

The ‘u’ signals that we’re changing the privileges of the user (i.e., the file’s owner), and rw is the new set of permissions. A quick ls -l shows us that it worked, because the owner’s permissions are now set to read and write:

$ ls -l final.grd

-rw-rwxrwx 1 vlad bio  4215  2010-08-30 08:19 final.grd

Let’s run chmod again to give the group read-only permission:

$ chmod g=r final.grd
$ ls -l final.grd

-rw-r--rw- 1 vlad bio  4215  2010-08-30 08:19 final.grd

And finally, let’s give “all” (everyone on the system who isn’t the file’s owner or in its group) no permissions at all:

$ chmod a= final.grd
$ ls -l final.grd

-rw-r----- 1 vlad bio  4215  2010-08-30 08:20 final.grd

Here, the ‘a’ signals that we’re changing permissions for “all”, and since there’s nothing on the right of the “=”, “all”’s new permissions are empty.

We can search by permissions, too. Here, for example, we can use -type f -perm -u=x to find files that the user can execute:

$ find . -type f -perm -u=x

./tools/format
./tools/stats

Before we go any further, let’s run ls -a -l to get a long-form listing that includes directory entries that are normally hidden:

$ ls -a -l

drwxr-xr-x 1 vlad bio     0  2010-08-14 09:55 .
drwxr-xr-x 1 vlad bio  8192  2010-08-27 23:11 ..
-rw-rw-r-- 1 vlad bio  1158  2010-07-11 08:22 safety.txt
-rwxr-xr-x 1 vlad bio 31988  2010-07-23 20:04 setup
-rw-rw-r-- 1 vlad bio  2312  2010-07-11 08:23 waiver.txt

The permissions for . and .. (this directory and its parent) start with a ‘d’. But look at the rest of their permissions: the ‘x’ means that “execute” is turned on. What does that mean? A directory isn’t a program—how can we “run” it?

In fact, ‘x’ means something different for directories. It gives someone the right to traverse the directory, but not to look at its contents. The distinction is subtle, so let’s have a look at an example. Vlad’s home directory has three subdirectories called venus, mars, and pluto:

execute

Each of these has a subdirectory in turn called notes, and those sub-subdirectories contain various files. If a user’s permissions on venus are ‘r-x’, then if she tries to see the contents of venus and venus/notes using ls, the computer lets her see both. If her permissions on mars are just ‘r–’, then she is allowed to read the contents of both mars and mars/notes. But if her permissions on pluto are only ‘–x’, she cannot see what’s in the pluto directory: ls pluto will tell her she doesn’t have permission to view its contents. If she tries to look in pluto/notes, though, the computer will let her do that. She’s allowed to go through pluto, but not to look at what’s there. This trick gives people a way to make some of their directories visible to the world as a whole without opening up everything else.

What about Windows?

Those are the basics of permissions on Unix. As we said at the outset, though, things work differently on Windows. There, permissions are defined by access control lists, or ACLs. An ACL is a list of pairs, each of which combines a “who” with a “what”. For example, you could give the Mummy permission to append data to a file without giving him permission to read or delete it, and give Frankenstein permission to delete a file without being able to see what it contains.

This is more flexible that the Unix model, but it’s also more complex to administer and understand on small systems. (If you have a large computer system, nothing is easy to administer or understand.) Some modern variants of Unix support ACLs as well as the older read-write-execute permissions, but hardly anyone uses them.

Challenge

If ls -l myfile.php returns the following details:
-rwxr-xr-- 1 caro zoo  2312  2014-10-25 18:30 myfile.php
Which of the following statements is true?

caro (the owner) can read, write, and execute myfile.php

caro (the owner) cannot write to myfile.php

members of caro (a group) can read, write, and execute myfile.php

members of zoo (a group) cannot execute myfile.php

Key Points

Correct permissions are critical for the security of a system.

File permissions describe who and what can read, write, modify, and access a file.

Use ls -l to view the permissions for a specific file.

Use chmod to change permissions on a file or directory.

Directory structure

Overview

Teaching: 5 min
Exercises: 0 min

Questions

Understanding the concept of Unix directory structure

Objectives

FIXME

All Unix files are integrated in a single directory structure. The file-system is arranged in a structure like an inverted tree. The top of this tree is the root and is written as a slash ‘/’.

The `tree` command

FIXME

Key Points

FIXME

Job control

Overview

Teaching: 5 min
Exercises: 0 min

Questions

How do keep track of the process running on my machine?

Can I run more than one program/script from within a shell?

Objectives

Learn how to use ps to get information about the state of processes

Learn how to control, ie., “stop/pause/background/foreground” processes

The shell-novice lesson explained how we run programs or scripts from the shell’s command line.

We’ll now take a look at how to control programs once they’re running. This is called job control, and while it’s less important today than it was back in the Dark Ages, it is coming back into its own as more people begin to leverage the power of computer networks.

When we talk about controlling programs, what we really mean is controlling processes. As we said earlier, a process is just a program that’s in memory and executing. Some of the processes on your computer are yours: they’re running programs you explicitly asked for, like your web browser. Many others belong to the operating system that manages your computer for you, or, if you’re on a shared machine, to other users.

The `ps` command

You can use the ps command to list processes, just as you use ls to list files and directories.

Behaviour of the ps command

The ps command has a swathe of option flags that control its behaviour and, what’s more, the sets of flags and default behaviour vary across different platforms.

A bare invocation of ps only shows you basic information about your, active processes.

After that, this is a command that it is worth reading the ‘man page’ for.

$ ps

  PID TTY          TIME CMD
12767 pts/0    00:00:00 bash
15283 pts/0    00:00:00 ps

At the time you ran the ps command, you had two active processes, your (bash) shell and the (ps) command you had invoked in it.

Chances are that you were aware of that information, without needing to run a command to tell you it, so let’s try and put some flesh on that bare bones information.

$ ps -f

UID        PID  PPID  C STIME TTY          TIME CMD
vlad     12396 25397  0 14:28 pts/0    00:00:00 ps -f
vlad     25397 25396  0 12:49 pts/0    00:01:39 bash

In case you haven’t had time to do a man ps yet, be aware that the -f flag doesn’t stand for “flesh on the bones” but for “Do full-format listing”, although even then, there are “fuller” versions of the ps output.

But what are we being told here?

Every process has a unique process id (PID). Remember, this is a property of the process, not of the program that process is executing: if you are running three instances of your browser at once, each will have its own process ID.

The third column in this listing, PPID, shows the ID of each process’s parent. Every process on a computer is spawned by another, which is its parent (except, of course, for the bootstrap process that runs automatically when the computer starts up).

Clearly, the ps -f that was run is a child process of the (bash) shell it was invoked in.

Column 1 shows the username of the user the processes are being run by. This is the username the computer uses when checking permissions: each process is allowed to access exactly the same things as the user running it, no more, no less.

Column 5, STIME, shows when the process started running, whilst Column 7, TIME, shows you how much time process has used, whilst Column 8, CMD, shows what program the process is executing.

Column 6, TTY, shows the ID of the terminal this process is running in. Once upon a time, this really would have been a terminal connected to a central timeshared computer. It isn’t as important these days, except that if a process is a system service, such as a network monitor, ps will display a question mark for its terminal, since it doesn’t actually have one.

The fourth column, C, is an indication of the perCentage of processor utilization.

Your version of ps may show more or fewer columns, or may show them in a different order, but the same information is generally available everywhere, and the column headers are generally consistent.

Stopping, pausing, resuming, and backgrounding, processes

The shell provides several commands for stopping, pausing, and resuming processes. To see them in action, let’s run our analyze program on our latest data files. After a few minutes go by, we realize that this is going to take a while to finish. Being impatient, we kill the process by typing Control-C. This stops the currently-executing program right away. Any results it had calculated, but not written to disk, are lost.

$ ./analyze results*.dat

...a few minutes pass...
^C

Let’s run that same command again, with an ampersand & at the end of the line to tell the shell we want it to run in the background:

$ ./analyze results*.dat &

When we do this, the shell launches the program as before. Instead of leaving our keyboard and screen connected to the program’s standard input and output, though, the shell hangs onto them. This means the shell can give us a fresh command prompt, and start running other commands, right away. Here, for example, we’re putting some parameters for the next run of the program in a file:

$ cat > params.txt
density: 22.0
viscosity: 0.75
^D

(Remember, \^D is the shell’s way of showing Control-D, which means “end of input”.) Now let’s run the jobs command, which tells us what processes are currently running in the background:

$ jobs

[1] ./analyze results01.dat results02.dat results03.dat

Since we’re about to go and get coffee, we might as well use the foreground command, fg, to bring our background job into the foreground:

$ fg

...a few minutes pass...

When analyze finishes running, the shell gives us a fresh prompt as usual. If we had several jobs running in the background, we could control which one we brought to the foreground using fg %1, fg %2, and so on. The IDs are not the process IDs. Instead, they are the job IDs displayed by the jobs command.

The shell gives us one more tool for job control: if a process is already running in the foreground, Control-Z will pause it and return control to the shell. We can then use fg to resume it in the foreground, or bg to resume it as a background job. For example, let’s run analyze again, and then type Control-Z. The shell immediately tells us that our program has been stopped, and gives us its job number:

$ ./analyze results01.dat
^Z

[1]  Stopped   ./analyze results01.dat

If we type bg %1, the shell starts the process running again, but in the background. We can check that it’s running using jobs, and kill it while it’s still in the background using kill and the job number. This has the same effect as bringing it to the foreground and then typing Control-C:

$ bg %1

$ jobs

[1] ./analyze results01.dat

$ kill %1

Job control was important when users only had one terminal window at a time. It’s less important now: if we want to run another program, it’s easy enough to open another window and run it there. However, these ideas and tools are making a comeback, as they’re often the easiest way to run and control programs on remote computers elsewhere on the network. This lesson’s ssh episode has more to say about that.

Key Points

When we talk of ‘job control’, we really mean ‘process control’

A running process can be stopped, paused, and/or made to run in the background

A process can be started so as to immediately run in the background

Paused or backgrounded processes can be brought back into the foreground

Process information can be inspected with ps

Aliases and bash customization

Overview

Teaching: 10 minutes min
Exercises: 0 min

Questions

How do I customize my bash environment?

Objectives

Create aliases.

Add customizations to the .bashrc and .bash_profile files.

Change the prompt in a bash environment.

Bash allows us to customize our environments to fill our own particular needs.

Aliases

Sometimes we need to use long commands that have to be typed over and over again. Fortunately, the alias command allows us to create shortcuts for these long commands.

As an example, let’s create aliases for going up one, two, or three directories.

alias up='cd ..'
alias upup='cd ../..'
alias upupup='cd ../../..'

Let’s try these commands out.

cd /usr/local/bin
upup
pwd

/usr

{ .output}

We can also remove a shortcut with unalias.

unalias upupup

If we create one of these aliases in a bash session, they will only last until the end of that session. Fortunately, bash allows us to specify customizations that will work whenever we begin a new bash session.

Bash customization files

Bash environments can be customized by adding commands to the .bashrc, .bash_profile, and .bash_logout files in our home directory. The .bashrc file is executed whenever entering interactive non-login shells whereas .bash_profile is executed for login shells. If the .bash_logout file exists, then it will be run after exiting a shell session.

Let’s add the above commands to our .bashrc file.

echo "alias up='cd ..'" >> ~/.bashrc
tail -n 1 ~/.bashrc

alias up='cd ..'

We can execute the commands in .bashrc using source

source ~/.bashrc
cd /usr/local/bin
up
pwd

/usr/local

Having to add customizations to two files can be cumbersome. It we would like to always use the customizations in our .bashrc file, then we can add the following lines to our .bash_profile file.

if [ -f $HOME/.bashrc ]; then
        source $HOME/.bashrc
fi

Customizing your prompt

We can also customize our bash prompt by setting the PS1 system variable. To set our prompt to be $ , then we can run the command

export PS1="$ "

To set the prompt to $ for all bash sessions, add this line to the end of .bashrc.

Further bash prompt customizations are possible. To have our prompt be username@hostname[directory]: , we would set

export PS1="\u@\h[\W]: "

where \u represents username, \h represents hostname, and \W represents the current directory.

Key Points

Aliases are used to create shortcuts or abbreviations

The .bashrc and .bash_profile files allow us to customize our bash environment.

The PS1 system variable can be changed to customize your bash prompt.

Shell Variables

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to change shell variables

Objectives

Understanding shell variables

The shell is just a program, and like other programs, it has variables. Those variables control its execution, so by changing their values you can change how the shell and other programs behave.

On an HPC cluster, software is usually installed in modules. For example if you want to use pip to install a python package, type

` module load python3/3.8.5 `

You can add this to a .bashrc file to run automatically.

Let’s start by running the command set and looking at some of the variables in a typical shell session:

$ set

COMPUTERNAME=TURING
HOME=/home/vlad
HOMEDRIVE=C:
HOSTNAME=TURING
HOSTTYPE=i686
NUMBER_OF_PROCESSORS=4
OS=Windows_NT
PATH=/Users/vlad/bin:/usr/local/git/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
PWD=/home/vlad
UID=1000
USERNAME=vlad
...

As you can see, there are quite a few—in fact, four or five times more than what’s shown here. And yes, using set to show things might seem a little strange, even for Unix, but if you don’t give it any arguments, it might as well show you things you could set.

Every variable has a name. By convention, variables that are always present are given upper-case names. All shell variables’ values are strings, even those (like UID) that look like numbers. It’s up to programs to convert these strings to other types when necessary. For example, if a program wanted to find out how many processors the computer had, it would convert the value of the NUMBER_OF_PROCESSORS variable from a string to an integer.

Similarly, some variables (like PATH) store lists of values. In this case, the convention is to use a colon ‘:’ as a separator. If a program wants the individual elements of such a list, it’s the program’s responsibility to split the variable’s string value into pieces.

The `PATH` Variable

Let’s have a closer look at that PATH variable. Its value defines the shell’s search path, i.e., the list of directories that the shell looks in for runnable programs when you type in a program name without specifying what directory it is in.

For example, when we type a command like analyze, the shell needs to decide whether to run ./analyze or /bin/analyze. The rule it uses is simple: the shell checks each directory in the PATH variable in turn, looking for a program with the requested name in that directory. As soon as it finds a match, it stops searching and runs the program.

To show how this works, here are the components of PATH listed one per line:

/Users/vlad/bin
/usr/local/git/bin
/usr/bin
/bin
/usr/sbin
/sbin
/usr/local/bin

On our computer, there are actually three programs called analyze in three different directories: /bin/analyze, /usr/local/bin/analyze, and /users/vlad/analyze. Since the shell searches the directories in the order they’re listed in PATH, it finds /bin/analyze first and runs that. Notice that it will never find the program /users/vlad/analyze unless we type in the full path to the program, since the directory /users/vlad isn’t in PATH.

Showing the Value of a Variable

Let’s show the value of the variable HOME:

$ echo HOME

HOME

That just prints “HOME”, which isn’t what we wanted (though it is what we actually asked for). Let’s try this instead:

$ echo $HOME

/home/vlad

The dollar sign tells the shell that we want the value of the variable rather than its name. This works just like wildcards: the shell does the replacement before running the program we’ve asked for. Thanks to this expansion, what we actually run is echo /home/vlad, which displays the right thing.

Creating and Changing Variables

Creating a variable is easy—we just assign a value to a name using “=”:

$ SECRET_IDENTITY=Dracula
$ echo $SECRET_IDENTITY

Dracula

To change the value, just assign a new one:

$ SECRET_IDENTITY=Camilla
$ echo $SECRET_IDENTITY

Camilla

If we want to set some variables automatically every time we run a shell, we can put commands to do this in a file called .bashrc in our home directory. (The ‘.’ character at the front prevents ls from listing this file unless we specifically ask it to using -a: we normally don’t want to worry about it. The “rc” at the end is an abbreviation for “run control”, which meant something really important decades ago, and is now just a convention everyone follows without understanding why.)

For example, here are two lines in /home/vlad/.bashrc:

export SECRET_IDENTITY=Dracula
export TEMP_DIR=/tmp
export BACKUP_DIR=$TEMP_DIR/backup

These three lines create the variables SECRET_IDENTITY, TEMP_DIR, and BACKUP_DIR, and export them so that any programs the shell runs can see them as well. Notice that BACKUP_DIR’s definition relies on the value of TEMP_DIR, so that if we change where we put temporary files, our backups will be relocated automatically.

While we’re here, it’s also common to use the alias command to create shortcuts for things we frequently type. For example, we can define the alias backup to run /bin/zback with a specific set of arguments:

alias backup=/bin/zback -v --nostir -R 20000 $HOME $BACKUP_DIR

As you can see, aliases can save us a lot of typing, and hence a lot of typing mistakes. You can find interesting suggestions for other aliases and other bash tricks by searching for “sample bashrc” in your favorite search engine.

Key Points

FIXME

The Unix Shell

Overview

Teaching: min
Exercises: min

Questions

Objectives

Learning Objectives {.objectives}

Understand the need for flexibility regarding arguments

Generate the values of the arguments on the fly using command substitution

Understand the difference between pipes/redirection, and the command substitution operator

Introduction

In the Loops topic we saw how to improve productivity by letting the computer do the repetitive work. Often, this involves doing the same thing to a whole set of files, e.g.:

$ cd data/pdb
$ mkdir sorted
$ for file in cyclo*.pdb; do
>     sort $file > sorted/sorted-$file
> done

In this example, the shell generates for us the list of things to loop over, using the wildcard mechanism we saw in the Pipes and Filters topic. This results in the cyclo*.pdf being replaced with cyclobutane.pdb cyclohexanol.pdb cyclopropane.pdb ethylcyclohexane.pdb before the loop starts.

Another example is a so-called parameter sweep, where you run the same program a number of times with different arguments. Here is a fictitional example:

$ for cutoff in 0.001 0.01 0.05; do
>   run_classifier.sh --input ALL-data.txt --pvalue $cutoff --output results-$cutoff.txt
> done

In the second example, the things to loop over: "0.001 0.01 0.05" are spelled out by you.

Looping over the words in a string {.callout}

In the previous example you can make your code neater and self-documenting by putting the cutoff values in a separate string:
$ cutoffs="0.001 0.01 0.05"
$ for cutoff in $cutoffs; do
  run_classifier.sh --input ALL-data.txt --pvalue $cutoff --output results-$cutoff.txt
done
This works because, just as with the filename wildcards, $cutoffs is replaced with 0.001 0.01 0.05 before the loop starts.

However, you don’t always know in advance what you have to loop over. It could well be that it is not a simple file name pattern (in which case you can use wildcards), or that it is not a small, known set of values (in which case you can write them out explicitly as was done in the second example). It would therefore be nice if you could loop over filenames or over words contained in a file. Suppose that file cohort2010.txt contains the filenames over which to iterate, then it would be nice to able to say something like:

# (imaginary syntax)
$ for file in [INSERT THE CONTENTS OF cohort2010.txt HERE]
> do
>    run_classifier.sh --input $file --pvalue -0.05 --output $file.results
> done

Command substitution

This would be more general, more flexible and more tractable than relying on the wildcard mechanism. What we need, therefore, is a mechanism that actually replaces everytying beween [ and ] with the desired names of input files, just before the loop starts. Thankfully, this mechanism exists, and it is called the command substitution operator (previously written using the backtick operator). It looks much like the previous snippet:

# (actual syntax)
$ for file in $(cat cohort2010.txt)
> do
>    run_classifier.sh --input $file --pvalue -0.05 --output $file.results
> done

It works simply as follows: everything between the $( and the ) is executed as a Unix command, and the command’s standard output replaces everything from $( up to and including ), just before the loop starts. For convenience, newlines in the command’s output are replaced with simple spaces.

Backtick operator

In legacy code, you may see the same construct but with a different syntax. It starts and ends with backticks, ` (not to be confused with the single quote ' !). The backticks work exactly the same as the command substitution done by $( and ). However its use is discouraged as backticks cannot be nested.

Example

OK. Recall from the Pipes and Filters topic that cat prints the contents of its argument (a filename) to standard output. So, if the contents of file cohort2010.txt look like

patient1033130.txt 
patient1048338.txt 
patient7448262.txt 
.
.
.
patient1820757.txt

then the construct

$ for file in $(cat cohort2010.txt)
> do
>     ...
> done

will be expanded to

$ for file in patient1033130.txt patient1048338.txt patient7448262.txt ... patient1820757.txt
> do
>     ...
> done

(notice the convenience of newlines having been replaced with simple spaces).

This example uses $(cat somefilename) to supply arguments to the for variable in ... do ... done-construct, but any output from any command, or even pipeline, can also be used. For example, if cohort2010.txt contains a few thousand patients but you just want to try the first two for a test run, you can use the head command to just get the first few lines of its argument, like so:

$ for file in $(cat cohort2010.txt | head -n 2)
> do
>     ...
> done

which will expand to

$ for file in patient1033130.txt patient1048338.txt
> do
>     ...
> done

simply because cat cohort2010.txt | head -n 2 produces patient1033130.txt patient1048338.txt after the command substitution.

Everything between the $( and ) is executed verbatim by the shell, so also the -n 2 argument to the head command works as expected.

Important

Recall from the Loops and the Shell Scripts topics that Unix uses whitespace to separate command, options (flags) and parameters / arguments. For the same reason it is essential that the command (or pipeline) inside the backticks produces clean output: single word output works best within single commands and whitespace- or newline-separated words works best for lists over which to iterate in loops.

Generating filenames based on a timestamp {.challenge}

It can be useful to create the filename ‘on the fly’. For instance, if some program called qualitycontrol is run periodically (or unpredictably) it may be necessary to supply the time stamp as an argument to keep all the output files apart, along the following lines:
qualitycontrol --inputdir /data/incoming/  --output qcresults-[INSERT TIMESTAMP HERE].txt
Getting [INSERT TIMESTAMP HERE] to work is a job for the command subsitution operator. The Unix command you need here is the date command, which provides you with the current date and time (try it).

In the current form, its output is less useful for generating filenames because it contains whitespace (which, as we know from now, should preferably be avoided in filenames). You can tweak date’s format in great detail, for instance to get rid of whitespace:
$ date +"%Y-%m-%d_%T"
(Try it).

Write the command that will copy a file of your choice to a new file whose name contains the time stamp. Test it by executing the command a few times, waiting a few seconds between invocations (use the arrow-up key to avoid having to retype the command)

Juggling filename extensions {.challenge}

When running an analysis program with a certain input file, it is often required that the output has the same name as the input, but with a different filename extension, e.g.
$ run_classifier.sh --input patient1048338.txt --pvalue -0.05 --output patient1048338.results
A good trick here is to use the Unix basename command. It takes a string (typically a filename), and strips off the given extension (if it is part of the input string). Example:
$ basename patient1048338.txt    .txt
gives
patient1048338
Write a loop that uses the command substitution operator and the basename command to sort each of the *.pdb files into a corresponding *.sorted file. That is, make the loop do the following:
$ sort ammonia.pdb > ammonia.sorted
but for each of the .pdb-files.

Closing remarks

The command subsitution operator provides us with a powerful new piece of ‘plumbing’ that allows us to connect “small pieces, loosely together” to keep with the Unix philosophy. It is remotely similar to the | operator in the sense that it connects two programs. But there is also a clear difference: | connects the standard output of one command to the standard input of another command, where as $(command) is substituted ‘in-place’ into the the shell script, and always provides parameters, options, arguments to other commands.

Key Points

Transferring Files

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How to use wget, curl and lftp to transfer file?

Objectives

FIXME

There are other ways to interact with remote files other than git.

It is true that we can clone an entire git repository, or even one level of a git repository using: git clone --depth-1 repository_name. What about files that do not exist in a git repository? If we wish to download files from the shell we can use tools such as Wget, cURL, and lftp.

Wget

Wget is a simple tool developed for the GNU Project that downloads files with the HTTP, HTTPS and FTP protocols. It is widely used by Unix-like users and is available with most Linux distributions.

To download this lesson (located at http://swcarpentry.github.io/shell-extras/03-file-transfer.html) from the web via HTTP we can simply type:

$ wget http://swcarpentry.github.io/shell-extras/03-file-transfer.html

--2014-11-21 09:41:31--  
http://swcarpentry.github.io/shell-extras/03-file-transfer.html
Resolving software-carpentry.org (software-carpentry.org)... 174.136.14.108
Connecting to software-carpentry.org (software-carpentry.org)|174.136.14.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8901 (8.7K) [text/html]
Saving to: '03-file_transfer.html'

100%[======================================>] 8,901       --.-K/s   in 0.05s   

2014-11-21 09:41:31 (187 KB/s) - '03-file_transfer.html' saved [8901/8901]

Alternatively, you can add more options, which are in the form:

wget -r -np -D domain_name target_URL

where -r means recursively crawl to other files and directories, -np means avoid crawling to parent directories, and -D means to target only the following domain name

For our URL it would be:

$ wget -r -np -D software-carpentry.org http://swcarpentry.github.io/shell-extras/03-file-transfer.html

To restrict retrieval to a particular extension(s) we can use the -A option followed by a comma separated list:

wget -r -np -D software-carpentry.org -A html http://swcarpentry.github.io/shell-extras/03-file-transfer.html

We can also clone a webpage with its local dependencies:

$ wget -mkq target_URL

We could also clone the entire website:

$ wget -mkq -np -D domain_name domain_name_URL

and add the -nH option if we do not want a subdirectory created for the websites content:

e.g.

$ wget -mkq -np -nH -D example.com http://example.com

where:

-m is for mirroring with time stamping, infinite recursion depth, and preservation of FTP directory settings -k converts links to make them suitable for local viewing -q supresses the output to the screen

The above command can also save the clone the contents of one domain to another if we are using ssh or sshfs to access a webserver.

Please refer to the man page by typing man wget in the shell for more information.

cURL

Alternatively, we can use cURL. It supports a much larger range of protocols including common mail based protocols like pop3 and smtp.

To download this lesson (located at http://swcarpentry.github.io/shell-extras/03-file-transfer.html) from the web via HTTP we can simply type:

$ curl -o 10-file_transfer.html http://swcarpentry.github.io/shell-extras/03-file-transfer.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed
100 14005  100 14005    0     0  35170      0 --:--:-- --:--:-- --:--:--  105k

This input to curl is in the form:

curl -o filename_for_local_machine target_url

where the -o option says write the output to a file instead of the stdout (the screen), and file_name_for_local_machine is any file name you choose to save to the local machine, and target_URL is where the file is the URL where the file is on the web

Removing the -o option, and following the syntax curl target_URL outputs the contents of the url to the screen. If we wanted to enhance the functionality we have we could use information from the pipes and filters section, which is lesson 4 from the unix shell session.

For example, we could type curl http://swcarpentry.github.io/shell-extras/03-file-transfer.html | grep curl which would tell us that indeed this URL contains the string curl. We could make the output cleaner by limiting the output of curl to just the file contents by using the -s option (e.g. curl -s http://swcarpentry.github.io/shell-extras/03-file-transfer.html | grep curl).

If we wanted only the text and not the html tags in our output we could use html to text parser such as html2text.

$ curl -s http://swcarpentry.github.io/shell-extras/03-file-transfer.html | html2text | grep curl

With wget, we can obtain the same results by typing:

$ wget -q -D swcarpentry.github.io -O /dev/stdout http://swcarpentry.github.io/shell-extras/03-file-transfer.html | html2text | grep curl

Wget offers more functionality natively than curl for retrieving entire directories. We could use Wget to first retrieve an entire directory and then run html2text and grep to find a particular string. cURL is limited to retrieving one or more specified URLs that cannot be obtained by recursively crawling a directory. The situation may be improved by combining with other unix tools, but is not thought as being as good as Wget.

Please refer to the man pages by typing man wget, man curl, and man html2text in the shell for more information.

lftp

Another option is lftp. It has a lot of capability, and even does simple bittorrent.

If we want to retrieve 03-file-transfer.html on the website and save it with the filename 03-file-transfer.html locally:

$ lftp -c get http://swcarpentry.github.io/shell-extras/03-file-transfer.html

If we want to print 03-file-transfer.html to the screen instead:

$ lftp -c cat http://swcarpentry.github.io/shell-extras/03-file-transfer.html

To obtain retrive all of the files with a particular extension in a directory we can type:

$ lftp -c mget {URL for directory}/*.extension_name

For example, to retrieve all of the .html files in the extras folder:

$ lftp -c mget http://swcarpentry.github.io/shell-extras/*.html

Please refer to the man page by typing man lftp in the shell for more information.

Key Points

FIXME

Extra Unix Shell Material

Advanced Shell Scripting

Overview

Data Organization

Reviewing scripting

Nelle’s Pipeline: Processing Files

System information and variables

Conditionals

System Variables

Key Points

Working Remotely

Overview

SSH History

A remote login using ssh

Copying files to, and from a remote machine using scp

Running commands on a remote machine using ssh

SSH Keys

SSH Files and Directories

Key Points

Running Bash Scripts on an HPC Cluster

Overview

Job scheduler

Key Points

Permissions

Overview

Why Integer IDs?

Necessary But Not Sufficient

What about Windows?

Challenge

Key Points

Directory structure

Overview

The tree command

Key Points

Job control

Overview

The ps command

Behaviour of the ps command

Stopping, pausing, resuming, and backgrounding, processes

Key Points

Aliases and bash customization

Overview

Aliases

Bash customization files

Customizing your prompt

Key Points

Shell Variables

Overview

The PATH Variable

Showing the Value of a Variable

Creating and Changing Variables

Key Points

The Unix Shell

Overview

Learning Objectives {.objectives}

Introduction

Looping over the words in a string {.callout}

Command substitution

Backtick operator

Example

Important

Generating filenames based on a timestamp {.challenge}

Juggling filename extensions {.challenge}

Closing remarks

Key Points

Transferring Files

Overview

Wget

cURL

lftp

Key Points

A remote login using `ssh`

Copying files to, and from a remote machine using `scp`

Running commands on a remote machine using `ssh`

The `tree` command

The `ps` command

Behaviour of the `ps` command

The `PATH` Variable