Linux Tutorial

 

When you log onto a linux machine like Karst, Mason or Big Red II, you are technically starting what is called an interactive shell. The shell is simply a layer of software that allows you to interact with the computer to accomplish things like examining and manipulating your data. Terminal windows on Mac's and PC's all run the equivalent of a shell, though as a user, you are most likely unaware of it. There are many different types of shell (often called "flavors"), but they all accomplish similar things and operate in fundamentally similar ways. When using the IU computer clusters, the use of either the bash or tcsh shell is strongly recommended.

tux, the linux penguin

What follows is a relatively brief but hopefully thorough description of some of the key concepts that operate inside both a linux machine and a shell. There are some useful commands spread through this discussion that will allow you to access and manipulate your data. Pages with additional information for people who are new to linux are also embedded in the text (e.g., a table containing some common linux commands can be found here). This tutorial may be over-kill for casual users of the IU linux clusters, but is included for those who want a better understanding of what is happening while using these systems.


In linux, everything on the computer system is a file. There are many types of files (and some files are more special than others), but you should keep in mind that essentially everything is some sort of file: text files are files the way you normally think of them, images are another sort of normal file, directories are actually files that contain other files, commands are files that tell the shell what to do, programs (usually called executables as opposed to the program source code files) are another form of file that tells the computer what to do, etc. Even physical devices such as hard-drives are represented by device files.

Files have the usual names that we associate with them (e.g., myData.txt and imageOfGoldLattice.dm3). However, these same files also have a name that includes a full description of where on the computer system the file resides:

   /N/u/yourUserName/Karst/myData.txt
& /N/dcwan/projects/cryoem/Users/JEOL/Certification/imageOfGoldLattice.dm3

This description of where the file resides is the absolute path to a particular file and is sometimes simply called the full path or pathname for a file. The general concept of a path as a description of where to find something is critical to understanding how linux operates and will appear again.

image of computer's directory treeThese pathnames all have a common starting place which is denoted by the forward slash symbol (/) and which is described as the root directory of the filesystem. The root directory contains all the files and directories on the computer. Directories that reside in / are described as being "beneath" the root. Since a directory can contain any number of additional directories (called sub-directories), it is easy to envision directories that branch infinitely downward from the root. In such a vision, the entire computer system is an upside-down tree, with its root at the top and branches (directories) extending downwards to the leaves (files) in the lower directories.


Every user of the computer has a specific place (called the home directory) within this enormous inverted tree. Your home is where you are when you log onto the computer system, and on Karst, it will be something like

/N/u/yourUserName/Karst

You can create files and directories in this area (but not in other users' home directories). If you have an additional computer account on (say) Mason, you will have a different home directory for when you log onto that system:

/N/u/yourUserName/Mason

Both these home directories start with /N/u/yourUserName, where the / here is the root directory that contains everything in the IU campus computing system. This style of organization is part of the way that the IU computer center keeps track of individual user accounts spread across the multiple machines that constitute the campus system. You can actually move into this common area (using the command cd ./../ ) and look at what is there (using the command ls). You may even find it convenient to put files in this area that you access from your Karst, Mason and Big Red II accounts.

Navigating in the directory tree is quite easy (and always involves the cd command mentioned above). It is possible to move to directories near your current location using relative movements (so cd .. moves you up one directory, cd ../.. moves you up two directories, cd myData moves down into a directory called myData and cd ../anotherPlace moves you up and then back down into another place). You can also jump (maybe teleport is a good analogy!) from place to place in the directory tree using absolute location names (and so cd /N/dcwan/projects/cryoem moves you from wherever you are to the main EM Center area of the IU campus computer system).


A key concept to how shells function is that they use system-defined variables (called environmental variables). These standardized variables refer to commonly used things such as your user name and your home area, but can also be used in ways that help the computer accomplish the tasks you tell it. Many of the nuts-and-bolts that the computer uses in order for you to use the system are handled with such environmental variables, even if you as a user are totally unaware that they exist. The Modules system works in part by creating new variables and modifying existing ones. In addition, users often find it helpful to create their own user-specific variables. A table containing some useful variables (starting with system-defined variables and ending with a few set by the cryoem Modules environment) can be found here.

Values for the environmental variables are accessed using a "$" followed by the variable name, and so (for example), there is a HOME variable that refers to a user's home directory but your specific home directory is specified using ${HOME} (or $HOME, if you chose to leave out the delimiting {}'s). The easiest way to see the value for any variable is to use the echo command and on Karst, echo $HOME will produce the full name of your home directory (/N/u/yourUserName/Karst). Keep in mind that linux is case sensitive and that ${HOME} will likely mean something completely different from ${home}!

Another way to think about these environmental variables is as a type of shorthand that the computer uses. For example, all the software for the cryoem module on Karst is located in the directories below /N/soft/rhel6/cryoem, and the Modules system sets the variable CRYO_PREFIX to this location. If you wanted to see what was there, you could type ls $CRYO_PREFIX instead of ls /N/soft/rhel6/cryoem.

A major advantage to using such a variable is that all the user needs to know is the name of the variable (and not the variable's value). Imagine that for some reason, all the software for the cryoem module was moved to a different location. Such a change would cause every reference to software in directories below /N/soft/rhel6/cryoem to fail (i.e., a script such as /N/soft/rhel6/cryoem/bin/dm2mrc.sh would no longer exist, and so trying to use it with that pathname simply wouldn't work). However, since it is possible to use $CRYO_PREFIX to designate wherever the software is located, using the $CRYO_PREFIX shorthand in the pathname for dm2mrc.sh avoids all the problems of trying to access files that no longer exist (i.e., while /N/soft/rhel6/cryoem/bin/dm2mrc.sh may or may not exist, $CRYO_PREFIX/bin/dm2mrc.sh will exist as long as the cryoem module is well maintained). This ability to redefine the value of existing variables is one important reason that the Modules system is such a powerful way to manipulate a user's working environment.

One of the most important variables is called PATH. PATH specifies the set of places you the user know about that contain files for system-wide linux commands, executable programs and shell scripts. If, for example, a shell script is in a place that is on your PATH, you can run the script simply by typing its name. When you do this, the shell searches your PATH to find a match to the name you typed, and runs the first match that it finds. When no match can be found, the shell tells you that and does nothing. In most cases (but not all), you can run a command, program or script that cannot be found on your PATH by typing the absolute name (i.e., the full path to that file, starting at /). Since it is much easier to type just the names, your PATH should include all the places where useful things reside. The Modules system works to a large extent by adding and subtracting places from your PATH.

Another extremely important variable similar to PATH is LD_LIBRARY_PATH. This variable specifies the places you know about that contain the shared libraries that are needed when certain programs run. Many of the programs you will use on a linux machine use dynamically linked libraries, meaning that at least some of the functions and subroutines needed by the program are not included in the program itself but are located in other places that must be found when the program is executed. There are system-wide locations for such shared libraries and also executable-specific (or module- or user-specific) repositories for them. If you attempt to run an executable that requires a shared library that cannot be found on your LD_LIBRARY_PATH, the executable will generate an error indicating the name of the shared library it can not find and stop before doing anything else.

The number of environmental variables and their use is essentially unlimited, and all you really need to know about them is that they exist and are an extremely powerful way to keep things organized both for the computer system itself and for the people responsible for maintaining it. However, if you start to write you own shell scripts, you will find it useful to define your own variables. This is described in a bit more detail here.


A second key concept for shells is the use of metacharacters. These are characters (or in a few instances, multiple characters) that have special (also called reserved) meanings to the shell. The most commonly used metacharacter is the asterisk (*) which most people recognize as meaning "anything." For example, if you want to list all files in the current directory that start with the letter m or all the DigitalMicrograph data files with a dm3 file extension, you can type ls m* or ls *.dm3, respectively.

There are many other metacharacters, some of which are described here. In practical terms, keep in mind that it is generally unwise to use any of the metacharacters as parts of file names, and that even the whitespaces (spaces, blanks, empty characters, etc.) that are often used in filenames on Windows and Macintosh computers can cause problems for linux. There are always ways around the problems that can be caused by the use of metacharacters in names, but they are both messy and can cause unintended consequences when automated processes run...


Another useful thing to know about linux is that many of the shells (including tcsh and bash) support something called command-line completion (or more commonly "tab completion"). When using tab completion, a user types the initial part of a "word" (e.g., a command or file name) and then the tab key. In response to the tab, the shell attempts to "fill in the rest" of the word. If this attempt at completion has a unique solution, the shell provides it and the user can continue typing the rest of the line. When there are mulitple possible responses, the shell will list the available choices and return to a state where the user can supply the correct one. Tab completion can even be used multiple times within a single word, where the first tab will complete as much of the word as is unique and a second (third, fourth, etc.) tab will show the remaining possible ways to complete the word.

Different shells implement tab completion differently and so, for example, in the case where there are multiple possible completions, tcsh will automatically list them all after a single tab while bash will stop after the unique portion of a word is printed to the terminal and only lists all the possibilities after a second tab is entered.

The essence of tab completion is that the shell helps a user "remember" complicated words such as file names, commands or locations. This is especially useful when moving from directory to directory or when trying to recall the name of a command or script that you have used previously, but not often enough to remember fully.

A shell "trick" that is similar to tab completion is found in the bash shell but not in tcsh. When using bash, if you enter a command that operates on a limited number of file types and then type a tab, the shell responds by showing you only the available files that could be used with that particular command. Keep in mind that any directory could lead to an appropriate file type, and so all sub-directories will appear as possible choices...


There are a vast number of linux commands and this tutorial will only deal with some of them. The commands are short words (often only two or three letters) that may or may not seem related to the action they invoke. Some commands can be used alone, but most require some object for the command to manipulate and they all have an assortment of command line options (also called "arguments").

The syntax for using linux commands is:

        cmd [options] [targets]                  examples:     cd /N/dcwan/projects/cryoem/Users

          where cmd is a linux command                         ls -l -s -h ./MetaData/*txt
                [options] are any/all of the              or   ls -lsh ./MetaData/*txt
                   optional arguments to cmd
              & [targets] are the object(s) on                 mkdir ${HOME}/MyImageProcessingArea
                   which the command operates

The "options" mentioned above all started as single characters that were used to modify the action of a command. Most (but unfortunately not all) options are designated using a "-" (e.g., ls -a). Many of these single character options can now be replaced using a longer collection of letters (usually a real word or phrase) that is more descriptive of what the option actually does. Such "word options" are all designated using "--" (e.g., ls --all and is equivalent to ls -a). Single character options are often combined (in any order) and designated using a single "-" (e.g., ls -l -a -t -r is the same as ls -latr, ls -altr, ls -tral, etc.). This is obviously not possible when using the word options.

For all linux commands, it is possible to obtain a detailed description of a command's use and options by typing the following command:

        man [sec] cmd                            examples:        man man

          where man is the linux command                          man echo
                  that brings up the "manual"
                  for any given command                           man ls
                [sec] is the optional section
                  number to use
              & cmd is the command of interest

This is also described in a bit more detail here. Some (but not all) of the commands that have obligatory targets will also produce helpful information if you simply type the command. Another common way to get help is to type commandName -h and/or commandName --help.

Many linux commands and their options are described here. Remember that the commands listed in that table have additional uses and options that are not described there, and also keep in mind that the most commonly used commands are near the top of the table and are in a somewhat arbitrary order.

There are many, many more linux commands, and there are lots of on-line resources to help you figure them out and use them. Both Linux in a Nutshell and the Linux Pocket Guide books from O'Reilly Media are great resources and can help you understand more about commands and how various shells operate.


It is probably clear that you can type linux commands for hours and hours, and some of this typing may even be trial-and-error attempts to get some sequence of commands, pipes, redirection, etc. to do the right thing. It might be nice to examine this sort of long sequence of commands, and the different linux shells provide several ways to do this. The easiest is the up arrow key, which will scroll through a set number of "remembered" lines starting with the most recent first. This is useful for examining the last ten to twenty lines (and can be used to go back much further), but it definitely gets tedious to hit the up arrow key more than a few dozen times....

Fortunately, linux also has a history command (which is not included in the table of commands). This is a built-in part of most modern shells (for example, bash and tcsh have history commands built into them) but it was a separate command in the early days of Unix and may not exist in some of the more esoteric flavors of shell. Associated with the history command are environmental variables that tell history where (for example) to store the saved commands (lines) and how many lines should be saved. The default number of saved lines is usually 500, but individual users can change this using the proper environmental variable.

When history is typed at the command line, the most likely result is simply a listing of all the saved lines. Both bash and tcsh automatically associate a number with each of the remembered lines. It is possible to recall a specific line by typing the metacharacter ! (the exclamation point or bang) followed by the line number. For example, if you typed several different ls commands and wanted to repeat one of them, you could type history to see the saved lines and line numbers, then !lineNumber to repeat the command associated with lineNumber. If you want to repeat a line that starts with a unique word, you can simply type a ! followed by as much of the unique word as is needed to be unambiguous. However, bear in mind that what comes after the ! cannot contain spaces and that using line numbers is the only way to distinguish between (for example) the lines that start with ls -Flacs and ls m*.


One final concept that is useful to understand about linux is that the shells have commands such as do, if, until, and while that control the actions the shell performs. Such flow control commands are built-in shell commands (meaning that each shell might have a slightly different way to invoke a command, even if it has the same name in the different shells). In addition, there are operators both for math (e.g., the addition (+), subtraction (-), multiplication (*), division (/), and modulus (%) operators) and for logic (e.g., greater than (>), less than (<), is equal to (==), is not equal to (!=), etc.). The combination of command flow, math and logic allows the user to perform a variety of complicated tasks. The use of flow control commands (with various tests to control the flow) and of operators is key to the entire idea of writing shell scripts, and an extremely brief tutorial on shell scripting can be found here.