Using the Unix Shell There is No ‘Undelete’ The Unix Shell “A Unix shell is a command-line interpreter or shell that provides a.

Transcript Using the Unix Shell There is No ‘Undelete’ The Unix Shell “A Unix shell is a command-line interpreter or shell that provides a.

Using the Unix Shell
There is No ‘Undelete’
The Unix Shell
“A Unix shell is a command-line interpreter or
shell that provides a traditional user interface
for the Unix operating system and for Unix-like
systems. Users direct the operation of the
computer by entering commands as text for a
command line interpreter to execute or by
creating text scripts of one or more such
commands.” - Wikipedia
Things to Keep in Mind
• There is no ‘undelete’
• Shell commands are case-sensitive
(CaPitaLizaTIoN mAttErs)
• Do NOT use space, ?, *, \, / or $ in file names
because these have special meanings to the
shell
• Filenames that begin with . are ‘hidden’
• There is no ‘undelete’
The Importance of Being ‘Root’
• ‘Root’ or ‘Superuser’ is the administrator account,
which has phenomenal cosmic power.
• The ‘sudo’ command allows you to “do as superuser”
from an account with ‘sudo privileges’.
• As root in the shell, you can literally ‘delete’ the
operating system or operating system files (like
choosing to delete Microsoft Windows while using
Windows)… and then watch the stars go out…
– Moral of the story: If you don’t know what a file is… it’s
better to ask or leave it alone.
– Installing software can require use of ‘sudo’
Unix Tutorial
• http://www.ee.surrey.ac.uk/Teaching/Unix/
• Science.txt file location for tutorial:
– http://www.ee.surrey.ac.uk/Teaching/Unix/science.txt
– Unix command:
• wget http://www.ee.surrey.ac.uk/Teaching/Unix/science.txt
Additional help/tutorial/walkthrough
• http://software-carpentry.org/4_0/shell/
Grep
• grep science science.txt
• grep science science.txt > newfile1.txt
• grep -B 1 -A 2 science science.txt > newfile1.txt
Command line ‘options’
that change the behavior of
the ‘grep’ program, with
numerical parameters that
specify the new behavior.
A ‘redirect’ symbol
that sends output
which would normally
go to the screen to a
text file instead.
• Use man grep to learn more about grep
Permissions
• Type ls -l
*note: those are both lower-case L characters
•
•
-rw-r--r-- 1 krmerrill staff 358400 Feb 2 13:00 AJB_Merrill-d1100085_au.doc
drwxr-xr-x 47 krmerrill staff 1598 Jul 17 2011 My Pictures
- means regular file, d means directory, l (lower-case L) means link
first triplet is the user read, write, and execute permissions
second triplet is the group permissions
last triplet is permissions for everyone else, or ‘other’
ls -al shows above information for all files, including hidden files
chmod = change permissions
u = user; g = group; o = other; a = all (user, group, and other)
r = read; w = write; x = execute
chmod u+x filename adds user execute permission on filename
chmod g-wx filename removes group write and execute permissions from filename
Permissions that are not mentioned in this format chmod command are not affected
Useful Shell Commands
• See the Linux Command Line Reference document on the course
website
• Directory commands
• Change to sub-directory within the current directory: cd xyz
• Change to sub-directory in another part of the directory tree: cd
/path/to/filename
• Create directory: mkdir newdir
• Remove empty directory: rmdir xyz
• Wildcard characters: ? matches any single character, * matches zero or
more characters
• Example: rm *.txt will remove all files with a name ending in .txt
• rm file?.fastq will remove file1.fastq, file2.fastq, … , filex.fastq
Regular Expressions
• See the RegularExpressions.pdf document on the course website
for an overview of literal characters and metacharacters
• Regular expressions are useful within grep, awk, sed and other
command-line tools as well as in Java, Perl, Python, and other
scripting languages.
• Some text editor programs in Linux also use regular expressions,
(also called regexps or regex). We will use nedit as an example.
• Replacing a space character with a new-line character in a file of
barcodes – find ‘(OWB\d+) ’ and replace with ‘\1\n’ – note the
trailing space in the first expression.
Command-line example
• Testing analyses on a small random sample of a sequence
dataset is a good idea – find and fix problems quickly
• How to randomly sample the same reads from a set of pairedend files?
• A one-line command is saved on the course website to do this.
• time paste file1.fastq file2.fastq |awk '{ printf("%s",$0); n++;
if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head 2000000 | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print
$2 > "file2.fastq"}‘
• Let’s look at this step by step
Command-line example
time this tells the system to display the time required to execute the command
paste Bigfile1.fastq Bigfile2.fastq | this joins two files of paired-end sequence reads as
tab-delimited columns, line by line – the files should have the same number of lines,
with reads in the same order in both files
awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | this
uses the ‘awk’ program to convert the four lines of FASTQ format to tab-separated
fields on a single line per sequence record
shuf | this utility sorts lines in a file into a random order
head -2000000 | this utility takes the first 2 million lines of the re-ordered file
sed 's/\t\t/\n/g' | this uses the ‘sed’ stream editor to convert the tab delimiters back
into new-line characters to restore the 4-line FASTQ format
awk '{print $1 > “Subfile1.fastq"; print $2 > “Subfile2.fastq"}' this uses ‘awk’ to split
the two tab-delimited columns back into two separate files
How do you come up with this stuff?
How do you come up with this stuff?
Someone else has probably had this problem
Search for help on SeqAnswers or StackExchange
http://biostar.stackexchange.com/
The Bioinformatics Forum on SeqAnswers:
http://seqanswers.com/forums/forumdisplay.php?f=18
SolexaQA.pl
• This Perl script assumes that header lines of sequence
files are written in one of several formats
• The code uses regular expressions to sort out formats:
if( $line =~ /\S+\s\S+/ ){
# Cassava 1.8 variant
if( $line =~ /^@[\d\w\-\._]+:[\d\w]+:[\d\w]+:[\d\w]+:(\d+)/ ){
$number_of_tiles = $1 + 1; # Sequence Read Archive variant
}elsif( $line =~ /^@[\d\w\-\._\s]+:[\d\w]+:(\d+)/ ){
$number_of_tiles = $1 + 1;
} # All other variants
}elsif( $line =~ /^@[\d\w\-:\._]*:+\d*:(\d*):[\.\d]+:[\.\/\#\d\w]+$/ ){
$number_of_tiles = $1 + 1; }
Alternate Formats
• This Perl script assumes that header lines of sequence
files are written in one of several formats
• The code uses regular expressions to sort out formats:
if( $line =~ /\S+\s\S+/ ){
# Cassava 1.8 variant – does the header line
contain a space surrounded by non-space characters?
@EAS139:136:FC706VJ:2:2104:15343:197393_1:Y:18:ATCACG
$line =~ /^@[\d\w\-\._]+:[\d\w]+:[\d\w]+:[\d\w]+:(\d+)/ ) # NCBI SRA variant –
does the header line contain a string with – , _ ,or . before the first colon?
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
SolexaQA.pl
$line =~ /^@[\d\w\-\._\s]+:[\d\w]+:(\d+)/ ) # Two other variants –
1. does first field contain – , ., or _ followed by two more colon-
delimited fields?
$line =~ /^@[\d\w\-:\._]*:+\d*:(\d*):[\.\d]+:[\.\/\#\d\w]+$/ )
2.
does first field contain – , ., :, or _ followed by four colon-delimited
fields, followed by ., /, or # at the end of the line?
Example header line from GSL sequence file:
@3:1:1006:20321:Y
This would be described by $line =~ /^@\d+:\d+:\d+:\d+:[YN]/