Beginning Perl for Bioinformatics-RVS
Short Description
Perl is a interpreted language optimized for scanning arbitrary text files, extracting information from those text files...
Description
PRACTICAL EXTRACTION & REPORT LANGUAGE
By Raghvendra Sachan
Raghvendra Sachan
CONTENTS SL.NO.
1.0
TOPIC
PAGE NO.
INTRODUCTION TO PERL
2 1.1
PERL FACT!
1.2
WHY PERL?
2 3
2.0
HISTORY OF PERL
3.0
BIOINFORMATICS (GENERAL VIEW)
3
5 4.0
BIOINFORMATICS USING PERL
4.1
PROGRAMMING CONCEPTS
5
5 4.2
VARIABLE
4.3
STRING OPERATION
7
7 5.0 5.1
PERL PROGRAMS
8
TO FIND OUT THE FIRST ORF IN THE GIVEN AMINO ACID SEQUENCE
8 5.2
TO FIND OUT 6 ORF’s IN THE GIVEN DNA SEQUENCE
11 5.3
TO DETERMINE THE INFORMATION ABOUT 20 AMINO ACIDS
14 5.4
TO DETERMINE THE INFORMATION ABOUT NUCLEOTIDES.
22 5.5
TO DETERMINE THE MOLECULAR WEIGHT OF THE AMINO ACIDS
SEQUENCE
5.6
TO DETERMINE
SEQUENCE.
5.7
25 MOLECULAR FORMULA OF THE AMINO ACIDS
28
TO FIND THE REVERSE, COMPLIMENTARY, SEQUENCE.
31
2
5.8
TO IDENTIFY THE NUMBER OF NUCLEOTIDES IN THE SEQUENCE.
32 5.9
TO IDENTIFY THE NUMBER OF NUCLEOTIDES AND LENGTH IN THE
SEQUENCE
5.10
33
TO DETERMINE MOL. WT. OF THE DNA SEQ. USING FIL EHANDLING
34 6.0
APPENDIX
40
6.1
WHAT IS PERL?
40
6.2
VARIABLE & DATA TYPES 40
6.3
QUOTES AND STRINGS
6.4
OPERATORS
41
41 6.5
TESTING
6.6
BOOLEAN EXPRESSIONS
42
43 6.7
INPUT PERL FUNCTIONS
7.0
CONCLUSION
44
48
1.0 Introduction to Perl Perl is a interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). It combines (in the author's opinion, anyway)some of the best features of C, sed, awk, and sh, so people familiar with those languages should have little difficulty with it. (Language historians will also note some vestiges of csh, Pascal, and even BASIC|PLUS.) Expression syntax corresponds quite closely to C expression syntax. http://www.activestate.com/Products/ActivePerl/
3
This is the officially blessed version of Perl for Windows. It is released by Active State. Active Perl can be downloaded for free, or we can order the ActiveCD from them. It comes with a wealth of widely used third-party libraries such as Tk, LWP, and the XML bundle. Whatever operating system we are on, this is a valid choice. Especially if it happen to be on a UNIX-based operating system such as Linux, FreeBSD, Windows or Mac OS X. The official documentation system for Perl is POD, or "Plain Old Documentation". It is powerful and widely used. 1.1 Perl Facts •
Perl is a stable, cross platform programming language.
•
It is used for mission critical projects in the public and private sectors.
•
Perl is Open Source software, licensed under its Artistic License, or the GNU General Public License (GPL).
•
Perl was created by Larry Wall.
•
Perl 1.0 was released to usenet's alt.comp.sources in 1987
•
PC Magazine named Perl a finalist for its 1998 Technical Excellence Award in the Development Tool category.
1.2 Why Perl? •
Perl takes the best features from other languages, such as C, awk, sed, sh, and BASIC, among others.
•
Perl database integration interface (DBI) supports third-party databases including Oracle, Sybase, Postgres, MySQL and others.
•
Perl works with HTML, XML, and other mark-up languages.
•
Perl supports Unicode.
•
Perl is Y2K compliant.
•
Perl supports both procedural and object-oriented programming.
•
Perl interfaces with external C/C++ libraries through XS or SWIG.
•
Perl is extensible. There are over 500 third party modules available from the Comprehensive Perl Archive Network (CPAN).
4
•
The Perl interpreter can be embedded into other systems.
2.0 HISTORY OF PERL -- Larry Wall when asked if he learned Perl from the perl source PERL 1.000 Perl 1.000 is unleashed upon the world. Some People take Perls' Birthday seriously. Behold as Randal sings Happy Birthday to Larrys' answering machine. The description from the original man page sums up this new language well. (18 December) PERL 2.000 Perl 2.000 released. (5 June) Some of the enhancements from Perl1 included: •
New regexp routines derived from Henry Spencer's.
•
Support for /(foo|bar)/.
•
Support for /(foo)*/ and /(foo)+/.
•
\s for whitespace, \S for non-, \d for digit, \D nondigit
PERL 3.000 Perl 3.000 is released and is distributed by Larry for the first time under the terms of the GNU Public License. A few of the new features: (18 Oct) •
Perl can now handle binary data correctly and has functions to pack and unpack binary structures into arrays or lists. You can now do arbitrary ioctl functions.
•
You can now pass things to subroutines by reference.
•
Debugger enhancements.
PERL 4.000 Perl 4.000 is released and includes an artistic license as well as the GPL. (21 March) Linus Torvalds releases the first version of Linux. Linus had wanted to name it Freax (free + freak + unix) but the site administrator liked Linux better. It was distributed under the GNU Public License. (July). PERL 5.000 The much anticipated Perl 5.000 is unveiled. It was a complete rewrite of Perl. A few of the features and pitfalls are: (18 October) •
Objects.
5
•
The documentation is much more extensive and perldoc along with pod is introduced.
•
Lexical scoping available via my. eval can see the current lexical variables.
•
The preferred package delimiter is now :: rather than '.
•
New functions include: abs(), chr(), uc(), ucfirst(), lc(), lcfirst(), chomp(), glob()
•
There is now an English module that provides human readable translations for cryptic variable names.
•
Several previously added features have been subsumed under the new keywords use and no.
•
Pattern matches may now be followed by an m or s modifier to explicitly request multiline or singleline semantics. An s modifier makes . match newline.
•
@ now always interpolates an array in double-quotish strings. Some programs may now need to use backslash to protect any @ that shouldn't interpolate.
•
It is no longer syntactically legal to use whitespace as the name of a variable, or as a delimiter for any kind of quote construct.
•
The -w switch is much more informative.
•
is now a synonym for comma. This is useful as documentation for arguments that come in pairs, such as initializers for associative arrays, or named arguments to a subroutine.
Perl 5.001 is released. (13 March) Perl 5.002 announced which introduced, among other things, subroutine prototypes and sysopen(). (29 February)
3.0 Bioinformatics Definition -General view Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary,
6
using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics. It has many practical applications in different areas of biology and medicine.
4.0 Bioinformatics using Perl Bioinformatics, the use of computers in biology research, has been increasing in importance during the past decade as the Human Genome Project went from its beginning to the announcement last year of a "draft" of the complete sequence of human DNA. The importance of programming in biology stretches back before the previous decade. And it certainly has a significant future now that it is a recognized part of research into many areas of medicine and basic biological research. This may not be news to biologists. But Perl programmers may be surprised to find that their handsome language has become one of the most - if not the most popular - of computer languages used in bioinformatics. 4.1 Programming Concepts •
Program = a text file that contains instructions for the computer to follow
•
Programming Language = a set of commands that the computer understands (via a “command interpreter”)
•
Input = data that is given to the program
•
Output = something that is produced by the program
•
Programming
•
Write the program (with a text editor)
•
Run the program
•
Look at the output
•
Correct the errors (debugging)
•
Repeat
•
(computers are VERY dumb -they do exactly what you tell them to do, so be careful what you ask for…)
•
String
•
Text is handled in Perl as a string
7
•
This basically means that you have to put quotes around any piece of text that is not an actual Perl instruction.
•
Perl
has
two
kinds
of
quotes
-
single
‘
‘
and double “ “ •
(they are different- more about this later)
•
Print
•
Perl uses the term “print” to create output
•
Without a print statement, you won’t know what your program has done
•
You need to tell Perl to put a carriage return at the end of a printed line o Use the “\n” (newline) command o Include the quotes o The “\” character is called an escape - Perl uses it a lot
•
Numbers and Functions
•
Perl handles numbers in most common formats:
•
456
•
5.6743
•
6.3E-26
•
Mathematical functions work pretty much as you would expect:
•
4+7,6*4 ,43-27, 256/12,2/(3-5)
4.2 Variable •
To be useful at all, a program needs to be able to store information from one line to the next
•
Perl stores information in variables
•
A variable name starts with the “$” symbol, and it can store strings or numbers o Variables are case sensitive o Give them sensible names
•
Use the “=”sign to assign values to variables
8
•
$a = 100
•
$s = “ttattagcc”
4.3 String operation •
Strings (text) in variables can be used for some math-like operations
•
Concatenate (join) use the dot . operator
•
$seq1= “ACTG”;
•
$seq2= “GGCTA”;
•
$seq3= $seq1 . $seq2;
•
print $seq3
•
ACTGGGCTA
String comparison (are they the same, > or
View more...
Comments