Understanding Bioinformatics

January 24, 2017 | Author: Timmy Tran | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Understanding Bioinformatics...

Description

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page i

Understanding Bioinformatics

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page ii

In memory of Arno Siegmund Baum

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page iii

Understanding Bioinformatics Marketa Zvelebil & Jeremy O. Baum

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page iv

Senior Publisher: Jackie Harbor Editor: Dom Holdsworth Development Editor: Eleanor Lawrence Illustrations: Nigel Orme Typesetting: Georgina Lucas Cover design: Matthew McClements, Blink Studio Limited Production Manager: Tracey Scarlett Copyeditor: Jo Clayton Proofreader: Sally Livitt Accuracy Checking: Eleni Rapsomaniki Indexer: Lisa Furnival Vice President: Denise Schanck

© 2008 by Garland Science, Taylor & Francis Group, LLC This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. Every attempt has been made to source the figures accurately. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. All rights reserved. No part of this book covered by the copyright herein may be reproduced or used in any format in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems—without permission of the publisher. 10-digit ISBN 0-8153-4024-9 (paperback) 13-digit ISBN 978-0-8153-4024-9 (paperback)

Library of Congress Cataloging-in-Publication Data Zvelebil, Marketa J. Understanding bioinformatics / Marketa Zvelebil & Jeremy O. Baum. p. ; cm. Includes bibliographical references and index. ISBN-13: 978-0-8153-4024-9 (pbk.) ISBN-10: 0-8153-4024-9 (pbk.) 1. Bioinformatics. [DNLM: 1. Computational Biology--methods. QU 26.5 Z96u 2008] I. Baum, Jeremy O. II. Title. QH324.2.Z84 2008 572.80285--dc22 2007027514

Published by Garland Science, Taylor & Francis Group, LLC, an informa business 270 Madison Avenue, New York, NY 10016, USA, and 2 Park Square, Milton Park, Abingdon, OX14 4RN, UK. Printed in the United States of America. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Taylor & Francis Group, an informa business

Visit our Web site at http://www.garlandscience.com

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page v

PREFACE

The analysis of data arising from biomedical research has undergone a revolution over the last 15 years, brought about by the combined impact of the Internet and the development of increasingly sophisticated and accurate bioinformatics techniques. All research workers in the areas of biomolecular science and biomedicine are now expected to be competent in several areas of sequence analysis and often, additionally, in protein structure analysis and other more advanced bioinformatics techniques. When we began our research careers in the early 1980s all of the techniques that now comprise bioinformatics were restricted to specialists, as databases and userfriendly applications were not readily available and had to be installed on laboratory computers. By the mid-1990s many datasets and analysis programs had become available on the Internet, and the scientists who produced sequences began to take on tasks such as sequence alignment themselves. However, there was a delay in providing comprehensive training in these techniques. At the end of the 1990s we started to expand our teaching of bioinformatics at both undergraduate and postgraduate level. We soon realized that there was a need for a textbook that bridged the gap between the simplistic introductions available, which concentrated on results almost to the exclusion of the underlying science, and the very detailed monographs, which presented the theoretical underpinnings of a restricted set of techniques. This textbook is our attempt to fill that gap. Therefore on the one hand we wanted to include material explaining the program methods, because we believe that to perform a proper analysis it is not sufficient to understand how to use a program and the kind of results (and errors!) it can produce. It is also necessary to have some understanding of the technique used by the program and the science on which it is based. But on the other hand, we wanted this book to be accessible to the bioinformatics beginner, and we recognized that even the more advanced students occasionally just want a quick reminder of what an application does, without having to read through the theory behind it. From this apparent dilemma was born the division into Applications and Theory Chapters. Throughout the book, we wrote dedicated Applications Chapters to provide a working knowledge of bioinformatics applications, quick and easy to grasp. In most places, an Applications Chapter is then followed by a Theory Chapter, which explains the program methods and the science behind them. Inevitably, we found this created a small amount of duplication between some chapters, but to us this was a small sacrifice if it left the reader free to choose at what level they could engage with the subject of bioinformatics. We have created a book that will serve as a comfortable introduction to any new student of bioinformatics, but which they can continue to use into their postgraduate studies. The book assumes a certain level of understanding of the background biology, for example gene and protein structure, where it is important to appreciate the variety that exists and not only know the canonical examples of first-year textbooks. In addition, to describe the techniques in detail a level of mathematics is

v

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page vi

Preface required which is more appropriate for more advanced students. We are aware that many postgraduate students of bioinformatics have a background in areas such as computer science and mathematics. They will find many familiar algorithmic approaches presented, but will see their application in unfamiliar territory. As they read the book they will also appreciate that to become truly competent at bioinformatics they will require knowledge of biomedical science. There is a certain amount of frustration inherent in producing any book, as the writing process seems often to be as much about what cannot be included as what can. Bioinformatics as a subject has already expanded to such an extent, and we had to be careful not to diminish the book’s teaching value by trying to squeeze every possible topic into it. We have tried to include as broad a range of subjects as possible, but some have been omitted. For example, we do not deal with the methods of constructing a nucleotide sequence from the individual reads, nor with a number of more specialized aspects of genome annotation. The final chapter is an introduction to the even-faster-moving subject of systems biology. Again, we had to balance the desire to say more against the practical constraints of space. But we hope this chapter gives readers a flavor of what the subject covers and the questions it is trying to answer. The chapter will not answer every reader’s every query about systems biology, but if it prompts more of them to inquire further, that is already an achievement. We wish to acknowledge many people who have helped us with this project. We would almost certainly not have got here without the enthusiasm and support of Matthew Day who guided us through the process of getting a first draft. Getting from there to the finished book was made possible by the invaluable advice and encouragement from Chris Dixon, Dom Holdsworth, Jackie Harbor, and others from Garland Science. We also wish to thank Eleanor Lawrence for her skills in massaging our text into shape, and Nigel Orme for producing the wonderful illustrations. We received inspiration and encouragement from many others, too many to name here, but including our students and those who read our draft chapters. Finally, we wish to thank the many friends and family members who have had to suffer while we wrote this book. In particular JB wishes to thank his wife Hilary for her encouragement and perseverance. MZ wishes to specially thank her parents, Martin Scurr, Nick Lee, and her colleagues at work.

Marketa Zvelebil Jeremy O. Baum May 2007

vi

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page vii

A NOTE TO THE READER Organization of this Book Applications and Theory Chapters Careful thought has gone into the organization of this book. The chapters are grouped in two ways. Firstly, the chapters are organized into seven parts according to topic. Within the parts, there is a second, less traditional, level of organization: most chapters are designated as either Applications or Theory Chapters. This book is designed to be accessible both to students who wish to obtain a working knowledge of the bioinformatics applications, as well as to students who want to know how the applications work and maybe write their own. So at the start of most parts, there are dedicated Applications Chapters, which deal with the more practical aspects of the particular research area, and are intended to act as a useful hands-on introduction. Following this are Theory Chapters, which explain the science, theory, and techniques employed in generally available applications. These are more demanding and should preferably be read after having gained a little experience of running the programs. In order to become truly proficient in the techniques you need to read and understand these more technical aspects. On the opening page of each chapter, and in the Table of Contents, it is clearly indicated whether it is an Applications or a Theory Chapter.

Part 1: Background Basics Background Basics provides three introductory chapters to key knowledge that will be assumed throughout the remainder of the book. The first two chapters contain material that should be well-known to readers with a background in biomedical science. The first chapter describes the structure of nucleic acids and some of the roles played by them in living systems, including a brief description of how the genomic DNA is transcribed into mRNA and then translated into protein. The second chapter describes the structure and organization of proteins. Both of these chapters present only the most basic information required, and should not in any way be regarded as an adequate grounding in these topics for serious work. The intention is to provide enough information to make this book self-sufficient. The third chapter in this part describes databases, again at a very introductory level. Many biomedical research workers have large datasets to analyze, and these need to be stored in a convenient and practical way. Databases can provide a complete solution to this problem.

Part 2: Sequence Alignments Sequence Alignments contains three chapters that deal with a variety of analyses of sequences, all relating to identifying similarities. Chapter 4 is a practical introduction to the area, following some examples through different analyses and showing some potential problems as well as successful results. Chapters 5 and 6 deal with several of the many different techniques used in sequence analysis. Chapter 5 focuses on the general aspects of aligning two sequences and the specific methods employed in database searches. A number of techniques are described in detail, including dynamic programming, suffix trees, hashing, and chaining. Chapter 6 deals with methods involving many sequences, defining commonly occurring patterns, defining the profile of a family of related proteins, and constructing a multiple alignment. A key technique presented in this chapter is that of hidden Markov models (HMMs).

vii

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page viii

A Note to the Reader

Part 3: Evolutionary Processes Evolutionary Processes presents the methods used to obtain phylogenetic trees from a sequence dataset. These trees are reconstructions of the evolutionary history of the sequences, assuming that they share a common ancestor. Chapter 7 explains some of the basic concepts involved, and then shows how the different methods can be applied to two different scientific problems. In Chapter 8 details are given of the techniques involved and how they relate to the assumptions made about the evolutionary processes.

Part 4: Genome Characteristics Genome Characteristics deals with the analysis required to interpret raw genome sequence data. Although by the time a genome sequence is published in the research journals some preliminary analysis will have been carried out, often the unanalyzed sequence is available before then. This part describes some of the techniques that can be used to try to locate genes in the sequence. Chapter 9 describes some of the range of programs available, and shows how complex their output can be and illustrates some of the possible pitfalls. Chapter 10 presents a survey of the techniques used, especially different Markov models and how models of whole genes can be built up from models of individual components such as ribosome-binding sites.

Part 5: Secondary Structures Secondary Structures provides two chapters on methods of predicting secondary structures based on sequence (or primary structure). Chapter 11 introduces the methods of secondary structure prediction and discusses the various techniques and ways to interpret the results. Later sections of the chapter deal with prediction of more specialized secondary structure such as protein transmembrane regions, coiled coil and leucine zipper structures, and RNA secondary structures. Chapter 12 presents the underlying principles and details of the prediction methods from basic concepts to in-depth understanding of techniques such as neural networks and Markov models applied to this problem.

Part 6: Tertiary Structures Tertiary Structures extends the material in Part 5 to enable the prediction and modeling of protein tertiary and quaternary structure. Chapter 13 introduces the reader to the concepts of energy functions, minimization, and ab initio prediction. It deals in more detail with the method of threading and focuses on homology modeling of protein structures, taking the student in a stepwise fashion through the process. The chapter ends with example studies to illustrate the techniques. Chapter 14 contains methods and techniques for further analysis of structural information and describes the importance of structure and function relationships. This chapter deals with how fold prediction can help to identify function, as well as giving an introduction to ligand docking and drug design.

Part 7: Cells and Organisms Cells and Organisms consists of two chapters that deal in some detail with expression analysis and an introductory chapter on systems biology. Chapter 15 introduces the techniques available to analyze protein and gene expression data. It shows the reader the information that can be learned from these experimental techniques as well as how the information could be used for further analysis. Chapter 16 presents some of the clustering techniques and statistics that are touched upon in Chapter 15 and are commonly used in gene and protein expression analysis. Chapter 17 is a standalone chapter dealing with the modeling of systems processes. It introduces the reader to the basic concepts of systems biology, and shows what this exciting and rapidly growing field may achieve in the future.

viii

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page ix

A Note to the Reader

Appendices Three appendices are provided that expand on some of the concepts mentioned in the main part of this book. These are useful for the more inquisitive and advanced reader. Appendix A deals with probability and Bayesian analysis, Appendix B is mainly associated with Part 6 and deals with molecular energy functions, while Appendix C describes function optimization techniques.

Organization of the Chapters Learning Outcomes Each chapter opens with a list of learning outcomes which summarize the topics to be covered and act as a revision checklist.

Flow Diagrams Within each chapter every section is introduced with a flow diagram to help the student to visualize and remember the topics covered in that section. A flow diagram from Chapter 5 is given below, as an example. Those concepts which will be described in the current section are shown in yellow boxes with arrows to show how they are connected to each other. For example two main types of optimal alignments will be described in this section of the chapter: local and global. Those concepts which were described in previous sections of the chapter are shown in grey boxes, so that the links can easily be seen between the topics of the current section and what has already been presented. For example, creating alignments requires methods for scoring gaps and for scoring substitutions, both of which have already been described in the chapter. In this way the major concepts and their inter-relationships are gradually built up throughout the chapter.

PAIRWISE SEQUENCE ALIGNMENT AND DATABASE SEARCHING

scoring gaps

scoring substitutions

alignments

local

global

Needleman –Wunsch

Smith– Waterman

log-odds scores

PAM scoring matrices potentially nonoptimal

optimal alignments

residue properties

BLOSUM scoring matrices band or X-drop

suboptimal alignments

ix

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page x

m e m as at ur ch in es g

or in g sc

gap penalty

l ca

domains

E EM M

am

lo

l

rs

ba

Pf

he

l

lo multiple g

ot

pairwise

ca lo

fa

patterns

IT PROS

gl ob al

producing and analyzing sequence alignments ili es

BL A

others

conservation

ng s ni ce ig n al que se

ST LA I-B PH

E

se ba ng ta hi da arc se

ST

H

pairwise alignment

m

RC

FAS TA

EA SS PRA TT

% identity

su b m sti at tu ric tio es n

PAM BL O SU M

A Note to the Reader

Mind Maps Each chapter has a mind map, which is a specialized pedagogical feature, enabling the student to visualize and remember the steps that are necessary for specific applications. The mind map for Chapter 4 is given above, as an example. In this example, four main areas of the topic ‘producing and analyzing sequence alignments’ have been identified: measuring matches, database searching, aligning sequences, and families. Each of these areas, colored for clarity, is developed to identify the key concepts involved, creating a visual aid to help the reader see at a glance the range of the material covered in discussing this area. Occasionally there are important connections between distinct areas of the mind map, as here in linking BLAST and PHI-BLAST, with the latter method being derived directly from the former, but having a quite different function, and thus being in a different area of the mind map.

Illustrations Each chapter is illustrated with four-color figures. Considerable care has been put into ensuring simplicity as well as consistency of representation across the book. Figure 4.16 is given below, as an example. (A) p110d p110b p110g p110a

YCVATYVLGIGDRHSDNIMIRESGQLFHIDFGHFLGNFKTKFGINRERVP YCVASYVLGIGDRHSDNIMVKKTGQLFHIDFGHILGNFKSKFGIKRERVP YCVATFVLGIGDRHNDNIMITETGNLFHIDFGHILGNYKSFLGINKERVP YCVATFILGIGDRHNSNIMVKDDGQLFHIDFGHFLDHKKKKFGYKRERVP

(B) name

combined p-value

motifs

p110a 5.03e-127 p110b p110d p110g

(C) P11G pig

x

2

0.34

3

1 2

3 1

2 2

3

1

2

5.9e-161

PRKD human

1

2

1.22e-142 7.09e-139 2.13e-119

1 5

6

3

6

3 3

2

4 1

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xi

A Note to the Reader

Further Reading It is not possible to summarize all current knowledge in the confines of this book, let alone anticipate future developments in this rapidly developing subject. Therefore at the end of each chapter there are references to research literature and specialist monographs to help readers continue to develop their knowledge and skills. We have grouped the books and articles according to topic, such that the sections within the Further Reading correspond to the sections in the chapter itself: we hope this will help the reader target their attention more quickly onto the appropriate extension material.

List of Symbols Bioinformatics makes use of numerous symbols, many of which will be unfamiliar to those who do not already know the subject well. To help the reader navigate the symbols used in this book, a comprehensive list is given at the back which quotes each symbol, its definition, and where its most significant occurrences in the book are located.

Glossary All technical terms are highlighted in bold where they first appear in the text and are then listed and explained in the Glossary. Further, each term in the Glossary also appears in the Index, so the reader can quickly gain access to the relevant pages where the term is covered in more detail. The book has been designed to crossreference in as thorough and helpful a way as possible.

Garland Science Website Garland Science has made available a number of supplementary resources on its website, which are freely available and do not require a password. For more details, go to www.garlandscience.com/gs_textbooks.asp and follow the link to Understanding Bioinformatics.

Artwork All the figures in Understanding Bioinformatics are available to download from the Garland Science website. The artwork files are saved in zip format, with a single zip file for each chapter. Individual figures can then be extracted as jpg files.

Additional Material The Garland Science website has some additional material relating to the topics in this book. For each of the seven parts a pdf is available, which provides a set of useful weblinks relevant to those chapters. These include weblinks to relevant and important databases and to file format definitions, as well as to free programs and to servers which permit data analysis on-line. In addition to these, the sets of data which were used to illustrate the methods of analysis are also provided. These will allow the reader to reanalyze the same data, reproducing the results shown here and trying out other techniques.

xi

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xii

LIST OF REVIEWERS

The Authors and Publishers of Understanding Bioinformatics gratefully acknowledge the contribution of the following reviewers in the development of this book:

xii

Stephen Altschul

National Center for Biotechnology Information, Bethesda, Maryland, USA

Petri Auvinen

Institute of Biotechnology, University of Helsinki, Finland

Joel Bader

Johns Hopkins University, Baltimore, USA

Tim Bailey

University of Queensland, Brisbane, Australia

Alex Bateman

Wellcome Trust Sanger Institute, Cambridge, UK

Meredith Betterton

University of Colorado at Boulder, USA

Andy Brass

University of Manchester, UK

Chris Bystroff

Rensselaer Polytechnic University, Troy, USA

Charlotte Deane

University of Oxford, UK

John Hancock

MRC Mammalian Genetics Unit, Harwell, Oxfordshire, UK

Steve Harris

University of Oxford, UK

Steve Henikoff

Fred Hutchinson Cancer Research Center, Seattle, USA

Jaap Heringa

Free University, Amsterdam, Netherlands

Sudha Iyengar

Case Western Reserve University, Cleveland, USA

Sun Kim

Indiana University Bloomington, USA

Patrice Koehl

University of California Davis, USA

Frank Lebeda

US Army Medical Research Institute of Infectious Diseases, Fort Detrick, Maryland, USA

David Liberles

University of Bergen, Norway

Peter Lockhart

Massey University, Palmerston North, New Zealand

James McInerney

National University of Ireland, Maynooth, Ireland

Nicholas Morris

University of Newcastle, UK

William Pearson

University of Virginia, Charlottesville, USA

Marialuisa PellegriniCalace

European Bioinformatics Institute, Cambridge, UK

Mihaela Pertea

University of Maryland, College Park, Maryland, USA

David Robertson

University of Manchester, UK

Rob Russell

EMBL, Heidelberg, Germany

Ravinder Singh

University of Colorado, USA

Deanne Taylor

Brandeis University, Waltham, Massachusetts, USA

Jen Taylor

University of Oxford, UK

Iosif Vaisman

University of North Carolina at Chapel Hill, USA

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xiii

CONTENTS IN BRIEF

PART 1 Background Basics Chapter 1: Chapter 2: Chapter 3:

The Nucleic Acid World Protein Structure Dealing With Databases

3 25 45

PART 2 Sequence Alignments Chapter 4: Chapter 5: Chapter 6:

Producing and Analyzing Sequence Alignments Pairwise Sequence Alignment and Database Searching Patterns, Profiles, and Multiple Alignments

Applications Chapter Theory Chapter Theory Chapter

71 115 165

Applications Chapter Theory Chapter

223 267

Applications Chapter Theory Chapter

317 357

Applications Chapter Theory Chapter

411 461

Applications Chapter Applications Chapter

521 567

PART 3 Evolutionary Processes Chapter 7: Chapter 8:

Recovering Evolutionary History Building Phylogenetic Trees

PART 4 Genome Characteristics Chapter 9: Chapter 10:

Revealing Genome Features Gene Detection and Genome Annotation

PART 5 Secondary Structures Chapter 11: Chapter 12:

Obtaining Secondary Structure from Sequence Predicting Secondary Structures

PART 6 Tertiary Structures Chapter 13: Chapter 14:

Modeling Protein Structure Analyzing Structure–Function Relationships

PART 7 Cells and Organisms Chapter 15: Chapter 16: Chapter 17:

Proteome and Gene Expression Analysis Clustering Methods and Statistics Systems Biology

599 625 667

APPENDICES Background Theory Appendix A: Appendix B: Appendix C:

Probability, Information, and Bayesian Analysis Molecular Energy Functions Function Optimization

695 700 709

xiii

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xiv

CONTENTS Preface A Note to the Reader List of Reviewers Contents in Brief

v vii xii xiii

Part 1 Background Basics Chapter 1 The Nucleic Acid World 1.1 The Structure of DNA and RNA DNA is a linear polymer of only four different bases Two complementary DNA strands interact by base pairing to form a double helix RNA molecules are mostly single stranded but can also have base-pair structures

5 5 7 9

1.2 DNA, RNA, and Protein: The Central Dogma DNA is the information store, but RNA is the messenger Messenger RNA is translated into protein according to the genetic code Translation involves transfer RNAs and RNA-containing ribosomes

10

1.3 Gene Structure and Control RNA polymerase binds to specific sequences that position it and identify where to begin transcription The signals initiating transcription in eukaryotes are generally more complex than those in bacteria Eukaryotic mRNA transcripts undergo several modifications prior to their use in translation The control of translation

14

1.4 The Tree of Life and Evolution A brief survey of the basic characteristics of the major forms of life Nucleic acid sequences can change as a result of mutation Summary Further Reading

11 12 13

15 17 18 19 20 21 22 23 24

Chapter 2 Protein Structure 2.1 Primary and Secondary Structure Protein structure can be considered on several different levels Amino acids are the building blocks of proteins The differing chemical and physical properties of amino acids are due to their side chains

xiv

25 26 27 28

Amino acids are covalently linked together in the protein chain by peptide bonds Secondary structure of proteins is made up of a-helices and b-strands Several different types of b-sheet are found in protein structures Turns, hairpins and loops connect helices and strands

29 33 35 36

2.2 Implication for Bioinformatics Certain amino acids prefer a particular structural unit Evolution has aided sequence analysis Visualization and computer manipulation of protein structures

37

2.3 Proteins Fold to Form Compact Structures The tertiary structure of a protein is defined by the path of the polypeptide chain The stable folded state of a protein represents a state of low energy Many proteins are formed of multiple subunits

40

Summary Further Reading

37 38 38

41 41 42 43 44

Chapter 3 Dealing with Databases 3.1 The Structure of Databases Flat-file databases store data as text files Relational databases are widely used for storing biological information XML has the flexibility to define bespoke data classifications Many other database structures are used for biological data Databases can be accessed locally or online and often link to each other

46 48

3.2 Types of Database There’s more to databases than just data Primary and derived data How we define and connect things is very important: Ontologies

52 53 53

3.3 Looking for Databases Sequence databases Microarray databases

55 55 58

49 50 51 52

54

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xv

Contents Protein interaction databases Structural databases 3.4 Data Quality Nonredundancy is especially important for some applications of sequence databases Automated methods can be used to check for data consistency Initial analysis and annotation is usually automated Human intervention is often required to produce the highest quality annotation The importance of updating databases and entry identifier and version numbers Summary Further Reading

58 59 61

85

4.5 Types of Alignment Different kinds of alignments are useful in different circumstances Multiple sequence alignments enable the simultaneous comparison of a set of similar sequences Multiple alignments can be constructed by several different techniques Multiple alignments can improve the accuracy of alignment for sequences of low similarity ClustalW can make global multiple alignments of both DNA and protein sequences Multiple alignments can be made by combining a series of local alignments Alignment can be improved by incorporating additional information

87

4.6 Searching Databases Fast yet accurate search algorithms have been developed FASTA is a fast database-search method based on matching short identical segments BLAST is based on finding very similar short segments Different versions of BLAST and FASTA are used for different problems PSI-BLAST enables profile-based database searches SSEARCH is a rigorous alignment method

93

85 86

62 63 64 65 65 66 67

Part 2 Sequence Alignments APPLICATIONS CHAPTER

Chapter 4 Producing and Analyzing Sequence Alignments 4.1 Principles of Sequence Alignment 72 Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity 73 Alignment can reveal homology between sequences 74 It is easier to detect homology when comparing protein sequences than when comparing nucleic acid sequences 75 4.2 Scoring Alignments 76 The quality of an alignment is measured by giving it a quantitative score 76 The simplest way of quantifying similarity between two sequences is percentage identity 76 The dot-plot gives a visual assessment of similarity based on identity 77 Genuine matches do not have to be identical 79 There is a minimum percentage identity that can be accepted as significant 81 There are many different ways of scoring an alignment 81 4.3 Substitution Matrices Substitution matrices are used to assign individual scores to aligned sequence positions The PAM substitution matrices use substitution frequencies derived from sets of closely related protein sequences The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence The choice of substitution matrix depends on the problem to be solved

4.4 Inserting Gaps Gaps inserted in a sequence to maximize similarity require a scoring penalty Dynamic programming algorithms can determine the optimal introduction of gaps

81 81

82

84 84

87

90 90 91 92 92 93

94 95 95 95 96 97

4.7 Searching with Nucleic Acid or Protein Sequences 97 DNA or RNA sequences can be used either directly or after translation 97 The quality of a database match has to be tested to ensure that it could not have arisen by chance 97 Choosing an appropriate E-value threshold helps to limit a database search 98 Low-complexity regions can complicate homology searches 100 Different databases can be used to solve particular problems 102 4.8 Protein Sequence Motifs or Patterns Creation of pattern databases requires expert knowledge The BLOCKS database contains automatically compiled short blocks of conserved multiply aligned protein sequences

103

4.9 Searching Using Motifs and Patterns The PROSITE database can be searched for protein motifs and patterns

107

104

105

107

xv

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xvi

Contents The pattern-based program PHI-BLAST searches for both homology and matching motifs Patterns can be generated from multiple sequences using PRATT The PRINTS database consists of fingerprints representing sets of conserved motifs that describe a protein family The Pfam database defines profiles of protein families 4.10 Patterns and Protein Function Searches can be made for particular functional sites in proteins Sequence comparison is not the only way of analyzing protein sequences Summary Further Reading

108 108

109 109 109 109 110 111 112

The BLAST algorithm makes use of finite-state automata Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms

Chapter 5 Pairwise Sequence Alignment and Database Searching

150

5.4 Alignment Score Significance The statistics of gapped local alignments can be approximated by the same theory

153

5.5 Aligning Complete Genome Sequences Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment of higher organisms The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms

156

Summary Further Reading

THEORY CHAPTER

147

156

157

159 159 161

THEORY CHAPTER

5.1 Substitution Matrices and Scoring Alignment scores attempt to measure the likelihood of a common evolutionary ancestor The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins The BLOSUM matrices were designed to find conserved regions of proteins Scoring matrices for nucleotide sequence alignment can be derived in similar ways The substitution scoring matrix used must be appropriate to the specific alignment problem Gaps are scored in a much more heuristic way than substitutions

117

Chapter 6 Patterns, Profiles, and Multiple Alignments

117

6.1 Profiles and Sequence Logos Position-specific scoring matrices are an extension of substitution scoring matrices Methods for overcoming a lack of data in deriving the values for a PSSM PSI-BLAST is a sequence database searching program Representing a profile as a logo

167

179

5.2 Dynamic Programming Algorithms Optimal global alignments are produced using efficient variations of the Needleman–Wunsch algorithm Local and suboptimal alignments can be produced by making small modifications to the dynamic programming algorithm Time can be saved with a loss of rigor by not calculating the whole matrix

127

6.2 Profile Hidden Markov Models The basic structure of HMMs used in sequence alignment to profiles Estimating HMM parameters using aligned sequences Scoring a sequence against a profile HMM: The most probable path and the sum over all paths Estimating HMM parameters using unaligned sequences 6.3 Aligning Profiles Comparing two PSSMs by alignment Aligning profile HMMs

193 193 195

5.3 Indexing Techniques and Algorithmic Approximations Suffix trees locate the positions of repeats and unique sequences Hashing is an indexing technique that lists the starting positions of all k-tuples The FASTA algorithm uses hashing and chaining for fast database searching

xvi

119 122 125 126 126

129

168 171 176 177

180 185

187 190

135 139

141 141 143 144

6.4 Multiple Sequence Alignments by Gradual Sequence Addition 196 The order in which sequences are added is chosen based on the estimated likelihood of incorporating errors in the alignment 198 Many different scoring schemes have been used in constructing multiple alignments 200

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xvii

Contents The multiple alignment is built using the guide tree and profile methods and may be further refined

204

6.5 Other Ways of Obtaining Multiple Alignments The multiple sequence alignment program DIALIGN aligns ungapped blocks The SAGA method of multiple alignment uses a genetic algorithm

207

6.6 Sequence Pattern Discovery Discovering patterns in a multiple alignment: eMOTIF and AACC Probabilistic searching for common patterns in sequences: Gibbs and MEME Searching for more general sequence patterns

211

207 209

Phylogenetic analyses of a small dataset of 16S RNA sequence data Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved Summary Further Reading

255 259 264 265

THEORY CHAPTER

Chapter 8 Building Phylogenetic Trees

Summary Further Reading

213 215 217 218 219

Part 3 Evolutionary Processes APPLICATIONS CHAPTER

Chapter 7 Recovering Evolutionary History 7.1 The Structure and Interpretation of Phylogenetic Trees Phylogenetic trees reconstruct evolutionary relationships Tree topology can be described in several ways Consensus and condensed trees report the results of comparing tree topologies

225 225 230 232

7.2 Molecular Evolution and its Consequences Most related sequences have many positions that have mutated several times The rate of accepted mutation is usually not the same for all types of base substitution Different codon positions have different mutation rates Only orthologous genes should be used to construct species phylogenetic trees Major changes affecting large regions of the genome are surprisingly common

235

7.3 Phylogenetic Tree Reconstruction Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset A model of evolution must be chosen to use with the method All phylogenetic analyses must start with an accurate multiple alignment

248

236 236 238 239 247

249

249 251 255

8.1 Evolutionary Models and the Calculation of Evolutionary Distance 268 A simple but inaccurate measure of evolutionary distance is the p-distance 268 The Poisson distance correction takes account of multiple mutations at the same site 270 The Gamma distance correction takes account of mutation rate variation at different sequence positions 270 The Jukes–Cantor model reproduces some basic features of the evolution of nucleotide sequences 271 More complex models distinguish between the relative frequencies of different types of mutation 272 There is a nucleotide bias in DNA sequences 275 Models of protein-sequence evolution are closely related to the substitution matrices used for sequence alignment 276 8.2 Generating Single Phylogenetic Trees Clustering methods produce a phylogenetic tree based on evolutionary distances The UPGMA method assumes a constant molecular clock and produces an ultrametric tree The Fitch–Margoliash method produces an unrooted additive tree The neighbor-joining method is related to the concept of minimum evolution Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree

276

8.3 Generating Multiple Tree Topologies The branch-and-bound method greatly improves the efficiency of exploring tree topology Optimization of tree topology can be achieved by making a series of small changes to an existing tree Finding the root gives a phylogenetic tree a direction in time

286

8.4 Evaluating Tree Topologies Functions based on evolutionary distances can be used to evaluate trees Unweighted parsimony methods look for the trees with the smallest number of mutations

293

276 278 279 282

285

288

288 291

293 297

xvii

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xviii

Contents Mutations can be weighted in different ways in the parsimony method Trees can be evaluated using the maximum likelihood method The quartet-puzzling method also involves maximum likelihood in the standard implementation Bayesian methods can also be used to reconstruct phylogenetic trees 8.5 Assessing the Reliability of Tree Features and Comparing Trees The long-branch attraction problem can arise even with perfect data and methodology Tree topology can be tested by examining the interior branches Tests have been proposed for comparing two or more alternative trees Summary Further Reading

300 302 305

Prokaryotic promoter regions contain relatively well-defined motifs Eukaryotic promoter regions are typically more complex than prokaryotic promoters A variety of promoter-prediction methods are available online Promoter prediction results are not very clear-cut

339 340 340 341

306

307 308

9.6 Confirming Predictions There are various methods for calculating the accuracy of gene-prediction programs Translating predicted exons can confirm the correctness of the prediction Constructing the protein and identifying homologs

342

9.7 Genome Annotation Genome annotation is the final step in genome analysis Gene ontology provides a standard vocabulary for gene annotation

346

9.8 Large Genome Comparisons

353

342 343 343

309 310 311 312

347 348

Part 4 Genome Characteristics Summary Further Reading

APPLICATIONS CHAPTER

Chapter 9 Revealing Genome Features 9.1 Preliminary Examination of Genome Sequence Whole genome sequences can be split up to simplify gene searches Structural RNA genes and repeat sequences can be excluded from further analysis Homology can be used to identify genes in both prokaryotic and eukaryotic genomes

354 355

318 THEORY CHAPTER 319

Chapter 10 Gene Detection and Genome Annotation

319

9.2 Gene Prediction in Prokaryotic Genomes

322

10.1 Detection of Functional RNA Molecules Using Decision Trees Detection of tRNA genes using the tRNAscan algorithm Detection of tRNA genes in eukaryotic genomes

9.3 Gene Prediction in Eukaryotic Genomes Programs for predicting exons and introns use a variety of approaches Gene predictions must preserve the correct reading frame Some programs search for exons using only the query sequence and a model for exons Some programs search for genes using only the query sequence and a gene model Genes can be predicted using a gene model and sequence similarity Genomes of related organisms can be used to improve gene prediction

323

10.2 Features Useful for Gene Detection in Prokaryotes 364

323

10.3 Algorithms for Gene Detection in Prokaryotes GeneMark uses inhomogeneous Markov chains and dicodon statistics GLIMMER uses interpolated Markov models of coding potential ORPHEUS uses homology, codon statistics, and ribosome-binding sites GeneMark.hmm uses explicit state duration hidden Markov models EcoParse is an HMM gene model

368

337

9.5 Prediction of Promoter Regions

338

10.4 Features Used in Eukaryotic Gene Detection Differences between prokaryotic and eukaryotic genes Introns, exons, and splice sites Promoter sequences and binding sites for transcription factors

377

9.4 Splice Site Detection Splice sites can be detected independently by specialized programs

xviii

322

324 327 332 334

361 361 362

368 371 372 373 376

336

338

377 379 381

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xix

Contents 10.5 Predicting Eukaryotic Gene Signals Detection of core promoter binding signals is a key element of some eukaryotic geneprediction methods A set of models has been designed to locate the site of core promoter sequence signals Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results Predicting eukaryotic transcription and translation start sites Translation and transcription stop signals complete the gene definition

381

381

414

415

383

387 389 389

10.6 Predicting Exon/Intron Structure 389 Exons can be identified using general sequence properties 390 Splice-site prediction 392 Splice sites can be predicted by sequence patterns combined with base statistics 393 GenScan uses a combination of weight matrices and decision trees to locate splice sites 394 GeneSplicer predicts splice sites using first-order Markov chains 394 NetPlantGene uses neural networks with intron and exon predictions to predict splice sites 395 Other splicing features may yet be exploited for splice-site prediction 396 Specific methods exist to identify initial and terminal exons 396 Exons can be defined by searching databases for homologous regions 397 10.7 Complete Eukaryotic Gene Models

397

10.8 Beyond the Prediction of Individual Genes Functional annotation Comparison of related genomes can help resolve uncertain predictions Evaluation and reevaluation of gene-detection methods

399 400

Summary Further Reading

that incorporate additional information about protein structure Machine-learning approaches to secondary structure prediction mainly make use of neural networks and HMM methods

403 405 405 406

11.2 Training and Test Databases There are several ways to define protein secondary structures 11.3 Assessing the Accuracy of Prediction Programs Q3 measures the accuracy of individual residue assignments Secondary structure predictions should not be expected to reach 100% residue accuracy The Sov value measures the prediction accuracy for whole elements CAFASP/CASP: Unbiased and readily available protein prediction assessments 11.4 Statistical and Knowledge-Based Methods The GOR method uses an information theory approach The program Zpred includes multiple alignment of homologous sequences and residue conservation information There is an overall increase in prediction accuracy using multiple sequence information The nearest-neighbor method: The use of multiple nonhomologous sequences PREDATOR is a combined statistical and knowledge-based program that includes the nearest-neighbor approach 11.5 Neural Network Methods of Secondary Structure Prediction Assessing the reliability of neural net predictions Several examples of Web-based neural network secondary structure prediction programs PROF: Protein forecasting PSIPRED Jnet: Using several alternative representations of the sequence alignment

416 417

417 417 418 419 419 421 422

425 426 428

428

430 432 432 434 434 434

Chapter 11 Obtaining Secondary Structure from Sequence

11.6 Some Secondary Structures Require Specialized Prediction Methods Transmembrane proteins Quantifying the preference for a membrane environment

11.1 Types of Prediction Methods 413 Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure 414 Nearest-neighbor methods are statistical methods

11.7 Prediction of Transmembrane Protein Structure 438 Multi-helix membrane proteins 439 A selection of prediction programs to predict transmembrane helices 441

Part 5 Secondary Structures APPLICATIONS CHAPTER

435 436 437

xix

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xx

Contents Statistical methods Knowledge-based prediction Evolutionary information from protein families improves the prediction Neural nets in transmembrane prediction Predicting transmembrane helices with hidden Markov models Comparing the results: What to choose What happens if a non-transmembrane protein is submitted to transmembrane prediction programs Prediction of transmembrane structure containing b-strands

443 443 444 445 446 447 448 448

11.8 Coiled-coil Structures The COILS prediction program PAIRCOIL and MULTICOIL are an extension of the COILS algorithm Zipping the Leucine zipper: A specialized coiled coil

451 452

11.9 RNA Secondary Structure Prediction

455

Summary Further Reading

453 453

458 459

12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction Layered feed-forward neural networks can transform a sequence into a structural prediction Inclusion of information on homologous sequences improves neural network accuracy More complex neural nets have been applied to predict secondary and other structural features 12.5 Hidden Markov Models Have Been Applied to Structure Prediction HMM methods have been found especially effective for transmembrane proteins Nonmembrane protein secondary structures can also be successfully predicted with HMMs 12.6 General Data Classification Techniques Can Predict Structural Features Support vector machines have been successfully used for protein structure prediction Discriminants, SOMs, and other methods have also been used Summary Further Reading

492 494 502 503

504 506 509

510 511 512 514 515

THEORY CHAPTER

Chapter 12 Predicting Secondary Structures 12.1 Defining Secondary Structure and Prediction Accuracy 463 The definitions used for automatic protein secondary structure assignment do not give identical results 464 There are several different measures of the accuracy of secondary structure prediction 469 12.2 Secondary Structure Prediction Based on Residue Propensities Each structural state has an amino acid preference which can be assigned as a residue propensity The simplest prediction methods are based on the average residue propensity over a sequence window Residue propensities are modulated by nearby sequence Predictions can be significantly improved by including information from homologous sequences

472 473 476 479 484

12.3 The Nearest-Neighbor Methods are Based on Sequence Segment Similarity 485 Short segments of similar sequence are found to have similar structure 487 Several sequence similarity measures have been used to identify nearest-neighbor segments 488 A weighted average of the nearest-neighbor segment structures is used to make the prediction 490 A nearest-neighbor method has been developed to predict regions with a high potential to misfold 491

xx

Part 6 Tertiary Structures APPLICATIONS CHAPTER

Chapter 13 Modeling Protein Structure 13.1 Potential Energy Functions and Force Fields The conformation of a protein can be visualized in terms of a potential energy surface Conformational energies can be described by simple mathematical functions Similar force fields can be used to represent conformational energies in the presence of averaged environments Potential energy functions can be used to assess a modeled structure Energy minimization can be used to refine a modeled structure and identify local energy minima Molecular dynamics and simulated annealing are used to find global energy minima

524

13.2 Obtaining a Structure by Threading The prediction of protein folds in the absence of known structural homologs Libraries or databases of nonredundant protein folds are used in threading Two distinct types of scoring schemes have been used in threading methods Dynamic programming methods can identify optimal alignments of target sequences and structural folds

529

525 525

526 527 527 528

531 531 531

533

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xxi

Contents Several methods are available to assess the confidence to be put on the fold prediction The C2-like domain from the Dictyostelia: A practical example of threading

534 535

13.3 Principles of Homology Modeling 537 Closely related target and template sequences give better models 539 Significant sequence identity depends on the length of the sequence 540 Homology modeling has been automated to deal with the numbers of sequences that can now be modeled 541 Model building is based on a number of assumptions 541 13.4 Steps in Homology Modeling 542 Structural homologs to the target protein are found in the PDB 543 Accurate alignment of target and template sequences is essential for successful modeling 543 The structurally conserved regions of a protein are modeled first 544 The modeled core is checked for misfits before proceeding to the next stage 545 Sequence realignment and remodeling may improve the structure 545 Insertions and deletions are usually modeled as loops 545 Nonidentical amino acid side chains are modeled mainly by using rotamer libraries 547 Energy minimization is used to relieve structural errors 548 Molecular dynamics can be used to explore possible conformations for mobile loops 548 Models need to be checked for accuracy 549 How far can homology models be trusted? 551 13.5 Automated Homology Modeling The program MODELLER models by satisfying protein structure constraints COMPOSER uses fragment-based modeling to automatically generate a model Automated methods available on the Web for comparative modeling Assessment of structure prediction

552

a 13.6 Homology Modeling of PI3 Kinase p110a Swiss-Pdb Viewer can be used for manual or semi-manual modeling Alignment, core modeling, and side-chain modeling are carried out all in one The loops are modeled from a database of possible structures Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer

557

553 553 554 554

MolIDE is a downloadable semi-automatic modeling package Automated modeling on the Web illustrated with p110a kinase Modeling a functionally related but sequentially dissimilar protein: mTOR Generating a multidomain three-dimensional structure from sequence Summary Further Reading

560 561 563 564 564 565

APPLICATIONS CHAPTER

Chapter 14 Analyzing Structure–Function Relationships 14.1 Functional Conservation Functional regions are usually structurally conserved Similar biochemical function can be found in proteins with different folds Fold libraries identify structurally similar proteins regardless of function

568

14.2 Structure Comparison Methods Finding domains in proteins aids structure comparison Structural comparisons can reveal conserved functional elements not discernible from a sequence comparison The CE method builds up a structural alignment from pairs of aligned protein segments The Vector Alignment Search Tool (VAST) aligns secondary structural elements DALI identifies structure superposition without maintaining segment order FATCAT introduces rotations between rigid segments

574

569 570 571

574

576 576 577 578 579

14.3 Finding Binding Sites 580 Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites 582 Searching for protein–protein interactions using surface properties 584 Surface calculations highlight clefts or holes in a protein that may serve as binding sites 585 Looking at residue conservation can identify binding sites 586

557 558 559 559

14.4 Docking Methods and Programs Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known Specialized docking programs will automatically dock a ligand to a structure

587

588 588

xxi

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xxii

Contents Scoring functions are used to identify the most likely docked ligand The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area GOLD is a flexible docking program, which utilizes a genetic algorithm The water molecules in binding sites should also be considered

592

Summary Further Reading

593 594

590

590

591 591

The changes in a set of protein spots can be tracked over a number of different samples Databases and online tools are available to aid the interpretation of 2D gel data Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins Mass spectrometry can be used to identify the proteins separated and purified by 2D gel electrophoresis or other means Protein-identification programs for mass spectrometry are freely available on the Web Mass spectrometry can be used to measure protein concentration Summary Further Reading

618 620

621

621 622 623 623 624

Part 7 Cells and Organisms Chapter 15 Proteome and Gene Expression Analysis 15.1 Analysis of Large-scale Gene Expression The expression of large numbers of different genes can be measured simultaneously by DNA microarrays Gene expression microarrays are mainly used to detect differences in gene expression in different conditions Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues Facilitating the integration of data from different places and experiments The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis Techniques based on self-organizing maps can be used for analyzing microarray data Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters Clustered gene expression data can be used as a tool for further research

601

15.2 Analysis of Large-scale Protein Expression Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell Measuring the expression levels shown in 2D gels Differences in protein expression levels between different samples can be detected by 2D gels Clustering methods are used to identify protein spots with similar expression patterns Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data

612

xxii

602

602 604

Chapter 16 Clustering Methods and Statistics 16.1 Expression Data Require Preparation Prior to Analysis Data normalization is designed to remove systematic experimental errors Expression levels are often analyzed as ratios and are usually transformed by taking logarithms Sometimes further normalization is useful after the data transformation Principal component analysis is a method for combining the properties of an object

626 627 628 630 631

605 606 606 608

610 610

613 614 615 615

618

16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points Euclidean distance is the measure used in everyday life The Pearson correlation coefficient measures distance in terms of the shape of the expression response The Mahalanobis distance takes account of the variation and correlation of expression responses 16.3 Clustering Methods Identify Similar and Distinct Expression Patterns Hierarchical clustering produces a related set of alternative partitions of the data k-means clustering groups data into several clusters but does not determine a relationship between clusters Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined number of clusters Evolutionary clustering algorithms use selection, recombination, and mutation to find the best possible solution to a problem

633 634

635 636

637 639

641

644

646

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xxiii

Contents The self-organizing tree algorithm (SOTA) determines the number of clusters required Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples The validity of clusters is determined by independent methods 16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression t-tests can be used to estimate the significance of the difference between two expression levels Nonparametric tests are used to avoid making assumptions about the data sampling Multiple testing of differential expression requires special techniques to control error rates 16.5 Gene and Protein Expression Data Can be Used to Classify Samples Many alternative methods have been proposed that can classify samples Support vector machines are another form of supervised learning algorithms that can produce classifiers Summary Further Reading

648

649 650

17.4 Storing and Running System Models Specialized programs make simulating systems easier Standardized system descriptions aid their storage and reuse Summary Further Reading

689 691 692 692 693

651

APPENDICES Background Theory 654 656 657

659 660

Appendix A: Probability, Information, and Bayesian Analysis Probability Theory, Entropy, and Information Mutually exclusive events Occurrence of two events Occurrence of two random variables Bayesian Analysis Bayes’ theorem Inference of parameter values Further Reading

661 662 664

Chapter 17 Systems Biology 17.1 What is a System? A system is more than the sum of its parts A biological system is a living network Databases are useful starting points in constructing a network To construct a model more information is needed than a network There are three possible approaches to constructing a model Kinetic models are not the only way in systems biology

669 669 670

17.2 Structure of the Model Control circuits are an essential part of any biological system The interactions in networks can be represented as simple differential equations

679

17.3 Robustness of Biological Systems Robustness is a distinct feature of complexity in biology Modularity plays an important part in robustness Redundancy in the system can provide robustness Living systems can switch from one state to another by means of bistable switches

683

671 672 674 678

680 680

684 685 686 688

695 695 696 696 697 697 698 699

Appendix B: Molecular Energy Functions Force Fields for Calculating Intra- and Intermolecular Interaction Energies Bonding terms Nonbonding terms Potentials used in Threading Potentials of mean force Potential terms relating to solvent effects Further Reading

701 702 704 706 706 707 708

Appendix C: Function Optimization Full Search Methods Dynamic programming and branch-and-bound Local Optimization The downhill simplex method The steepest descent method The conjugate gradient method Methods using second derivatives Thermodynamic Simulation and Global Optimization Monte Carlo and genetic algorithms Molecular dynamics Simulated annealing Summary

710 710 710 711 711 714 714 715 716 718 719 719

Further Reading

719

List of Symbols Glossary Index

721 734 751

xxiii

BIF Prelims 5th proofs.qxd

18/7/07

10:34

Page xxiv

End matter 6th proofs.qxd

19/7/07

12:17

Page 751

INDEX

Note: Entries which are simply page numbers refer to the main text. Other entries have the following abbreviations immediately afer the page number: B, box; F, figure; FD, flow diagram; MM, mind map; T, table.

2ZIP method, 453–4, 455F 310-helices, 435 defining, for prediction algorithms, 464–5 3D-Coffee, 203 3DEE library, 574 3Djigsaw, 563 3D-PSSM, 533, 534F 3¢ end, 6, 12T, 17–18 3-patterns, 217 5¢ end, 6, 12T, 14, 19 –10 motif, 16 16S RNA sequences, 249 evolutionary model selection, 254F, 255, 256T phylogenetic analysis, 249, 251, 255–8, 257F, 258F –35 motif, 16 123D+ program, 535–6, 536F a/b-fold proteins, 421F, 423F, 573F, 574 a-helices, 33–5, 413F amino acid preferences, 37 Chou–Fasman propensities, 474–5, 474F, 475F coiled coil formation, 451, 451F defining, for prediction algorithms, 464–5, 466–7 hydrogen bonding, 34, 35F length distributions, 467, 468F prediction, 413–14, 428–9, 429F see also secondary structure prediction based on residue propensities, 477–8 neural network methods, 501, 501F transmembrane proteins, 438, 439–48 sequence–structure correlations, 487–8, 487F transmembrane proteins see transmembrane helices

turns, hairpins and loops connecting, 36–7 a-lactalbumin, 538–9, 539F bab repeat, 40B b-barrels, transmembrane see transmembrane b-barrels b-bulges, 463, 465 b-lactamase family, 573F b-meander, 40B b-sheets, 34–6, 36F defining, for prediction algorithms, 465, 465F transmembrane proteins, 436 types, 35–6 b-Spider, 466, 467F b-strands, 34–6, 36F, 413F amino acid preferences, 37 Chou–Fasman propensities, 474–5, 474F defining, for prediction algorithms, 465–6, 466–7 distortions, 463 length distributions, 467, 468F prediction, 413–14, 428–9, 429F see also secondary structure prediction based on residue propensities, 477–8 transmembrane proteins, 448–51, 450F variability, 467, 467F turns, hairpins and loops connecting, 36–7 b-turns, 36, 37F, 413F Chou–Fasman propensities, 475, 476T defining, prediction algorithms, 465 prediction, 413–14, 478, 503 p-helices, 435 defining, for prediction algorithms, 464–5 f angles see under torsion angles y angles see under torsion angles

A A (accepted point mutation matrix), 120 AACC, 214–15, 214F AAINDEX, 84 AAindex, 476 AAT program, 331T, 332T, 335, 336 ab initio approach, modeling protein structure, 522, 523B accepted mutations, 84 accepted point mutation matrix (A), 120 acceptor splice sites, 18F, 380F, 392 acetolactate synthase (ALS) family, 259B, 262 activators, 16–17 adaptive systems, 667–8 additive trees, 228–9, 229F, 230 adenosine (A), 6, 6F affine gap penalty, 127, 128, 133–4, 139 Affymetrix GeneChip® arrays, 602 Akaike information criterion (AIC), 253–5 ALDH10 gene, 324–5 annotation, 351–2 exon prediction accuracy, 345, 345–6 different programs, 331–2, 331T, 333–4, 334F, 335, 336 experimental results compared, 327, 328F using related organisms, 336–7 gene structure, 327B interspecies comparisons, 353, 353F, 354F pathway approach to identifying, 348, 349–50F promoter prediction, 341, 341T start codon, 327, 330F alignment, sequence see sequence alignment Alix, Alain, 475 all a-fold proteins, 421F, 422F, 573F, 574

751

End matter 6th proofs.qxd

19/7/07

12:17

Page 752

Index all b-fold proteins, 421F, 422F, 573F, 574 alternative splicing, 19, 380–1 Alu elements, 337B Alzheimer’s disease, 491 AMAS program, 93 AMBER program, 526, 701 amino acid(s) (residues), 11, 27–33 chemical structure, 28F conservation, to identify binding sites, 586–7, 587F conservation values (Zpred), 426, 427F, 428F, 429T hydrophobicity scales, 437–8, 450, 475, 477T peptide bonds, 29–33, 31F physicochemical properties, 28–9, 28T, 30F amino acid propensities, 37, 472–85, 472FD see also Chou–Fasman propensities averaged over sequence windows, 476–9 derivation and calculations, 473–6 nearby sequence effects, 479–84, 480F amino acid sequences, 13, 25, 29 see also protein sequences evolutionary conservation, 38 short segments with structural correlations, 487–8, 487F amino acid side chains, 28F modeling, 547–8, 548F, 558–9, 561 physicochemical properties, 28–9, 30F torsion angles (c1, c2, etc), 547, 548F amino (N) terminus, 29 amphipathic helix, 439–41 amyloidogenic proteins, 486, 487, 491–2, 492F, 493F analogous enzymes, 244, 244F analysis of covariance (ANCOVA), 659 analysis of variance (ANOVA), 659 ancestral states, 226 anchor points, 546, 546F Anfinsen, Christian, 412, 412F annotation, 357 automated, 64–5 database, 53 data errors or omissions, 64 gene, 348–52 genome see genome annotation manual, 65 ANOLEA program, 550–1, 551T antibiotic synthesis, 643B antibodies, 381, 555B modeling, 555–6B anticoding strand, 11 anticodons, 13–14, 14F

752

antigen-binding site, 555–6B antigens, 555B antisense strand, 11 apoptotic pathway, 681F approximate correlation coefficient (AC), 366B Arabidopsis thaliana, 328, 330B gene duplications, 241B Rha1 gene prediction, 393F splice sites, 380F, 396 vs rice, 335B Archaea, 21, 21F horizontal gene transfer, 246F, 247 sequenced genomes, 324T architecture database, 45 network, 676, 677F Argos, Patrick, 171 ArrayExpress, 58, 606, 611 ArrayExpress Data Warehouse, 58 arrhythmia, cardiac, modeling, 677, 678F ATG start codons see start codons atomic charges, 704 atomic mean force potential (AMFP), 551 AUG codon, 13, 19, 367 AU (approximately unbiased) method, 309 average conditional probability (ACP), 366B

B backbone (protein), 29, 32 models, 39, 39F back-propagation method, 497B backward algorithm, 190–1 bacteria, 21, 21F see also Escherichia coli; prokaryotes 16S RNA, 249 horizontal gene transfer, 246F, 247 sequenced genomes, 324T balanced training, 498B Baldi, Pierre, 191 BAliBase, 92, 93F balloting probabilities, 501 Barton, Geoff, 206 base-pairing, 7–9, 8F RNA, 456 wobble, 14 bases, 5–7, 6F base sequences see nucleotide sequences Baum–Welch expectation maximization algorithm, 191–3 Bayesian information criterion (BIC), 254–5 Bayesian methods, 697–8 dealing with lack of replicates, 657B

phylogenetic tree reconstruction, 250, 251T, 253, 306–7 Bayes’ theorem, 697–8 Benjamini, Yoav, 659 Berkeley Drosophila Genome Project (BDGP), 340, 341T Betaturns method, 503 biased mutation pressure, 239 biclustering, 649–50, 650F bidirectional recurrent neural network (BRNN), 504, 505F Bifidobacterium longum, 348, 350F bifurcating (branching) pattern, 226–7 binding sites, protein see protein binding sites biochemical pathways see metabolic pathways BioEdit program, 260 bioinformatics, 3 protein structure and, 37–9, 38FD BioModels Database, 692 Biomolecular Interaction Network (BIND), 58, 671, 673F bistable switches, 688–9, 689F BLAST program, 95–6 algorithmic approximations, 141 comparing nucleotide with protein sequences, 150–3 Conserved Domain Database (CDD) search, 99F, 100 dealing with low-complexity regions, 101–2 E-values, 98–100, 99F, 156 gapped method, 147–50, 178T GenScan modification using, 397 restriction of matrix coverage, 140 suffix trees, 141–3 use of finite-state automata, 147–50, 147F, 148F versions available, 95–7 whole genome alignments, 157–9 blastx program, 96, 97, 150, 343 BLAT program, 158 BLOCKS database, 58 Dirichlet mixture from, 174–5, 174F searching, 105–7, 106F substitution matrices from, 122 BLOSUM matrices, 83F, 84 alignment scoring, 82 derivation, 122–5, 123F, 124F selection, 84, 85 summary score measures, 125F, 126 Blundell, Tom, 532 Boltzmann factor, 706 bond angle energy, 703 bond energy, 702 bonding terms, 525–6, 701, 702–4, 702F Bonferroni correction, 658

End matter 6th proofs.qxd

19/7/07

12:17

Page 753

Index bootstrap analysis, 310B assessing tree topology, 309–10 comparing tree topologies, 233–4, 233F comparing two or more trees, 311 parametric, 310B practical example, 258, 259F bootstrap interior branch test, 310 bottom-up approach, modeling biological systems, 674–6, 676F bovine spongiform encephalopathy (BSE), 37B, 101B branch-and-bound method, 288, 710 branches, 226, 227F branch length calculations, 293–7, 295F, 296F assessing reliability, 309–10 parsimony methods, 299–300 branch swapping techniques, 289–91, 290F BRCA2, 78, 79F Brenner, Steven, 480 Brudno, Michael, 209 Bryant, David, 296, 296F BTPRED method, 503 Bucher weight matrix method, 383–4, 384F Burset, Moises, 365–6B, 392B BVSPS program, 551T

C C2-like domain, Dictyostelia, 535–7, 536F, 537F Ca atoms, 28, 28F, 29, 417 analysis of geometry, for prediction algorithms, 466, 466F torsion angles see under torsion angles Ca models, 39, 39F Caenorhabditis elegans, 399 CAFASP (Critical Assessment of Fully Automated Structure Prediction), 419, 554–6 cAMP PK see cyclic AMP-dependent protein kinase canonical ensemble, 718 Cantor, Charles, 271 capping, RNA, 18 cap signal (initiator signal, Inr), 389 Bucher weight matrix, 383, 384, 384F GenScan prediction method, 385, 385F NNPP prediction method, 385–6, 386F carboxy (C) terminus, 29 Casadio, Rita, 479–80 cascade-correlation neural network, 503–4

CASP (Critical Asssessment of Structure Prediction), 419, 554–6 CATH database, 531, 574 causal dependencies, 668 Cbl protein, 575–80, 576F CCAAT box, detection algorithms, 383, 384–5 CDK10 gene, 324–5 DNA sequence, 326–7B exon prediction, 329F, 330–1, 332T, 336–7 translation of predicted exons, 344F cDNA (complementary DNA) exon prediction using, 397 gene-prediction programs using, 334, 335 microarrays, 602 sequence databases, 56 Celera, 376B cell-division cycle, 688–9 Cell Markup Language (CellML), 692 CellML Model Repository, 692 cellular modeling heart, 685T international projects, 668 programs, 691–2, 691F CE (Combinatorial Extension) method, 576–7, 578F central dogma, 10–14, 10F, 10FD centroid, 711 centroid method, hierarchical clustering, 640, 641F chaining, 144–6 chameleon sequences, 37B, 488 CHAOS algorithm, 209 CHARMM program, 526, 701 ChiClust program, 617, 618–19 ChiMap program, 618–20, 619F chloroplasts, 22, 292B Chou, Peter, 472 Chou–Fasman propensities, 414, 415F, 472, 474–6 applied to GOR, 483 calculated values, 474F, 476T measures of accuracy, 424T nearest-neighbor methods, 489 periodic variation, 474–5, 475F transmembrane helices, 475–6, 478F window sizes, 477–8 chromatography, 600, 623 chromosomes, 10, 21–2 rearrangements, 248 Churchill, Gary, 275 chymosin B, 486, 487F, 490F chymotrypsin, 243–4, 244F CINEMA program, 93 cis conformation, 32, 33F clades, 256 Cladist program, 608–9, 609F

cladogram, 228, 229F ClustalW, 90, 91–2 progressive alignment method, 205 scoring scheme, 201–2, 201F, 202F vs other alignment methods, 92, 93F cluster analysis, 625–64, 626MM data preparation, 626–33, 627F, 627FD defining distances, 633–7, 634FD, 636F evaluating validity of clusters, 650–1 hierarchical see hierarchical clustering hydrophobic (HCA), 110–11, 110F sequence alignment, 90–1, 90F, 126 clustering methods see also specific methods comparison between, 643B gene expression microarray data, 606–11, 611F identifying expression patterns, 637–51, 637FD phylogenetic tree construction, 276–9, 277FD protein expression data, 615–17, 617F, 618F Clusters of Orthologous Groups (COG) database, 103, 243, 245B CMISS modeling tool, 692 COACH method, 195, 203 coding, 11, 12–13 coding strand, 11–12 codon-pairs see dicodons codons, 13 see also start codons; stop codons frequency of occurrence, 367, 367F genetic code, 12T mutation rates at different, 238–9, 238F statistics, use by ORPHEUS, 372–3 co-expressed genes or proteins, 600, 638 COFFEE scoring system, 200, 203, 204F COG (Clusters of Orthologous Groups) database, 103, 243, 245B Cohen, Stanley, 643B coiled coils, 413, 435 geometry, 451, 451F prediction, 451–4, 452FD, 478–9, 510, 510F COILS program, 452–3, 454F, 478–9 collagen, 452 common evolutionary ancestor, measuring likelihood, 117–19 comparative modeling see homology modeling COMPASS method, 195

753

End matter 6th proofs.qxd

19/7/07

12:17

Page 754

Index complementary DNA see cDNA complementary DNA strands, 7–8 complete linkage clustering, 640, 641F complexity see also low-complexity regions biological systems, 684–5 compositional, 151–2B COMPOSER program, 546, 553–4 compositional complexity, 151–2B concatamers, 605 condensation reaction, 29, 31F condensed trees, 233–4, 233F conditioned reconstruction, 292B confidence index, 432 conformation, 27, 41 see also quaternary conformation energies, 524–9, 524FD side chains, 547–8 conformational flexible docking, 590 conformers, 547 conjugate gradient method, 528, 713F, 714 conjugate prior, 698 consensus features, 234 consensus method, pattern or motif creation, 105 consensus sequences, 16 consensus trees, 234–5, 234F, 291 Conserved Domain Database (CDD) search, 99F, 100 CONSOLV program, 593 ConSurf program, 587, 587F contact capacity potential (CCP), 533, 707–8, 708F context strings, 371 control circuits, biological systems, 680, 680F convergent evolution, 74–5, 75B, 243–4, 244F cooperativity, 701 COPASI modeling tool, 692 Corbin, Kendall, 270 CorePromoter program, 340, 341T, 388, 389F core promoters, 17, 319 see also promoter prediction detection of binding signals, 339, 381–9 models designed to locate, 383–7 Cost, Scott, 489, 491 covalent bonds, 32B, 33B energetics, 525–6, 701, 702–4 CPHmodels, 554, 563 creatine kinase, 42F, 43 Creutzfeldt–Jakob disease (CJD), 101, 101B variant (vCJD), 101B Crick, Francis, 7

754

Critical Assessment of Fully Automated Structure Prediction (CAFASP), 419, 554–6 Critical Assessment of Structure Prediction (CASP), 419, 554–6 Crooks, Gavin, 480 C terminus, 29 Cy5/Cy3 label gene expression microarrays, 602–3, 603F cyclic AMP-dependent protein kinase (cAMP PK) inserting gaps, 86, 86F local and global alignment, 89, 89F multiple alignment, 91–2, 92F cytochrome c oxidase I, 249 cytosine (C), 6, 6F

D Dali library, 574 DALI program, 578–9, 579F Darwinian concept of evolution, 235 DAS (Distributed Annotation System), 348–51, 351F DAS (dense alignment surface) program, 442F, 444–5, 445F, 447 data, 53 checking for consistency, 63–4 derived (secondary), 53–4 log transformation, 629–30, 630F normalization, 627–31, 628F, 630F primary, 53–4 quality, 61–6, 62FD database management system (DBMS), 48 Database of Interacting Proteins (DIP), 58 databases, 45–66, 46MM access to, 52 categories (by content), 55–61, 56F centers, 55 content of entries, 53 data quality, 61–6, 62FD distributed, 48, 52 entry identifiers/version numbers, 65–6 first computerized, 48, 48F flat-file, 47, 47F, 48–9 links between, 52, 53 looking for, 55–61 nonredundancy, 62–3 ontologies, 54–5, 54F relational, 48, 49–50, 49F structure, 46–52, 47FD for systems biology, 671–2, 675T training and test, 416–17 types, 52–5, 53FD, 55FD updating, 65–6

data classification, 637–8, 638F see also sample classification secondary structure prediction, 510–14, 511FD data warehouses, 48, 51F, 52 Davies, Graham P., 420B Dayhoff, Margaret, 82, 119 Dayhoff mutation data matrices (MDMs) see PAM matrices dbEST, 56, 321B DEAD-box motif, 420B decision trees detection of functional RNA molecules, 361–3, 363F sample classification, 661 splice site prediction, 394 DEFINE, 417 degenerate (genetic code), 13 degrees of freedom (df), 654, 655 deletions accounting for, in sequence alignment, 85–7 alignment scoring schemes, 117, 126–7 homology modeling, 542, 545–6, 545F threading and, 532, 537 denatured proteins, 42 dendrograms, 636, 636F gene expression data, 606F, 607, 607F, 608 hierarchical cluster analysis, 639, 640, 640F, 641F dense alignment surface (DAS) program, 442F, 444–5, 445F, 447 deoxyribonucleic acid see DNA deoxyribonucleotides, 6 deoxyribose, 5–6 DESTRUCT method, 503–4, 505F deterministic finite-state automaton, 147F, 148–50 diagonals DIALIGN method, 92, 207–9, 208F FASTA scoring, 95 labeling of matrix, 144F, 145 restricting matrix coverage to, 139–41, 139F, 140F DIALIGN program, 92, 93F, 207–9 DIAL program, 575, 576, 576F dichotomous (branching) pattern, 226–7 dicodons (hexamers), 328, 367 exon prediction using, 390 gene detection methods using, 368–72 promoter prediction using, 387–8 Dictyostelia, C2-like domain, 535–7, 536F, 537F dielectric constant, 704

End matter 6th proofs.qxd

19/7/07

12:17

Page 755

Index differential equations, modeling biological systems, 680–3, 682F digital differential display (DDD), 605–6, 605F dihedral angles see torsion angles dihydrofolate reductase (DHFR) ligand docking, 592, 592F pocket identification, 585–6, 586F dimers, 43 directed acyclic graph (DAG), 512 directional information, 423, 482 Dirichlet distribution densities, 174 Dirichlet mixture, 174–5, 174F, 176F discriminant analysis see also linear discriminant analysis; quadratic discriminant analysis gene prediction, 340, 388, 389F, 396–7 sample classification, 661 secondary structure prediction, 512–13 distance, 81 see also evolutionary distance; p-distance definitions for cluster analysis, 633–7, 634FD, 636F phylogenetic tree reconstruction, 249–50, 251, 251T distance correction, 236 Distributed Annotation System (DAS), 348–51, 351F distributed databases, 48, 52 divergent evolution, 75B divide-and-conquer method (multiple alignment), 91, 91F vs other alignment methods, 92, 93F DNA, 4 central dogma concept, 10, 10F, 10FD complementary see cDNA double helix formation, 7–9, 8F mutations see mutations noncoding see junk DNA strands, 7–9, 8F, 11–12 structure, 5–9, 5FD, 8F transcription see transcription DNA gyrases (GyrA and GyrB), 249 DNA microarrays, 9, 600, 601–4 basic principle, 602 databases see microarray databases data clustering methods, 606–10, 643B data sharing and integration, 606 gene expression studies, 602–4, 603F principal component analysis of data, 618 two-color, 602–3, 603F uses of clustered data, 610–11, 611F

DNA polymerase, 8 DNA repeats, 22B see also repeat sequences detection, 152B exclusion from analysis, 319–21 DNA replication, 8, 8F DNA sequence databases, 56, 57F nomenclature for base uncertainty, 63, 63T DNA sequences alignment scoring matrices, 124F, 125 detecting homology, 75–6 gene prediction from see gene prediction multiple alignments, 92 nucleotide bias, 275–6 phylogenetic tree reconstruction, 249 preliminary examination, 318–22, 319FD searching with, 97 docking, 587–93, 588FD accounting for water molecules, 592–3 conformational flexible, 590 fragment, 591 scoring functions, 590 simple strategies, 588 specialized programs, 588–92, 592F DOCK program, 590–1 domains protein, 41 see also multidomain proteins families, 259B identifying, 574–6, 576F shuffling, 570 taxonomic, 21 donor splice sites, 18F, 380F, 392 dot-plots, 77–8, 77F, 79F low-complexity regions, 101–2, 102F double dynamic programming, 534 downhill simplex method, 711, 712F downstream sequences, 16 d-patterns, 217 drawhca program, 110F, 111 drug design, rational, 588, 589B DSC method, 512–13 DSSP program, 417 defining secondary structures, 464–6, 465F, 465T, 467, 467F length distributions of secondary structures, 467, 468F nearby sequence effects, 479–80, 480F duplication chromosome and genome, 248 gene see gene duplication sequence, 158F, 245

Durbin, Richard, 363 DUST program, 152B dynamic programming algorithms double, 534 gene model, 399, 402F global–local, 533 pairwise alignment, 86–7 database searching, 95–7 discarding intermediate calculations, 138B extension to multiple alignment, 198 function optimization, 710 local and suboptimal, 135–9 optimal global, 129–35 principles and methods, 127–41, 128FD time methods, 139–41, 139F, 140F Sankoff algorithm for weighted parsimony, 300–2, 301F threading, 533–4, 534F

E E-Cell Project, 668 EcoCyc database, 671, 673F EcoKI restriction enzyme, 420B EcoParse gene model, 375F, 376–7 Eddy, Sean, 293, 362, 363 edges see branches Efron, Bradley, 310B EGFR see epidermal growth factor receptor eigensamples, 633 Eisenberg hydrophobicity scale, 450 Elber, Ron, 532 electronic resonance, 31 electrostatic interactions, 33B, 704 EMAP modeling tool, 692 emergent properties, 669 emissions, 179, 181–2 eMOTIF, 213–15, 214F end state, 179, 180, 182–3, 183F energies free see free energy molecular, 700–8 potential see potential energy energy gradient, 528 energy minima, global, 524, 528–9 energy minimization, 527–8, 528F applied to homology modeling, 548, 559–60 Ensembl, 103, 403 enthalpy see potential energy entropy, 695–7 component of free energy, 525 relative, 125F, 126, 697 Shannon, 695–6

755

End matter 6th proofs.qxd

19/7/07

12:17

Page 756

Index enzymes, 40 analogous, 244, 244F convergent evolution, 243–4, 244F phylogenetic analysis, 259–63 simulation modeling, 690F, 691–2, 691F epidermal growth factor receptor (EGFR), 436, 436B mitogen-activated protein kinase system, 683F pathway modeling, 681, 682F, 690 epitope, 555B ergodic systems, 717, 718–19 errors random, 627–8 systematic, 625, 627–8 type I, 653, 658 types and rates, 657–8 Erwinia carotovora, 262 Escherichia coli, 21, 378 detection of tRNA genes, 320–1, 320F EcoCyc database, 671, 673F EcoParse gene model, 375F, 376–7 engineered OROlac promoter, 676, 676F gene classification by codon usage, 370 GeneMark.hmm gene model, 375–6 genome segment annotations, 322, 323F heat shock response, 680, 680F length distributions of coding/noncoding regions, 374F, 375 promoters, 339–40 pyruvate formate-lyase, 467F pyruvate kinase, 480F robustness, 684 start codons, 366F, 367 ESPript, 93 ESTs see expressed sequence tags ESyPred3D, 554, 563, 563T Euclidean distance, 634–5, 636F Eukarya see eukaryotes eukaryotes, 14, 21–2, 21F control of translation, 19 exon prediction see exon prediction gene detection, 323–37, 323FD, 360 finding correct start codon, 327, 330F homology searching, 322 with only query sequence, 327–32 with query sequence and gene model, 332–4 sequence features used, 377–81, 378FD series of steps, 346T

756

using correct reading frame, 325–7, 325T, 328F, 329F using gene control signals, 381–9, 382FD using gene model and sequence similarity, 334–6 using genomes of related organisms, 336–7 variety of approaches, 324–5 vs methods used in prokaryotes, 377–9 gene models, 397–9, 398FD gene structure, 319, 325F intron prediction see intron prediction mRNA modifications, 18–19 origins, 292B promoter prediction, 339, 340–2 indefinite nature of results, 341, 341T online methods, 340–1 theoretical basis, 381–9 regulation of transcription, 15, 17–18, 17F splice site detection see splice sites, detection tRNA gene detection, 362–3 Eukaryotic Promoter Database (EPD), 339, 340 European Bioinformatics Institute (EMBL-EBI), 52, 55, 606 databases, 55–6, 60 E-values, 98 cut-off thresholds, 98–100, 99F, 101F PSSM construction, 176 statistical significance, 156 EVA program, 551T evolution, 5, 20–3, 20FD aiding sequence analysis, 38 basic concepts of molecular, 235–48, 235FD convergent, 74–5, 75B, 243–4, 244F Darwinian concept, 235 divergent, 75B gene level, 239–47 genome level, 247–8 minimum see minimum evolution nucleotide level, 236–9 evolutionary clustering algorithms, 646–7, 646F evolutionary distance, 81, 199, 224–5 see also p-distance additive phylogenetic trees, 228, 229F calculation, 268–76, 269F evaluating tree topologies using, 293–7 PAM matrices and, 84 sources of errors, 277

tree construction, 251–2, 276–9, 277FD evolutionary history phylogenetic trees see phylogenetic trees recovering, 223–64, 224MM evolutionary models practical application, 251–5, 253T selection of appropriate, 253–5, 254F, 256T sequence alignment, 117–19 theoretical basis, 268–76 time-reversible, 302 evolutionary trace method, identifying binding sites, 586–7, 587F exclusive classification, 637–8, 638F exon prediction, 319, 323–37 assessing accuracy, 343–6, 343F, 344F, 392B with only query sequence, 327–32 with query sequence and gene model, 332–4 theoretical basis, 379–81, 389–97, 391FD using correct reading frame, 325–7, 325T, 328F, 329F, 391–2 using gene model and sequence similarity, 334–6 using general sequence properties, 390–2 using genomes of related organisms, 336–7 using homology searches, 397 variety of approaches, 324–5 exons, 18, 18F, 19 initial and terminal, detection, 390, 396–7 length distributions, 379, 379F translating predicted, 343, 344F use of term, 379–80 ExPASy program, 345, 412, 620 expectation maximization (EM), 191, 216 expectation values see E-values expected number of offspring (EO), 209 expected score, 119, 126 see also E-values explicit state duration hidden Markov model (HMM), 374 expressed (genes), 11 see also gene expression expressed sequence tags (ESTs), 321B databases, 56, 103 digital differential display (DDD), 605–6, 605F exon prediction using, 397 gene-prediction methods using, 334–5

End matter 6th proofs.qxd

19/7/07

12:17

Page 757

Index expression level ratios, 628–30, 629F, 630F in different samples, 652 log transformation, 629–30, 630F eXtensible Markup Language (XML), 50–1 external nodes, 226, 227F extracellular matrix (ECM), modeling tumor invasion, 677, 677F Extreme Pathways, 678 extreme-value distribution, 97–8, 155–6, 155F extrinsic classification, 638 extrinsic gene detection methods, 361, 368FD eye, gene expression patterns, 607F, 608

F false discovery error rate (FDR), 658, 659 false negatives in gene prediction, 365B in sequence analysis, 212 false positives in gene prediction, 365B in sequence analysis, 212 statistical tests, 653 families, protein see protein families family-wise error rate (FWER), 658, 659 Fano definition of mutual information, 481 Fasman, Gerald, 472 FASTA program, 95 algorithmic approximations, 141 chaining, 144–6 comparing nucleotide with protein sequences, 150–3 database searching method, 143, 144–6, 145F E-values, 98, 100, 101F, 156 restriction of matrix coverage, 140 versions available, 95–6, 96T whole genome alignments, 157–9 fast Fourier transform (FFT), 206 FATCAT program, 579–80, 580F feedback control, 680, 680F feedforward control, 680, 680F Felsenstein, Joseph, 253, 275 Felsenstein 81 (F81) model, 253, 253T, 254F, 256T Felsenstein zone (long-branch attraction), 292, 308–9, 309F Ferrell, J.E., 689F FGENESH program, 332, 333–4, 334F comparative results, 331T, 332T, 333F rice genome prediction, 335B

fibrin, 451–2 fibrous proteins, 41, 435 fields (database), 46–7 fingerprints, multiple motif, 109 finite-state automata (FSA), 147–50, 147F, 148F vs hidden Markov models, 147, 179, 180–1 FirstEF, 332, 396–7 Fitch algorithm see post-order traversal Fitch–Margoliash method, 250, 251T evaluating tree topologies, 293. 297 generating single trees, 279–80, 280F, 281F vs neighbor-joining, 282, 284F, 285 fitness, 235 evolutionary clustering, 646–7, 646F flavin adenine dinucleotide (FAD), 259B, 260, 261F, 262 flavodoxin family, 573F Fletcher–Reeves formula, 714 Flicker program, 614, 620, 620F Flux Balance Analysis (FBA), 678 FoldIndex method, 513 folding, protein see protein folding folding funnel, 525 fold recognition see threading folds, protein see protein folds force fields, 522, 524–9, 701–5 additive, 701 class I and II, 702 nonadditive, 701 forward algorithm, 190 fractional alignment difference, 269 frameshift, 150 Franklin, Rosalind, 7, 7F free energy folded proteins, 41–2 RNA secondary structures, 456, 457–8 surface, molecular systems, 525, 525F free insertion modules (FIMs), 184–5 fructose-1,6-bisphosphate aldolases (FBPAs), 569F, 570, 570F FSSP database, 574, 578–9 Fuchs, Patrick, 475 FUGUE program, 532, 535–6, 536F fully resolved trees, 227 function (protein and gene), 40–1 see also structure–function relationships conservation, 568–74, 568FD evolution, 242, 243–4 genome annotation, 400–3 orthologs, 239, 243 patterns and, 109–11 phylogenetic trees for predicting, 262

protein folding and, 40–1, 41F using orthologs to predict, 245 functional homology, 569–70, 569F, 570F function optimization see optimization, function FunSiteP algorithm, 340, 341, 341T fusion gene, 72 genome, 292B

G Gamma distance (correction), 239, 269F, 270 Gamma distribution (G), 269F, 270 evolutionary model variation, 253T, 254F gap extension penalty (GEP), 85, 127 gap insertion operator, 210–11, 211F gap opening penalty (GOP), 127, 202, 202F gap penalties, 85–6, 87, 126–7 global alignments, 131F, 132–5, 132F, 134F local alignments, 137 manual adjustment, 93 multiple alignments, 202, 205, 206 position-specific scoring matrices, 170, 177 suboptimal alignments, 137F, 139 gaps, 74 inserting, 85–7 in multiple alignments, 204, 205F scoring, 126–7 Garnier, J, 422 Gaussian distributions see normal distributions GAZE program, 399, 402F GC box, detection algorithms, 383, 384–5 GC content bacterial genomes, 238F, 239 evolutionary models and, 273 promoter prediction using, 386, 387F regions of different (isochores), 275, 378 GenBank, 55–6, 102–3 flat-file format, 47, 47F sample extract, 57F gene(s), 5, 10–11 evolution, 239–47 families see protein families function see function fusion, 72 nested, 399 nonfunctional, 242 overlapping, 12, 12F, 360 prokaryotic vs eukaryotic, 377–9

757

End matter 6th proofs.qxd

19/7/07

12:17

Page 758

Index structure and control, 14–20, 15FD, 318–19 structure in eukaryotes, 319, 324 GeneBee program, 457F, 458 GeneBuilder program, 331T, 332T, 335, 336 GeneCluster2 program, 608 gene duplication, 73, 239–42, 242F acetolactate synthase (ALS), 262, 263F effects on phylogenetic analyses, 245 identified from synonymous mutations, 241B phylogenetic trees, 226, 231F structure–function relationships, 570 use for rooting trees, 292–3 gene expression, 11 co-expression, 600 databases, 58 digital differential display (DDD), 605–6, 605F microarrays, 602–4, 603F see also DNA microarrays patterns, 638, 639F SAGE method, 604–5, 604F sample classification, 659–62, 660FD uses of clustered data, 610–11, 611F gene expression analysis, 599–600, 600MM, 601–11, 601FD clustering methods see clustering methods data preparation for, 626–33, 627F, 627FD statistics, 652–9 gene loss, 242–3, 243F effects on phylogenetic analyses, 245 GeneMark algorithm, 328–9, 368–70 comparative results, 331–2, 331T, 332T GeneMark.hmm algorithm, 373–6, 374F gene models, eukaryotic, 397–9, 398FD Gene Ontology, 54, 348 gene ontology evaluating validity of clusters, 651 genome annotation, 348–52, 402 gene prediction (detection), 317–46, 318MM assessing accuracy, 342–6, 342FD at exon level, 343, 344F, 392B at nucleotide level, 343, 343F, 365–6B at protein level, 343–6, 345F eukaryotes see under eukaryotes

758

evaluation and reevaluation of methods, 405 exon prediction see exon prediction further analysis, 399–405, 400FD intrinsic and extrinsic methods, 361, 368FD intron prediction see intron prediction potential for errors, 65 preliminary steps, 318–22, 319FD prokaryotes see under prokaryotes promoter region, 338–42, 381–9 splice site detection see splice sites, detection theoretical basis, 357–99, 358MM general time-reversible model (GTR or REV), 253T, 255, 262 general transcription initiation factors, 17 see also transcription factors generation, 209 GeneSplicer program, 394–5 genetic algorithms cluster analysis, 646–7, 646F docking, 591–2, 592F function optimization, 709, 716–18, 716F multiple sequence alignment (SAGA), 209–11, 210F, 211F genetic code, 11, 12–13, 12T degeneracy, 13 genetic distance, 224–5, 232F see also evolutionary distance gene (phylogenetic) trees, 226, 230, 231F combined with species trees, 243, 244F reconstruction example, 259–63, 261F, 263F GeneWalker program, 331T, 332T, 335–6 GeneWise program, 345–6 Genie program, 329F, 386 Geno3D program, 554, 563, 563T genome(s), 4, 10 comparisons see genome sequence alignments completely sequenced, 71 databases, 56, 103 evolution, 247–8 fusion, 292B identifying features, 317–54, 318MM known prokaryotic, 324T problems of defining, 23B genome annotation, 65, 399–405 see also gene prediction comparing genomes to check accuracy, 353–4, 353F, 354F, 403–5, 403F, 404F

E. coli segment, 322, 323F evaluation and reevaluation, 405 functional, 400–3 pathway information aiding, 348, 349–50F pipeline approach, 319 practical aspects, 346–52, 347FD quality of information used, 403 role of gene ontology, 348–52, 402 theoretical basis, 357–9, 358MM Genome Browser, 352, 352F GenomeNet, 84 GenomeScan, 397 genome sequence alignments to verify annotation, 353–4, 353F, 354F, 403–5, 403F, 404F whole genomes, 156–9, 157FD genome sequences excluding noncoding regions, 319–21 gene prediction from see gene prediction preliminary examination, 318–22, 319FD splitting, 319 genome sequencing, 71 multiple genomes, 376B shotgun procedure, 376B genomic imprinting, 7 genomics functional, 600 role in systems biology, 668 structural, 569 GenScan program, 334 comparative results, 331T, 332T, 336 exon detection, 390 promoter detection, 385, 385F splice site prediction, 394, 395F, 396 transcription stop signal detection, 389 translation start site detection, 389 use of gene models, 398–9, 401F use of homology searches, 397 GenTHREADER, 532–3, 534–5, 535F, 536F GEPASI, 691–2, 691F GES (Goldman, Engelman and Steitz) hydrophobicity scale, 438, 475, 477T Gibbs program, 215–17 Gleevec®, 593 GLIMMER program, 323, 371–2 global alignments, 88–9, 89F large genome sequences, 352F, 353 optimal, 128, 129–35, 129F, 130F, 131F score significance, 154 time saving methods of deriving, 139–41, 139F, 140F

End matter 6th proofs.qxd

19/7/07

12:17

Page 759

Index global–local dynamic programming, 533 globular proteins, 41 length distributions of secondary structures, 467, 468F secondary structure prediction, 509 secondary structures, 463 gluconeogenesis pathway, 348, 349–50F glycolytic pathway, 671, 672F E. coli, 673F interactions, 673F modularity, 686F, 687F glycosylphosphatidylinositol (GPI) anchors, 513–14, 513F Godzik, Adam, 491 Gojobori, Takashi, 240B GOLD program, 591–2, 592F GOR methods, 414, 422–5, 425F, 472–3 accuracy, 422, 423, 424T, 484 derivation, 480–4, 482F version III, 483, 484F version IV, 423–5, 427F, 483 version V, 423–5, 425–6, 426F, 483 Gotoh, Osamu, 206 GPI-SOM method, 513–14, 513F G-protein-coupled receptors, 436, 436B GrailEXP program, 331T, 332T, 334–5, 336 Grail program, 323, 386, 387F, 389, 399 greedy alignment methods, 199 greedy permutation encoding method, 646–7 Greek Key structure, 40B GRID program, 591 GRIN program, 591 Grishin, Nick, 466 growth factors, 616–17, 617F guanine (G), 6, 6F guide tree, 90, 199–200 construction, 204–6, 205F multiple alignment from, 206, 206F pattern discovery, 214 Guigo, Roderic, 365–6B, 392B Gumbel extreme-value distribution see extreme-value distribution

H HbP method, 491–2, 492F, 493F Haemophilus influenzae, 371 hairpins, 36–7 harmonic approximation, 526, 702–3, 702F hashing, 95 theoretical basis, 143–6 whole genome sequences, 158

heart cellular modeling, 685T modeling of function, 677, 678F heat shock response, E. coli, 680, 680F helical wheels, 439F, 440–1, 448 helices, 435 see also 310-helices; a-helices; p-helices; transmembrane helices helix tails, 441 hemagglutinin, 34, 486, 486F hemoglobin, 43, 43F Henikoff, Steven and Jorja, 122, 171F heptads, 451, 451F, 510 Hessian, 714–15 hexamers (hexanucleotides) see dicodons HHsearch, 195F, 196 hidden layers, 431, 431F, 494, 499 hidden Markov models (HMMs), 166, 179, 179FD with duration, or explicit state duration, 374–6 EcoParse gene model, 375F, 376–7 exon prediction, 328, 332 GAZE gene model, 402F GeneMark.hmm algorithm, 374–6, 374F genome annotation, 359 GenScan gene model, 399, 401F multiple sequence alignments, 200, 203–4 profile see profile hidden Markov models secondary structure prediction, 504–10, 506FD transmembrane protein prediction, 446–7, 446F, 451 vs finite-state automata, 147, 179, 180–1 hidden neural networks (HNN), 509 hierarchical clustering, 638, 639–41 see also UPGMA method gene expression microarray data, 606–8, 606F, 607F protein expression data, 616–17, 617F, 618F vs other clustering methods, 643B hierarchical likelihood ratio test (hLRT), 253, 254F, 255 Higgins, Desmond, 209 high-scoring segment pairs (HSPs), 141, 149 Hinton diagram, 499F histone deposition protein, 571F HIV (HIV-1), 337B drug design, 589B protease (HIV-PR), 551–2, 552F HKY85 model, 253T, 254F, 256T, 273 HMM see hidden Markov models

HMMER2 program, 185 HMMGene program, 331T, 332, 332T, 333 HMMTOP program, 441F, 446–7, 448F, 506–7, 507F Hochberg, Yosef, 659 Hollerith, Herman, 48, 48F homolog methods see nearestneighbor methods homologous genes chicken, human and puffer fish genomes, 245, 246F evolution, 239–42, 242F homologous proteins, 38 see also protein families alignment, 38, 74 secondary structure prediction, 416, 418–19, 419F homologous sequences see also sequence alignment cut-off points for identifying, 81 identifying, 74–6 inserting gaps, 85–7 scoring alignments, 76–81 searching databases see searching sequence databases secondary structure prediction using, 425–6, 484–5, 489–90, 502–3 homology exon prediction based on, 397 functional, 569–70, 569F, 570F gene prediction based on, 320F, 321B, 322, 372–3 homology modeling (3D protein structure), 522–3, 537–64, 538FD assumptions, 541–2 automated, 541, 552–6, 553FD, 561–3 checking for accuracy, 549–51, 550F, 551T, 560, 560F energy minimization, 548, 559–60 history, 538–9, 538F loops, 545–6, 546F, 547F, 559, 559F manual or semi-manual, 557–61 molecular dynamics, 548 mTOR protein, 563, 563T multidomain proteins, 564 PI3 kinase p110a, 557–63 principles, 537–42 sequence length cut-offs, 540–1, 542F sequence similarity thresholds, 539–40, 541F steps, 540F, 542–52, 543FD structurally conserved regions (SCRs), 544–5, 545F, 554 trustworthiness, 551–2 Web-based servers, 554, 561–3 homoplasy, 244

759

End matter 6th proofs.qxd

19/7/07

12:17

Page 760

Index horizontal gene transfer (HGT), 246–7, 246F, 247F, 292B Hsp60, 249 HSSP database, 490 HTML (hypertext markup language), 50–1 human immunodeficiency virus see HIV Hutchinson, Gail, 475 Hutchinson, Gordon, 387 hybridization, 9, 602 hydrogen bonds DNA, 7, 8 energetics, 525–6, 701 peptide bonds, 29, 32, 32B protein folds, 42 RNA, 456 secondary protein structure, 34, 35, 35F, 36F defining, for prediction algorithms, 464–5, 465F nonidealized patterns, 463–4 hydropathic (hydrophobicity) profiles, 439, 442 hydrophilic amino acid residues, 29, 30F transmembrane proteins, 439F, 440–1 hydrophilic regions, folded proteins, 41 hydrophobic amino acid residues, 29, 30F hydrophobic cluster analysis (HCA), 110–11, 110F hydrophobicity scales, 437–9, 450, 475, 477T hydrophobic moment, 440 hydrophobic regions folded proteins, 41, 42 indicating binding sites, 583 transmembrane proteins, 437–41, 439F hyperplanes, separating, 661, 662, 662F hypertext markup language (HTML), 50–1 HyPhy program, 255 hypothetical proteins, 65, 348 conserved, 348

I identity, 76 percent/percentage see percent/percentage identity visual assessment, 77–8, 77F, 79F immunoglobulin folds, 571F immunoglobulins, 381, 555–6B importin a, 480F imprinting, genomic, 7

760

indels, 85, 117 see also deletions; insertions indexing techniques, 141–6 see also hashing; suffix trees whole genome sequences, 157–9 influenza virus hemagglutinin, 34, 486, 486F rational drug design, 589B, 591 information directional, 423, 482 mutual, 697 pair, 423, 482 Shannon entropy and, 696 information theory approach, secondary structure prediction, 422–5, 480–4 informative sites, 298 ingroups, 230 inhomogeneous Markov chain (IMC) models, 328, 368–70 initiator (Inr) see cap signal input, 431, 494 input layer, 430, 494 insertions accounting for, in sequence alignment, 85–7 alignment scoring schemes, 117, 126–7 homology modeling, 542, 545–6, 545F threading and, 532, 537 integral membrane proteins see transmembrane proteins integrative approach, 670F intermediate alignment, 198, 204–5, 205F intermediate sequences, 97 internal nodes, 226, 227F Internet, access to databases via, 52 interpolated Markov models, 371–2, 388 intrinsic classification, 638 intrinsic gene detection methods, 361, 368FD intron prediction, 319, 323, 379–81 approaches used, 324–5 theoretical basis, 389–97, 391FD introns, 18–19, 18F see also splice sites AT–AC or U12, 19, 392 branch point, 18–19, 396 length distributions, 379, 379F invariable sites, 298 inverse protein folding, 530–1 inversion, sequence, 158F I-sites library, 487–8, 487F isochores, 275, 378 isoelectric focusing (IEF), 613 iterated sequence search (ISS), 168 iterative alignment, 198, 206, 206F

J Jarnac, 690F, 691–2 JC model see Jukes–Cantor model Jnet program, 424T, 434, 435F Jones, David, 276, 503 JTEF program, 397 JTT matrix, 276 Jukes, Thomas, 271 Jukes–Cantor (JC) model, 253T, 271–2 evaluation using maximum likelihood, 302, 303–4 example distance corrections, 252F examples of constructed trees, 256, 258F, 261F, 262 Gamma distribution applied to (JC+G), 273 more complex models based on, 272–3 synonymous/nonsynonymous mutations, 241B testing for suitability, 253, 254F, 256T junk DNA, 22B, 336, 378–9 jury decision neural networks, 432, 501 jury voting technique, 485 JWS Online Cellular System Modeling, 692

K Kabat database, 103 Kabsch, Wolfgang, 464–5 Katoh, Kazutaka, 206 KD hydrophobicity scale, 475, 477T, 479F Kendrew, John Cowdery, 538F keratins, 451 keys, 49, 49F Kihara, Daisuke, 480 Kimura-two-parameter (K2P or K80) model, 253, 253T, 272–3 practical application, 261F, 262 transition/transversion ratio calculation, 274–5B Kimura-three-parameter (K3P or K81) model, 253, 253T kinetic energy, 718 kinetic models, 678, 690F kinetic parameters, biological networks, 674 k-means clustering, 608, 641–2, 642F vs other clustering methods, 643B k-mers, 141, 147, 199–200 k-nearest-neighbor method, sample classification, 660–1 knockout mice, 688 knowledge-based methods modeling 3D protein structure see homology modeling

End matter 6th proofs.qxd

19/7/07

12:17

Page 761

Index secondary structure prediction, 414–15, 421–30 transmembrane protein prediction, 443 knowledge-based scoring, 590 KOG database, 243, 245B Kohonen networks see self-organizing maps Krebs cycle see tricarboxylic acid cycle Krogh, Anders, 500, 501F, 502–3 k-tuples, 95, 141, 143–4, 147 whole genome sequences, 158–9 Kullback–Leibler distance see relative entropy kuru, 101, 101B Kyoto Encyclopedia of Genes and Genomes (KEGG), 348, 671, 672F Kyte–Doolittle (KD) hydrophobicity scale, 475, 477T, 479F

L L2L tool, 611 Laboratory Information Management System (LIMS), 600 LAGAN method, 352F, 353 Lake, James, 292B LAMA program, 106 alignment of PSSMs, 193–5, 194F Lander, Eric, 488, 491 lariat RNA, 18–19, 18F last common ancestor, 227, 227F last universal common ancestor, 293 lateral gene transfer (LGT) see horizontal gene transfer layers, neural networks, 430–1, 431F, 494–5 learning supervised, 497B, 638 unsupervised, 638, 644 learning rate, 497B least-squares method, 250 Bryant and Waddell version, 296, 296F evaluating tree topologies, 294–6, 295F, 297 leaves, 226, 227F LEGO® system, 686, 688F length distributions a-helices and b-strands, 467, 468F prokaryotic coding/noncoding regions, 374–5, 374F vertebrate introns and exons, 379, 379F Lennard–Jones terms, 705, 705F leucine zipper, 413, 451 prediction, 453–4, 455F Levitt, Michael, 195 LIBRA, 536, 537F

library extension, COFFEE scoring scheme, 203, 204F ligands docking procedures, 587–93, 588FD drug design methods, 588, 589B identifying candidate, 590 likelihood ratio test, hierarchical (hLRT), 253, 254F, 255 linear discriminant analysis (LDA) promoter prediction, 340, 388, 389F secondary structure prediction, 512–13 linear gap penalties, 126–7 global alignments, 131F, 132–3 local alignments, 137 suboptimal alignments, 137F, 139 links (in databases), 52, 53 lipopolysaccharide (LPS), 608, 609F, 674F liquid chromatography, 623 local alignments, 88–9, 89F dynamic programming algorithm, 135–9, 136F gapped, score statistics, 153, 156 multiple alignment using, 92–3, 93F optimal, 135–7, 136F profile hidden Markov model, 183–4, 184F suboptimal, 137–9, 137F ungapped, score statistics, 153, 155–6 log-likelihoods amino acid propensities, 476, 478F evolutionary models, 254F, 256T multiple alignments, 192, 216 log-odds ratio, 118–19, 169–70 log-odds scores, 188–90 logos, 177 aligned HMMs, 196, 196F patterns, 213 PSSMs, 106F, 177–8, 178F log ratios defining distances between, 634–7 expression data, 629–30, 630F t-test, 654 z-test, 653–4 long-branch attraction see Felsenstein zone LOOPP program, 532–3, 533F, 535–6, 536F loops, 36–7 amino acid residue preferences, 202 homology modeling and, 542, 545–6, 546F, 547F, 559, 559F transmembrane proteins, prediction, 506, 508 Loopy program, 561 low-complexity regions, 100–2, 151–2B see also repeat sequences

Lowe, Todd, 362 lowess normalization, 630–1, 631F LUDI program, 591 lysozyme, 538–9, 539F

M M (mutation probability matrix), 120, 121–2 machine-learning methods, 430 see also neural network methods secondary structure prediction, 414, 415–16 Macromolecular Structure Database (MSD), 52, 60, 64 macrophages, 608, 609F, 674F MAFFT method, 199–200, 206 Mahalanobis distance, 636–7 main chain see backbone Major, Francois, 466 major histocompatibility complex (MHC) proteins, 593 majority-rule consensus trees, 234F, 235 majority voting technique, 485 Mann–Whitney U test, 656–7 MAO (multiple alignment ontology), 54F, 55 MARCOIL, 510, 510F Markov chain Monte Carlo (MCMC), 307 Markov chains, 368–9 first order, splice site prediction, 394–5 Markov models, 179 see also hidden Markov models; inhomogeneous Markov chain models fifth order, 368–70, 370F interpolated, 371–2, 388 splice site prediction, 394–5 used by GeneMark, 369, 370F MASCOT program, 622–3 mass spectrometry (MS), 600, 621–3 protein identification, 621–3, 622F protein quantitation, 623 MAST program, 106 mathematical modeling of biological systems, 689–92, 689FD approaches, 674–7, 676F model databases, 692 model structure, 679–83, 679FD specialized programs, 690F, 691–2, 691F standardized languages, 692 Matthews correlation coefficient, 469–70 maximal dependence decomposition (MDD), 394, 395F

761

End matter 6th proofs.qxd

19/7/07

12:17

Page 762

Index maximal segment pair (MSP), 141, 149 maximum likelihood (ML), 250, 251T, 286 evaluating tree topologies, 302–5, 302F, 303F, 304F hidden Markov model parameterization, 191 inference of parameter values, 698 measure of optimality, 287 practical application, 255–6, 257F, 262, 263F testing for suitability, 253 maximum parsimony, 250, 251T, 286 branch-and-bound technique, 288 long-branch attraction problem, 309, 309F measure of optimality, 287 unweighted, 297–300, 299F weighted, 300–2, 300F, 301F McClintock, Barbara, 337B McPromoter program, 388 mean(s), 626, 652 comparison between two, 652–5 MEGA3 program, 250, 260 Melanie program, 614–15, 620 membrane proteins, 436–7, 462 see also transmembrane proteins interactions with membrane, 437, 437F secondary structure prediction, 468 MEME program, 105–7, 107F, 215–17 MEMSAT program, 443, 475–6, 479 messenger RNA (mRNA), 11 analysis of transcribed see gene expression analysis capping, 18 genetic code, 12–13, 12T polyadenylation, 18 reading frames, 13, 13F secondary structure, 455 splicing see RNA splicing synthesis see transcription translation see translation metabolic models, 678 metabolic pathways databases as sources, 671, 672F, 673F modeling interactions, 681–3, 682F modularity, 685, 686F, 687F simulation programs, 690F, 691–2, 691F methylation, 6–7 MFOLD program, 457F, 458 MIAME (Minimum Information About a Microarray Experiment), 64, 606 Michener, Charles, 278 microarray databases, 58, 60F applications, 610–11, 612F data standards, 64, 606

762

Microarray Gene Expression Data (MGED), 54–5, 606 MicroArray Quality Control (MAQC) project, 606 microarrays, 602 DNA see DNA microarrays protein, 621 middle-out approach, modeling biological systems, 677, 678F midnight zone, 81 minimum evolution, 250, 282 methods, 250, 251T, 297 MIRIAM standard, 692 mitochondria, 22, 292B, 367 modeling biological systems see mathematical modeling of biological systems modeling (tertiary) protein structure, 521–65, 522MM ab initio approach, 522, 523B assessment of predicted structure, 554–6 comparative, homology or knowledge-based see homology modeling potential energy functions and force fields, 524–9, 524FD ROSETTA/HMMSTR method, 523B threading (fold recognition) see threading MODELLER program, 535, 541, 552, 553, 554F model surgery, 182 ModelTest, 255 modularity, biological systems, 685–6 modules, 680, 681F, 685–6, 686F Molecular Biology Database Collection, 55, 56F molecular clock, 229–30, 278 hypothesis, 250 molecular configuration, 33B molecular dynamics, 528–9 function optimization, 718–19 homology modeling, 548 molecular energy functions, 700–8 see also bonding terms; nonbonding terms force fields for intra- and intermolecular interactions, 701–5 potentials used in threading, 706–8 molecular evolution, 235–48, 235FD Molecular INTeraction database (MINT), 58 molecular mechanics, 524–9 molecular modeling, ligand binding, 588, 589B molecular models, 39, 39F MolIDE, 542, 557–8, 560–1, 561F MolProbity program, 527, 549, 551T

monophyletic (groups), 231, 255–6, 258 Monte Carlo methods see also Markov chain Monte Carlo docking, 590 function optimization, 716–18, 716F modeling protein structure, 523B Morse potential, 702F, 703 MOTIF program, 217 motifs, 103–9, 412 see also patterns automated generation, 105–7, 106F, 107F creating databases, 104–5 searching for, 103–4, 107–8 MrAIC script, 255 mRNA see messenger RNA mTOR protein, 563, 563T MULTICOIL program, 453 multidomain proteins, 41 3D structural modeling, 537, 564 sequence alignment, 88, 88F multifurcating trees, 227, 233, 233F Multi-LAGAN method, 353 multiple alignment, 89–93 applications, 90 construction methods, 90–1, 196–211 discovering patterns, 213–15 divide-and-conquer method, 91, 91F by gradual sequence addition, 196–206, 197FD manual refinement, 93 methods not using pairwise alignment, 207–11, 207FD phylogenetic tree reconstruction using, 250–1, 255, 260 secondary structure prediction using, 425–7, 427F from series of local alignments, 92–3, 93F theory, 165–7, 166MM transmembrane protein prediction using, 444, 445 value for sequences of low similarity, 91–2, 92F vs pairwise alignments, 90, 166–7 multiple alignment ontology (MAO), 54F, 55 multiple linear regression, 514 MUMmer method, 159 MUSCLE method, 199–200, 206 mutation data matrices (MDMs), Dayhoff see PAM matrices mutation probability matrix (M), 120, 121–2 mutation rates codon position and, 238–9, 238F

End matter 6th proofs.qxd

19/7/07

12:17

Page 763

Index estimating and predicting, 236, 237F type of base substitution and, 236–8, 238F mutations, 22–3 accepted, 84 masking sequence similarities, 72, 73–4 selective pressures on, 240–1B synonymous/nonsynonymous, 238, 240–1B, 245 transition and transversion, 237–8, 238F mutual information, 697 Mycoplasma, 684 myoglobin, sperm whale, 538F myosin II, 451 MZEF program, 328 comparative results, 331–2, 331T, 332T scores used, 331T

N N-acetylneuraminate lyase gene, 247F National Center for Biotechnology Information (NCBI), 52, 55 dbEST, 56, 321B GEO, 606 Protein Database, 56–8 SAGE analysis programs, 605 UniGene database, 103, 605–6, 605F native structure or state (of proteins), 522 NCBI see National Center for Biotechnology Information nearest-neighbor interchange (NNI) method, 289–90, 289B nearest-neighbor methods, 414–15, 428–30, 485–92, 485FD misfolding proteins, 491–2, 492F, 493F outline, 486, 487F sample classification, 660–1 similarity measures used, 488–90, 490F weighting of predictions, 490–1 Needleman, S.B., 87, 128 Needleman–Wunsch algorithm, 87, 128 database search programs using, 95 discarding intermediate calculations, 138B extension to multiple alignments, 199 illustration of original, 135, 135F more efficient variations, 129–35, 129F, 130F negative selection, 240–1B Nei, Masatoshi, 240B, 282

neighbor-joining (NJ) method, 250, 251T, 252–3 generating single trees, 282–5, 282F, 284F multiple alignment, 199, 200 practical application, 261F, 262 variants, 285 Nei–Gojobori method, 240–1B Neisseria meningitidis, 348, 350F nested genes, 399 NetPhos server, 110 NetPlantGene program, 390–1, 393F, 395–6 networks see also neural networks; systems, biological architectures, 676, 677F biological, 670–1 information for constructing, 671–4 kinetic models, 678 mathematical modeling approaches, 674–7 mathematical representation of interactions, 680–3 scale-free, 676 neural network methods exon prediction, 334–5, 390–1 genome annotation, 359 promoter prediction, 340, 385–6, 386F, 387F secondary structure prediction, 415–16, 430–4, 430FD assessing reliability, 432 Qian and Sejnowski studies, 496–9, 499F, 500F Riis and Krogh methods, 500–1, 501F, 502–3 theoretical basis, 492–504, 493FD transmembrane proteins, 445 using homologous sequences, 502–3 Web-based programs using, 432–4 splice site prediction, 395–6 neural networks, 430–2 GenTHREADER, 534–5, 535F Kohonen see self-organizing maps layered feed-forward, 494–502, 495F more complex, 503–4, 504F, 505F multilayer, 431, 431F training process, 496, 497–8B two-layered, 430–1, 431F neuraminidase, 589B Nevill-Manning, Craig, 213 Newick or New Hampshire format, 231–2 Newton–Raphson method, 528 NMR see nuclear magnetic resonance

NNPP program, 340, 341T, 385–6, 386F NNSSP program, 424T, 433, 488–9, 490, 491 nodes neural networks see units, neural network phylogenetic trees, 226, 227F self-organizing maps, 608, 608F, 644, 644F self-organizing tree algorithms, 648, 648F nonbonding terms, 525–6, 701, 704–5 noncoding DNA see junk DNA noncoding RNA (ncRNA) genes, detection, 319–21, 361–3 noncoding strand, 11 nonlinearity, 667 nonparametric tests, 656–7 nonrandom model, sequence alignment, 117–19 nonredundant database, 63 nonsynonymous mutations, 239, 240–1B, 245 normal distributions, 626, 628F, 698 statistical tests, 653–5 normalization data, 627–31, 628F, 630F lowess, 630–1, 631F Notredame, Cedric, 209 N terminus, 29 nuclear magnetic resonance (NMR), 411, 521 nucleic acid sequences see nucleotide sequences Nucleic Acids Research (NAR), 55, 56F nucleic acid world, 3–23, 4MM see also DNA; RNA nucleotides, 5–6, 6F nucleotide sequences, 5, 6 see also DNA sequences; RNA sequences base composition variations, 275–6 comparison with protein sequences, 150–3 databases, 55–6, 57F, 58 derivation of scoring matrices, 124F, 125 detection of homology, 75–6 evolutionary changes, 236–9 evolutionary models, 271–2 large-scale rearrangements see rearrangements, large-scale low-complexity regions, 151B scoring of alignment, 76–7, 80–1 searching with, 97–103 null distribution, 656 null model, 189–90 NVT ensemble, 718

763

End matter 6th proofs.qxd

19/7/07

12:17

Page 764

Index

O object-oriented databases, 48, 51 odds ratio, 118 Ohler, Uwe, 388 oligomeric proteins, 42–3 one-tailed test, 653 Online Mendelian Inheritance in Man (OMIM) Web site, 352 ontologies, 54–5, 54F, 64 gene see gene ontology open reading frames (ORFs), 13, 318, 367 compared to eukaryotic genes, 377–8 hypothetical proteins, 348 identifying, 318–19, 359–60 practical aspects, 322–3 theoretical basis, 364, 371, 372–3 minimum and maximum sizes, 328, 405 orphan (ORFans), 405 potential, 364 operational taxonomic units (OTUs), 225 operons, 19–20, 19F, 319, 341 optimal alignments, 76, 128 extreme-value distribution, 155, 155F global, 128, 129–35, 129F, 130F, 131F local, 135–7, 136F score significance, 153–6, 154FD optimization, function, 709–19, 709F full search methods, 710 global, 715–19, 715F local, 710–15 ordinary differential equations (ODEs), 683 ORFs see open reading frames Organismic System Theory, 667 orphan ORFs (ORFans), 405 ORPHEUS program, 323, 372–3 orthogonal encoding, 496 orthologous genes, 239, 242F chicken, human and puffer fish genomes, 245, 246F to construct species trees, 239–47 identifying, 243, 245B large-scale rearrangements and, 248 orthologous sequences (orthologs), 223 Osguthorpe, David, 422 outgroups, 229F, 230, 258, 291–2 output, 680 output expansion, 500 output layer, 430, 494 overall alignment score, 80 overlapping classification, 638 overlapping genes, 12, 12F, 360 overtraining, neural networks, 498B

764

OWL database, 109 oxygen, molecular (O2), 684–5

P p53 protein, 580–2, 581F, 582F identifying interaction sites, 584–5, 584F, 587, 587F module, apoptotic pathway, 680, 681F Pacific Northwest National Laboratory (PNNL), 668 PAIRCOIL program, 453 paired-site tests, 311 pair information, 423, 482 pairwise alignment, 89, 115–61, 116MM alignment score significance, 153–6 complete genome sequences, 156–9 discarding intermediate calculations, 138B dynamic programming algorithms, 127–41, 128FD indexing techniques and algorithmic approximations, 141–53, 142FD inserting gaps, 86, 86F multiple alignments based on, 196–206, 197FD secondary structure prediction method using, 430 substitution matrices and scoring, 117–27, 117FD vs multiple alignment, 90, 166–7 pairwise contact potential (PCP), 533 PALSSE method, 466, 466F, 467, 467F, 468 PAM matrices, 82–4, 83F derivation, 119–22, 119F evolutionary model incorporation, 276 PET91 version of PAM250, 121F, 122 selection, 84, 85 summary score measures, 125F, 126 vs percentage identity, 120F, 121 paralogous genes (paralogs), 239–42, 242F identifying, 243, 245B parameters Bayesian inference, 698 system, 678, 679, 679F Parisien, Marc, 466 parsimony methods see unweighted parsimony partially resolved tree, 227 partitional classification, 638 partition function, 706, 707, 716

partitions see also splits clustering methods, 637, 638 hierarchical clustering, 639–41 k-means clustering, 641–2 phylogenetic trees, 231 parvalbumin (1B8C), 421F, 422F path, 179 pathogenicity islands, 342, 402–3 pathways, metabolic see metabolic pathways patristic distances, 294 PatternHunter program, 159 patterns, 103–11, 104FD, 151B see also motifs automated generation, 105–7, 106F, 107F creating databases, 104–5 discovery, 165, 166MM, 211–18, 212FD protein function and, 109–11 searching for, 103–4, 107–9, 108F, 109F Pavesi, Angelo, 362–3 PDB see Protein Data Bank PDB_SELECT, 416–17, 473 p-distance, 236, 237F, 268–9 effects of correction, 252F Gamma correction, 269F, 270, 270F phylogenetic tree reconstruction, 251–2 Poisson correction, 269F, 270 Pearson, William, 144 Pearson correlation coefficient, 194, 635–6, 636F peptide bonds, 29–33, 31F trans and cis conformations, 32, 33F percent/percentage identity, 76–7 BLOSUM matrices and, 84 homology modeling and, 540–1, 541F, 542F limitations, 79–81 minimum acceptable, 81 PAM matrices, 120F, 121 percent similarity, 80–1 perceptrons, 430–1, 494 per-comparison error rate (PCER), 658 per-family error rate (PFER), 658 periodicity, 151B PET91 matrix, 121F, 122 Petersen, Thomas, 499–500, 501 Pfam database, 109 phages, sequenced genomes, 324T PHAT matrix, 84 PHDhtm program, 442F, 445 PHD program, 424T, 432, 432F PHDsec program, 499, 501–2, 503 PHI-BLAST program, 108 Phobius method, 509

End matter 6th proofs.qxd

19/7/07

12:17

Page 765

Index phosphatidylinositol 3-OH kinase (PI3 kinase) p110a subunit, 557 alignment, 86, 86F homology modeling, 557–64, 563T local and global alignment, 89, 89F multiple alignment, 91–2, 92F protein family profile, 109 searching sequence databases, 99F, 100, 101F phosphatidylinositol 3-OH kinase (PI3 kinase) p110g subunit, 557, 557F, 563, 563T, 564 phosphatidylinositol 3-OH kinases (PI3 kinases), 87B, 557 multidomain nature, 88, 88F patterns and motifs, 106–9, 106F, 107F, 109F phosphatidylinositol-4-OH kinases (PI4-kinases), 87B, 88F patterns and motifs, 106, 107–8 phosphoinositol kinase, 439F, 441 phospholipid kinases, 87B phosphopeptide-binding proteins, 570–1, 572F phosphorylation sites, predicting, 110 phosphotyrosine-binding (PTB) domain, 571, 572F phylogenetic tree reconstruction, 248–64 assessing tree feature reliability, 307–10, 308FD choice of method, 249–51, 251T clustering methods, 276–85, 277FD data choice, 249 evaluating topologies, 293–307, 294FD evolutionary model choice, 251–5 multiple alignment as starting point, 255, 260 multiple topologies, 286–93, 287FD practical examples, 255–8, 257F, 258F single trees, 276–86, 277FD starting trees for further exploration, 285–6 theoretical basis, 267–311, 268MM phylogenetic trees, 223–4 see also guide tree additive, 228–9, 229F, 230 comparing two or more alternative, 310–11 condensed, 233–4, 233F consensus, 234–5, 234F, 291 fully resolved, 227 gene see gene trees measuring difference between two, 289B multifurcating (polytomous), 227, 233, 233F partially resolved, 227

reconciled, 243, 244F rooted see rooted trees scoring multiple alignments, 200–1, 200F species see species trees strict consensus, 234–5, 234F structure and interpretation, 225–35, 226FD substitution matrix derivation from, 82–3, 119F, 120 topologies see tree topologies ultrameric, 229–30, 229F unrooted see unrooted trees phylogenomics, 262 PHYML program, 251, 255 PHYRE program, 535–6, 536F PI3 kinase see phosphatidylinositol 3-OH kinase PISSLRE see CDK10 gene PKN/PRK1 protein kinase, 452, 452F, 453F, 454F plasmids, 21 platelet-derived growth factor (PDGF), 616–17, 617F pleckstrin homology (PH) domain, 571, 572F Pocket-Finder program, 585–6, 586F point accepted mutations matrices see PAM matrices Poisson corrected distance, 269F, 270 polyadenylation, 18 signal detection, 389 polycystein-1-protein, 571F polypeptide chain, 29, 31–2 conformational flexibility, 32, 32F polytomous (multifurcating) trees, 227, 233, 233F porins, 35, 436 secondary structure prediction, 450–1, 450F position-specific scoring matrices (PSSMs), 96, 166, 168–78 see also profiles aligning, 193–5, 194F construction, 168–71 overcoming lack of data, 171–5, 176F representation as logos, 177–8, 178F secondary structure prediction using, 503, 504, 505F, 514 sequence weighting schemes, 171, 171F using PSI-BLAST, 176–7, 177F positive-inside rule, 441 positive selection, 240–1B posterior probability, 698 post-order traversal, 298–9, 298F, 300–1

potential energy, 522, 524, 525 see also force fields calculations, 525–6 functions, 522, 524–9, 706–8 surface, 525 potentials of mean force, 532–3, 706–7 PPI-PRED program, 584–5, 584F PRATT program, 108, 109F, 217–18 Predator Multiple Seq., 424T PREDATOR program, 414, 424T, 428–30 prediction confidence level (PCL), 432 prediction filtering, 484 PRED-TMBB method, 509 Pribnow box, 339, 340 primary structure, 26–7, 27F, 29–33 principal component analysis (PCA) application, 618, 619F principle, 631–3, 632F, 633F PRINTS database, 109 prion proteins (PrP), 101B chameleon sequences, 37B hydrophobic cluster analysis, 110F, 111 low-complexity regions, 101–2, 102F prior distribution, 172 prior probabilities, 307, 698 probabilistic approaches alignment scoring, 117–19 pattern discovery, 215–17 secondary structure prediction, 414 probability conditional, 696 marginal, 696 posterior, 698 prior, 307, 698 statistical tests, 652–3, 653F probability theory, 695–7 ProbCons method, 200, 203–4, 206 PROCHECK program, 527, 549, 550F, 551T Prodom database, 58 profile hidden Markov models (HMMs), 109, 179–93, 374 aligning, 195–6, 195F, 196F basic structure, 180–5, 181F, 183F, 184F parameterization using aligned sequences, 185–7 using unaligned sequences, 191–3 path lengths, 185, 185F, 186F scoring sequences against, 187–91 profiles, 96, 165–96, 166MM see also position-specific scoring matrices aligning, 193–6, 193FD defining, 167–78, 167FD representation as logos, 177–8, 178F

765

End matter 6th proofs.qxd

19/7/07

12:17

Page 766

Index PROF program, 424T, 433F, 434 prof-sim method, 195 PROFtmb program, 450F, 451, 508F, 509 progressive alignment, 198, 204–6, 205F prokaryotes, 21, 21F see also bacteria 16S RNA, 249 control of translation, 19–20 gene detection, 359–60 algorithms, 368–77, 368FD homology searching, 322 practical aspects, 322–3, 322FD, 323F sequence features used, 364–8, 364FD vs methods used in eukaryotes, 377–9 gene structure, 318–19 genomes, 324T promoter prediction, 339–40, 341–2 regulation of transcription, 15–17, 16F tRNA gene detection, 361–2, 362F, 363F ProMate, 584, 584F PromFind program, 387–8 Promoter 2.0 algorithm, 340, 341T PromoterInspector program, 341, 341T, 388 promoter prediction, 338–42, 381–9 eukaryotes see under eukaryotes indefinite nature of results, 341, 341T online methods, 340–1 prokaryotes, 339–40, 341–2 Promoter Recognition Profile, 341 promoters, 15–16, 319 core (basal) see core promoters eukaryotic, 17, 17F, 381 ProScan program, 341, 341T, 386–7 PROSITE database, 105, 107–8, 108F, 109 protease, HIV (HIV-PR), 551–2, 552F protein(s), 4–5 concentration measurement, 623 conformation see conformation denatured, 42 function see function hypothetical, 65, 348 identification of purified, 621–3, 622F interactions between atoms, 32B localization signals, 111, 111B phylogenetic trees, 226, 230 stability of folded, 41–2 synthesis see translation protein backbone see backbone

766

protein binding sites docking procedures, 587–93, 588FD finding, 580–7, 581FD highlighting clefts or holes, 585–6, 585F, 586F residue conservation for, 586–7, 587F surface properties for, 584–5, 585F useful features for, 582–4 types, 582 water molecules, 592–3 Protein Data Bank (PDB), 60, 62F, 102–3, 531 finding target protein homologs, 543, 557 PDB_SELECT, 416–17, 473 Protein Domain Parser (PDP), 575, 576 protein expression 2D gel electrophoresis see twodimensional gel electrophoresis analysis, 612–23, 612FD cluster analysis, 615–17, 617F, 618F data preparation for, 626–33, 627F, 627FD differential, 615, 616F, 617F methods, 614–20 online tools, 620 principle component analysis, 618, 619F statistics, 652–9 tracking changes over different samples, 618–20, 619F clustering methods and statistics, 625–64, 626MM databases, 58, 620 identification of purified proteins, 621–3, 622F quantitation, 623 sample classification, 659–62, 660FD protein families, 259B phylogenetic tree reconstruction, 259–63, 261F, 263F profiles of, 109 protein fold libraries, 573 topological, 573F, 574 protein folding, 40–1, 41F, 412 alternative, 486, 491–2, 492F, 493F inverse, 530–1 protein fold recognition see threading protein folds, 40, 41, 411 classification, 573–4, 573F libraries, 531, 532F, 571–4 prediction in absence of known homologs, 531 recognition see threading structurally different, with similar functions, 570–1, 572F

structurally similar with different functions, 570, 571F, 572F unrelated molecules, 529, 530F protein interaction(s), 580–2 databases, 58–9 interactive Web sites, 671–2, 673F, 674F maps, 610, 611F sites see protein binding sites protein kinases, 86, 87B cAMP-dependent see cyclic AMP-dependent protein kinase catalytic subunit (PRKD), 107–8, 107F microarrays, 621 PKN/PRK1, 452, 452F, 453F, 454F protein microarrays, 621 ProteinProspector program, 622–3 protein–protein interactions see also protein interaction(s) analysis using clustered data, 610, 611F searching for, 584–5, 584F protein sequence databases, 56–8, 59F nomenclature for amino acid uncertainty, 63 protein sequences see also amino acid sequences comparison with nucleotide sequences, 150–3 constructing predicted, 343–6, 345 detection of homology, 75–6 evolutionary models, 276 low-complexity regions, 100–2, 151B multiple alignments, 92 obtaining secondary structure from see secondary structure prediction phylogenetic tree reconstruction, 249 scoring of alignment, 76–7, 79–80 searching for motifs or patterns, 103–4 searching with, 97–103 substitution matrices, 82–5, 117–25 protein structure, 25–43, 26FD, 26MM classification, 421F, 573–4, 573F comparison methods, 574–80, 575FD implications for bioinformatics, 37–9, 38FD low secondary structure content (low SS), 573F, 574 modeling see modeling protein structure molecular representations, 39, 39F native, 522 primary see primary structure

End matter 6th proofs.qxd

19/7/07

12:17

Page 767

Index quaternary see quaternary conformation secondary see secondary structure supersecondary, 40B, 529 tertiary see tertiary protein structure three-dimensional see tertiary protein structure visualization and computer manipulation, 38–9, 39F protein subunits, 27, 42–3 Proteobacteria, 249, 255–8, 257F, 258F proteome, 600, 612 see also protein expression analysis, 612–23, 612FD proteomics, 600–1 applications, 601T role in systems biology, 668 protocols, 686 ProtScale, 110 prrp program, 206 pseudocounts, 172–3, 176F pseudo-energy functions, 526–7 pseudogenes, 22B, 73, 73B, 242 pseudoknots, 457 pseudo-torsion angles, 703 PSI-BLAST program, 96–7, 108 comparative effectiveness, 177, 178T homology modeling, 560–1 PSSM construction, 176–7, 177F secondary structure prediction, 433F, 502, 503, 504 PSIMLR method, 514 PSIPRED program, 433F, 434, 434F, 503 accuracy, 424T, 469, 469F, 472, 503 homology modeling, 560–1 PSORT programs, 111 PSSMs see position-specific scoring matrices pSTIING, 58–9, 671–2 analysis of clustered genes, 610, 611F protein interaction networks, 61F, 674F purifying (negative) selection, 240B purines, 6, 6F pyridoxal phosphate-dependent aminotransferases, 570 pyrimidines, 6, 6F pyruvate formate-lyase, 467F pyruvate kinase, 480F

Q Q3, 417–19, 418F, 469 compared to Sov, 470T, 471–2 different methods compared, 422, 424T GOR method, 422, 423, 484 nearest-neighbor methods, 491

neural network methods, 499, 501, 503, 504 range of values, 469, 469F Qian, Ning, 496–9, 499F Q-SiteFinder, 585–6, 586F quadratic discriminant analysis (QDA), 340, 388, 389F, 396 quality match scores, 200, 203–4 quantum mechanics, 700 quartet-puzzling method, 251T, 305–6, 306F quaternary conformation, 27, 27F, 42–3, 42F, 43F

R Ramachandran plots, 33, 34F, 525 PI3 kinase p110a model, 560, 560F random error, 627–8 random model, sequence alignment, 117–19 rank-sum test, 656–7 reaction rates, 679–80 reading frames, 13, 13F see also open reading frames exon prediction and, 325–7, 328F, 329F, 391–2 rearrangements, large-scale, 248 examples, 158F identifying, 156–7, 158F, 159 rat and mouse X chromosomes, 403–4, 403F receptor tyrosine kinases (RTKs), 436B Reciprocal Net database, 52 reconciled trees, 243, 244F RECON program, 347 records, database, 46–7 reductionist approach, 670F redundancy, biological systems, 686–8 redundant data, 63 regulatory elements, 15 relational databases, 48, 49–50, 49F relative entropy, 697 substitution matrices, 125F, 126 relative mutability, 120 Relenza®, 589B, 591 reliability (confidence index), 432 RELL method, 311 repeated elements, 337B RepeatMasker program, 347, 378–9 repeat sequences see also DNA repeats; low-complexity regions annotation, 347 dot-plots for identifying, 77–8, 79F exclusion from analysis, 319–21, 360, 378–9 SEG for identifying, 151–2B repressors, 16–17 resolution, 64

response function, 495, 496F restriction enzymes, type I, 420B retrotransposons, 337B REV+G model, 254F, 255–6, 256T REV (GTR) model, 253T, 255, 262 R factor, 64 Rhodopseudomonas blastica, 450 rhodopsin, 440–1, 440F helical wheel representation, 439F, 441 secondary structure prediction, 441F, 442F, 443, 447F ribonuclease (RNase), 412 ribonucleic acid see RNA ribonucleotides, 6 ribose, 5–6 Ribosomal Database Project (RDP) database, 255 ribosomal RNA (rRNA), 13 see also 16S RNA sequences sequences, identifying, 361 small ribosomal subunit, 249 ribosome-binding sites (RBS), 366F, 380 absence in eukaryotes, 380, 389 GeneMark.hmm, 375 ORPHEUS scoring scheme, 372–3 ribosomes, 13–14, 14F rice genome, 335B Riis, Søren, 500–1, 501F, 502–3 RING-finger domains, 575 ring of life, 292B ritonavir, 589B Rivera, Maria, 292B RMSD see root mean square deviation RNA, 4 central dogma concept, 10, 10F, 10FD functions, 13 noncoding, detection, 319–21, 361–3, 361FD structure, 5, 5FD, 6F, 9–10, 9F transcription see transcription RNA capping, 18 RNAfold, 457F, 458 RNA polymerase II, 17 promoters, detection, 383–7, 387F subunit, 582, 582F RNA polymerases, 11 bacterial, 15–17, 339 eukaryotic, 17–18, 383 RNA secondary structure, 9, 435, 455–6 prediction, 455–8, 455FD, 456F types, 456, 456F RNA sequences databases, 56 searching with, 97 RNA splicing, 18–19, 18F alternative, 19, 380–1

767

End matter 6th proofs.qxd

19/7/07

12:17

Page 768

Index Robinson–Foulds difference see symmetric difference Robson, Barry, 422, 480 robustness biological systems, 683–9, 684FD characterization, 690 as feature of complexity, 684–5 Rocke, David, 627–8 roll, 573F root, 227, 227F rooted trees, 227, 227F construction, 291–3 root mean square deviation (RMSD), 542 domain identification, 577 modeling of loops, 546, 547F practical application, 563, 563T ROSETTA/HMMSTR method, 523B Rost, Burkhard, 470 rotamer libraries, 547–8 rRNA see ribosomal RNA Rychlewski, Leszek, 491

S Saccharomyces cerevisiae, 324, 404, 405 cDNA array data analysis, 632F gene expression microarray database, 611, 612F SAGA multiple alignment method, 209–11, 210F, 211F SAGE (serial analysis of gene expression), 604–5, 604F SAGEmap, 605 Saitou, Naruya, 282 Salzberg, Steven, 489, 491 SAM (significance analysis of microarray method), 656 sample classification, 659–62, 660FD see also data classification biclustering, 649–50, 650F methods available, 660–1 principal component analysis, 631–3, 632F, 633F support vector machines, 661–2, 662F, 663F sample classifier, 660 SAM program, 182, 184 Sander, Christian, 464–5 sandwich, 573F Sanger, Frederick, 45 Sanger Institute, 55 Sankoff algorithm, 300–2, 301F SATCHMO program, 200, 203 scatterplots, protein expression data, 615, 617F Scherf, Matthias, 388 Schneider, Thomas, 178 SCOP database, 531, 532F, 572–4

768

scores (alignment), 76, 117 derivation, 117–19 expected, 119, 126 overall, 80 statistical significance, 153–6, 154FD scoring schemes/matrices, 75, 76–81 see also position-specific scoring matrices; substitution matrices constructing multiple alignments, 200–4 selection of appropriate, 126 theoretical basis, 117–27, 117FD threading, 531–3 scrapie, 101B SCWRL3, 561 searching sequence databases, 93–111, 94FD assessing quality of match, 97–100, 99F database selection, 102–3 dealing with low-complexity regions, 100–2 exon prediction, 397 patterns and protein function, 109–11 programs, 94–7 protein sequence motifs or patterns, 103–7 using motifs and patterns, 107–9 secondary RNA structure see RNA secondary structure secondary structure, 27, 27F, 33–6 see also a-helices; b-strands alternative conformations, 486, 486F common types, 413–14, 413F databases, 60–1 defining, for prediction algorithms, 463–8 length distributions, 467, 468, 468F local sequence effects, 479–84, 480F sequence correlations, 487–8, 487F secondary structure prediction, 37, 411–59, 412MM assessing accuracy, 417–19, 418FD, 469–72 based on residue propensities, 472–85, 472FD coiled coils, 451–4, 452FD defining secondary structure, 463–8, 464FD expected accuracy, 468 general data classification techniques, 510–14, 511FD hidden Markov models, 504–10, 506FD methods of defining structures, 417, 417F

nearest-neighbor methods see nearest-neighbor methods neural network methods see under neural network methods specialized methods, 435–58, 435FD statistical and knowledge-based methods, 421–30, 421FD success application, 420B theoretical basis, 461–514, 462MM training and test databases, 416–17, 416FD transmembrane proteins, 438–51, 438FD types of methods available, 413–16, 413FD second derivative methods, function optimization, 714–15 SEG program, 151–2B Sejnowski, Terrence, 496–9, 499F selective pressures, 240–1B self-information, 423, 482 self-organizing maps (SOMs), 644–6, 644F, 645F basic principle, 608, 608F biclustering, 650, 650F gene expression microarray data, 608–9, 609F, 610 secondary structure prediction, 513–14, 513F vs other clustering methods, 643B self-organizing tree algorithms (SOTA), 648–9, 648F evaluating validity of clusters, 651 gene expression microarray data, 610, 610F semiglobal alignment, 132F, 133 semi-Markov model, 374 sense strand, 11–12 sensitivity (Sn) exon prediction, 343, 392B gene prediction at nucleotide level, 365–6B separating hyperplane, 661, 662, 662F sequence alignment, 71–112, 72MM see also global alignments; local alignments applications, 72 detection of homology, 74–6 genome sequences see genome sequence alignments homology modeling, 543–4, 544F, 558–9 inserting gaps, 85–7 multiple see multiple alignment optimal see optimal alignments pairwise see pairwise alignment principles, 72–6, 73FD progressive, 198, 204–6, 205F scores see scores (alignment)

End matter 6th proofs.qxd

19/7/07

12:17

Page 769

Index scoring see scoring schemes/matrices searching databases see searching sequence databases suboptimal, 76 substitution matrices, 81–5 types, 87–93, 88FD sequence analysis, 71, 72MM evolutionary conservation and, 38 sequence databases, 55–8 automated data analysis, 64–5 gene prediction using, 334–6 nonredundancy, 62–3 searching see searching sequence databases selecting, 102–3 sequence length compositional complexity and, 151B homology modeling and, 540–1, 542F substitution matrix choice and, 85 sequence motifs see motifs sequence ontology project (SOP), 55 Sequences Annotated by Structure (SAS), 103 sequence similarity see similarity, sequence sequence–structure correlations, 487–8, 487F sequence-to-structure networks, 432, 499–500, 500F serial analysis of gene expression (SAGE), 604–5, 604F serine proteases, 570 serotonin N-acetyltransferase, 421F secondary structure prediction, 423F SH2 domains, 78B, 571, 572F Cbl protein, 575, 576F dot-plot assessment, 77F, 78 identification, 576–80 searching sequence databases, 98–100 sequence alignments, 92, 93F SH3 domains, 529, 530F Shannon entropy, 695–6 Shigella flexneri, 262 Shine–Dalgarno sequence, 19, 373 shotgun genome sequencing procedure, 376B SH test, 311 shuffle test, 534 Sibbald, Peter, 171 side chains, amino acid see amino acid side chains sigma factors (s), 339 signaling pathways, 110 modeling interactions, 681–3, 682F network models, 678

signal peptide, 508–9 signal sequences, protein localization, 111, 111B significance, statistical, 653 SigPath, 692 silent states, 180, 181, 183–4, 184F similarity, sequence, 74 dot-plots for assessing, 77–8, 77F gene prediction using, 334–6 homology modeling and, 539–40, 541F percent, 80–1 percent identity for quantifying, 76–7 scoring, 80, 81 secondary structure prediction, 488–90 Simon, István, 506–7 SIMPA96 scoring method, 488, 490, 491 simple sequences, 151–2B see also low-complexity regions simplex, 711, 712F SIM program, 554 simulated annealing, 528–9 function optimization, 719 single linkage clustering, 640, 641F singleton sites, 298 Sippl, Manfred, 534, 706 Sippl test, 534 Sjögren–Larsson syndrome (SLS), 351, 351B Sjölander, Kimmen, 174, 174F SLAGAN program, 158F, 159 SLIM matrices, 84 small ribosomal subunit rRNA, 249 Smith, Randal, 214 Smith, Temple, 214 Smith–Waterman algorithm, 88–9, 136–7 database search programs using, 95, 97, 145–6 discarding intermediate calculations, 138B vs PSI-BLAST, 178T Söding, Johannes, 195F, 196 sodium dodecyl sulfate (SDS), 613 softmax, 495–6 Sokal, Robert, 278 solvation potential, 533 solvents see also water molecules omission from energetics calculations, 700 potential terms relating to, 526–7, 707–8 SOMs see self-organizing maps SOSUI program, 442F, 443, 444F, 447 SOTA see self-organizing tree algorithms

Sov, 417, 419, 419F compared to Q3, 470T, 471–2 derivation, 470–2 different methods compared, 422, 424T GOR method, 423 range of values, 469F, 472 spaced seed method, 158–9 spacer unit, 496, 500 speciation duplication inference (SDI), 293 speciation events, 226, 239, 242F species reconstructing evolution, 249 specific databases, 103 species (phylogenetic) trees, 225–30, 227F, 229F combined with gene trees, 243, 244F effects of gene loss/missing gene data, 242–3, 243F orthologous genes for constructing, 239–47, 242F vs gene trees, 230, 231F specificity (Sp) exon prediction, 343, 392B gene prediction at nucleotide level, 365–6B spliceosomes, 18 SplicePredictor program, 393–4 splice sites, 18–19 detection, 337–8, 338F, 379–81, 390 theoretical basis, 392–6, 395F donor and acceptor, 18F, 380F, 392 variability, 379, 380F splice variants, 380–1 SpliceView program, 338, 339F splicing, RNA see RNA splicing splits assessing accuracy, 309 differences between two trees, 289B multiple alignment guide trees, 206, 206F phylogenetic trees, 231–2, 232F Src-homology domains see SH2 domains; SH3 domains SSAHA program, 158 SSEARCH program, 96T, 97, 100 SSPAL method, 489, 490, 490F, 491 SSpro method, 504, 505F standard deviation, 652, 653F dealing with lack of replicates, 657B Stanford Microarray Database (SMD), 58, 60F, 611 star decomposition, 285–6 start codons, 13, 19, 318, 367 E. coli, 366F, 367 predicting correct, 327, 330F, 333–4, 389 star tree, 200F, 201

769

End matter 6th proofs.qxd

19/7/07

12:17

Page 770

Index start state, 179, 182–3, 183F states (hidden Markov models), 179, 180, 181 state variables, 679–80 statistical methods secondary structure prediction, 414, 415F, 421–30 transmembrane protein prediction, 443 statistical tests, 625, 626MM, 651–62, 651FD importance of variance, 652, 652F multiple, controlling error rates, 657–9, 658T nonparametric, 656–7 steady state, 690 steepest descent method, 528, 711–13, 713F step-down Holm method, 658 Stephens, Michael, 178 step-up Hochberg method, 659 stepwise addition, 285–6 steric hindrance, 32 Sternberg, Michael, 206 stop codons, 12T, 13, 19, 318, 367 detection, 389 Streptococcus protein G, 484F Streptomyces coelicolor, 643B strict consensus trees, 234–5, 234F STRIDE program, 417 STR matrix, 84 Structural Bioinformatics Protein Databank see Protein Data Bank structural databases, 59–61 automated data analysis, 64 checking for data consistency, 63–4 structure, protein see protein structure Structured Query Language (SQL), 49–50 structure–function relationships, 567–93, 568MM docking methods and programs, 587–93, 588FD finding binding sites, 580–7, 581FD functional conservation, 568–74, 568FD structure comparison methods, 574–80, 575FD structure-to-structure network, 432, 499 Student’s t-distribution, 654, 655 suboptimal alignments, 76, 135–9, 137F substitution groups, 213 substitution matrices, 81–5 see also BLOSUM matrices; PAM matrices evolutionary models and, 276 position-specific scoring matrices and, 168–71

770

selection of appropriate, 126 theoretical background, 117–27, 117FD threading, 532 subtilisin, 243–4, 244F subtree pruning and regrafting (SPR), 289B, 290, 290F subtrees, 230 subunits, protein, 27, 42–3 suffix, 142 suffix trees, 141–3, 143F whole genome sequences, 158 sum-of-pairs (SP), scoring multiple alignments, 200F, 201 superfamilies, 259, 259B phylogenetic tree reconstruction, 259–63, 261F, 263F protein fold libraries, 573 superkingdoms, 21 supersecondary structures, 40B, 529 supervised learning, 497B, 638 support vector machines (SVMs) sample classification, 661–2, 662F, 663F secondary structure prediction, 511–12, 512F, 513F survivin, 583, 583F S-values branch-and-bound method, 288 maximum-likelihood methods, 287 minimum evolution method, 297 optimizing tree topologies, 288, 290, 291, 291F, 293 parsimony methods, 287, 293, 297–9, 301 starting trees, 286 SWISS-2D-PAGE, 620 Swiss Institute for Bioinformatics (SIB), 620 Swiss-Model, 552, 554, 561–3, 562F Swiss-Pdb Viewer, 542, 557–60, 558F, 559F, 562–3 Swiss-Prot database, 54, 56–8, 59F, 102–3 manual annotation, 65 pattern and motif searching, 105, 106–8 searching, 98–100, 99F, 101F vs PSI-BLAST, 178T switches, bistable, 688–9, 689F symmetric difference, 289, 289B, 291 SYM model, 253T synonymous mutations, 238, 240–1B, 245 syntenic regions, 248, 403–4, 404F systematic errors, 625, 627–8 systems, biological, 669–78, 669FD see also networks bistable switches, 688–9, 689F

concept, 669–70, 670F, 671F control circuits, 680, 680F information needed to construct, 671–4 mathematical modeling approaches, 674–7, 676F mathematical representation of interactions, 680–3 modularity, 685–6 network properties, 670–1 redundancy, 686–8 robustness, 683–9, 684FD standardized description, 692 storing and running models, 689–92, 689FD systems biology, 667–93, 668MM model types used, 678 structure of model, 679–83, 679FD system properties, 683–9, 684FD Web-based tool and databases, 671–2, 675T Systems Biology Markup Language (SBML), 692

T Tamura-Nei (TN) model, 253T target protein, 527 alignment with template, 543–4, 544F finding structural homologs, 543, 557 similarity to template, 539–40 TATA-binding protein (TBP), 17 TATA box, 17, 383 Bucher weight matrix, 383, 384, 384F detection, 383–7, 389 genes lacking, 381, 383 GenScan prediction method, 385, 385F NNPP prediction method, 385–6, 386F taxa, 225 Taylor, Willie, 276 tblastx, 96, 150 T-Coffee program, 203, 204F temperature biological systems, 679–80 molecular dynamics simulations, 718 simulated annealing, 529, 719 template protein, 527, 542–3 alignment with target, 543–4, 544F locating, 543, 557 similarity to target, 539–40 terminator signal, 16 tertiary contact (TC) measure, 491–2, 492F

End matter 6th proofs.qxd

19/7/07

12:17

Page 771

Index tertiary protein structure, 27, 27F, 40–2 see also protein folds analyzing function from see structure–function relationships experimental methods of determining, 521 modeling see modeling protein structure visualization and computer manipulation, 38–9, 39F test dataset, 416–17 test statistic, 652, 653F tetramers, 43 thermodynamic simulation, and global optimization, 715–19, 715F thermodynamic stability, folded proteins, 41–2 thiamine diphosphate (TDP), 259B, 260 Thornton, Janet, 276, 475 THREADER program, 707 threading (fold recognition), 523–4, 529–37, 530FD assessing confidence of prediction, 534–5, 535F dynamic programming methods, 533–4, 534F libraries of protein folds, 531 potentials used, 706–8 practical example, 535–7, 536F, 537F procedure, 530–1, 531F pseudo-energy functions, 527 scoring schemes, 531–3 three-dimensional protein structure see tertiary protein structure thymine (T), 6, 6F Tie, Jien-Ke, 449B TIM barrel folds, 570, 570F, 573F differing functions, 570, 572F time-delay neural network (TDNN), 385–6, 386F TMAP program, 442F, 444, 447 TMbase, 443 TMHMM server, 446, 446F, 447F, 507–9 assessing accuracy, 471F, 472 comparative results, 442F TMpred program, 442F, 443 Toll-like receptor, 608 top-down approach, modeling biological systems, 676–7, 677F topological families, 573F, 574 topological models, 678 TopPred program, 441, 442 torsion angle potential, 703, 703F torsion (dihedral) angles, 29–33 amino acid side chains (c1, c2, etc), 547, 548F

Ca chain (f, y), 29–32, 32F ideal b-strands, 36F Ramachandran plots, 33, 34F secondary structure prediction, 417, 466, 466F, 503–4, 504F, 505F improper, 703 peptide bond (w), 31–2, 32F traceback, 132, 136, 138B, 300 training, neural networks, 496, 497–8B training dataset, 416–17 trans conformation, 32, 33F transcription, 11–12, 11F regulation, 15–18, 16F, 17F stop signals, detection, 389 transcription (initiation) factors binding sites, 381, 386 detection algorithm, 386–7 general, 17 leucine zipper, 413, 451 transcription start site (TSS), 15–16, 16F, 17F prediction, 338–9, 340, 381–9 transcriptome, 600 transfer function see response function transfer RNA (tRNA), 13 base modifications, 7 function in translation, 13–14 gene detection methods, 320–1, 320F, 361–3 secondary structure prediction, 457F, 458 structure, 14F transition mutations, 237–8, 238F transitions, hidden Markov models, 179, 180, 181, 181F transition/transversion ratio (R), 237–8 calculation, 274–5B weighted parsimony method, 300, 300F translation, 13–14, 14F control, 19–20 genetic code, 12–13, 12T predicted exons, 343, 344F, 345, 345 start sites, prediction, 389 stop signals see stop codons translation initiation factor 5A (1BKB), 421F secondary structure prediction, 422F TRANSLATOR program, 345 translocation, 158F transmembrane b-barrels, prediction, 448–50, 450F, 508F, 509 transmembrane helices, 436 amino acid propensities, 475–6, 478F helical wheel diagrams, 439F, 440–1 length distribution, 468, 468F

prediction, 439–48 algorithms available, 441–7 assessing accuracy, 471F, 472 based on residue propensities, 477–8, 479, 479F comparing results, 447–8 example, 449B hidden Markov models, 506–9, 507F using evolutionary information, 444–5 three-dimensional structure, 440F transmembrane proteins, 435, 436–51 7-transmembrane spanning superfamily, 436B bitopic and polytopic, 437, 437F functional importance, 436B hydrophobicity scales and, 437–8 prediction, 438–51, 438FD example, 449B hidden Markov models, 506–9 structural elements, 437T transmissible spongiform encephalopathies, 101B transport systems, 669–70, 670F transposons, 22B, 336, 337B transversion mutations, 237–8, 238F transversion parsimony, 300 tree bisection and reconnection (TBR), 289B, 290–1 tree methods, multiple alignment, 90–1, 90F, 200–1 tree of life, 20–3, 20F, 21F, 38F horizontal gene transfer within, 246F, 247 origins, 292B tree topologies, 227–8, 228B comparing, 232–5, 233F, 234F describing, 230–2, 232F evaluating, 293–307, 294FD generating initial, 285–6 generating multiple, 286–93, 287FD interior branch examination, 309–10 measuring difference between two, 289B TrEMBL, 102–3 tricarboxylic acid (TCA) cycle, 685, 686F, 687F trimers, 43 tRNA see transfer RNA tRNAscan algorithm, 321, 361–2, 362F, 363F tRNAscan-SE algorithm, 362–3 TSSG algorithm, 340, 341T TSSW algorithm, 340, 341, 341T t-statistic, 654, 655 t-test, 654–5, 656T modifications, 657–9

771

End matter 6th proofs.qxd

19/7/07

12:17

Page 772

Index tumors invasion, mathematical modeling, 676–7, 677F sample classification, 662, 663F turns, 36–7, 37F see also b-turns amino acid preferences, 37 Tusnády, Gábor, 506–7 twilight zone, 81 TWINSCAN program, 331T, 332T, 336–7 two-dimensional (2D) gel electrophoresis, 600, 613–20 see also protein expression analysis of data, 614–20 clustering, 615–17, 617F, 618F differential protein expression, 615, 616F, 617F measuring expression levels, 614–15 principal component analysis, 618, 619F identification of separated proteins, 621–3, 622F spot detection, 614, 614F technique, 613–14, 613F two-hit method, 149 two-tailed test, 653, 653F type I error, 653, 658

URL, 53 UTRs see untranslated regions Uzzell, Thomas, 270

U

W

Y

ubiquitin ligases, 575 UGA codon, 23 ultrameric trees, 229–30, 229F UniGene database, 103, 605–6, 605F UniProtKB, 56–8, 65 units see also nodes neural network, 430–1, 494–5, 495F unrooted trees, 227, 227F generation, 286–91 unsupervised learning, 638, 644 untranslated regions (UTRs), 325F, 379 detection, 390, 396–7 unweighted parsimony, 297–300, 299F UPGMA method, 199, 250, 251T, 608, 639 practical application, 256–8, 258F theoretical basis, 278–9, 279F, 640 vs Fitch–Margoliash, 280 UPGMC method, 640 upstream sequences, 16

Waddell, Peter, 296, 296F Waterman, M.S., 136, 154 water molecules, 700 see also solvents ligand–protein docking and, 592–3 Watson, James, 7 Watson–Crick base-pairing, 7–9, 8F weight matrices Bucher, 383–4, 384F splice site prediction, 394 weight sharing, neural networks, 500–1, 501F Welsh’s t-test, 655 WHAT_CHECK program, 549–50 WHAT-IF program, 549, 551T whole-genome alignment, 156–9, 157FD see also genome sequence alignments Wilcoxon test, 656–7 Wilkins, Maurice, 7, 7F

YASPIN, 509, 509F YBL036C hypothetical protein (1CT5), 421F secondary structure prediction, 423F Yi, Tau-Mu, 488, 491 Yona, Golan, 195

772

V van der Waals interactions, 32B van der Waals terms, 705 variable region, 555B variance, 626, 652, 653–4 importance in statistical testing, 652, 652F Vector Alignment Search Tool (VAST), 577–8, 579F Venn diagram, amino acid conservation, 426, 428F Venter, J. Craig, 376B Viagra, 589B virtual heart project, 677 virulence factors, 341–2 viruses, 21 overlapping genes, 12, 360 sequenced genomes, 324T VISTA program, 353–4, 353F, 354F vitamin K epoxide reductase (VKOR), 449B Viterbi algorithm, 188–9 von Bertalanffy, Ludwig, 667 von Heijne, G., 441, 442

windows (sequence), 476–9 GOR methods, 422–3 nearest-neighbor methods, 428, 486, 487F, 489 neural network methods, 431 support vector machines, 511 winner takes all strategy, 495 wobble base-pairing, 14 Woese, Carl, 249 Wood, Valerie, 405 words, 95, 141 WormBase, 399 Wu-BLAST, 95 Wunsch, C.D., 87, 128

X X chromosomes, mouse and rat, 403–4, 403F X-drop method, 139F, 140–1, 140F xenologous genes, 247 XHTML (eXtensible hypertext markup language), 50–1 XML (eXtensible markup language), 50–1 xProfiler, 605 Xquery, 51 X-ray crystallography, 411, 521 X-SITE program, 591, 592F

Z Zmasek, Christian, 293 Zpred program, 425–7, 484, 485 accuracy, 424T amino acid properties used, 426, 428F, 429T conservation values, 426, 427F, 428F, 429T z-statistic, 577, 578F, 654 z-test, 309, 653–4 Zvelebil conservation number, 426 Zviling hydrophobicity scale, 477T

Understanding Bioinformatics

Short Description

Description

Comments

We need your help!