Papers are in PDF or PostScript.
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for biological sequences is inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the example alignments score close to optimal. We extend prior work on inverse alignment to partial examples and to an improved model based on minimizing the average error of the examples. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the recovery rate for multiple sequence alignment by up to 25%.
Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment.
We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties.
For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. While some choices for substitution scores are now common, largely due to convention, there is no standard for choosing gap penalties. An objective way to resolve this question is to learn the appropriate values by solving the Inverse String Alignment Problem: given examples of correct alignments, find parameter values that make the examples be optimal-scoring alignments of their strings.
We present a new polynomial-time algorithm for Inverse String Alignment that is simple to implement, fast in practice, and for the first time can learn hundreds of parameters simultaneously. The approach is also flexible: minor modifications allow us to solve inverse unique alignment (find parameter values that make the examples be the unique optimal alignments of their strings), and inverse near-optimal alignment (find parameter values that make the example alignments be as close to optimal as possible). Computational results with an implementation for global alignment show that, for the first time, we can find best-possible values for all 212 parameters of the standard protein-sequence scoring-model from hundreds of alignments in a few minutes of computation.
Software watermarks -- bitstrings encoding some sort of identifying information that are embedded into executable programs -- are an important tool against software piracy. Most existing proposals for software watermarking have the shortcoming that they can be destroyed via fairly straightforward semantics-preserving code transformations. This paper introduces path-based watermarking, a new approach to software watermarking based on the dynamic branching behavior of programs. We show how error-correcting and tamper-proofing techniques can be used to make path-based watermarks resilient against a wide variety of attacks. Experimental results, using both Java bytecode and IA-32 native code, indicate that even relatively large watermarks can be embedded into programs at modest cost.
A basic computational problem that arises in both the construction and local-search phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sum-of-pairs objective and induced pairwise alignments are evaluated using linear gap-costs, we call this problem Aligning Alignments. While seemingly a straightforward extension of two-sequence alignment, we prove it is actually NP-complete. As explained in the paper, this provides the first demonstration that minimizing linear gap-costs, in the context of multiple sequence alignment, is inherently hard.
We also develop an exact algorithm for Aligning Alignments that is remarkably efficient in practice, both in time and space. Even though the problem is NP-complete, computational experiments on both biological and simulated data show we can compute optimal alignments for all benchmark instances in two standard datasets, and solve very-large random instances with highly-gapped sequences.
"A great deal of software is distributed in the form of executable code. The ability to reverse engineer such executables can create opportunities for theft of intellectual property via software piracy, as well as security breaches by allowing attackers to discover vulnerabilities in an application. Techniques such as watermarking and fingerprinting have been developed to discourage piracy, however, if no protective measures are taken, an attacker may be able to remove and/or destroy watermarks and fingerprints with relative ease once they have been identified. For this reason, methods such as source code obfuscation, code encryption, and self verifying code have been developed to help achieve some measure of tamper-resistance.
"It is, of course, necessary for an attacker to gain a reliable disassembly of some portion of executable code before any intelligent tampering can take place. In fact, even a reliable disassembly in the absence of some sort of control flow graph is not sufficient for serious tampering. Coupled with other methods, we propose one method of obfuscating address computations in which the targets of control transfers are made difficult to determine statically ...."
One of the key open problems in large-scale DNA sequence assembly is the correct reconstruction of sequences that contain repeats. A long repeat can confound a sequence assembler into falsely overlaying fragments that sample its copies, effectively compressing out the repeat in the reconstructed sequence. We call the task of correcting this compression by separating the overlaid fragments into the distinct copies they sample, the repeat separation problem. We present a rigorous formulation of repeat separation in the general setting without prior knowledge of consensus sequences of repeats or their number of copies. Our formulation decomposes the task into a series of four subproblems, and we design probabilistic tests or combinatorial algorithms that solve each subproblem. The core subproblem separates repeats using the so-called k-median problem in combinatorial optimization, which we solve using integer linear-programming. Experiments with an implementation show we can separate fragments that are overlaid at 10 times the coverage with very few mistakes in a few seconds of computation, even when the sequencing error rate and the error rate between copies are identical. To our knowledge this is the first rigorous and fully general approach to separating repeats that directly addresses the problem.
We present a new method for reconstructing the distances between probes in physical maps of chromosomes constructed by hybridizing pairs of clones under the so-called sampling-without-replacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equal-length clones are hybridized against a clone-subset called the probes. The probes are chosen by a sequential process that is designed to generate a pairwise-nonoverlapping subset of the clones. We derive a likelihood function on probe spacings and orders for this protocol under a natural model of hybridization error, and describe how to reconstruct the most likely spacing for a given order under this objective using continuous optimization. The approach is tested on simulated data and real data from chromosome VI of Aspergillus nidulans. On simulated data we recover the true order and close to the true spacing; on the real data, for which the true order and spacing is unknown, we recover a probe order differing significantly from the published one. To our knowledge this is the first practical approach for computing a globally-optimal maximum-likelihood reconstruction of interprobe distances from clone-probe hybridization data.
We study two new problems in sequence alignment both from a practical and a theoretical view, using tools from combinatorial optimization to develop branch-and-cut algorithms. The Generalized Maximum Trace formulation captures several forms of multiple sequence alignment problems in a common framework, among them the original formulation of Maximum Trace. The RNA Sequence Alignment Problem captures the comparison of RNA molecules on the basis of their primary sequence and their secondary structure. Both problems have a characterization in terms of graphs which we reformulate in terms of integer linear programming. We then study the polytopes (or convex hulls of all feasible solutions) associated with the integer linear program for both problems. For each polytope we derive several classes of facet-defining inequalities and show that for some of these classes the corresponding separation problem can be solved in polynomial time. This leads to a polynomial time algorithm for pairwise sequence alignment that is not based on dynamic programming. Moreover, for multiple sequences the branch-and-cut algorithms for both sequence alignment problems are able to solve to optimality instances that are beyond the range of present dynamic programming approaches.
We introduce a new combinatorial formulation of chromosome physical-mapping by the sampling-without-replacement protocol. In this protocol, which is simple, inexpensive, and has been used to successfully map several organisms, equal-length clones are hybridized against a subset of the clones called probes, which are designed to form a maximal nonoverlapping clone-subset. The output of the protocol is the clone-probe hybridization matrix H. The problem of finding a maximum-likelihood reconstruction of the order of the probes along the chromosome in the presence of false positive and negative hybridization error is equivalent to finding the minimum number of entries of H to change to zeros so that the resulting matrix has at most 2 ones per row, and the consecutive-ones property across rows. This combinatorial problem, which we call 2-Consecutive-Ones Mapping, has a concise integer linear-programming formulation, to which we apply techniques from polyhedral combinatorics. The formulation is unique in that it does not explicitly represent the probe permutation, and in contrast to prior linear-programming approaches, the number of variables is small: in practice, linear in the number of clones. We derive a large class of facet-defining inequalities for the 2-consecutive-ones polytope that we call the augmented k-degree inequalities, and we show that the basic k-degree class can be efficiently separated using bipartite matchings. Computational results with an implementation of the resulting branch-and-cut algorithm applied to both simulated and real data from the complete genome of Aspergillus nidulans show that we can solve many problems to provable optimality and find maps of higher quality than previously possible.
We give an experimental study of a new O(mn a(m,n))-time implementation of Edmonds' algorithm for maximum-cardinality matching in sparse general graphs with n vertices and m edges. The implementation incorporates several optimizations resulting from searching for augmenting paths depth-first, and we study the iteraction among a set of heuristics, each with the potential to significantly speed up the code, through experiments with all possible variants. The experiments indicate that the simplest heuristic, an early-termination test for the depth-first search, results in the greatest performance gain, and yields an implementation that on graphs with large degree actually finds an optimal solution in less time than a standard greedy heuristic. The resulting code appears to be the fastest among those publicly available on the classes of random, k-regular, union-of-k-cycle, and Euclidean k-nearest-neighbor graphs with up to 100,000 vertices, 500,000 edges, and average degree up to 10, achieving a maximum speedup of 50 over the LEDA codes, and 4 and 350 over two of the DIMACS implementation challenge codes, while never taking longer than these implementations.
While the area of sequence comparison has a rich collection of results on the alignment of two sequences, and even the alignment of multiple sequences, there is little known about the alignment of two alignments. The problem becomes interesting when the alignment objective function counts gaps, as is common when aligning biological sequences, and has the form of the sum-of-pairs objective. We begin a thorough investigation of aligning two alignments under the sum-of-pairs objective with general linear gap costs when either of the two alignments are given in the form of a sequence (a degenerate alignment containing a single sequence), a multiple alignment (containing two or more sequences), or a profile (a representation of a multiple alignment often used in computational biology). This leads to five problem variations, some of which arise in widely-used heuristics for multiple sequence alignment, and in assessing the relatedness of a sequence to a sequence family. For variations in which exact gap counts are computationally difficult to determine, we offer a framework in terms of optimistic and pessimistic gap counts. For optimistic and pessimistic gap counts we give efficient algorithms for the sequence vs. alignment, sequence vs. profile, alignment vs. alignment, and profile vs. profile variations, all of which run in essentially O(mn) time for two input alignments of lengths m and n. For exact gap counts, we give the first provably efficient algorithm for the sequence vs. alignment variation, which runs in essentially O(mn log n) time using the candidate-list technique developed for convex gap-costs, and we conjecture that the alignment vs. alignment variation is NP-complete.
We consider the problem of multiple sequence alignment under a fixed evolutionary tree: given a tree whose leaves are labeled by sequences, find ancestral sequences to label its internal nodes so as to minimize the total length of the tree, where the length of an edge is the edit distance between the sequences labeling its endpoints. We present a new polynomial-time approximation algorithm for this problem, and analyze its performance on regular d-ary trees with d a constant. On such a tree, the algorithm finds a solution within a factor (d+1)/(d-1) of the minimum in O(kd T(d,n) + k2d n2) time, where k is the number of leaves in the tree, n is the length of the longest sequence labeling a leaf, and T(d,n) is the time to compute a Steiner point for d sequences of length at most n. (A Steiner point for a set S of sequences is a sequence P that minimizes the sum of the edit distances from P to each sequence in S. The time T(d,n) is O(d 2d nd), given O(d sd+1)-time preprocessing for an alphabet of size s.) The approximation algorithm is conceptually simple and easy to implement, and actually applies to any metric space in which a Steiner point for any fixed-sized set can be computed in polynomial time.
We also introduce a new problem, bottleneck tree-alignment, in which the objective is to label the internal nodes of the tree so as to minimize the length of the longest edge. We describe an exponential-time exact algorithm for the case of unit-cost edit operations, and show there is a simple linear-time approximation algorithm for the general case that finds a solution within a factor O(log k) of the minimum.
One of the classic problems in computational biology is the reconstruction of evolutionary history. A recent trend in the area is to increase the explanatory power of the models that are considered by incorporating higher-order evolutionary events that more accurately reflect the mechanisms of mutation at the level of the chromosome.
We take a step in this direction by considering the problem of reconstructing an evolutionary history for a set of genetic sequences that have evolved by recombination. Recombination is a non-tree-like event that produces a child sequence by crossing two parent sequences. We present polynomial-time algorithms for reconstructing a parsimonious history of such events for several models of recombination when all sequences, including those of ancestors, are present in the input. We also show that these models appear to be near the limit of what can be solved in polynomial time, in that several natural generalizations are NP-complete.
A major challenge in computational biology is the identification of evolutionarily related macromolecules in cases of distant homology, where pairwise sequence similarity may be low, or even insignificant. Methods for the quantitative assessment of such distant relationships must go beyond simple pairwise sequence comparison, and exploit all the information available in multiple alignments of related sequences. Evaluation of the statistical significance of a pairwise alignment score by random shuffling is an established approach, which we extend here to the more general case of evaluating the similarity between a query sequence and a prealigned family of macromolecules whose multiple alignment may contain gaps. The method involves (1) the optimal alignment of the query sequence against the prealigned family, for which we develop a new algorithm that accurately takes into account the structure of gaps in the alignment, followed by (2) repeated shuffling of the query sequence and calculation of a shuffling statistic to assess the significance of the query-family homology, for which we consider four different measures. We apply the method to several data sets of interest, and compare its sensitivity to the traditional method of pairwise sequence comparison. Overall, the significance score for the query-family homology assessed by the new method exceeded the median significance score assessed by the traditional pairwise method by 2.2 standard deviations on average. This suggests that the method is consistently more sensitive for assessing distant homology than the conventional approach based on pairwise sequence comparison.
"To the Editor: A progressive decline in plasma or serum selenium levels paralleling the loss of CD4+ T cells has been widely documented in HIV-1 infections .... In light of this body of clinical evidence, we would like to report an observation of potential significance for the molecular mechanisms underlying any interaction between HIV and dietary selenium ...."
A fundamental problem in computational biology is the construction of physical maps of chromosomes from hybridization experiments between unique probes and clones of chromosome fragments in the presence of error. Alizadeh, Karp, Weisser and Zweig (Algorithmica 13:1/2, 52-76, 1995) first considered a maximum-likelihood model of the problem that is equivalent to finding an ordering of the probes that minimizes a weighted sum of errors, and developed several effective heuristics. We show that by exploiting information about the end-probes of clones, this model can be formulated as a weighted Betweenness Problem. This affords the significant advantage of allowing the well-developed tools of integer linear-programming and branch-and-cut algorithms to be brought to bear on physical mapping, enabling us for the first time to solve small mapping instances to optimality even in the presence of high error. We also show that by combining the optimal solution of many small overlapping Betweenness Problems, one can effectively screen errors from larger instances, and solve the edited instance to optimality as a Hamming-Distance Traveling Salesman Problem. This suggests a new approach, a Betweenness-Traveling Salesman hybrid, for constructing physical maps.
We show how to efficiently reconstruct an original DNA sequence of length n with high probability from o(log2 n) copies containing insertion, deletion, and substitution errors, assuming the sequence itself is random and errors are random with constant error rate.
We present some experiences with the problem of multiple genome comparison, analogous to multiple sequence alignment in sequence comparison, under the inversion and transposition distance metrics, given a fixed phylogeny. We first describe a heuristic for the case in which the phylogeny is a star on three vertices and then use this to approximate the multiple genome comparison problem via local search.
The MSA program, written and distributed in 1989, is one of the few existing programs that attempts to find optimal alignments of multiple protein or DNA sequences. The MSA program implements a branch-and-bound technique together with a variant of Dijkstra's shortest paths algorithm to prune the basic dynamic programming graph. We have made substantial improvements in the time and space usage of MSA. The improvements make feasible a variety of problem instances that were not feasible previously. On some runs we achieve an order of magnitude reduction in space usage and a significant multiplicative factor speedup in running time. To explain how these improvements work, we give a much more detailed description of MSA than has been previously available. In practice, MSA rarely produces a provably optimal alignment and we explain why.
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the Human Genome Initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form of multiple sequence alignment to detect, and often correct, errors in the data. Our combined algorithm has successfully reconstructed non-repetitive sequences of length 50,000 sampled at error rates of as high as 10 percent.
Motivated by the problem in computational biology of reconstructing the series of chromosome inversions by which one organism evolved from another, we consider the problem of computing the shortest series of reversals that transform one permutation to another. The permutations describe the order of genes on corresponding chromosomes, and a reversal takes an arbitrary substring of elements and reverses their order.
For this problem we develop two algorithms: a greedy approximation algorithm that finds a solution provably close to optimal in O(n2) time and O(n) space for an n element permutation, and a branch and bound exact algorithm that finds an optimal solution in O(m L(n,n)) time and O(n2) space, where m is the size of the branch and bound search tree and L(n,n) is the time to solve a linear program of n variables and n constraints. The greedy algorithm is the first to come within a constant factor of the optimum; it guarantees a solution that uses no more than twice the minimum number of reversals. The lower and upper bounds of the branch and bound algorithm are a novel application of maximum weight matchings, shortest paths, and linear programming.
In a series of experiments we study the performance of an implementation on random permutations and permutations generated by random reversals. For permutations differing by k random reversals we find that the average upper bound on reversal distance estimates k to within one reversal for k less than (1/2)n and n at most 100. For random permutations we find that the average difference between the upper and lower bounds is less than 3 reversals for n at most 50. Due to the tightness of these bounds we can solve to optimality random permutations on 30 elements in a few minutes of computer time.
A new and largely unexplored area of computational biology is combinatorial algorithms for genome rearrangement. In the course of its evolution, the genome of an organism mutates by processes that can rearrange whole segments of a chromosome in a single event. These rearrangement mechanisms include inversion, transposition, duplication, and translocation, and a basic problem is to determine the minimum number of such events that transform one genome to another. This number is called the rearrangement distance between the two genomes, and gives a lower bound on the number of events that must have occurred since their divergence, assuming evolution proceeds according to the processes of the study.
In this paper, we begin the algorithmic study of genome rearrangement by translocation. A translocation exchanges material at the end of two chromosomes within a genome. We model this as a process that exchanges prefixes and suffixes of strings, where each string represents a sequence of distinct markers along a chromosome in the genome. For the general problem of determining the translocation distance between two such sets of strings, we present a 2-approximation algorithm. For a theoretical model in which the exchanged substrings are of equal length, we derive an optimal algorithm for translocation distance.
We also examine for the first time two types of rearrangements in concert. An inversion reverses the order of markers within a substring, and flips the orientation of the markers. For genomes that have evolved by translocation and inversion, we show there is a simple 2-approximation algorithm for data in which the orientation of markers is unknown, and a 3/2-approximation algorithm when orientation is known.
These results take a step towards extending the area from the analysis of simple organisms, whose genomes consist of a single chromosome, and whose evolution has largely involved a single type of rearrangement event, to the analysis of organisms such as man and mouse, whose genomes contain many chromosomes, and whose history since divergence has largely consisted of inversion and translocation events.
We study the problem of comparing two circular chromosomes that have evolved by chromosome inversion, assuming that the order of corresponding genes is known, as well as their orientation. Determining the minimum number of inversions is equivalent to finding the minimum number of reversals to sort a signed circular permutation, where a reversal takes an arbitrary substring of elements and reverses their order, as well as flipping their sign. We show that tight bounds on the minimum number of reversals can be found by simple and efficient algorithms.
We define a new problem in multiple sequence alignment, called maximum weight trace. The problem formalizes in a natural way the common practice of merging pairwise alignments to form multiple sequence alignments, and contains a version of the minimum sum of pairs alignment problem as a special case.
Informally, the input is a set of pairs of matched characters from the sequences; each pair has an associated weight. The output is a subset of the pairs of maximum total weight that satisfies the following property: there is a multiple alignment that places each pair of characters selected by the subset together in the same column. A set of pairs with this property is called a trace. Intuitively a trace of maximum weight specifies a multiple alignment that agrees as much as possible with the character matches of the input.
We develop a branch and bound algorithm for maximum weight trace. Though the problem is NP-complete, an implementation of the algorithm shows we can solve instances on as many as 6 sequences of length 250 in a few minutes. These are among the largest instances that have been solved to optimality to date for any formulation of multiple sequence alignment.
The last few years have seen a surge of interest in computational molecular biology, with massive efforts such as the Human Genome Project underway. Sequence Analysis Primer, edited by Michael Gribskov and John Devereux, is a welcome introduction to computer analysis of biological sequence data. Drs. Gribskov and Devereux are well-known for the Genetics Computer Group sequence analysis package; their volume is not a manual on the GCG package, though it naturally includes many examples from their system. Their stated goal is to explain what types of analysis are available, what their limitations are, and what can or cannot be concluded from the analysis. In this the book succeeds admirably. Contributors have written in a friendly style, and offer plenty of advice to the reader. Perhaps the most appealing aspect of the book is its emphasis on examples drawn from actual use of programs. The final chapter, for instance, is an extended example applying nearly all the preceding material to analysis of a particular sequence, the Drosophila Notch gene.
The book in large measure is directed towards biologists. Readers of this review are likely to be applied mathematicians, statisticians, and computer scientists, hence we try to indicate some of the research problems, and point to some of the more recent work.
...
The DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 <= e < 1, find a shortest sequence S such that every fragment F in F, or its reverse complement, matches a substring of S with at most e|F| errors.
Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2e/(1-e) and at most a factor 1/(1-e) longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of a approximates Sequence Reconstruction within a factor of (1-e)a under the overlap measure. We construct the overlap graph for (1) in O(eN2) time given fragments of total length N at error rate e. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K <= 2E is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V (E + V log V) log V) time. We construct the multiple sequence alignment and consensus sequence for (3) in O(H2L + M + N) time given a layout with at most H mutually overlapping fragments, where L is the length of the alignment and M is the number of pairs of aligned characters in the overlaps.
We evaluate an implementation by comparing the computed reconstruction to the sampled sequence. For a random sequence of length 50,000 sampled at 10% error with 500 fragments of length 500, the software reconstructed the correct layout. For a human DNA sequence of length 50,000 containing 18 repeated elements of length 300 to 2,000 sampled as above at 5% error, the software found a shorter layout, though over 95% of the layout was correct. When covered by three or more fragments, the reconstructed sequence had less than 1 error in 5,000 characters, given input with 1 error in 10.
This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity.
Multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence-structure relationships. Until recently, it has been impractical to apply dynamic programming, the most widely accepted method for producing pairwise alignments, to comparisons of more than three sequences. We describe the design and application of a tool for multiple alignment of amino acid sequences that implements a new algorithm that greatly reduces the computational demands of dynamic programming. This tool is able to align in reasonable time as many as eight sequences the length of an average protein.