Un alineamiento de secuencias en bioinformática es una forma de mostrar DNA, RNA, o estructuras primarias proteicas para resaltar las zonas de similitud, que podrían indicar relaciones funcionales o evolutivas entre los genes o proteínas consultados. Las secuencias alineadas se escriben con las letras (representando aminoácidos o nucleótidos) en columnas en las que se insertan espacios para que las zonas con idéntica o similar estructura se alineen.
Si dos secuencias en un alineamiento comparten un ancestro común, las no coincidencias pueden interpretarse como puntos de mutación, y los huecos como indels (mutaciones de inserción o deleción) introducidas en uno o ambos linajes en el tiempo que pasa desde que divergieron. En el alinamiento de secuencias proteicas, el grado de simitiud entre los aminoácidos que ocupan una posición concreta en la secuencia puede interpretarse como una medida aproximada de conservación en una región particular o motivos de secuencia entre linajes. La ausencia de sustituciones, o la presencia de sustituciones muy conservadas (la sustitución de aminoácidos cuya cadena lateral tiene propiedades químicas similares) en una secuencia de una región concreta indica que la zona tiene importancia estructural o funcional. Aunque las bases de lo nucleótidos del DNA y RNA son más similares entre sí que con los aminoácidos, la conservación del emparejado de bases podría indicar papeles funcionales o estructurales similares. El alineamiento de secuencias puede utilizarse con secuencias no biológicas, como en la identificación de similitudes en series de letras y palabras del lenguaje humano.
Secuencias muy cortas o muy similares pueden alinearse manualmente. Aún así, los problemas más interesantes necesitan alinear secuencias largas, muy variables y extremadamente numerosas que no pueden ser alineadas por humanos. El conocimiento humano se aplica principalmente en la construcción de algoritmos que produzcan alineamientos de alta calidad, y excepcionalmente ajustando el resultado final para representar patrones que son difíciles de introducir en algoritmos (especialmente en el caso de secuencias de nucleótidos). Las aproximaciones computacionales al alineamiento de secuencias se diviven en dos categorías: alineamiento global y alineamiento local. Calcular un alineamiento global es una forma de optimización global que obliga al alineamiento a ocupar la longitud total de todas las secuencias introducidas. Comparativamente, los alineamientos locales identifican regiones de similaridad dentro de largas secuencas que normalmente son muy divergentes entre sí. A menudo se prefieren los alineamientos locales, pero pueden ser más difíciles de calcular porque se añade el desafío de identificar las regiones se similaridad. Se aplican gran variedad de algoritmos computacionales al problema de alineamiento de secuencias, como métodos lentos y optimizadores de programación dinámica y métodos eficientes de heurística o probabilística diseñados para búsqueda a gran escala en bases de datos.
Los alineamientos se representan normalmente con un formato gráfico y de texto. En casi todas las representaciones de alineamientos, las secuencias se escriben en filas de forma que los residuos alineados aparecen en columnas sucesivas. En los formatos de texto, las columnas alineadas contienen caracteres idénticos o similares, estos últimos indicados con sistema de símbolos de conservados. En la imagen superior se utiliza el asterisco para mostrar identidad entre dos columnas. Otros símbolos menos comunes son la coma para sustituciones conservativas y el punto para sustituciones semiconservativas. Muchos programas de visualización de secuencias utilizan también esquemas coloreados para mostrar información de las propiedades de los elementos secuencia individuales; en secuencias de DNA y RNA significa asignar a cada base su propio color. En alineamientos de proteínas, como el de la imagen superior, los colores se utilizan para indicar propiedades de los aminoácidos para ayudar en la caracterización de conservación o en una sustitución aminoacídica dada. Cuando se introducen múltiples secuencias la última fila de cada columna suele representar la secuencia consenso determinada por el alineamiento.
Los alineamientos de secuencias pueden almacenarse en una amplia variedad de formatos de archivo de texto, muchos de los cuales han sido desarrollados a la vez que un programa o implementación de alineamiento. La mayoría de las herramientas web permiten varios formatos de entrada y salida, como el formato FASTA y GenBank. La utilización de herramientas específicas en cada laboratorio de investigación puede complicarse por la baja compatibilidad. Existe un programa de conversión genérica en READSEQ.
Alineamientos locales y globales Editar
Los alinamientos globales, que intentan alinear cada residuo en cada secuencia, son más útiles cuando las secuencias inicicales son similares y aproximadamente del mismo tamaño (no quiere decir que los alineamientos globales no puedan terminar en huecos). Una estrategia general de alineamiento global es el algoritmo de Needleman-Wunsch basado en programación dinámica. Los alineamientos locales son más útiles para secuencias diferenciadas en las que se sospecha que existen regiones de similaridad o motivos de secuencias similares dentro de un contexto mayor. El algoritmo Smith-Waterman es un método general de alineamiento local basado en programación dinámica. Con las suficientes secuencias similares, no existe diferencia entre alineamientos globales y locales.
Los métodos híbridos, conocidos como semiglobales o métodos "glocales" intentan encontrar el mejor alineamiento posible que incluya el inicio de una u otra secuencia, y el final de una o de la otra. Puede ser especialmente útil cuando la parte final de una secuencia se solapa con la parte inicial de la otra. En este caso, ni el alineamiento global ni el local son completamente adecuados: un alineamiento global intentará forzar a la alineación a extenderse más allá de la región de solapamiento, mientras que el alineamiento local no cubrirá totalmente la región solapada.
Pairwise alignment Editar
Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision, such as searching a database for sequences with high homology to a query. The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods; however, most multiple sequence alignment techniques can align only two sequences. Although each method has its individual strengths and weaknesses, all three methods have difficulty with highly repetitive sequences of low information content, especially where the number of repetitions may be different in the two sequences to be aligned.
The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and simple, though time-consuming to analyze on a large scale. It is very easy to visually identify certain sequence features—such as insertions, deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's main diagonal.
Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar structural domains.
The technique of dynamic programming can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to amino-acid matches or mismatches, and a gap penalty for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.)
Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Although the method is very slow, its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. Although the BLAST family of database search methods includes tools for working with translated sequences, more general methods are only available as commercial software, such as FrameSearch, distributed as part of the Accelrys GCG package.
The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of or extremely long sequences.
Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family. Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.
In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.
Multiple sequence alignment Editar
Multiple sequence alignment (MSA) is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. MSAs are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems. Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.
The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and memory, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" objective function, has been implemented in the MSA software package.
Progressive, hierarchical, or tree methods generate an MSA by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to FASTA. Progessive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive MSA methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.
Many variations of the Clustal progressive implementation are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee; implementations can be found at ClustalW and T-Coffee.
Iterative methods attempt to improve on the weak point of the progressive methods, the heavy dependence on the accuracy of the initial pairwise alignments. Iterative methods optimice an objective function based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's MSA. Various ways of selecting the sequence subgroups and objective function are reviewed in .
Motif finding, also known as profile analysis, constructs global MSAs that attempt to align short conserved sequence motifs among the sequences in the query set. This is usually done by first constructing a general global MSA, after which the highly conserved regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original data set contained a small number of sequences, or only highly related sequences, pseudocounts are added to normalize the character distributions represented in the motif.
Techniques inspired by computer scienceEditar
A variety of general optimization algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. Most recently hidden Markov models have been used to produce probability scores for a family of possible MSAs for a given query set. Genetic algorithms and simulated annealing have also been used in optimizing MSA scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article multiple sequence alignment.
Structural alignment Editar
Plantilla:Main Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence, structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure prediction because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.
The DALI method, or distance matrix alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences. It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at EBI DALI and the FSSP is located at The Dali Database.
SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments, and has been used in the construction of the CATH (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds. The CATH database can be accessed at CATH Protein Structure Classification.
The combinatorial extension (CE) method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment. Based on measures such as rigid-body root mean square distance, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor hydrophobicity, local alignments called "aligned fragment pairs" (AFPs) are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the CE alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the PDB is located at the Combinatorial Extension website.
Phylogenetic analysis Editar
Plantilla:Main Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness. The field of phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence homology suggests that the sequences in question have a comparatively young most recent common ancestor, while low homology suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.
Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble MSAs and phylogenetic trees score and sort trees first and calculate an MSA from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal MSA, is NP-hard.
Assessment of significanceEditar
Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.
In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.
The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Percent Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introducción of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.
Plantilla:Main Common software tools used for general sequence alignment tasks include ClustalW and T-coffee for alignment, and BLAST for database searching. A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software.
Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE. The dataset consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.
- ↑ Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S. (2003). Glocal alignment: finding rearrangements during alignment. Bioinformatics 19 Suppl 1:i54–62.
- ↑ 2,0 2,1 Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
- ↑ Wang L, Jiang T. (1994) On the complexity of multiple sequence alignment. J Comput Biol 1:337-348.
- ↑ Lipman DJ, Altschul SF, Kececioglu JD. (1989). A tool for multiple sequence alignment. Proc Natl Acad Sci USA 86:4412-4415.
- ↑ Higgins DG, Sharp PM. (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237-44.
- ↑ Thompson JD, Higgins DG, Gibson TJ. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673-4680.
- ↑ Notredame C, Higgins DG, Heringa J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205-17.
- ↑ Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. (1995). Comprehensive study on iterative algorithms of multiple sequence alignment. Comput Appl Biosci 11:13-18.
- ↑ Chothia C, Lesk AM. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J 5(4):823-6.
- ↑ 10,0 10,1 Zhang Y, Skolnick J. (2005). The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA 102(4):1029-34.
- ↑ Holm L, Sander C (1996). Mapping the protein universe. Science 273: 595-603.
- ↑ Taylor WR, Flores TP, Orengo CA. (1994). Multiple protein structure alignment. Protein Sci 3(10):1858-70.
- ↑ Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. (1997) CATH: A hierarchical classification of protein domain structures. Structure. 5(8): 1093-1108.
- ↑ Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11(9) 739-747.
- ↑ Felsenstein J. (2004). Inferring Phylogenies Sinauer Associates: Sunderland, MA.
- ↑ Thompson JD, Plewniak F, Poch O. (1999). BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15: 87-88.
- ↑ Thompson JD, Plewniak F, Poch O. (1999). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 7(13):2682-90.