A comparação estrutural entre proteínas é um problema fundamental na Biologia Molecular, pois estruturas similares entre proteínas, frequentemente refletem uma funcionalidade ou origem em comum entre as mesmas. No Problema de Alinhamento Estrutural entre Proteínas, buscamos encontrar o melhor alinhamento estrutural entre duas proteínas, ou seja, a melhor sobreposição entre duas estruturas proteicas, uma vez que alinhamentos locais podem levar a conclusões distorcidas sobre as características c funcionalidades das proteínas em estudo. A maioria dos métodos atuais para abordar este problema ou tem um custo computacional muito elevado ou não tem nenhuma garantia de convergência para o melhor alinhamento entre duas proteínas. Neste trabalho, propomos métodos computacionais para o Problema de Alinhamento Estrutural entre Proteínas que tenham boas garantias de encontrar o melhor alinhamento, mas em um tempo computacional razoável, utilizando as mais variadas técnicas de Otimização Global. A análise sobre os desempenhos de cada método tanto em termos quantitativos quanto qualitativos, além de um gráfico de Pareto, são apresentados de forma a facilitar a comparação entre os métodos com respeito à qualidade da solução e ao tempo computacional; The structural comparison of proteins is a fundamental problem in Molecular Biology because similar structures often reflect a comrnon origin or funcionality. In the Protein Alignment problem onc seeks the best structural alignment between two proteins...
Alignment of protein structures is a fundamental task in computational molecular biology. Good structural alignments can help detect distant evolutionary relationships that are hard or impossible to discern from protein sequences alone. Here, we study the structural alignment problem as a family of optimization problems and develop an approximate polynomial-time algorithm to solve them. For a commonly used scoring function, the algorithm runs in O(n10/ε6) time, for globular protein of length n, and it detects alignments that score within an additive error of ε from all optima. Thus, we prove that this task is computationally feasible, although the method that we introduce is too slow to be a useful everyday tool. We argue that such approximate solutions are, in fact, of greater interest than exact ones because of the noisy nature of experimentally determined protein coordinates. The measurement of similarity between a pair of protein structures used by our algorithm involves the Euclidean distance between the structures (appropriately rigidly transformed). We show that an alternative approach, which relies on internal distance matrices, must incorporate sophisticated geometric ingredients if it is to guarantee optimality and run in polynomial time. We use these observations to visualize the scoring function for several real instances of the problem. Our investigations yield insights on the computational complexity of protein alignment under various scoring functions. These insights can be used in the design of scoring functions for which the optimum can be approximated efficiently and perhaps in the development of efficient algorithms for the multiple structural alignment problem.
We apply a simple method for aligning protein sequences on the basis of a 3D structure, on a large scale, to the proteins in the scop classification of fold families. This allows us to assess, understand, and improve our automatic method against an objective, manually derived standard, a type of comprehensive evaluation that has not yet been possible for other structural alignment algorithms. Our basic approach directly matches the backbones of two structures, using repeated cycles of dynamic programming and least-squares fitting to determine an alignment minimizing coordinate difference. Because of simplicity, our method can be readily modified to take into account additional features of protein structure such as the orientation of side chains or the location-dependent cost of opening a gap. Our basic method, augmented by such modifications, can find reasonable alignments for all but 1.5% of the known structural similarities in scop, i.e., all but 32 of the 2,107 superfamily pairs. We discuss the specific protein structural features that make these 32 pairs so difficult to align and show how our procedure effectively partitions the relationships in scop into different categories, depending on what aspects of protein structure are involved (e.g....
In the era of structural genomics, it is necessary to generate accurate structural alignments in order to build good templates for homology modeling. Although a great number of structural alignment algorithms have been developed, most of them ignore intermolecular interactions during the alignment procedure. Therefore, structures in different oligomeric states are barely distinguishable, and it is very challenging to find correct alignment in coil regions. Here we present a novel approach to structural alignment using a clique finding algorithm and environmental information (SAUCE). In this approach, we build the alignment based on not only structural coordinate information but also realistic environmental information extracted from biological unit files provided by the Protein Data Bank (PDB). At first, we eliminate all environmentally unfavorable pairings of residues. Then we identify alignments in core regions via a maximal clique finding algorithm. Two extreme value distribution (EVD) form statistics have been developed to evaluate core region alignments. With an optional extension step, global alignment can be derived based on environment-based dynamic programming linking. We show that our method is able to differentiate three-dimensional structures in different oligomeric states...
Similarity of protein structures has been analyzed using three-dimensional Delaunay triangulation patterns derived from the backbone representation. It has been found that structurally related proteins have a common spatial invariant part, a set of tetrahedrons, mathematically described as a common spatial subgraph volume of the three-dimensional contact graph derived from Delaunay tessellation (DT). Based on this property of protein structures, we present a novel common volume superimposition (TOPOFIT) method to produce structural alignments. Structural alignments usually evaluated by a number of equivalent (aligned) positions (Ne) with corresponding root mean square deviation (RMSD). The superimposition of the DT patterns allows one to uniquely identify a maximal common number of equivalent residues in the structural alignment. In other words, TOPOFIT identifies a feature point on the RMSD Ne curve, a topomax point, until which the topologies of two structures correspond to each other, including backbone and interresidue contacts, whereas the growing number of mismatches between the DT patterns occurs at larger RMSD (Ne) after the topomax point. It has been found that the topomax point is present in all alignments from different protein structural classes; therefore...
We present MASS (Multiple Alignment by Secondary Structures), a novel highly efficient method for structural alignment of multiple protein molecules and detection of common structural motifs. MASS is based on a two-level alignment, using both secondary structure and atomic representation. Utilizing secondary structure information aids in filtering out noisy solutions and achieves efficiency and robustness. Currently, only a few methods are available for addressing the multiple structural alignment task. In addition to using secondary structure information, the advantage of MASS as compared to these methods is that it is a combination of several important characteristics: (1) While most existing methods are based on series of pairwise comparisons, and thus might miss optimal global solutions, MASS is truly multiple, considering all the molecules simultaneously; (2) MASS is sequence order-independent and thus capable of detecting nontopological structural motifs; (3) MASS is able to detect not only structural motifs, shared by all input molecules, but also motifs shared only by subsets of the molecules. Here, we show the application of MASS to various protein ensembles. We demonstrate its ability to handle a large number (order of tens) of molecules...
A novel method is presented for joint prediction of alignment and common secondary structures of two RNA sequences. The joint consideration of common secondary structures and alignment is accomplished by structural alignment over a search space defined by the newly introduced motif called matched helical regions. The matched helical region formulation generalizes previously employed constraints for structural alignment and thereby better accommodates the structural variability within RNA families. A probabilistic model based on pseudo free energies obtained from precomputed base pairing and alignment probabilities is utilized for scoring structural alignments. Maximum a posteriori (MAP) common secondary structures, sequence alignment and joint posterior probabilities of base pairing are obtained from the model via a dynamic programming algorithm called PARTS. The advantage of the more general structural alignment of PARTS is seen in secondary structure predictions for the RNase P family. For this family, the PARTS MAP predictions of secondary structures and alignment perform significantly better than prior methods that utilize a more restrictive structural alignment model. For the tRNA and 5S rRNA families, the richer structural alignment model of PARTS does not offer a benefit and the method therefore performs comparably with existing alternatives. For all RNA families studied...
SARSA is a web tool that can be used to align two or more RNA tertiary structures. The basic idea behind SARSA is that we use the vector quantization approach to derive a structural alphabet (SA) of 23 nucleotide conformations, via which we transform RNA 3D structures into 1D sequences of SA letters and then utilize classical sequence alignment methods to compare these 1D SA-encoded sequences and determine their structural similarities. In SARSA, we provide two RNA structural alignment tools, PARTS for pairwise alignment of RNA tertiary structures and MARTS for multiple alignment of RNA tertiary structures. Particularly in PARTS, we have implemented four kinds of pairwise alignments for a variety of practical applications: (i) global alignment for comparing whole structural similarity, (ii) semiglobal alignment for detecting structural motifs, (iii) local alignment for finding locally similar substructures and (iv) normalized local alignment for eliminating the mosaic effect of local alignment. Both tools in SARSA take as input RNA 3D structures in the PDB format and in their outputs provide graphical display that allows the user to visually view, rotate and enlarge the superposition of aligned RNA molecules. SARSA is available online at http://bioalgorithm.life.nctu.edu.tw/SARSA/.
MolLoc stands for Molecular Local surface comparison, and is a web server for the structural comparison of molecular surfaces. Given two structures in PDB format, the user can compare their binding sites, cavities or any arbitrary residue selection. Moreover, the web server allows the comparison of a query structure with a list of structures. Each comparison produces a structural alignment that maximizes the extension of the superimposition of the surfaces, and returns the pairs of atoms with similar physicochemical properties that are close in space after the superimposition. Based on this subset of atoms sharing similar physicochemical properties a new rototranslation is derived that best superimposes them. MolLoc approach is both local and surface-oriented, and therefore it can be particularly useful when testing if molecules with different sequences and folds share any local surface similarity. The MolLoc web server is available at http://bcb.dei.unipd.it/MolLoc.
A novel method is presented for predicting the common secondary structures and alignment of two homologous RNA sequences by sampling the ‘structural alignment’ space, i.e. the joint space of their alignments and common secondary structures. The structural alignment space is sampled according to a pseudo-Boltzmann distribution based on a pseudo-free energy change that combines base pairing probabilities from a thermodynamic model and alignment probabilities from a hidden Markov model. By virtue of the implicit comparative analysis between the two sequences, the method offers an improvement over single sequence sampling of the Boltzmann ensemble. A cluster analysis shows that the samples obtained from joint sampling of the structural alignment space cluster more closely than samples generated by the single sequence method. On average, the representative (centroid) structure and alignment of the most populated cluster in the sample of structures and alignments generated by joint sampling are more accurate than single sequence sampling and alignment based on sequence alone, respectively. The ‘best’ centroid structure that is closest to the known structure among all the centroids is, on average, more accurate than structure predictions of other methods. Additionally...
Comparing the 3D structures of proteins is an important but computationally hard
problem in bioinformatics. In this paper, we propose studying the problem when
much less information or assumptions are available. We model the structural
alignment of proteins as a combinatorial problem. In the problem, each protein
is simply a set of points in the 3D space, without sequence order information,
and the objective is to discover all large enough alignments for any subset of
the input. We propose a data-mining approach for this problem. We first perform
geometric hashing of the structures such that points with similar locations in
the 3D space are hashed into the same bin in the hash table. The novelty is that
we consider each bin as a coincidence group and mine for frequent
patterns, which is a well-studied technique in data mining. We
observe that these frequent patterns are already potentially large alignments.
Then a simple heuristic is used to extend the alignments if possible. We
implemented the algorithm and tested it using real protein structures. The
results were compared with existing tools. They showed that the algorithm is
capable of finding conserved substructures that do not preserve sequence order...
Recent studies have shown that RNA structural motifs play essential roles in RNA folding and interaction with other molecules. Computational identification and analysis of RNA structural motifs remains a challenging task. Existing motif identification methods based on 3D structure may not properly compare motifs with high structural variations. Other structural motif identification methods consider only nested canonical base-pairing structures and cannot be used to identify complex RNA structural motifs that often consist of various non-canonical base pairs due to uncommon hydrogen bond interactions. In this article, we present a novel RNA structural alignment method for RNA structural motif identification, RNAMotifScan, which takes into consideration the isosteric (both canonical and non-canonical) base pairs and multi-pairings in RNA structural motifs. The utility and accuracy of RNAMotifScan is demonstrated by searching for kink-turn, C-loop, sarcin-ricin, reverse kink-turn and E-loop motifs against a 23S rRNA (PDBid: 1S72), which is well characterized for the occurrences of these motifs. Finally, we search these motifs against the RNA structures in the entire Protein Data Bank and the abundances of them are estimated. RNAMotifScan is freely available at our supplementary website (http://genome.ucf.edu/RNAMotifScan).
Protein structure comparison by pairwise alignment is commonly used to identify highly similar substructures in pairs of proteins and provide a measure of structural similarity based on the size and geometric similarity of the match. These scores are routinely applied in analyses of protein fold space under the assumption that high statistical significance is equivalent to a meaningful relationship, however the truth of this assumption has previously been difficult to test since there is a lack of automated methods which do not rely on the same underlying principles. As a resolution to this we present a method based on the use of topological descriptions of global protein structure, providing an independent means to assess the ability of structural alignment to maintain meaningful structural correspondances on a large scale.
Motivation: Progress in protein biology depends on the reliability of results from a handful of computational techniques, structural alignments being one. Recent reviews have highlighted substantial inconsistencies and differences between alignment results generated by the ever-growing stock of structural alignment programs. The lack of consensus on how the quality of structural alignments must be assessed has been identified as the main cause for the observed differences. Current methods assess structural alignment quality by constructing a scoring function that attempts to balance conflicting criteria, mainly alignment coverage and fidelity of structures under superposition. This traditional approach to measuring alignment quality, the subject of considerable literature, has failed to solve the problem. Further development along the same lines is unlikely to rectify the current deficiencies in the field.
Thesis (Ph. D.)--University of Rochester. Dept. of Electrical and Computer Engineering, 2010.; In this thesis, the problem of structural alignment of homologous RNA sequences is addressed.
The structural alignment of a given set of RNA sequences is a secondary structure
for each sequence, such that the structures are similar to each other, and a sequence alignment
between the sequences that is conforming with the secondary structures. A solution
to this problem was proposed by Sankoff as a dynamic programming algorithm whose time
and memory complexities are polynomial in the length of shortest sequence and exponential
in the number of input sequences, respectively. Variants of Sankoff’s method employ
constraints that reduce the computation by restricting the allowed alignments or structures.
In the first part of the thesis, a new methodology is presented for the purpose of
establishing alignment constraints based on nucleotide alignment and insertion posterior
probabilities. Using a hidden Markov model, posterior probabilities of alignment and insertion
are computed and these probabilities are additively combined to obtain probabilities
of co-incidence. The constraints on alignments are computed by adaptively thresholding
these probabilities to determine co-incidence constraints for pruning of computations that
hold with high probability. The proposed constraints are implemented into Dynalign...
We performed a phylogenetic analysis of the crustacean class Remipedia. For this purpose, we generated sequences of three different molecular markers, 16S rRNA (16S), histone 3 (H3), and cytochrome c oxidase subunit I (COI). The analyses included sequences from 20 of the 27 recent species of Remipedia, plus four still-undescribed species. The data matrix was complemented with sequences from online databases (The European Molecular Biology Laboratory and GenBank®). Campodea tillyardi (Diplura), Hutchinsoniella macracantha (Cephalocarida), Penaeus monodon (Malacostraca) and Branchinella occidentalis (Branchiopoda) served as out-groups. In addition to the classic computer-based alignment methods used for protein-coding markers (H3 and COI), an alternative approach combining structural alignment and manual optimization was used for 16S. The results of our analyses uncovered several inconsistencies with the current taxonomic classification of Remipedia. Godzilliidae and the genera Speleonectes and Lasionectes are polyphyletic, while Speleonectidae emerges as a paraphyletic group. We discuss current taxonomic diagnoses based on morphologic characters, and suggest a taxonomic revision that accords with the topologies of the phylogenetic analyses. Three new families (Kumongidae...
Circular permutation connects the N and C termini of a protein and
concurrently cleaves elsewhere in the chain, providing an important mechanism
for generating novel protein fold and functions. However, their in genomes is
unknown because current detection methods can miss many occurances, mistaking
random repeats as circular permutation. Here we develop a method for detecting
circularly permuted proteins from structural comparison. Sequence order
independent alignment of protein structures can be regarded as a special case
of the maximum-weight independent set problem, which is known to be
computationally hard. We develop an efficient approximation algorithm by
repeatedly solving relaxations of an appropriate intermediate integer
programming formulation, we show that the approximation ratio is much better
then the theoretical worst case ratio of $r = 1/4$. Circularly permuted
proteins reported in literature can be identified rapidly with our method,
while they escape the detection by publicly available servers for structural
alignment.; Comment: 5 pages, 3 figures, Accepted by IEEE-EMBS 2004 Conference Proceedings
This article is available from: http://www.biomedcentral.com/1471-2105/8/425; Consultar los ficheros adjuntos: http://www.biomedcentral.com/1471-2105/8/425/additional/; [Background] The task of computing highly accurate structural alignments of proteins in very short
computation time is still challenging. This is partly due to the complexity of protein structures.
Therefore, instead of manipulating coordinates directly, matrices of inter-atomic distances, sets of
vectors between protein backbone atoms, and other reduced representations are used. These
decrease the effort of comparing large sets of coordinates, but protein structural alignment still
remains computationally expensive.; [Results] We represent the topology of a protein structure through a structural profile that
expresses the global effective connectivity of each residue. We have shown recently that this
representation allows explicitly expressing the relationship between protein structure and protein
sequence. Based on this very condensed vectorial representation, we develop a structural
alignment framework that recognizes structural similarities with accuracy comparable to
established alignment tools. Furthermore, our algorithm has favourable scaling of computation time
with chain length. Since the algorithm is independent of the details of the structural representation...
Systematic research on noncoding RNAs (ncRNAs) has revealed that many ncRNAs are actively involved in various biological networks. Therefore, in order to fully understand the mechanisms of these networks, it is crucial to understand the roles of ncRNAs. Unfortunately, the annotation of ncRNA genes that give rise to functional RNA molecules has begun only recently, and it is far from being complete. Considering the huge amount of genome sequence data, we need efficient computational methods for finding ncRNA genes. One effective way of finding ncRNA genes is to look for regions that are similar to known ncRNA genes. As many ncRNAs have well-conserved secondary structures, we need statistical models that can represent such structures for this purpose. In this paper, we propose a new method for representing RNA sequence profiles and finding structural alignment of RNAs based on profile context-sensitive hidden Markov models (profile-csHMMs). Unlike existing models, the proposed approach can handle any kind of RNA secondary structures, including pseudoknots. We show that profile-csHMMs can provide an effective framework for the computational analysis of RNAs and the identification of ncRNA genes.
A novel RNA structural alignment method has been proposed based on profile-csHMMs. In principle, the profile-csHMM based approach can handle any kind of RNA secondary structures including pseudoknots, and it has been shown that the proposed approach can find highly accurate RNA alignments. In order to find the optimal alignment, the method employs the SCA algorithm that can be used for finding the optimal state sequence of profile-csHMMs. The computational complexity of the SCA algorithm is not fixed, and it depends on the so-called adjoining order that describes how we can trace-back the optimal state sequence in a given profile-csHMM. Therefore, for fast RNA structural alignments, it is important to find the adjoining order that has the minimum computational cost. In this paper, we propose an efficient algorithm that can systematically find the optimal adjoining order that minimizes the computational cost for finding the RNA alignments. Numerical experiments show that employing the proposed algorithm can make the alignment speed up to 3.6 times faster, without any degradation in the quality of the RNA alignments.