With the rapid growth in the development of structure determination methods, the number of protein structures deposited in PDB has been growing exponentially. This enables us to understand more about the protein structure-function relationships. However, it is important to classify the structures according to their evolutionary origin. For proteins of known structure, the Structural Classification of Proteins (SCOP) database provides a comprehensive description of the structural and evolutionary relationships. For a protein structural domain, family and superfamily levels describe the near and distant evolutionary relationships respectively; fold describes geometrical relationships. SCOPe extends SCOP through a combination of automation and manual curation (1).
Sequence alignments of proteins are of key importance in understanding structural, evolutionary and functional relationships between proteins. Proteins descending from the same common ancestor and falling within the same superfamily can, however, be sequentially divergent thus rendering routine sequence alignment methods inappropriate. In such cases, structure-based sequence alignments of superfamily members serve as a guiding evolutionary model, which can be used for the identification of conserved residues or motifs, modelling of distant homologues, genome-wide sequence searches as well as genome-wide association studies leading to SNP identification and drug discovery (2,3).
Protein Alignments organized as Structural Superfamilies or PASS2 is a database which provides such alignments for protein domain superfamilies and has been updated continuously since 2002 (2). Protein structural domains within a superfamily which have less than 40% sequence identity amongst themselves are recognised from SCOPe and considered for structure-based sequence alignment and subsequent annotation. The sequence identity filter aids to avoid redundancy and hence, computational time required for the rigorous alignment protocol (4). It also enables us to avoid alignments between closely related protein domains, which can be achieved by automatic multiple sequence alignment algorithms. The alignment produced for each superfamily is annotated and systematically included in the PASS2 database. In the present update, we have assembled structure-based sequence alignments of 2006 superfamilies with close to 14000 structures of protein domains. Features such as solvent accessibility, Hidden Markov Model or HMM profiles, absolutely conserved residues, etc. have also been provided alongside the alignments. While aiming at automation, we have modified the pipeline, where required. We also discuss how outliers and extreme outliers were handled, where alignment of all structural domains within a superfamily was difficult. A few case studies have been discussed for further exemplification. The current update, PASS2.6, is in accordance with SCOPe 2.06.