ANR "Modèles Numériques" GREMSAP (submitted)

Title of the project

Grid Evolutionary Multiple Sequence Alignment Platform (GREMSAP)

Participants

LSIIT
IGBMC
ISC-PIF
INAF

Summary

The aim of the GREMSAP project (GRid Evolutionary Multiple Sequence Alignment Platform) is to provide an autonomic massively parallel evolutionary platform for Multiple Sequence Alignment. Multiple sequence alignment represents a cornerstone of modern molecular biology, providing the basis for structure, function and evolutionary studies of proteins. Proteins are the molecular workhorses of biology, responsible for carrying out a tremendous range of essential functions, such as catalysis, transportation of nutrients, recognition and transmission of signals, etc.

Given an uncharacterized protein, the analysis of its amino acid sequence, via multiple alignment, can reveal important information about its 3D structure and its functional role in the cell. Protein sequence analysis is therefore one of the most widely studied fields in bioinformatics/ mathematics/physics/chemistry and the comparison (or alignment) of protein sequences is an essential prerequisite in diverse areas of modern biology, such as elucidation of the tree of life, studies of epidemiology and virulence, drug design, human genetics, cancer or biodiversity.

The context in computer science is changing. Powerful computers are now multiscale massively parallel systems. The multidisciplinary GREMSAP project consists in creating a number of original MSA workflows, optimized for evolutionary, structural or functional analyses, which can exploit massive parallel systems. As it is still not clear what good objective functions are in MSA, we will find new Biologically-Significant Objective Functions through massively parallel Genetic Programming. Second, to address the problem of complex protein sequences that contain semantically different regions, GREMSAP will implement a multi-objective MSA algorithm, that will use several BSOFs simultaneously, without aggregating them, in order to obtain different Pareto-optimal MSAs. Third, the high-scoring regions in each Pareto-optimal MSA will be identified and combined to provide the user with a single, final MSA.

Finally, all the algorithms will be developed on a research grid / cloud of machines to constitute a 500k€ platform in a single cabinet, hosted by Ecole Polytechnique and operated by LLR, mutualised with the BioEmergences platform, the France Bioimaging infrastructure and the Morphoscope EquipEx, that already represent a 2M€ grid. All in all, GREMSAP (and other scientists) will have access to a 2.5M€ research cloud that will be accessible on an individually or collectively, based on quasi-optimal Gittins indices (bandit branches processes with stochastic arrivals).

Résumé

L'objectif du projet GREMSAP (Plate-forme d'alignement multiple évolutionnaire de séquences sur la grille) est de fournir une plate-forme autonome évolutionnaire massivement parallèle pour l'alignement multiple de séquences. L'alignement multiple de séquences représente une pierre angulaire de la biologie moléculaire moderne, fournissant le fondement de l'étude de la structure, de la fonction et de l'évolution des protéines. Les protéines sont les briques de base de la biologie, responsables de la réalisation d'une très large gamme de fonctions essentielles, telles que la catalyse, le transport des nutriments, la reconnaissance et la transmission de signaux... L'analyse de la séquence en acides aminés d'une protéine non caractérisée, au moyen de l'alignement multiple de séquences, peut révéler des informations importantes sur sa structure 3D et son rôle fonctionnel dans la cellule. L'analyse de la séquence protéique est donc l'un des domaines majeurs de la bio-informatique/mathématique/physique/chimie et la comparaison (ou l'alignement) des séquences de protéines est un outil essentiel dans divers domaines de la biologie moderne, tels que l'élucidation de la phylogenèse du vivant, les études en épidémiologie et virologie, la conception de médicaments, la génétique humaine, le cancer ou la biodiversité.

Une révolution est en cours en informatique scientifique : les ordinateurs puissants sont maintenant des machines massivement parallèles multi-échelles. Le projet multidisciplinaire GREMSAP consiste à créer un flot de tâches (workflow) d'alignement multiple de séquences, optimisés pour des analyses fonctionnelles, structurelles ou évolutives, qui peuvent exploiter ces systèmes massivement parallèles. Comme on ne sait toujours pas ce que serait une bonne fonction objectif dans le domaine de l'alignement multiple de séquences, nous allons trouver de nouvelles fonctions objectif significatives d'un point de vue biologique (Biologically-Significant Objective Functions) par Programmation Génétique massivement parallèle.

Dans un deuxième temps, pour s'attaquer au problème de séquences protéiques complexes qui contiennent des régions sémantiquement différentes, GREMSAP implémentera un algorithme d'alignement multiple de séquences multi-objectif, qui utilisera plusieurs BSOF de manière simultanée, sans les agréger, pour obtenir différents alignement multiples optimaux au sens de Pareto.

Pour finir, les meilleures régions de chaque alignement multiple de séquences Pareto-optimal seront identifiées et recombinées pour redonner à l'utilisateur un seul meilleur alignement multiple de séquences. Tous ces algorithmes seront mis en oeuvre sur une grille de calcul / cloud de recherche qui constituera une plateforme de 500k€ tenant dans une baie hébergée à l'école Polytechnique et gérée par le LLR, mutualisée avec la plateforme BioEmergences, l'infratructure France Bioimaging et l'EquipEx Morphoscope, qui ont pour 2M€ de machines, soit un cloud de recherche de 2.5M€ accessible sur un principe de priorité se fondant sur des indices quasi-optimaux de Gittins.

Objectives

The aims of this project are both novel and ambitious, as completion of this project will result in the development of an original, scalable, autonomic, massively parallel platform for the construction of high quality multiple alignments of protein sequences.

The major goals of our project are to:

Come up with new Biologically Significant Objective Functions for MSAs, as well as massively parallel

algorithms that can exploit these Objective Functions to obtain significant improvements in the efficiency of the complete construction process and the quality of the final alignment. The developments represent a major scientific challenge that requires new concepts, algorithms and technologies. The project will deliver highly sophisticated computational workflows that must be both portable, so that the advances can be disseminated to a large community of scientists and sustainable, so that the developments can be adapted to future hardware developments in a rapidly evolving field.

Provide scientists with an autonomic research grid/cloud platform that will determine the best alignment

depending on each BSOF found above. We plan to exploit the recent results of Partner 1 in the field of artificial evolution algorithms and Partner 2 in the field of MSA and of Partner 3 and 4’s OpenMOLE in cloud/grid computing on EGI to design efficient, scalable, massively parallel MSA algorithms.

The scalability of the system is ensured by the well established EGI grid and the user interface is be provided by an extension of OpenMOLE as a generic GPU/CPU cloud/grid autonomic computing on EGI, a very ambitious project combining CPU and GPU clusters, cloud and grid computing, and the unavoidable autonomic computing, i.e. self-aware, self-healing, self-optimizing and — the most difficult and not yet realized challenge— self-configuring.

Workplan

This multidisciplinary project has several aims:

Developing scalable (up to exascale), massively parallel algorithms for efficient construction of multiple

alignments of protein sequences, that can run on an unlimited number of GPGPU machines organised as a grid or as a cloud, with maximum performance, both in terms of speedup and quality of results.

Developing new methods for multiple sequence alignment, namely a Genetic Programming algorithm, to find

Biologically Significant Objective Functions to be used by a multi-objective evolutionary Multiple Sequence Alignment algorithm based on Pareto-ranking, in order to find better results both in terms of quality and number of sequences to be aligned.

Putting together a massively parallel evolutionary multiple sequence alignment platform (GREMSAP)

mutualised with the virtual organization vo.iscpif.fr and for the software part, a specific software platform for providing biologist users with a transparent access to the computational power. The software part will be based on the generic software OpenMOLE, a CPU cloud-grid computing on EGI to be extended to a hybrid GPU-CPU cloud-grid autonomic computing.

The computing platform has been dimensioned so as to provide the maximum computing power in one cabinet. Hosted by Ecole Polytechnique and administered by LLR (cf. letters in annex), the 40 nodes will each compris 3xC2090 NVidia cards, each providing 1.33 TFlops. Even without supra-linear speedup, this cabinet provides ranks 73th in the top 500 list of super computers as of November 2011. If this project is successful, by the time it starts, NVidia has announced the new generation of 28nm engraved processors, which will yield a computing power of 5 TFlops each, so for the same price, this machine will have a computing power of 600 TFlops, and will therefore rank as the 20th machine in Nov 2011th top 500 list.

Scientific, technical and economical outcomes

The results of the GREMSAP project will be made available to the wider research community via:

GREMSAP as a SaaS
The GREMSAP web service will be free for the scientific community during the duration of the project, and, after, will be paid by a marginal cost. Companies using the GREMSAP will pay the full cost including the renewal of the clusters. Indeed, it is expected that GREMSAP as a PaaS will concentrate the main attempts for improving the multiple alignments and, thus, that this web service will remain the best one. As such, it is expected that GREMSAP as a PaaS will be able to renew its clusters and software, allowing the continuation of the SaaS.
Software Distribution of GREMSAP as a PaaS
The GREMSAP system distribution will include all of the software modules of the system (source code and binaries). It will be proposed for free for users from academic research institutes and government laboratories, universities and nonprofit foundations. The 'commercial' version will be made available to other laboratories and industrialists. Note: the details of the procedure for commercialization will be specified in the agreement of the project consortium.
Publications
The algorithms developed and the results of the software evaluation will be published in major journals and conferences in the different fields.
High throughput data
Results of the pilot study will be made available on the SM2PH website.

The project offers unique possibilities for economic and industrial impact, both in transforming the developed technologies into industrial products, and in applying the technologies for e.g. health and biotechnology purposes. The massively parallel MSA program proposed here has potential applications in the numerous diverse fields requiring automatic analyses of large-scale sequencing data, ranging from whole-genome sequencing or targeted resequencing to human genetics, clinical and diagnostic studies, or metagenomics of ecosystems. The MSA program that is currently the most widely used, ClustalW (>26000 citations), is distributed under a non-exclusive commercial license and a parallel version of ClustalW has been commercialized by SGI (http://www.sgi.com). The improved efficiency and accuracy of MSAs in our proposal should be of interest to all researchers using protein sequence analysis in high throughput genomics projects.