This review discusses and surveys the concepts and progress of bioinformatics and highlights their recent biotechnological applications in post genomics era, in many fields starting with basic and applied future life sciences where the philosophy and style of both research and knowledge has changed.
Bioinformatics for cancer management in the post-genome era.
Furthermore, discusses their related applications in molecular medicine and microbial genome, as well as summarizes all their possible biotechnological applications in agriculture, energy and environment. Article :. DOI: Need Help? The emerging field of proteomics addresses the analysis of the protein population inside the cell.
Technologies such as 2D gels and mass spectrometry offer glimpses into the world of mature proteins and their molecular interactions. Finally, we are stepping beyond analyzing generic genomes and are asking what genetic differences between individuals of a species are the key for predisposition to certain diseases and effectivity of special drugs. These questions join the fields of molecular biology, genetics, and pharmacy in what is commonly named pharmacogenomics.
Pharmaceutical industry was the first branch of the economy to strongly engage in the new technology combining high-throughput experimentation with bioinformatics analysis. Medicine is following closely. Medical applications step beyond trying to find new drugs on the basis of genomic data. The aim here is to develop more effective diagnostic techniques and to optimize therapies.
The first steps to engage computational biology in this quest have already been taken. While driven by the biological and medical demand, computational biology will also exert a strong impact onto information technology. Since, due to their complexity, we are not able to simulate biological processes on the basis of first principles, we resort to statistical learning and data mining techniques, methods that are at the heart of modern information technology.
The mysterious encoding that Nature has afforded for biological signals as well as the enormous data volume present large challenges and are continuing to have large impact on the processes of information technology themselves. In this theme section, we present 15 scientific progress reports on various aspects of computational biology.
Thus, here we will confine ourselves to able: if the available memory is not sufficient to a description of the general principles underlying hold the index of the whole set of tags to be pro- the most successful algorithms, and very brief cessed, then the tags can be split into subsets and descriptions of a few—leaving to the interested each subset can be processed separately or in parallel, reader the task of keeping abreast with one of the if several computing cores are available.
The final hottest and most rapidly growing fields of modern result is obtained by merging of the results obtained bioinformatics. While the computation time is The principle of creating an index of the positions increased, tools of this kind can be used with stan- of all distinct k-mers in either the sequence reads dard personal computers. Applied to our problem, suppose that able mapping algorithms are, arguably, whether we have to map a tag of length 32 with up to two the genome or the sequence reads are indexed, and mismatches.
The tag can be defined as the concate- the indexing method applied. Additionally, different nation of four substrings of length 8 bp.
The main drawback is the to limit the search space. The program allow for mismatches in the rest. The performance is platform and others are more general purpose reported to be much faster than SOAP, but once Table 2. Probably the first  algorithm searches for exactly matching short tool introduced for this task, ELAND indexes the subsequences seeds between the genome and tag tags, and is based on the aforementioned tag-splitting sequences, and performs ungapped extension on strategy allowing mismatches.
It is still one of the those seeds to find the longest possible matching fastest and less memory-greedy pieces of software sequence with a user specified number of available. Likewise, SeqMap  builds an index mismatches. To search for matching seeds MOM for the reads by using the longest substring guaran- creates a hash table of subsequences of fixed teed to match exactly, and scans the genome against length k k-mers from either the genome or the it. It allows the possibility of insertions and deletions tag sequences, and then sequentially reads the in alignments.
ZOOM  is also based on the same un-indexed sequences searching for matching principles as ELAND, with the difference that k-mers in the hash table. For example, in number of errors are automatically trimmed. The performance reported is for typical applications. A good at least theoreti- SOAP  was one of the first methods published cally performance is also obtained by vmatch , for the mapping of short tags, in which both tags and which employs enhanced suffix arrays for a num- genome are first of all converted to numbers using ber of different genome-wide sequence analysis 2-bits-per-base encoding.
To admit two mismatches, applications. In turn, the choice of a human genome. In this way, Bowtie can run on a given method against another one depends on how typical desktop computer with 2 GB of RAM. BWA  is also based To illustrate variation in performance of different on the Burrows—Wheeler transform. Mismatches are thus sequence, with up to 2 mismatches. The six hash tables were not simulated . These reads were mapped correspond to six spaced seeds analogous to that to both the human transcriptome from which the used in ZOOM.
By default, MAQ indexes the first reads originated and which should provide perfect 28 bp of the reads. While very fast, MAQ is based on matches to each read and to the human genome a number of heuristics that do not always guarantee sequence from which the transcriptome originated to find the best match for a read. Table 3 shows various also employed by RMAP  wherein positions in statistics regarding the speed, memory use and sensi- reads are designated as either high- or low-quality.
We see clearly that SOAP provides the as wild-cards. To prevent the possibility of trivial fastest performance while Bowtie uses marginally matches, a quality control step eliminates reads less RAM, with Bowtie correctly mapping The apparently poor performance of CloudBurst  is a RMAP-like algorithm that sup- PASS is likely due to imperfect parameterization ports cloud computation using the open-source and failure to identify all map positions for reads Hadoop implementation of MapReduce to parallel- matching on a large numbers of transcripts.
The alignment algorithms that have been developed for decay from the previous scenario is due principally Illumina GA and Roche short reads. Nevertheless, the differences observed polyadenylation sites and the introduction of sequen- between methods illustrate that, particularly when cing errors under user defined models, in this case error prone short reads are mapped to genomic an empirically deduced model for Illumina sequen- sequences, a substantial number of artifactual place- cing .
In total, 4 bp long reads were ments are generated mostly due to the presence of simulated, of which covered splice junctions sequencing errors and that the different heuristics dataset RER. Again all methods were used to used by different algorithms can find different imper- map reads to both the transcriptome and genomic fectly matching map positions.
Here a more compli- cated picture is observed. In total, However, at least not mapped to the transcriptome dataset as polyA for smaller bacterial genomes, even the shortest tails were excluded from the transcripts for computa- reads can be used to effectively assemble genome tional reasons. Thus in our simulation, sequence sequences de novo, and even where complete closure errors, mis-priming and alternative polyadenylation of the genome is not possible, large contigs can seem to have little effect on our capacity to correctly be reliably constructed from such data provided map reads with Bowtie and SOAP.
Bioinformatics in a post-genomics age | Nature
When the reads that repeated sequences are not overly abundant. It the lure of NGS, although unsurprisingly, until seems that SSAKE is particularly vulnerable to now, the technology has been the most widely the presence of sequencing errors, while all contigs used for these applications . For a contigs. In the context of metagenomics, after detailed view of technical and algorithmic issues contig assembly, high-throughput identification and in de novo assembly of short reads, readers are referred phylogenetics strategies are required for the recon- to .
SNPs are often higher error frequencies in these reads. However, identified using data from high-throughput sequen- they have proved useful for the development of cing projects and reads are typically aligned to the de novo genome assemblers. To approach. Recently, a study by Smith of shorter contigs with a higher overall genomic and colleagues  showed that single mutations coverage. The a reference . After alignment, base positions deep-sequencing coverage provided by NGS with a high probability to be SNPs are selected.
In contrast, Roche reads provide lower This novel technology, termed RNA-Seq, provides coverage per base but with high quality. However, sequence reads from one single-end sequencing pyrosequencing can introduce biases in SNP detec- or both paired-end sequencing ends of cDNAs tion when homopolymeric strings are present. Tools generated by a population of total or polyA enriched like Pyrobayes can overcome such limitations in RNAs. However, in this way a functionally stretches .
To with all of the NGS platforms considered here also obtain a more comprehensive overview of the facilitates the identification of genome rearrange- transcriptome the random amplification of total ments when relative mapping orientations or posi- RNA can be carried out, taking care to perform tions of reads do not correspond to those expected a rRNA depletion step to prevent an unwanted from the reference genome e.
RNA-Seq reads from Illumina and SOLiD technologies more suitable for quantifying transcript platforms have been successfully used to detect levels through tag profiling [8, 70] also termed digital the complete editing pattern in the mitochondrial gene expression, and full-length transcript profiling genome of grapevine, supporting the idea that .
The latter methodology suitably applies to RNA editing in plant mitochondria is likely more the detection of transcribed regions in the genome, pervasive than expected Picardi et al. In mammals, several known editing novel ones. However, as such reads typically span events have been accurately detected using a single exon the relevant information about exon sequencing technology . For this aim the longer reads RNA-Seq produced by Roche FLX are much more infor- NGS platforms are ideally suited for the detailed mative although sophisticated model-based systems analysis of the transcriptome.
Indeed, our current  are intended to deconvolute the relative abun- knowledge of the transcriptome complexity in dance of different transcripts derived from the different tissues, cell types, developmental stages same gene. The human or mouse. Alternative splicing, a pervasive discovery of novel splice sites can be carried out phenomenon affecting in human virtually all multi- either by searching contiguous mappings against exon genes [67—69] is a major determinant of tran- splice junction libraries derived from the concatena- scriptome complexity.
This technology is rapidly spanning exon junctions, thus able to perform split becoming the method of choice for the large-scale alignments against the reference genome. Chip-Seq implies the characterization uous mapper see Table 2 to carry out the mapping of isolated DNA by NGS approaches as opposed not only against the genome but also against the full to the search for specific sequences by PCR, or the set of known transcripts as derived from RefSeq  identification of isolated DNA through microarray- and other databases such as ASPicDB  that also based approaches.
Mortazavi et al. If we assume that in a completely random transcriptome annotation using only RNA-Seq data experiment each genomic region has the same prob- e. The for sequencing, etc. Such phenomena can impact upon many or chromosome arms , since for experimental reasons applications of NGS, but are particularly important different regions can have different propensities for RNA-seq and can complicate both transcript to produce reads. Thus, global or region-specific annotation and quantification.
Unlike oligonucleotide ChipSeq Peak Finder used in , ChIPDiff  array studies, deep sequencing requires no a-priori and CisGenome , which encompasses a series knowledge of the nature of small RNAs, is less sub- of tools for the different steps of the ChIP-seq ject to the lack of specificity of short probes some- analysis pipeline. False discovery rates are estimated times associated with oligonucleotide arrays  and by these tools by comparing the level of enrichment expression levels can be followed over a wider range number of tags at given sites, with the background with deep sequencing.
Indeed, even the shortest model used. Moreover, critical that partial adaptor sequences are removed with current tools the choice of a significance or before analysis. It should also be born in mind that additional selves. In fact, the first applications of ChIP-Seq bases, not derived from the genome sequence are have been related to epigenetic regulation see for often added physiologically to the 30 ends of example [91, 92] , perhaps because the problem is mature microRNAs and these bases can also obscure somewhat easier than for TFs.
The analysis protocol correct alignments to genomic sequences . Clustering of observed sequences making the separation between signal and noise in and comparison with databases of annotated small peak detection much clearer. Primary analyses of RNAs e. Analysis of the size distribution of Such efforts vary from ChIP-PCR, to the recogni- reads can also prove informative as to the nature of tion of known motifs, to the detection of sequences small RNAs present.
- Author information.
- The Ruling Class of Judaea: The Origins of the Jewish Revolt against Rome, A.D. 66-70?
- Bioinformatics: Genomics and Post-Genomics.
- The New Famines: Why Famines Persist in an Era of Globalization (Routledge Studies in Development Economics)?
For example, microRNAs tend overrepresented in isolated fragments. Recent years have seen number of important discov- Several specific bioinformatics tools have been eries relating to the regulated expression of small developed to identify members of different classes typically 18—25 base RNAs in eukaryotic cells of small RNAs from deep sequencing data. One of the most and searching for plausible hairpin structures encom- popular methods of characterizing the methylation passing regions where small RNAs map and identify- state of genomic DNA has been the targeted sequen- ing cases where a single species, deriving from a stem cing of particular genomic regions after treatment is over-represented with respect to a putative of isolated DNA with bisulfite which converts miRNA' and reads derived from loop regions.
For unmethylated cytosines to uracil, but does not a review dedicated to the discovery and expression modify 50 methylated cytosines. More recently, and profiling of miRNA using deep sequencing, see analogously to the situation with ChiP experiments, ref. The development of NGS technol- cursors . This algorithm has of the frequency with which such sites are methy- also been implemented as a web tool . Piwi associated RNAs piRNAs are a class Clearly, modification of non-methylated cyto- of repeat associated small interfering RNAs sines will increase the level of mismatches in reads ra-siRNAs derived from large repeat containing derived from non-methylated regions and potentially genomic loci such as the flamenco locus in introduce artifactual matches to regions of the Drosophila, predominantly expressed in germline genome other than the one from which reads were cells, and thought to be principally involved in derived.
Amplification of genomic DNA fragments the regulation of transposon expression through adds additional complications antisense reads derived complementary interactions leading to degradation from modified or non-modified genomic regions. Of the limited numbers of studies of this target molecules . A simple algorithm to type published until now using Illumina data, two detect such complementary patterns has been pro- have used conventional short read mapping tools and posed .
These workers sequenced probabilistic mapping procedure based on base call the 50 ends of mRNA degradation products, assum- scores and combinatorial substitution of cytosines ing that sites targeted by siRNAs would be over- for thymines in reads. To minimize the computa- represented. A dedicated bioinformatics pipeline tional cost of exhaustive genome scans for each for matching end-reads, datasets of known small read, an efficient branch and bound algorithm was RNAs and a database of transcripts has been pre- applied to an appropriate genome index structure, sented .
While the use of NGS technol- Epigenomics studies ogies in epigenomic studies is in its infancy, 50 -Methylation of cytosine bases forms the basis of the increasing awareness of the importance of epige- important mechanisms of regulation of chromatin netic marking in development and disease suggest state and gene expression . It is becoming that this field will develop rapidly over the next increasingly clear that DNA methylation and years, at both the experimental and bioinformatics demethylation can be a dynamic process in both levels.
All of these considerations We have attempted to provide a broad outline of will further enhance the symbiotic relationship bioinformatics approaches for the analysis of NGS between modern biology and computational data. The rapid rate of development in the field sciences, and ensure long and productive careers means that it is likely that significant developments for talented and committed bioinformaticians.
To this end we have avoided detailed discussions of data formats and quality Key Points scores. However, several dynamic and useful discus- NGS technologies are revolutionizing the scale and perspectives sion forums on the WWW may be of use to keep of research in the fields of genomics and functional genomics. The general features of the three major NGS platforms, namely up-to-date with recent developments e.
Finally, it is fascinating to speculate as handling and the analysis of the huge amount of data produced. Metatranscriptomics for current limitations and open problems in genome mapping of example [—], promises to allow previously NGS data. The major bioinformatics applications for dealing with NGS unimaginable advances in our understanding of including genome mapping, de novo assembly, detection of SNPs large scale biological interactions in microbial com- and editing sites, transcriptome analysis, ChIP-Seq, small RNA munities.
Indeed, NGS tech- nologies are already playing a key role in the References genomes project  directed at the wide sampling 1. DNA sequencing with of human genome sequences. There cannot be any chain-terminating inhibitors. Nat Rev Microbiol ;— Mardis ER. Trends Genet ;— Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet ;— New strategies and emerging technologies in modern post-genomic research. Here we have for massively parallel sequencing: applications in medical focused on the current generation of bioinformatics research.
Genome Med ; Morozova O, Marra MA. Applications of next-generation be command line driven and somewhat inaccessible sequencing technologies in functional genomics. Genomics to many wet-bench researchers. There is undoubt- ;— From cytogenetics to next- generation sequencing technologies: advances in the detec- instruments to render the power of these new tech- tion of genome rearrangements in tumors. Biochem Cell Biol nologies available to a wider audience within the ;— Bioinformatics for NGS 8.
Next-generation Li H, Durbin R. Fast and accurate short read alignment tag sequencing for cancer gene expression profiling. Genome with burrows-wheeler transform. Bioinformatics ; Res ;— Schuster SC. Next-generation sequencing transforms Nat Methods ;— Ansorge WJ. N Biotechnol ;— Using quality scores Shendure J, Ji H. Next-generation DNA sequencing.
Nat and longer reads improves accuracy of Solexa read mapping. Biotechnol ;— BMC Bioinformatics ;