User:Rgocs/RNA-test

RNA-Seq, also called "Whole Transcriptome Shotgun Sequencing" ^[1] ("WTSS"), and dubbed "a revolutionary tool for transcriptomics" ^[2], refers to the use of high throughput sequencing technologies to sequence cDNAs in order to get information about a sample's RNA content, a technique that is quickly becoming invaluable in the study of diseases like cancer ^[3]. Thanks to the deep coverage and base level resolution provided by next-generation sequencing instruments, RNA-Seq provides researchers with efficient ways to meassure how different alleles of a gene are expressed, post-transcriptional mutations and gene fusions ^[3]

Introduction

The introduction of Next Generation Sequencing, or High Throughput Sequencing, technologies opened new doors into the field of DNA Sequencing, however as understanding of these technologies becomes more widespread and new tools are being developed, so are new innovative ways of applying these technologies being created. Given High Throughput Sequencing technologies' low requirements of nucleotide sequence product, together with its deep coverage and base-scale resolution, its use has expanded to the field of transcriptomics ^[2]. Transcriptomics is an area of study dealing with the RNA transcribed from a particular genome under investigation. Although transcriptomes are more dynamic relative to genomic DNA, these molecules provide direct access to genome regulation and protein information (wiki link "transcriptome"). Sequencing these transcriptomes is not a new idea, methods have been previously developed to directly determine cDNA sequences mostly based around traditional (and more expensive) Sanger sequencing (wiki link), some of the methodologies in existance include Serial analysis of gene expression (SAGE), cap analysis of gene expression (CAGE) and massively parallel signature sequencing (MPSS).

Transcriptome Sequencing (RNA-seq) can be done with a variety of platforms. For example, recent applications include using the Illumina Genome Analyzer [wiki link] platform to sequence mammalian transcriptomes [mortazavi2008], Applied Biosystem's SOLiD [wiki link] to profile stem cell transcriptomes <ref name=cloonan2008>{{cite journal | journal=Nature Methods | volume=5 | issue=7 | pages=613–619 | date=2008 | author=Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM. | title= Stem cell transcriptome profiling via massive-scale mRNA sequencing | url=http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1223.html | pmid=18516046 | doi=10.1038/nmeth.1223 }}</ref> or Life Science's 454 Sequencing [wiki link] to discover [[Single nucleotide polymorphisms]] (SNPs) in maize through its transcriptome <ref name=barbazuk2007>{{cite journal | journal=The Plant Journal | volume=51 | issue=5 | pages=910–918 | date=2007 | author=Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS | title=SNP discovery via 454 transcriptome sequencing | url=http://www3.interscience.wiley.com/journal/118488674/abstract | pmid=17662031 | doi= 10.1111/j.1365-313X.2007.03193.x}}</ref>. Even though each platform has its technical individualities, the information gathered from each is of the same nature.

Methodologies

RNA PolyA Library

Creation of a library can change from platform to platform in high throughput sequencing [reference], where each platform has several kits desiged to build different types of libraries and adapting the resulting sequences to the specific requirements of their instruments.

However, due to the nature of the template being analyzed, ie RNA, there are commonalities within each technology. Frequently in [[mRNA]] analysis the poly(Adenylated) (poly(A)) tail is targeted in order to ensure that coding RNA is separated from noncoding RNA. This can be accomplished simply with poly(T) oligos covalently attached to a given substrate. Presently many studies utilize magnetic beads for this step (<ref name=morin2008/>; Mortazavi, 2008) (Invitrogen, MACS mRNA Isolation kit).

Certain studies have shown that non poly(A) RNA can yield important non-coding RNA gene discovery and therefore, selecting poly (A) RNA molecules significantly reduces this efficiency (Morin, 2008). Since ribosomal RNA represents over 90% of the RNA within a given cell, studies have shown that its removal via probe hybridization assists this process of transcriptome coverage. (Invitrogen, RiboMinus Human/Mouse Transcriptome Isolation kit)

Due to the 5' bias of random PCR primers and secondary structures influencing primer binding sites (Mortazavi, 2008), hydrolysis of RNA into 200-300 nucleotides prior to reverse transcription theoretically and practically reduces both problems. Once the cDNA is synthesized it can be further fragmented to reach the desired fragment length as specified in table 1. For Illumina sequencing adapters are now ligated onto the fragmented cDNA. Finally, the template is now ready for the desired sequencing apparatus.

Protocol Online [link

http://www.protocol-online.org/prot/Molecular_Biology/RNA/RNA_Extraction/mRNA_Isolation/index.html]

provides a list of several protocols relating to mRNA isolation.

Next generation sequencing

High-throughput sequencing technologies generate millions of short reads from library of sequences, the most used technologies and some of their characteristics are shown in the following table (source: Mardis, ER. The impact of next-generation sequencing technology on genetics Trends in Genetics Mar 2008 24(3):133-41)

	454 Sequencing	Illumina	SOLiD
Sequencing Chemistry	Pyrosequencing	Polymerase-basedsequence-by-synthesis	Ligation-based sequencing
Amplification approach	Emulsion PCR	Bridge amplificatoin	Emulsion PCR
Paired end separation	3 kb	200 bp	3 kb
Mb per run	100 Mb	1300 Mb	3000 Mb
Time per paired end run	7 hours	4 days	5 days
Read length	250 bp	32 - 42 bp	35 bp
Cost per run	$ 8,438 USD	$ 8,950 USD	$ 17,447 USD
Cost per Mb	$ 84.39 USD	$ 5.97 USD	$ 5.81 USD

Table 1. Comparing metrics and performance of next-generation DNA sequencers [mardis2008]

Transcriptome alignment

Due to the small size of the short reads (for Illumina Genome Analyzer this can be around 42 bases) de novo assembly may be difficult (though some software does exist: [[Velvet_(algorithm)]]), as there cannot be large overlaps between each read needed to easily reconstruct the original sequences, and the deep coverate makes the computing power to track all the possible alignments prohibitibe <ref name=zerbino2008>{{cite journal | journal=Genome Research | volume=18 | issue=5 | pages=821–829 | date=2008 | author=Zerbino DR, Birney E | title=Velvet: Algorithms for de novo short read assemblyusing de Bruijn graphs | url=http://genome.cshlp.org/content/18/5/821.full | pmid=18349386 | doi= 10.1101/gr.074492.107 }}</ref>. This can be somewhat overcome by having larger sequences obtained from the same sample using other techniques as Sanger Sequencing [wiki link], and using this larger reads as a "skeleton" or a "template" to help assemble reads in difficult regions (e.g. regions with repetitive sequences).The recommended approach is that of aligning the millions of reads to a "reference Genome" [wiki link]. There are many tools available for aligning Genomic reads to a reference Genome (http://wiki.riteme.site/wiki/List_of_sequence_alignment_software), however, special attention is needed when alignment of a transcriptome to a genome, mainly when dealing with genes having intronic regions.As discused above, the sequence libraries created extracting mRNA using its poly(A) tail, which is added to the mRNA molecule post-transcriptionally and thus splicing has taken place. Therefore, the created library and the short reads obtained cannot come from intronic sequences, when trying to align these short reads to a reference Genome, only short reads aligning entirely inside exonic regions will be matched, short reads coming from exon-exon junction regions will not be aligned.A possible work around for this is to try to align the unaligned short reads using a proxy genome generated with known exonic sequences [reference]. This need not cover whole exons, only enough so that the short reads can match on both sides of the exon-exon junction with minimum overlap.[Final version of Transcriptome alignment figure. Some short reads that are in an exon-exon junction will be split when alighning to the reference genome][[Image:]]

Analysis

Gene Expression

Text [mortazavi2008] [marioni2008]The characterization of gene expression [wiki link-> Gene_expression] in cells via meassurement of mRNA levels has long been of interest to researchers. Even though it has been shown that due to other post transcriptional gene regulation events (such as RNA interference [wiki link]) there is not a strong correlation between the abundance of mRNA and the related proteins <ref name=greenbaum2003 >{{cite journal | journal=Genome Biology | volume=4 | issue=9 | pages=117 | date=2003 | author=Greenbaum D, Colangelo C, Williams K, Gerstein M. | title=omparing protein abundance and mRNA expression levels on a genomic scale | url=http://genomebiology.com/2003/4/9/117 | pmid=12952525 | doi= 10.1186/gb-2003-4-9-117}}</ref>, meassuring mRNA concentration levels is still a useful tool in determining how the transcriptional machinery of the cell is affected in the presence of external signals (e.g. drug treatment), or how do cells differ between a healthy state and a disease state.

Microarray approach

Prior to RNA-Seq, microarrays (wiki link) were unchallenged as the experiment of choice for transcriptome analysis. <C2><A0>Although many exciting experiments are still using microarrays with exciting results, where the amount of time to retrieve results for a given sample is shorter in time, intrinsic experimental limitations of microarrays seem to make RNA-Seq the method of choice. <C2><A0>One important limitation, amongst others, is a pre-requisite for sequence information in order to detect and therefore evaluate transcripts (Marioni, 2008)

Coverage as meassure of expression

Expression can be deduced via RNA-Seq to the extent at which a sequence is retrieved. <C2><A0>Transcriptome studies in Yeast (Nagalakshmi, 2008) show

that in this experimental setting, a four-fold coverage is required for amplicons to be classified and characterized as an expressed gene. <C2><A0>When t

he transcriptome is fragmented prior to cDNA synthesis, the number of reads corresponding to the particular exon normalized by its length<C2><A0>in vivo <C2><A0>yields gene expression levels which correlate with those obtained through qPCR. ==== [

Single Nucleotide Variation Discovery

Single nucleotide variation has been analyzed in maize on the Roche 454 sequencing platform ^[4], .

A massively parallel pyro-sequencing technology commercialized by 454 Life Sciences Corporation was used to sequence the transcriptomes of shoot apical meristems isolated from two inbred lines of maize using laser capture microdissection (LCM). A computational pipeline that uses the POLYBAYES polymorphism detection system was adapted for 454 ESTs and used to detect SNPs (single nucleotide polymorphisms) between the two inbred lines. Putative SNPs were computationally identified using 260 000 and 280 000 454 ESTs from the B73 and Mo17 inbred lines, respectively. Over 36 000 putative SNPs were detected within 9980 unique B73 genomic anchor sequences (MAGIs). Stringent post-processing reduced this number to > 7000 putative SNPs. Over 85% (94/110) of a sample of these putative SNPs were successfully validated by Sanger sequencing. Based on this validation rate, this pilot experiment conservatively identified > 4900 valid SNPs within > 2400 maize genes. These results demonstrate that 454-based transcriptome sequencing is an excellent method for the high-throughput acquisition of gene-associated SNPs.

Coverage

Coverage/depth can affect mutations seen, everything is expression-centric, so an allele might not be seen either because it is not in the genome, or because it is not being expressed.At the same time, RNA-seq can give additional information than just the existance of an heterozygous gene, it can also help in estimating the proportion of expression of each allele.In association studies, genotypes are associated to disease and expression levels can also be associated with disease. Using RNA-seq, we can a meassure of how these two relate, this is: in what relation are each of the alleles being expressed.

Germline vs Expressed alleles

The only way to be absolutely sure of the individual's mutations is to have the germline DNA sequence and comparing it to the transcriptome sequences. Doing this enables the distinction of homozygous genes vs skewed expression of one of the alleles, it can also provide information about genes that were not expressed in the transcriptomic experiment.

Post-transcriptional SNVs

Having the matching Genomic and Transcriptomic sequences of an individual can also help in detecting post-transcriptional edits <ref name=wang2009/>, if genome-wise the individual is homozygous for a gene, but the gene's transcript has a different allele, then a post-transcriptional modification event is determined.mRNA mutations are generally not considered as a representative source of functional variation in cells, mainly due to the fact that these mutations disappear with the mRNA molecule, however the fact that efficient DNA correction mechanisms do not apply to RNA molecules can cause them to appear more often. This has been proposed as the source of prion diseases <ref name=garcion2004>{{cite journal | journal=Journal of Theoretical Biology | volume=230 | issue=2 | pages=271–274 | date=2004 | author=Garcion E, Wallace B, Pelletier L, Wion D. | title= RNA mutagenesis and sporadic prion diseases | url=http://dx.doi.org/10.1016/j.jtbi.2004.05.014 | pmid=15302558 | doi= 10.1016/j.jtbi.2004.05.014 }}</ref>, also known as TES or [[transmissible spongiform encephalopathies]].

Fusion Gene Detection

In <ref name=maher/> any short that fails to align to the reference sequences is then[Final version of Gene Fusion detection image follows]

Caveats

The information gathered when sequencing a sample's transcriptome in this way has many of the same limitations as other RNA expression analysis pipelines. Mainly, the information gathered is:

a) Tissue specific: Gene expression is not uniform throughout an organism's cells, it is strongly dependant on the tissue type being meassured;

b) Time dependant: During a cell's lifetime gene expression changes.

Because of this, care must be taken when drawing conclusions from the sequencing experiment, as some of the information gathered might not be representative of the individual itself.An example of this would be when doing SNV discovery as the mutations discovered are more precisely the mutations being expressed, this is: observing an homozygote location to a non-reference allele in an organism does not necessarily mean that that is the individual's genotype, it could just mean that the gene copy with the reference allele is not being expressed in that tissue and/or at the time snapshot the sample was aquired.

References

^ Ryan D. Morin, Matthew Bainbridge, Anthony Fejes, Martin Hirst, Martin Krzywinski, Trevor J. Pugh, Helen McDonald, Richard Varhol, Steven J.M. Jones, and Marco A. Marra. (2008 ). "Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing". BioTechniques. 45 (1): 81–94. PMID 18611170. {{cite journal}}: Check date values in: |date= (help)CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
^ ^a ^b Wang Z, Gerstein M, Snyder M. (January 2009). "RNA-Seq: a revolutionary tool for transcriptomics". Nature Reviews Genetics. 10 (1): 57–63. doi:10.1038/nrg2484. PMID 19015660.{{cite journal}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)
^ ^a ^b Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (January 2009). "Transcriptome sequencing to detect gene fusions in cancer". Nature. doi:10.1038/nature07638. PMID 19136943.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Cite error: The named reference barbazuk2007 was invoked but never defined (see the help page).

DONE[morin2008] Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing Morin et al., BioTechniques 2008. 45(1):81-94 http://www.biotechniques.com/default.asp?page=current&subsection=article_display&id=112900

[marioni2008] RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays Marioni et al., Genome Research, 2008 http://genome.cshlp.org/content/early/2008/06/11/gr.079558.108.abstract?ck=nck

[mortazavi2008] Mapping and quantifying mammalian transcriptomes by RNA-Seq Mortazavi et al., Nature Methods, 5, 621 - 628 (2008) http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1226.html

(Maher, 2009)

Transcriptome sequencing to detect gene fusions in cancer, Nature 11 Jan 2009 (This one is on E pub, ie no page numbers)

[morin2008-1] Ryan D. Morin, Matthew Bainbridge, Anthony Fejes, Martin Hirst, Martin Krzywinski, Trevor J. Pugh, Helen McDonald, Richard Varhol, Steven J.M. Jones, and Marco A. Marra. (2008 ). "Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing". BioTechniques. 45 (1): 81–94. PMID 18611170. {{cite journal}}: Check date values in: |date= (help)CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)

[wang2009-2] Wang Z, Gerstein M, Snyder M. (January 2009). "RNA-Seq: a revolutionary tool for transcriptomics". Nature Reviews Genetics. 10 (1): 57–63. doi:10.1038/nrg2484. PMID 19015660.{{cite journal}}: CS1 maint: extra punctuation (link) CS1 maint: multiple names: authors list (link)

[maher2009-3] Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (January 2009). "Transcriptome sequencing to detect gene fusions in cancer". Nature. doi:10.1038/nature07638. PMID 19136943.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[barbazuk2007-4] Cite error: The named reference barbazuk2007 was invoked but never defined (see the help page).

[1]

[2]

[3]

[4]