2020年6月biorxiv生信好文速覽(PNAS的預印本,你瞭解嗎?)

說起預印本(preprint)平臺,生物學從業者一般最先想到的就是biorxiv了。近幾年,隨着預印本的重要性和認可度都不斷提升,越來越多的預印本服務器,如preprint.org、agriRxiv、medRxiv、OSF也都應運而生。然而,也許你並不瞭解,其實大名鼎鼎的《美國科學院院刊》PNAS早早地就已推出了“預印本”。

說到這兒,瞭解PNAS投稿的朋友大概可以猜到了,這就是大名鼎鼎的“院士通道”。目前,PNAS分爲直接投稿(direct submission)和貢獻投稿(contributed submission)兩個途徑。按照官方數據,前者囊括了75%以上發表的稿件(下圖左),也是大多數學者選擇的投稿方式,往往要遭遇比絕大多數雜誌更爲繁瑣而嚴苛的審稿,包括院士初審、主持編委複審及專家外審三個階段,能夠最終發表難度極大(也稱爲平民通道)。而後者,也就是所謂的contributed submission,必需由美國科學院院士投稿推薦,每位院士一年可以攤到兩個配額。儘管也要經歷審稿流程,但走院士途徑投稿的文章,相對於直接投稿模式而言,難度大幅降低,故稱爲院士擔綱通道。其實當年還有一個Communicated渠道(院士引薦通道,10年廢除【1】),需先與一位院士溝通,然後由其向雜誌引薦,該通道不保證發表,但成功率顯然較平民通道高。

好了,這三個通道發表的文章質量有沒有差別呢?2009年,兩名來自哈佛大學的學者對這三種通道文章的引用率【2】。結果表明,走平民通道的文章引用率(下圖右藍色)顯著高於院士擔綱通道(下圖右紅色)。當然,是否可以就此認爲走平民通道的文章水準更高,就見仁見智了。


針對突如其來的全區大流行,院士通道引來了更多的爭議,主要原因是走該通道的關於新冠病毒主題的文章由於審稿不嚴,一旦有嚴重問題或爭議,一經發表便可能借助PNAS的大平臺對全球抗疫造成不良影響。比如,下面兩篇院士通道的文章,都已經引起了不少爭議。針對第二篇,來自佛羅里達大學的生物統計學助理教授Natalie E. Dean更是表示,這篇(出自化學院士Mario J. Molina研究團隊的)文章裏的figure 3絕不會被任何流行病學家的審稿所通過。


不僅是其他科學家,也有美國院士對院士通道表示不屑。上個月,美國科學院院士、加拿大英屬哥倫比亞大學(university of british columbia)的進化生物學家Sarah Otto就表示,不會走這個通道,也不會爲其審稿,如果該文章足夠好,那就不應該回避常規的審稿。這裏有必要交代一下,院士擔綱通道的文章是可以指定四年內未同自己合作過的科學家作爲審稿人的【3】。所以,Otto教授因享有盛譽,自然是可以拒絕其他院士的審稿邀約的。但其他學者就未必敢了,這大概也是該模式長盛不衰的原因之一吧。


說是預印本,那是開玩笑的,前面說過,院士通道的文章還是要經過審稿的,但坊間很多聲音認爲形式遠遠大於實質。不論如何,PNAS的院士通道確實由於其“保送”的發表模式遭到了越來越多的質疑。這裏有一個解決方案,如果PNAS將院士通道轉爲PNAS preprints,一來保證了PNAS的大名,二來顯出其與平民通道的不同之處,是不是一個兩全其美的辦法呢?

其實,不論是平民還是院士通道,對於我而言都是未敢企及的通道,難免心生妒忌胡言亂語。所以大家姑且一聽,不必太過當真。好了,說完PNAS不正宗的preprint,讓我們繼續第25期的bioRxiv生信好文速覽,一起看看上個月正宗預印本平臺上發表了哪些值得一讀的preprints吧。

 1. 霍普金斯大學Salzberg推出Liftoff,一款網絡爆紅的基因註釋的比對工具

Liftoff: an accurate gene annotation mapping tool

Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.4% of human protein-coding genes to a chimpanzee genome assembly with 98.7% sequence identity.


2. 基因突變本身有適應性嗎?來看看這項被馬普所植物大佬Weigel稱爲可能是其最具啓發性的工作(Perhaps most provocative study ever from my lab.

Mutation bias shapes gene evolution in Arabidopsis thaliana

Classical evolutionary theory maintains that mutation rate variation between genes should be random with respect to fitness 1–4 and evolutionary optimization of genic mutation rates remains controversial 3,5. However, it has now become known that cytogenetic (DNA sequence + epigenomic) features influence local mutation probabilities 6, which is predicted by more recent theory to be a prerequisite for beneficial mutation rates between different classes of genes to readily evolve 7. To test this possibility, we used de novo mutations in Arabidopsis thaliana to create a high resolution predictive model of mutation rates as a function of cytogenetic features across the genome. As expected, mutation rates are significantly predicted by features such as GC content, histone modifications, and chromatin accessibility. Deeper analyses of predicted mutation rates reveal effects of introns and untranslated exon regions in distancing coding sequences from mutational hotspots at the start and end of transcribed regions in A. thaliana. Finally, predicted coding region mutation rates are significantly lower in genes where mutations are more likely to be deleterious, supported by numerous estimates of evolutionary and functional constraint. These findings contradict neutral expectations that mutation probabilities are independent of fitness consequences. Instead they are consistent with the evolution of lower mutation rates in functionally constrained loci due to cytogenetic features, with important implications for evolutionary biology8.


 3. 納米孔黑科技讓對RNA修飾的直接測序成爲可能

Direct detection of RNA modifications and structure using single molecule nanopore sequencing

Many methods exist to detect RNA modifications by short-read sequencing, relying on either antibody enrichment of transcripts bearing modified bases or mutational profiling approaches which require conversion to cDNA. Endogenous modifications are present on several major classes of RNA including tRNA, rRNA and mRNA and can modulate diverse biological processes such as genetic recoding, mRNA export and RNA folding. In addition, exogenous modifications can be introduced to RNA molecules to reveal RNA structure and dynamics. Limitations on read length and library size inherent in short-read-based methods dissociate modifications from their native context, preventing single molecule analysis and modification phasing.  Here we demonstrate direct RNA nanopore sequencing to detect endogenous and exogenous RNA modifications over long  sequence  distance  at  the  single  molecule  level.  We demonstrate comprehensive detection of endogenous modifications in E. coli and S. cerevisiae ribosomal RNA (rRNA) using current signal deviations. Notably  2’-O-methyl (Nm) modifications generated a discernible shift in current signal and event level dwell times. We show that dwell times are mediated by the RNA motor protein which sits atop the nanopore. Further, we characterize a recently described small adduct-generating 2’-O-acylation reagent, acetylimidazole (AcIm) for exogenously labeling flexible nucleotides in RNA. Finally, we demonstrate the utility of AcIm for single molecule RNA structural probing using nanopore sequencing.

4. 不同門類細菌間廣泛存在的全同序列意味着什麼?

Long identical sequences found in multiple bacterial genomes reveal frequent and widespread exchange of genetic material between distant species

Horizontal transfer of genomic elements is an essential force that shapes microbial genome evolution. This process occurs via various mechanisms and has been studied in detail for a variety of biological systems. However, a coarse-grained, global picture of horizontal gene transfer (HGT) in the microbial world is still missing. One reason is the difficulty to process large amounts of genomic microbial data to find and characterize HGT events, especially for highly distant organisms. Here, we exploit that HGT between distant species creates long identical DNA sequences in distant species, which can be found efficiently using alignment-free methods. We analyzed over 90, 000 bacterial genomes and thus identified over 100, 000 events of HGT. We further develop a mathematical model to analyze the statistical properties of those long exact matches and thus estimate the transfer rate between any pair of taxa. Our results demonstrate that long-distance gene exchange (across phyla) is very frequent, as more than 8% of the bacterial genomes analyzed have been involved in at least one such event. Finally, we confirm that the function of the transferred sequences strongly impact the transfer rate, as we observe a 3.5 order of magnitude variation between the most and the least transferred categories. Overall, we provide a unique view of horizontal transfer across the bacterial tree of life, illuminating one fundamental process driving bacterial evolution.


5. Projecting single-cell transcriptomics data onto a reference T cell atlas to interpret immune responses

Single-cell transcriptomics is a transformative technology to explore heterogeneous cell populations such as T cells, one of the most potent weapons against cancer and viral infections. Recent advances in this technology and the computational tools developed in their wake provide unique opportunities to build reference atlases that can be used to systematically compare new single-cell RNA-seq (scRNA-seq) datasets derived from different models or therapeutic conditions. We have developed ProjecTILs (https://github.com/carmonalab/ProjecTILs), a novel computational tool to project new scRNA-seq data into a reference map of T cells, allowing their direct comparison in a stable, annotated system of coordinates. ProjecTILs enables the classification of query cells into curated, discrete states, but also over a continuous space of intermediate states. We illustrate the projection of several datasets from recent publications over two novel cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection. ProjecTILs accurately predicted the effects of multiple perturbations, including the ablation of genes controlling T cell differentiation, such as Tox, Ptpn2, miR-155 and Regnase-1, and identified novel gene programs that were altered in these cells (such as a Lag3-Klrc1 inhibitory module), revealing mechanisms of action behind these immunotherapeutic targets and opening new opportunities for the identification of novel targets. By comparing multiple samples over the same reference map, and across alternative embeddings, our method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) in terms of transcriptional states and altered genetic programs.


6. 北美河狸基因組助於揭示其長壽和抗癌機理

The genome of North American beaver provides insights into the mechanisms of its longevity and cancer resistance

The North American beaver (Castor canadensis) is an exceptionally long-lived and cancer-resistant rodent species, and thus an excellent model organism for comparative genomic studies of longevity. Here, we utilize a significantly improved beaver genome assembly to assess evolutionary changes in gene coding sequences, copy number, and expression. We found that the beaver Aldh1a1, a stem cell marker gene encoding an enzyme required for detoxification of ethanol and aldehydes, is expanded (~10 copies vs. two in mouse and one in human). We also show that the beaver cells are more resistant to ethanol, and beaver liver extracts show higher ability to metabolize aldehydes than the mouse samples. Furthermore, Hpgd, a tumor suppressor gene, is uniquely duplicated in the beaver among rodents. Our evolutionary analysis identified beaver genes under positive selection which are associated with tumor suppression and longevity. Genes involved in lipid metabolism show positive selection signals, changes in copy number and altered gene expression in beavers. Several genes involved in DNA repair showed a higher expression in beavers which is consistent with the trend observed in other long-lived mammals. In summary, we identified several genes that likely contribute to beaver longevity and cancer resistance, including increased ability to detoxify aldehydes, enhanced tumor suppression and DNA repair, and altered lipid metabolism.

7. 羣體遺傳學研究工具msprime出現bug,作者發文認錯

Lessons learned from bugs in models of human history

Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors.


軟件作者、牛津大學教授Jerome Kelleher還就此事連發多條推特鄭重道歉 

8. 超快速基因組座標處理工具IGD,號稱快過bedtools等同類軟件一個量級(來自弗吉尼亞大學Nathan Sheffield組)

IGD: high-performance search for large-scale genomic interval datasets

Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.



9. Python, C++  Java之間的無監督翻譯(arxiv

Unsupervised Translation of Programming Languages

A transcompiler, also known as source-to-source translator, is a system that convertssource code from a high-level programming language (such as C++ or Python)to another.  Transcompilers are primarily used for interoperability, and to portcodebases written in an obsolete or deprecated language (e.g. COBOL, Python 2)to a modern one. They typically rely on handcrafted rewrite rules, applied to thesource code abstract syntax tree.  Unfortunately, the resulting translations oftenlack readability, fail to respect the target language conventions, and require manualmodifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, makingcode-translation projects expensive. Although neural models significantly outper-form their rule-based counterparts in the context of natural language translation,their applications to transcompilation have been limited due to the scarcity of paral-lel data in this domain. In this paper, we propose to leverage recent approaches inunsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and showthat it can translate functions between C++, Java, and Python with high accuracy.Our method relies exclusively on monolingual source code, requires no expertise inthe source or target languages, and can easily be generalized to other programminglanguages. We also build and release a test set composed of 852 parallel functions,along with unit tests to check the correctness of translations.  We show that ourmodel outperforms rule-based commercial baselines by a significant margin. 

10. 【新冠】新冠病毒變異在線研究工具(preprints.org

CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation

CoV-GLUE is an online web application for the interpretation and analysis of SARS-CoV-2 virus genome sequences, with a focus on amino acid sequence variation. It is based on the GLUE data-centric bioinformatics environment and provides a browsable database of amino acid replacements and coding region indels that have been observed in sequences from the pandemic. Users may also analyse their own SARS-CoV-2 sequences by submitting them to the web application to receive an interactive report containing visualisations of phylogenetic classification and highlighting genomic variation of potentially high impact, for example linked to primer mismatches.


引文

1. PNAS will eliminate Communicated submissions in July 2010. Randy Schekman. PNAS September 15, 2009 106 (37) 15518; https://doi.org/10.1073/pnas.

2. Rand DG, Pfeiffer T (2009) Systematic Differences in Impact across Publication Tracks at PNAS. PLoS ONE 4(12): e8092

3. https://www.pnas.org/page/authors/journal-policies


作者原創,原載於生信人公衆號

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章