2020年4月bioRxiv生信好文速覽

介於飛速增長的新冠肺炎相關preprint,bioRxiv,medRxiv,以及谷歌學術都開出了專門通道供大家查找新冠相關主題的文章。

 需要注意的是,biorxiv將拒絕你的研究成果在其上發佈,如果你的文章是:針對新冠肺炎的、純生信的、藥物作用或療效的預測。爲此,biorxiv聯合創始人Richard Server發推表示這是團隊深思熟慮的結果:The balance we have to strike is speeding up science but also minimising potential for harm。此舉正是爲了平衡預印本快速靈活的特點及其質量殘次不齊和缺乏審稿所帶來的可能的巨大的臨牀風險,然而,被退稿的美國東北大學教授Albert-László Barabási對此卻不以爲然:

針對新冠肺炎預印本的質量問題,美國西奈醫學院免疫學研究所(Immunology Institute at the Mount Sinai School of Medicine)的研究人員對bioRxiv和medRxiv上的preprint進行了所謂的pre-review,也就是對預印本的預審稿,截止目前爲止已發佈了168個點評,小編走馬觀花看了一下,都十分詳盡。大家如感興趣可以follow ID爲 @sinaiimmunologyreviewproject的medrxiv用戶,以及@SinaiImmunol的推特賬號。總之,如何平衡速度與質量仍然是biorxiv和整個preprint community未來面臨的重要課題。


剛剛過去的一個月也有悲痛的消息傳來:Galaxy聯合創始人、約翰霍普金斯大學教授James Taylor在4月2日不幸離開了我們,年僅40歲。作爲開放科學的旗手,Taylor共於biorxiv上投放過11篇手稿,且致力於打造透明和可重複的生物信息分析——這一切都與預印本的理念不謀而合。05年,身爲賓州州立大學比較基因組與生物信息學中心博士生的Taylor與合夥人開始了Galaxy的創建工作【1】。15年過去了,Galaxy早已名滿天下,其友好而又強大的互動分析讓高深莫測的生物信息分析不再神祕,也帶領無數生物學從業人員走入基因組學的世界,而今天Galaxy中層出不窮的分析工具大概是對Taylor教授最好的紀念吧。

 ——The Galaxy Project【2】


1. 一個基於Galaxy的scRNA-seq互動式分析集成環境

User-friendly, scalable tools and workflows for single-cell analysis

Single-cell RNA-Seq (scRNA-Seq) data analysis requires expertise in command-line tools, programming languages and scaling on compute infrastructure. As scRNA-Seq becomes widespread, computational pipelines need to be more accessible, simpler and scalable. We introduce an interactive analysis environment for scRNA-Seq, based on Galaxy, with ~70 functions from major single-cell analysis tools, which can be run on compute clusters, cloud providers or single machines, to bring compute to the data in scRNA-Seq.


2. gamete binning,染色體+單倍型水平的基因組組裝技術

Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes

Generating haplotype-resolved, chromosome-level assemblies of heterozygous genomes remains challenging. To address this, we developed gamete binning, a method based on single-cell sequencing of hundreds of haploid gamete genomes, which enables the separation of conventional long sequencing reads into two haplotype-specific read sets. After independently assembling the reads of each haplotype, the contigs are scaffolded to chromosome-level using a genetic map derived from the recombination patterns within the same gamete genomes. As a proof-of-concept, we assembled the two genomes of a diploid apricot tree supported by the analysis of 445 pollen genomes. Both assemblies (N50: 25.5 and 25.8 Mb) featured a haplotyping precision of >99% and were accurately scaffolded to chromosome-level as reflected by high levels of synteny to closely-related species. These two assemblies allowed for first insights into haplotype diversity of apricot and enabled the identification of non-allelic crossover events introducing severe chromosomal anomalies in 1.6% of the pollen genomes.


3. 港中文學者:動物界活化石——馬蹄蟹(鱟)基因組揭祕三輪全基因組倍增後的微小RNA進化

Horseshoe crab genomes reveal the evolutionary fates of genes and microRNAs after three rounds (3R) of whole genome duplication

Whole genome duplication (WGD) has occurred in relatively few sexually reproducing invertebrates. Consequently, the WGD that occurred in the common ancestor of horseshoe crabs ~135 million years ago provides a rare opportunity to decipher the evolutionary consequences of a duplicated invertebrate genome. Here, we present a high-quality genome assembly for the mangrove horseshoe crab Carcinoscorpius rotundicauda (1.7Gb, N50 = 90.2Mb, with 89.8% sequences anchored to 16 pseudomolecules, 2n = 32), and a resequenced genome of the tri-spine horseshoe crab Tachypleus tridentatus (1.7Gb, N50 = 109.7Mb). Analyses of gene families, microRNAs, and synteny show that horseshoe crabs have undergone three rounds (3R) of WGD, and that these WGD events are shared with spiders. Comparison of the genomes of C. rotundicauda and T. tridentatus populations from several geographic locations further elucidates the diverse fates of both coding and noncoding genes. Together, the present study represents a cornerstone for a better understanding of the consequences of invertebrate WGD events on evolutionary fates of genes and microRNAs at individual and population levels, and highlights the genetic diversity with practical values for breeding programs and conservation of horseshoe crabs.

4. 三個新的猴基因組助力靈長類系統發育分析

Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression

Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history of primate groups. Among these findings is a pattern of recent introgression between species within all major primate groups examined to date, though little is known about introgression deeper in time. To address this and other phylogenetic questions, here we present new reference genome assemblies for three Old World Monkey species: Colobus angolensis ssp. palliatus (the black and white colobus), Macaca nemestrina (southern pig-tailed macaque), and Mandrillus leucophaeus (the drill). We combine these data with 23 additional primate genomes to estimate both the species tree and individual gene trees using thousands of loci. While our species tree is largely consistent with previous phylogenetic hypotheses, the gene trees reveal high levels of genealogical discordance associated with multiple primate radiations. We use strongly asymmetric patterns of gene tree discordance around specific branches to identify multiple instances of introgression between ancestral primate lineages. In addition, we exploit recent fossil evidence to perform fossil-calibrated molecular dating analyses across the tree. Taken together, our genome-wide data help to resolve multiple contentious sets of relationships among primates, while also providing insight into the biological processes and technical artifacts that led to the disagreements in the first place.


5. 結構變異分析工具(SV caller)哪家強?

A comprehensive benchmarking of WGS-based structural variant callers

Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.

這是卡在了第一步?


6. 基於RAD-Seq的性染色體計算工具,法國農業食品環境研究所(INRAE)出品

RADSex: a computational workflow to study sex determination using Restriction Site-Associated DNA Sequencing data

The study of sex determination and sex chromosome organisation in non-model species has long been technically challenging, but new sequencing methodologies are now enabling precise and high-throughput identification of sex-specific genomic sequences. In particular, Restriction Site-Associated DNA Sequencing (RAD-Seq) is being extensively applied to explore sex determination systems in many plant and animal species. However, software designed to specifically search for sex-biased markers using RAD-Seq data is lacking. Here, we present RADSex, a computational analysis workflow designed to study the genetic basis of sex determination using RAD-Seq data. RADSex is simple to use, requires few computational resources, makes no prior assumptions about type of sex-determination system or structure of the sex locus, and offers convenient visualization through a dedicated R package. To demonstrate the functionality of RADSex, we re-analyzed a published dataset of Japanese medaka, Oryzias latipes, where we uncovered a previously unknown Y chromosome polymorphism. We then used RADSex to analyze new RAD-Seq datasets from 15 fish species spanning multiple systematic orders. We identified the sex determination system and sex-specific markers in six of these species, five of which had no known sex-markers prior to this study. We show that RADSex greatly facilitates the study of sex determination systems in non-model species and outperforms the commonly used RAD-Seq analysis software STACKS. RADSex in speed, resource usage, ease of application, and visualization options. Furthermore, our analysis of new datasets from 15 species provides new insights on sex determination in fish.


7. 普通小麥的新基因組組裝:找出5700新基因?

Chromosome-scale assembly of the bread wheat genome, Triticum aestivum, reveals over 5700 new genes

Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all 3 wheat subgenomes at chromosome scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 gigabases of genomic sequence. We earlier published an independent wheat assembly (Triticum 3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC 1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum 4.0, contains 15.07 gigabases of non-gap sequence anchored to chromosomes, which is 1.2 gigabases more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered more than 5700 new genes, all of them duplications in the Chinese Spring genome that are missing from the IWGSC assembly and annotation. The Triticum 4.0 assembly and annotations are freely available at www.ncbi.nlm.nih.gov/bioproject/PRJNA392179.


8. 馬普發育所Detlef Weigel:從single-genome到multi-genome參考序列

An Algorithm to Build a Multi-genome Reference

To overcome the limits imposed by mapping sequence reads against a single reference genome, or serially mapping them against multiple reference genomes, we have developed the MGR method that allows simultaneous comparison against multiple high-quality reference genomes, in order to remove the bias that comes from using only a single-genome reference and to simplify downstream analyses. To this end, we present the MGR algorithm that creates a graph (MGR graph) as a multi-genome reference. To reduce the size and complexity of the multi-genome reference, highly similar orthologous1 and paralogous2 regions are collapsed while more substantial differences are retained. To evaluate the performance of our model, we have developed a genome compression tool, which can be used to estimate the amount of shared information between genomes.

9. 【新冠肺炎】大連理工大學:Computational analysis suggests putative intermediate animal hosts of the SARS-CoV-2

The recent emerged SARS-CoV-2 may first transmit to intermediate animal host from bats before the spread to humans. The receptor recognition of ACE2 protein by SARS-CoVs or bat-originated coronaviruses is one of the most important determinant factors for the cross-species transmission and human-to-human transmission. To explore the hypothesis of possible intermediate animal host, we employed molecular dynamics simulation and free energy calculation to examine the binding of bat coronavirus with ACE2 proteins of 47 representing animal species collected from public databases. Our results suggest that intermediate animal host may exist for the zoonotic transmission of SARS-CoV-2. Furthermore, we found that tree shrew and ferret may be two putative intermediate hosts for the zoonotic spread of SARS-CoV-2. Collectively, the continuous surveillance of pneumonia in human and suspicious animal hosts are crucial to control the zoonotic transmission events caused by SARS-CoV-2.


10. 【新冠肺炎】Introductions and early spread of SARS-CoV-2 in France

Following the emergence of coronavirus disease (COVID-19) in Wuhan, China in December 2019, specific COVID-19 surveillance was launched in France on January 10, 2020. Two weeks later, the first three imported cases of COVID-19 into Europe were diagnosed in France. We sequenced 97 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes from samples collected between January 24 and March 24, 2020 from infected patients in France. Phylogenetic analysis identified several early independent SARS-CoV-2 introductions without local transmission, highlighting the efficacy of the measures taken to prevent virus spread from symptomatic cases. In parallel, our genomic data reveals the later predominant circulation of a major clade in many French regions, and implies local circulation of the virus in undocumented infections prior to the wave of COVID-19 cases. This study emphasizes the importance continuous and geographically broad genomic sequencing and calls for further efforts with inclusion of asymptomatic infections.


11. 【作者自薦】 CytoTalk: De novo construction of signal transduction networks using single-cell RNA-Seq data

Single-cell technology has opened the door for studying signal transduction in a complex tissue at unprecedented resolution. However, there is a lack of analytical methods for de novo construction of signal transduction pathways using single-cell omics data. Here we present CytoTalk, a computational method for de novo constructing cell type-specific signal transduction networks using single-cell RNA-Seq data. CytoTalk first constructs intracellular and intercellular gene-gene interaction networks using an information-theoretic measure between two cell types. Candidate signal transduction pathways in the integrated network are identified using the prize-collecting Steiner forest algorithm. We applied CytoTalk to a single-cell RNA-Seq data set on mouse visual cortex and evaluated predictions using high-throughput spatial transcriptomics data generated from the same tissue. Compared to published methods, genes in our inferred signaling pathways have significantly higher spatial expression correlation only in cells that are spatially closer to each other, suggesting improved accuracy of CytoTalk. Furthermore, using single-cell RNA-Seq data with receptor gene perturbation, we found that predicted pathways are enriched for differentially expressed genes between the receptor knockout and wild type cells, further validating the accuracy of CytoTalk. In summary, CytoTalk enables de novo construction of signal transduction pathways and facilitates comparative analysis of these pathways across tissues and conditions.


引文

1. Nekrutenko, A., Schatz, M.C. In memory of James Taylor: the birth of Galaxy. Genome Biol 21, 105 (2020). https://doi.org/10.1186/s13059-020-02016-0

2. https://galaxyproject.org/jxtx/


作者原創,原載於生信人公衆號

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章