2020年4月bioRxiv生信好文速览

介于飞速增长的新冠肺炎相关preprint，bioRxiv，medRxiv，以及谷歌学术都开出了专门通道供大家查找新冠相关主题的文章。

需要注意的是，biorxiv将拒绝你的研究成果在其上发布，如果你的文章是：针对新冠肺炎的、纯生信的、药物作用或疗效的预测。为此，biorxiv联合创始人Richard Server发推表示这是团队深思熟虑的结果：The balance we have to strike is speeding up science but also minimising potential for harm。此举正是为了平衡预印本快速灵活的特点及其质量残次不齐和缺乏审稿所带来的可能的巨大的临床风险，然而，被退稿的美国东北大学教授Albert-László Barabási对此却不以为然：

针对新冠肺炎预印本的质量问题，美国西奈医学院免疫学研究所（Immunology Institute at the Mount Sinai School of Medicine）的研究人员对bioRxiv和medRxiv上的preprint进行了所谓的pre-review，也就是对预印本的预审稿，截止目前为止已发布了168个点评，小编走马观花看了一下，都十分详尽。大家如感兴趣可以follow ID为 @sinaiimmunologyreviewproject的medrxiv用户，以及@SinaiImmunol的推特账号。总之，如何平衡速度与质量仍然是biorxiv和整个preprint community未来面临的重要课题。

刚刚过去的一个月也有悲痛的消息传来：Galaxy联合创始人、约翰霍普金斯大学教授James Taylor在4月2日不幸离开了我们，年仅40岁。作为开放科学的旗手，Taylor共于biorxiv上投放过11篇手稿，且致力于打造透明和可重复的生物信息分析——这一切都与预印本的理念不谋而合。05年，身为宾州州立大学比较基因组与生物信息学中心博士生的Taylor与合伙人开始了Galaxy的创建工作【1】。15年过去了，Galaxy早已名满天下，其友好而又强大的互动分析让高深莫测的生物信息分析不再神秘，也带领无数生物学从业人员走入基因组学的世界，而今天Galaxy中层出不穷的分析工具大概是对Taylor教授最好的纪念吧。

——The Galaxy Project【2】

1. 一个基于Galaxy的scRNA-seq互动式分析集成环境

User-friendly, scalable tools and workflows for single-cell analysis

Single-cell RNA-Seq (scRNA-Seq) data analysis requires expertise in command-line tools, programming languages and scaling on compute infrastructure. As scRNA-Seq becomes widespread, computational pipelines need to be more accessible, simpler and scalable. We introduce an interactive analysis environment for scRNA-Seq, based on Galaxy, with ~70 functions from major single-cell analysis tools, which can be run on compute clusters, cloud providers or single machines, to bring compute to the data in scRNA-Seq.

2. gamete binning，染色体+单倍型水平的基因组组装技术

Chromosome-level and haplotype-resolved genome assembly enabled by high-throughput single-cell sequencing of gamete genomes

Generating haplotype-resolved, chromosome-level assemblies of heterozygous genomes remains challenging. To address this, we developed gamete binning, a method based on single-cell sequencing of hundreds of haploid gamete genomes, which enables the separation of conventional long sequencing reads into two haplotype-specific read sets. After independently assembling the reads of each haplotype, the contigs are scaffolded to chromosome-level using a genetic map derived from the recombination patterns within the same gamete genomes. As a proof-of-concept, we assembled the two genomes of a diploid apricot tree supported by the analysis of 445 pollen genomes. Both assemblies (N50: 25.5 and 25.8 Mb) featured a haplotyping precision of >99% and were accurately scaffolded to chromosome-level as reflected by high levels of synteny to closely-related species. These two assemblies allowed for first insights into haplotype diversity of apricot and enabled the identification of non-allelic crossover events introducing severe chromosomal anomalies in 1.6% of the pollen genomes.

3. 港中文学者：动物界活化石——马蹄蟹（鲎）基因组揭秘三轮全基因组倍增后的微小RNA进化

Horseshoe crab genomes reveal the evolutionary fates of genes and microRNAs after three rounds (3R) of whole genome duplication

Whole genome duplication (WGD) has occurred in relatively few sexually reproducing invertebrates. Consequently, the WGD that occurred in the common ancestor of horseshoe crabs ~135 million years ago provides a rare opportunity to decipher the evolutionary consequences of a duplicated invertebrate genome. Here, we present a high-quality genome assembly for the mangrove horseshoe crab Carcinoscorpius rotundicauda (1.7Gb, N50 = 90.2Mb, with 89.8% sequences anchored to 16 pseudomolecules, 2n = 32), and a resequenced genome of the tri-spine horseshoe crab Tachypleus tridentatus (1.7Gb, N50 = 109.7Mb). Analyses of gene families, microRNAs, and synteny show that horseshoe crabs have undergone three rounds (3R) of WGD, and that these WGD events are shared with spiders. Comparison of the genomes of C. rotundicauda and T. tridentatus populations from several geographic locations further elucidates the diverse fates of both coding and noncoding genes. Together, the present study represents a cornerstone for a better understanding of the consequences of invertebrate WGD events on evolutionary fates of genes and microRNAs at individual and population levels, and highlights the genetic diversity with practical values for breeding programs and conservation of horseshoe crabs.

4. 三个新的猴基因组助力灵长类系统发育分析

Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression

Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history of primate groups. Among these findings is a pattern of recent introgression between species within all major primate groups examined to date, though little is known about introgression deeper in time. To address this and other phylogenetic questions, here we present new reference genome assemblies for three Old World Monkey species: Colobus angolensis ssp. palliatus (the black and white colobus), Macaca nemestrina (southern pig-tailed macaque), and Mandrillus leucophaeus (the drill). We combine these data with 23 additional primate genomes to estimate both the species tree and individual gene trees using thousands of loci. While our species tree is largely consistent with previous phylogenetic hypotheses, the gene trees reveal high levels of genealogical discordance associated with multiple primate radiations. We use strongly asymmetric patterns of gene tree discordance around specific branches to identify multiple instances of introgression between ancestral primate lineages. In addition, we exploit recent fossil evidence to perform fossil-calibrated molecular dating analyses across the tree. Taken together, our genome-wide data help to resolve multiple contentious sets of relationships among primates, while also providing insight into the biological processes and technical artifacts that led to the disagreements in the first place.

5. 结构变异分析工具（SV caller）哪家强？

A comprehensive benchmarking of WGS-based structural variant callers

Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.

这是卡在了第一步？

6. 基于RAD-Seq的性染色体计算工具，法国农业食品环境研究所（INRAE）出品

RADSex: a computational workflow to study sex determination using Restriction Site-Associated DNA Sequencing data

The study of sex determination and sex chromosome organisation in non-model species has long been technically challenging, but new sequencing methodologies are now enabling precise and high-throughput identification of sex-specific genomic sequences. In particular, Restriction Site-Associated DNA Sequencing (RAD-Seq) is being extensively applied to explore sex determination systems in many plant and animal species. However, software designed to specifically search for sex-biased markers using RAD-Seq data is lacking. Here, we present RADSex, a computational analysis workflow designed to study the genetic basis of sex determination using RAD-Seq data. RADSex is simple to use, requires few computational resources, makes no prior assumptions about type of sex-determination system or structure of the sex locus, and offers convenient visualization through a dedicated R package. To demonstrate the functionality of RADSex, we re-analyzed a published dataset of Japanese medaka, Oryzias latipes, where we uncovered a previously unknown Y chromosome polymorphism. We then used RADSex to analyze new RAD-Seq datasets from 15 fish species spanning multiple systematic orders. We identified the sex determination system and sex-specific markers in six of these species, five of which had no known sex-markers prior to this study. We show that RADSex greatly facilitates the study of sex determination systems in non-model species and outperforms the commonly used RAD-Seq analysis software STACKS. RADSex in speed, resource usage, ease of application, and visualization options. Furthermore, our analysis of new datasets from 15 species provides new insights on sex determination in fish.

7. 普通小麦的新基因组组装：找出5700新基因？

Chromosome-scale assembly of the bread wheat genome, Triticum aestivum, reveals over 5700 new genes

Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all 3 wheat subgenomes at chromosome scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 gigabases of genomic sequence. We earlier published an independent wheat assembly (Triticum 3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC 1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum 4.0, contains 15.07 gigabases of non-gap sequence anchored to chromosomes, which is 1.2 gigabases more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered more than 5700 new genes, all of them duplications in the Chinese Spring genome that are missing from the IWGSC assembly and annotation. The Triticum 4.0 assembly and annotations are freely available at www.ncbi.nlm.nih.gov/bioproject/PRJNA392179.

8. 马普发育所Detlef Weigel：从single-genome到multi-genome参考序列

An Algorithm to Build a Multi-genome Reference

To overcome the limits imposed by mapping sequence reads against a single reference genome, or serially mapping them against multiple reference genomes, we have developed the MGR method that allows simultaneous comparison against multiple high-quality reference genomes, in order to remove the bias that comes from using only a single-genome reference and to simplify downstream analyses. To this end, we present the MGR algorithm that creates a graph (MGR graph) as a multi-genome reference. To reduce the size and complexity of the multi-genome reference, highly similar orthologous1 and paralogous2 regions are collapsed while more substantial differences are retained. To evaluate the performance of our model, we have developed a genome compression tool, which can be used to estimate the amount of shared information between genomes.

9. 【新冠肺炎】大连理工大学：Computational analysis suggests putative intermediate animal hosts of the SARS-CoV-2

The recent emerged SARS-CoV-2 may first transmit to intermediate animal host from bats before the spread to humans. The receptor recognition of ACE2 protein by SARS-CoVs or bat-originated coronaviruses is one of the most important determinant factors for the cross-species transmission and human-to-human transmission. To explore the hypothesis of possible intermediate animal host, we employed molecular dynamics simulation and free energy calculation to examine the binding of bat coronavirus with ACE2 proteins of 47 representing animal species collected from public databases. Our results suggest that intermediate animal host may exist for the zoonotic transmission of SARS-CoV-2. Furthermore, we found that tree shrew and ferret may be two putative intermediate hosts for the zoonotic spread of SARS-CoV-2. Collectively, the continuous surveillance of pneumonia in human and suspicious animal hosts are crucial to control the zoonotic transmission events caused by SARS-CoV-2.

10. 【新冠肺炎】Introductions and early spread of SARS-CoV-2 in France

Following the emergence of coronavirus disease (COVID-19) in Wuhan, China in December 2019, specific COVID-19 surveillance was launched in France on January 10, 2020. Two weeks later, the first three imported cases of COVID-19 into Europe were diagnosed in France. We sequenced 97 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes from samples collected between January 24 and March 24, 2020 from infected patients in France. Phylogenetic analysis identified several early independent SARS-CoV-2 introductions without local transmission, highlighting the efficacy of the measures taken to prevent virus spread from symptomatic cases. In parallel, our genomic data reveals the later predominant circulation of a major clade in many French regions, and implies local circulation of the virus in undocumented infections prior to the wave of COVID-19 cases. This study emphasizes the importance continuous and geographically broad genomic sequencing and calls for further efforts with inclusion of asymptomatic infections.

11. 【作者自荐】 CytoTalk: De novo construction of signal transduction networks using single-cell RNA-Seq data

Single-cell technology has opened the door for studying signal transduction in a complex tissue at unprecedented resolution. However, there is a lack of analytical methods for de novo construction of signal transduction pathways using single-cell omics data. Here we present CytoTalk, a computational method for de novo constructing cell type-specific signal transduction networks using single-cell RNA-Seq data. CytoTalk first constructs intracellular and intercellular gene-gene interaction networks using an information-theoretic measure between two cell types. Candidate signal transduction pathways in the integrated network are identified using the prize-collecting Steiner forest algorithm. We applied CytoTalk to a single-cell RNA-Seq data set on mouse visual cortex and evaluated predictions using high-throughput spatial transcriptomics data generated from the same tissue. Compared to published methods, genes in our inferred signaling pathways have significantly higher spatial expression correlation only in cells that are spatially closer to each other, suggesting improved accuracy of CytoTalk. Furthermore, using single-cell RNA-Seq data with receptor gene perturbation, we found that predicted pathways are enriched for differentially expressed genes between the receptor knockout and wild type cells, further validating the accuracy of CytoTalk. In summary, CytoTalk enables de novo construction of signal transduction pathways and facilitates comparative analysis of these pathways across tissues and conditions.

引文

1. Nekrutenko, A., Schatz, M.C. In memory of James Taylor: the birth of Galaxy. Genome Biol 21, 105 (2020). https://doi.org/10.1186/s13059-020-02016-0

2. https://galaxyproject.org/jxtx/

作者原创，原载于生信人公众号

2020年4月bioRxiv生信好文速览

2021年1月biorxiv生信好文速覽

署名之爭

簡書重新更新

簡書未更新

2020年11月bioRxiv生信好文速覽

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結