Paper intensive reading (十九):Batch effects in a multiyear sequencing study

論文題目:Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths 

scholar 引用:9

頁數:11

發表時間:1 March 2018

發表刊物:Mol Ecol Resour.

作者:D. M. Leigh1,2,3 | H. E. L. Lischer1,2  C. Grossen1 | L. F. Kelle 

University of Zurich,

摘要:

High-throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long-term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differ- ences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high-throughput sequencing studies.

結論:

  • Studies that gather sequencing data over several years should look for alleles associated with specific batches of data, for instance, by testing for a significant association between allele frequencies and batch identity (The UK10K Consortium, 2015), and filter the data accordingly or change upstream bioinformatics steps like the SNP caller. 長時間收集的樣本數據集應該做的處理
  • In addition, batch effect detection and correction tools that have been designed for microarray data (Johnson, Li, & Rabi- novic, 2007; Leek, Johnson, Parker, Jaffe, & Storey, 2012; Mani- maran et al., 2016), and SNP data (e.g., batchTest in GWASTOOLS; Gogarten et al., 2012), can be employed. 處理方法一定程度上通用
  • reporting of the most common sources of batch effects, that is dif- ferences in read length, sequencing technology, or sequencing centre should become standard procedure so that this information is avail- able to data analysts (Leek et al., 2010).
  • Various bioinformatics steps helped to remove the batch effect reported here. Firstly, using more stringent SNP filters removed the false SNP calls, but presumably at the cost of removing many true SNPs. 刪除
  • A second, alternative approach was to employ a different SNP caller. Calling variants with GATK’s HaplotypeCaller did lead to cor- rect SNP calls.
  • This indicates that is important to not only evaluate SNP filters but also SNP callers in projects that combine data from multiple sequencing runs.
  • The study highlights the need for careful control of study design, bioinformatics pipelines and data analysis to prevent batch effects. 實驗設計
  • Only a combination of approaches will ensure that biologically spurious conclusions from batch effects in HTS data are kept to a minimum.

Introduction:

  • genome-wide single nucleotide polymorphism (SNP) 全基因組單核苷酸多態性(SNP)
  • 限制性酶切位點關聯DNA測序技術(restriction-site-associated DNA sequencing,RAD-seq)是在二代測序技術基礎上發展起來的一種簡化基因組技術(reduced-representation genome sequencing,RRGS)。
  • RAD-seq利用限性核酸內切酶使基因組片段化,經過修飾後連接含標記的接頭構建文庫並進行測序。因其具有操作簡單、實驗成本低、通量高等優點,在分子生態學、進化基因組學、保護遺傳學等領域得到應用。感覺我們組不一定接觸?
  • The simplest and most widely known form of errors is false SNP calls arising from sequencing error  高通量測序最顯著的錯誤
  • Batch effects are thus technical sources of variation that differ among subsets of the data
  • When batch effects are correlated with biological variables, the systematic differences among batches may lead to invalid biological conclusions. 
  • 最理想的解決方案:One way to address batch effects in HTS studies is to randomly divide samples from a population or experimental group across libraries and sequencing lanes 
  • 實際情況:as HTS sequencing develops, an increasing number of mul- tiyear or long-term studies will add sequencing data over time
  • In contrast, when scientific questions focus on specific SNPs, batch effects may be more problematic. 這種情況下,影響更大
  • For example, if a reduced representation sequencing method, such as restriction site-associated DNA sequencing (RAD- seq), is used in a GWAS (Yu et al., 2015), each relevant section of the genome will often only be represented by a single SNP (Lowry et al., 2017; Mckinney, Larson, Seeb, & Seeb, 2017). 舉個例子,In such cases, batch effects can easily cause bias. 
  •  It should be noted, however, that in studies with a higher marker density (e.g., from whole genome sequencing), associations would be confirmed by a cluster of markers rather than a single marker, making false associations due to batch effects less likely. 全基因組測序,影響就小一些
  • 本文對batch effect的討論主要圍繞一個例子開展,We discuss the origin of this batch effect, how it was identified, and its impact on the biological conclusions. We end with a brief discussion of ways in which the consequences of batch effects can be reduced.

正文組織架構:

1. Introduction

2. Methods

2.1 Study system

2.2 Data collection

2.3 Data processing

2.4 Selection detection

2.5 Detection of false SNPs

2.6 Preventing batch effects in selection analyses

2.7 Removing the false SNP calls

3. Results

3.1 Screening for signals of selection

3.2 Preventing batch effects in selection analysis

3.3 Removing false SNP calls

4. Discussion

4.1 Identifying batch effects

4.2 Removing false SNP calls

正文部分內容摘錄:

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章