Paper intensive reading (十五):Mitigating the adverse impact of batch effects in sample detection

論文題目:Mitigating the adverse impact of batch effects in sample pattern detection 

scholar 引用:6

頁數:8

發表時間:1 March 2018

發表刊物:Bioinformatics

作者:Teng Fei1, Tengjiao Zhang2, Weiyang Shi3,* and Tianwei Yu1,*

Emory University, 同濟大學,跟Paper intensive reading (十二)同一個作者

摘要:

Motivation: It is well known that batch effects exist in RNA-seq data and other profiling data. Although some methods do a good job adjusting for batch effects by modifying the data matrices, it is still difficult to remove the batch effects entirely. The remaining batch effect can cause artifacts in the detection of patterns in the data.

Results: In this study, we consider the batch effect issue in the pattern detection among the samples, such as clustering, dimension reduction and construction of networks between subjects. Instead of adjusting the original data matrices, we design an adaptive method to directly adjust the dissimilarity matrix between samples. In simulation studies, the method achieved better results recovering true underlying clusters, compared to the leading batch effect adjustment method ComBat. In real data analysis, the method effectively corrected distance matrices and improved the performance of clustering algorithms.

1.cluster方面比ComBat好

2.怎麼沒給他們的新方法起一個名字,有名字:QuantNorm. 

3.實際數據分析中,該方法的確起了作用

結論:

  • In this paper, we proposed novel approaches based on the interpolating quantile normalization. As the data become challenging, i.e. true clusters are closer to each other, and the batch effect is heterogeneous on different clusters, our methods outperform ComBat. 本文提出的方法的新穎之處。
  • 但是,ComBat還是一種更通用的方法,ComBat is a more general method, which adjusts the data matrix for many kinds of down-stream analysis, while our method focuses on adjusting the dissimilarity matrix between samples, mainly serving the purpose of pattern detection in the samples. It does not correct the raw count matrix to adjust for batch effects.
  • our method modifies the dissimilarity matrix so that various clustering approaches can achieve better performance. 本文的方法可以與多種聚類方法相結合。
  • 本方法的缺陷:On the one hand, the vectorization approach may suffer from insufficient discrimination due to the lack of extreme values. On the other hand, the row/column iterative approach is more easily affected by the wrong extreme values since each column and each row are polarized.
  • the vectorization approach performed better on data with high similarity between batches 本方法適用於不同批次之間具有高度相似性的數據。
  • the preprocessing method can affect the result of the clustering analysis. 預處理方法
  • the choice of the two preprocessing strategies may depend on data 預處理策略取決於數據
  • Although the iterative approach seems to have limitations ex- plained earlier, we generally recommend this approach. 可以提高方法的魯棒性

Introduction:

  • The existence of batch effects increases the difficulty in comparing the data from different labs, platforms and processing times. 不同實驗室,不同平臺,不同時間
  • 如果忽略批次效應,會得到錯誤的結果。比如說對小鼠和人的基因表達進行聚類分析,得出兩個物種而非兩種組織的結論,但是調整了批次效應以後,得到了相反的結論。
  • 大量的方法被提出:
  1. Johnson et al. (2007) proposed the empirical Bayes algorithm of ComBat, which removes the additive and multiplicative batch effects for each gene from each batch. 當前的黃金標準方法
  2. Gagnon-Bartsch and Speed (2012) applied the removal of unwanted variation method to make adjustments according to the variations of the control genes, which are not differentially expressed (DE) among the batches.
  • 大部分方法包含ComBat的缺點:attempt to modify the data matrix (N subjects􏰂p genes) so that the measurements from different batches become comparable. ComBat appears to be more effective for the microarray data, which is less skewed than RNA-seq data. 對microarray data更有效,但是對RNA-seq也許沒有那麼有效
  • Moreover, real data may have high irregularity such that the additive and the multiplicative parameters are insufficient to capture all batch effects. 加法和乘法參數不足以捕獲所有的批次效應
  • ad hoc approaches based on quantile normalization are introduced in this manuscript 本文提出了一種分位數歸一化的方法
  • According to simulation results, clustering based on the normalized dissimilarity matrix obtained by our methods outperformed ComBat in recapturing the underlying cluster structure in the data, especially when the data were more challenging as the percentage of genes that differentiate the underlying clusters was small. 仿真實驗結果顯示優於ComBat。尤其適用於一些有挑戰的數據集。
  • In real data analysis, we analyzed two datasets with dominating batch effects (Gilad and Mizrahi- Man, 2015; Zhang et al., 2016) and two scRNA-seq datasets where the batch effects are relatively weak (Muraro et al., 2016; Usoskin et al., 2015). Our methods improved the clustering accuracy and outperformed ComBat in both situations。在幾個實際數據集上,也優於ComBat。

正文組織架構:

1. Introduction

2. Materials and methods

2.1 Problem setup

2.2 Preprocessing

2.3 Interpolating quantile(內插分位數歸一化) normalization for vectors of different lengths

2.4 Dissimilarity matrix correction

2.5 Clustering and evaluation methods

3. Results

3.1 Simulation study

3.2 ENCODE data for human and mouse tissues

3.3 Human-mouse brain RNA-seq data

3.4 Mouse neuron scRNA-seq data

3.5 Human pancreas scRNA-seq data

4. Discussion

正文部分內容摘錄:

疑問:

1.什麼是分位數歸一化?

  •  

2.RNA-seq和scRNA-seq的區別?

  •  
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章