Paper intensive reading (十四)：Batch effects in single-cell RNA-sequencing data are corrected

論文題目：Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

scholar 引用：262

頁數：12

發表時間：2 April 2018

發表刊物：nature biotechnology

作者：Laleh Haghverdi1,2, Aaron T L Lun3 , Michael D Morgan4 & John C Marioni1,3,4

Cambridge

摘要：

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.

Discussion：

Existing batch-correction methods do not account for differences in cell composition between batches and fail to fully remove the batch effect in such cases. 已有方法的侷限
By using both simulated data and real scRNA-seq data sets, we demonstrated that our MNN method is able to successfully remove the batch effect in the presence of differences in composition. 在仿真和實際數據中均有測試
Moreover, we demonstrated the MNN method’s scalability on large droplet-based data sets.算法具有可擴展性
One prerequisite for our MNN method is that each batch must contain at least one shared cell population with another batch.MNN方法的先決條件
A notable feature of our MNN correction method is that it adjusts for local variations in batch effects by using a Gaussian kernel. MNN方法的一個顯著特徵，用了高斯核，適用於高維，線性方法無法實現此功能。
the correction vectors (provided as an output of the MNN algorithm) could potentially be examined to understand the differences between batches.
看了半天，好像意思是，不同的批次中，細胞系是不同的，然後不同批次會包含一些相同的細胞系，然後認爲相同細胞系的數據應該是相同的，就得出一個correction vector，然後不斷的去找這種重疊的子集，最後全部校正完畢~

Introduction：

Such differences can mask underlying biology or introduce spurious structure in the data; thus, to avoid misleading conclusions, they must be corrected before further analysis.
Most existing methods for batch correction are based on linear regression.大多數方法都是基於線性迴歸。
The limma package provides the removeBatchEffect function, limma包裏面的一個函數，該方法的主要原理：which fits a linear model containing a blocking term for the batch structure to the expression values for each gene. Subsequently, the coefficient for each blocking term is set to zero, and the expression values are computed from the remaining terms and residuals, thus yielding a new expression matrix without batch effects.
The ComBat method8 uses a similar strategy but performs an additional step involving empirical Bayes shrinkage of the blocking coefficient estimates. ComBat採取的策略跟removeBatchEffect函數的策略類似，但是有一些優化。
Other methods, such as RUVseq9 and svaseq10, are also frequently used for batch correction, but their focus is primarily on identifying unknown factors of variation, for example, those due to unrecorded experimental differences in cell processing. After these factors are identified, their effects can be regressed out as described previously. 這兩個方法側重於發現變化的未知因子
their application to scRNA-seq data is based on the assumption that the composition of the cell population within each batch is identical. 目前已有的方法基於的假設是每一批次的細胞羣相同，但實際上可能並不是
in practice, the population composition is usually not identical across batches in scRNA-seq studies.
即使相同，這一假設也有問題，Even if the same cell types are present in each batch, the abundance of each cell type in the data set can change depending upon subtle differences in procedures such as cell culture or tissue extraction, dissociation and sorting.
the estimated coefficients for the batch blocking factors are not purely technical but contain a nonzero biological component because of differences in composition.
Batch correction based on these coefficients would thus yield inaccurate representations of the cellular expression pro- files, and the results might potentially be worse than if no correction were performed.
An alternative approach for data merging and comparison in the presence of batch effects uses a set of landmarks from a reference data set to project new data onto the reference
PCA等投影方法的缺陷：if the new batches include cell types that fall outside the transcriptional space explored in the reference batch, these cell types will not be projected to an appropriate position in the space defined by the landmarks
本文的主要工作：
Here, we propose a new method for removal of discrepancies between biologically related batches according to the presence of MNNs between batches, which are considered to define the most similar cells of the same type across batches. 提出了MNN方法
The difference in expression values between cells in an MNN pair provides an estimate of the batch effect, which is made more precise by averaging across many such pairs.
A correction vector is obtained from the estimated batch effect and applied to the expression values to perform batch correction.
Our approach automatically identifies overlaps in population composition between batches and uses only the overlapping subsets for correction, thus avoiding the assumption of equal composition required by other methods.
We demonstrate that our approach outperforms existing methods on a range of simulated and real scRNA-seq data sets involving different biological systems and technologies.

正文組織架構：

1. Introduction

2. Results

2.1 Matching mutual nearest neighbors for batch correction

2.2 MNN correction outperforms existing methods on simulated data

2.3 MNN correction outperforms existing methods on hematopoietic(造血) data

2.4 MNN correction outperforms existing methods on a pancreas(胰腺) data set

2.5 MNN correction improves differential expression analysis

2.6 MNN correction is applicable to droplet RNA-seq technology

3. Discussion

4. Methods (online version才能看到)

正文部分內容摘錄：

Paper intensive reading (十四)：Batch effects in single-cell RNA-sequencing data are corrected

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

Paper intensive reading (八)：rna sequencing of Murine norovirus-infected cells

Paper intensive reading (十)：Zika virus infection reprograms global transcription of host cells

Writing in the Science(三)：writing review articles

chapter10-batch effects

Paper intensive reading (二十二)： Tissue-based map of the human proteome

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結