Bedtools 使用

Bedtools是處理基因組信息分析的強大工具集合,本文列出自己學習其官方文檔的幾個點,對後面計算不同樣品peak相似性的腳本做了下更新和調整,使用起來更爲簡單方便。

內容摘要

  1. 區域註釋,如peak註釋,peak分佈分析,peak與調控元件交集等。

  2. 區域合併,如求算多樣品peak合集,或合併重疊區域

  3. 區域互補,如得到非基因區

  4. 利用比對結果對測序廣度和深度評估

  5. 多樣品peak相似性計算,評估ChIP類區域結果的樣品相似性。

bedtools主要功能

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.

usage:    bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]

    intersect     Find overlapping intervals in various ways.

                  求區域之間的交集,可以用來註釋peak,計算reads比對到的基因組區域
                  不同樣品的peak之間的peak重疊情況。

    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.

                  尋找最近但可能不重疊的區域

    coverage      Compute the coverage over defined intervals.

                  計算區域覆蓋度

    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.

                  合併重疊或相接的區域

    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.

                  獲得互補區域

    subtract      Remove intervals based on overlaps b/w two files.

                  計算區域差集

    slop          Adjust the size of intervals.

                  調整區域大小,如獲得轉錄起始位點上下游3 K的區域

    flank         Create new intervals from the flanks of existing intervals.

    sort          Order the intervals in a file.

                  排序,部分命令需要排序過的bed文件

    random        Generate random intervals in a genome.

                  獲得隨機區域,作爲背景集

    shuffle       Randomly redistrubute intervals in a genome.

                  根據給定的bed文件獲得隨機區域,作爲背景集

    sample        Sample random records from file using reservoir sampling.
    spacing       Report the gap lengths between intervals in a file.
    annotate      Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]

    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]

    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.

[ Format conversion ]

    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]

    getfasta      Use intervals to extract sequences from a FASTA file.

                  提取給定位置的FASTA序列

    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]

    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]

    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.

                  計算數據集相似性

    reldist       Calculate the distribution of relative distances b/w two files.
    fisher        Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]

    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.

                  用於生成一個腳本,批量捕獲IGV截圖

    links         Create a HTML page of links to UCSC locations.

    makewindows   Make interval "windows" across a genome.

                  把給定區域劃分成指定大小和間隔的小區間 (bin)

    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")

                  分組結算,不只可以用於bed文件。

    expand        Replicate lines based on lists of values in columns.
    split         Split a file into multiple files with equal records or base pairs.

安裝bedtools

Linux - Conda軟件安裝方法

ct@ehbio:~$ conda install bedtools

獲得測試數據集(http://quinlanlab.org/tutorials/bedtools/bedtools.html)

ct@ehbio:~$ mkdir bedtools
ct@ehbio:~$ cd bedtools
ct@ehbio:~$ url=https://s3.amazonaws.com/bedtools-tutorials/web
ct@ehbio:~/bedtools$ curl -O ${url}/maurano.dnaseI.tgz
ct@ehbio:~/bedtools$ curl -O ${url}/cpg.bed
ct@ehbio:~/bedtools$ curl -O ${url}/exons.bed
ct@ehbio:~/bedtools$ curl -O ${url}/gwas.bed
ct@ehbio:~/bedtools$ curl -O ${url}/genome.txt
ct@ehbio:~/bedtools$ curl -O ${url}/hesc.chromHmm.bed

交集 (intersect)

image

查看輸入文件,bed格式,至少三列,分別是染色體起始位置(0-based,

包括),終止位置

(1-based,不包括)。第四列一般爲區域名字,第五列一般爲空,第六列爲鏈的信息。更詳細解釋見http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1。

自己做研究CpG島信息可以從UCSC的Table Browser獲得,具體操作見http://blog.genesino.com/2013/05/ucsc-usages/。

ct@ehbio:~/bedtools$ head -n 3 cpg.bed exons.bed

==> cpg.bed <==

chr1    28735   29810   CpG:_116
chr1    135124  135563  CpG:_30
chr1    327790  328229  CpG:_29

==> exons.bed <==

chr1    11873   12227   NR_046018_exon_0_0_chr1_11874_f 0   +
chr1    12612   12721   NR_046018_exon_1_0_chr1_12613_f 0   +
chr1    13220   14409   NR_046018_exon_2_0_chr1_13221_f 0   +

獲得重疊區域(既是外顯子,又是CpG島的區域)

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed | head -5

chr1    29320   29370   CpG:_116
chr1    135124  135563  CpG:_30
chr1    327790  328229  CpG:_29
chr1    327790  328229  CpG:_29
chr1    327790  328229  CpG:_29

輸出重疊區域對應的原始區域(與外顯子存在交集的CpG島)

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb > | head -5
  • chr1 28735 29810 CpG:_116 chr1 29320 29370

    NR_024540_exon_10_0_chr1_29321_r 0 -

  • chr1 135124 135563 CpG:_30 chr1 134772 139696

    NR_039983_exon_0_0_chr1_134773_r 0 -

  • chr1 327790 328229 CpG:_29 chr1 324438 328581

    NR_028322_exon_2_0_chr1_324439_f 0 +

  • chr1 327790 328229 CpG:_29 chr1 324438 328581

    NR_028325_exon_2_0_chr1_324439_f 0 +

  • chr1 327790 328229 CpG:_29 chr1 327035 328581

    NR_028327_exon_3_0_chr1_327036_f 0 +

計算重疊鹼基數

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10
  • chr1 28735 29810 CpG:_116 chr1 29320 29370

    NR_024540_exon_10_0_chr1_29321_r 0 - 50

  • chr1 135124 135563 CpG:_30 chr1 134772 139696

    NR_039983_exon_0_0_chr1_134773_r 0 - 439

  • chr1 327790 328229 CpG:_29 chr1 324438 328581

    NR_028322_exon_2_0_chr1_324439_f 0 + 439

  • chr1 327790 328229 CpG:_29 chr1 324438 328581

    NR_028325_exon_2_0_chr1_324439_f 0 + 439

  • chr1 327790 328229 CpG:_29 chr1 327035 328581

    NR_028327_exon_3_0_chr1_327036_f 0 + 439

  • chr1 713984 714547 CpG:_60 chr1 713663 714068

    NR_033908_exon_6_0_chr1_713664_r 0 - 84

  • chr1 762416 763445 CpG:_115 chr1 761585 762902

    NR_024321_exon_0_0_chr1_761586_r 0 - 486

  • chr1 762416 763445 CpG:_115 chr1 762970 763155

    NR_015368_exon_0_0_chr1_762971_f 0 + 185

  • chr1 762416 763445 CpG:_115 chr1 762970 763155

    NR_047519_exon_0_0_chr1_762971_f 0 + 185

  • chr1 762416 763445 CpG:_115 chr1 762970 763155

    NR_047520_exon_0_0_chr1_762971_f 0 + 185

計算第一個(-a)bed區域有多少個重疊的第二個(-b)bed文件中有多少個區域

ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -c | head

chr1    28735   29810   CpG:_116    1
chr1    135124  135563  CpG:_30 1
chr1    327790  328229  CpG:_29 3
chr1    437151  438164  CpG:_84 0
chr1    533219  534114  CpG:_94 0
chr1    544738  546649  CpG:_171    0
chr1    713984  714547  CpG:_60 1
chr1    762416  763445  CpG:_115    10
chr1    788863  789211  CpG:_28 9

另外還有-v取出不重疊的區域,

-f限定重疊最小比例,-sorted可以對按sort -k1,1 -k2,2n排序好的文件加速操作。

同時對多個區域求交集 (可以用於peak的多維註釋)

# -names標註註釋來源
# -sorted: 如果使用了這個參數,提供的一定是排序好的bed文件

ct@ehbio:~/bedtools$ bedtools intersect -a exons.bed \
    -b cpg.bed gwas.bed hesc.chromHmm.bed -sorted -wa -wb -names cpg gwas chromhmm \
    | head -10000  | tail -10
  • chr1 27632676 27635124

    NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27633213

    27635013 5_Strong_Enhancer

  • chr1 27632676 27635124

    NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27635013

    27635413 7_Weak_Enhancer

  • chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

    chromhmm chr1 27632613 27632813 6_Weak_Enhancer

  • chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

    chromhmm chr1 27632813 27633213 7_Weak_Enhancer

  • chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

    chromhmm chr1 27633213 27635013 5_Strong_Enhancer

  • chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f

    chromhmm chr1 27635013 27635413 7_Weak_Enhancer

  • chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f cpg

    chr1 27648453 27649006 CpG:_63

  • chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f

    chromhmm chr1 27648613 27649413 1_Active_Promoter

  • chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f cpg

    chr1 27648453 27649006 CpG:_63

  • chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f

    chromhmm chr1 27648613 27649413 1_Active_Promoter

合併區域

image

bedtools merge輸入的是按sort -k1,1 -k2,2n排序好的bed文件。

只需要輸入一個排序好的bed文件,默認合併重疊或鄰接區域。

ct@ehbio:~/bedtools$ bedtools merge -i exons.bed | head -n 5

chr1    11873   12227
chr1    12612   12721
chr1    13220   14829
chr1    14969   15038
chr1    15795   15947

合併區域並輸出此合併後區域是由幾個區域合併來的

ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -c 1 -o count | head -n 5

chr1    11873   12227   1
chr1    12612   12721   1
chr1    13220   14829   2
chr1    14969   15038   1
chr1    15795   15947   1

合併相距90 nt內的區域,並輸出是由哪些區域合併來的

# -c: 指定對哪些列進行操作
# -o: 與-c對應,表示對指定列進行哪些操作
# 這裏的用法是對第一列做計數操作,輸出這個區域是由幾個區域合併來的
# 對第4列做收集操作,記錄合併的區域的名字,並逗號分隔顯示出來

ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -d 340 -c 1,4 -o count,collapse | head -4

chr1    11873   12227   1   NR_046018_exon_0_0_chr1_11874_f
chr1    12612   12721   1   NR_046018_exon_1_0_chr1_12613_f
chr1    13220   15038   3   NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r,NR_024540_exon_1_0_chr1_14970_r
chr1    15795   15947   1   NR_024540_exon_2_0_chr1_15796_r

計算互補區域

給定一個全集,再給定一個子集,求另一個子集。比如給定每條染色體長度和外顯子區域,求非外顯子區域。給定基因區,求非基因區。給定重複序列,求非重複序列等。

重複序列區域的獲取也可以用下面提供的鏈接:http://blog.genesino.com/2013/05/ucsc-usages/。

ct@ehbio:~/bedtools$ head genome.txt

chr1    249250621
chr10   135534747
chr11   135006516
chr11_gl000202_random   40103
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392

ct@ehbio:~/bedtools$ bedtools complement -i exons.bed -g genome.txt | head -n 5

chr1    0   11873
chr1    12227   12612
chr1    12721   13220
chr1    14829   14969
chr1    15038   15795

基因組覆蓋廣度和深度

計算基因組某個區域是否被覆蓋,覆蓋深度多少。有下圖多種輸出格式,也支持RNA-seq數據,計算junction-reads覆蓋。

image.gif

genome.txt裏面的內容就是染色體及對應的長度。

# 對單行FASTA,可如此計算
# 如果是多行FASTA,則需要累加

ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{\
    if($0~/>/) {seq_name=$0;sub(">","",seq_name);} \
    else {print seq_name,length;} }' ../bio/genome.fa | tee ../bio/genome.txt

chr1    60001
chr2    54001
chr3    54001
chr4    60001

ct@ehbio:~/bedtools$ bedtools genomecov -ibam ../bio/map.sortP.bam -bga \
    -g ../bio/genome.txt | head

# 這個warning很有意思,因爲BAM中已經有這個信息了,就不需要提供了

*****
*****WARNING: Genome (-g) files are ignored when BAM input is provided.
*****

# bedgraph文件,前3列與bed相同,最後一列表示前3列指定的區域的覆蓋度。

chr1    0   11  0
chr1    11  17  1
chr1    17  20  2
chr1    20  31  3
chr1    31  36  4
chr1    36  43  6
chr1    43  44  7
chr1    44  46  8
chr1    46  48  9
chr1    48  54  10

兩個思考題:

  1. 怎麼計算有多少基因組區域被測到了?

  2. 怎麼計算平均測序深度是多少?

數據集相似性

bedtools jaccard計算的是給定的兩個bed文件之間交集區域(intersection)佔總區域(union-intersection)的比例(jaccard)和交集的數目(n_intersections)。

ct@ehbio:~/bedtools$ bedtools jaccard \    -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \    -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bedintersection    union-intersection  jaccard n_intersections81269248    160493950   0.50637 130852

小思考:1. 如何用bedtools其它工具算出這個結果?2. 如果需要比較的文件很多,怎麼充分利用計算資源?

一個辦法是使用for循環,

雙層嵌套。這種用法也很常見,不管是單層還是雙層for循環,都有利於簡化重複運算。

ct@ehbio:~/bedtools$ for i in *.merge.bed; do \    for j in *.merge.bed; do \    bedtools jaccard -a $i -b $j | cut -f3 | tail -n +2 | sed "s/^/$i\t$j\t/"; \    done; done >total.similarity

另一個辦法是用parallel,不只可以批量,更可以並行。

root@ehbio:~# yum install parallel.noarch# parallel 後面雙引號("")內的內容爲希望用parallel執行的命令,# 整體寫法與Linux下命令寫法一致。# 雙引號後面的 三個相鄰冒號 (:::)默認用來傳遞參數的,可多個連寫。# 每個三冒號後面的參數會被循環調用,而在命令中的引用則是根據其出現的位置,分別用{1}, {2}# 表示第一個三冒號後的參數,第二個三冒號後的參數。## 這個命令可以替換原文檔裏面的整合和替換, 相比於原文命令生成多個文件,這裏對每個輸出結果# 先進行了比對信息的增加,最後結果可以輸入一個文件中。#ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \    | sed 's/^/{1}\t{2}\t/'" ::: `ls *.merge.bed` ::: `ls *.merge.bed`  >totalSimilarity.2# 上面的命令也有個小隱患,並行計算時的輸出衝突問題,可以修改爲輸出到單個文件,再cat到一起ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \    | sed 's/^/{1}\t{2}\t/' >{1}.{2}.totalSimilarity_tmp" ::: `ls *.merge.bed` ::: `ls *.merge.bed`ct@ehbio:~/bedtools$ cat *.totalSimilarity_tmp >totalSimilarity.2# 替換掉無關信息ct@ehbio:~/bedtools$ sed -i -e 's/.hotspot.twopass.fdr0.05.merge.bed//' \    -e 's/.hg19//' totalSimilarity.2  

totalSimilarity.2數據表格式如下 (數據是假的):

fMusle_leg-DS19115    fMusle_back-DS18454    0.55fMusle_leg-DS19115    fHeart-DS15643    0.4fHeart-DS15643    fHeart-DS16621    0.8fHeart-DS15643    fHeart-DS15839    0.7fHeart-DS16621    fHeart-DS15839    0.7

得到結果後,就可以繪製相關性熱圖PCA了。

image

也可以使用下面的命令轉換成Wide format矩陣,用高顏值可定製在線繪圖工具-第三版繪製。

# 這裏面b和c可以用一個,因爲是一個對稱陣# 如果2和3列內容不同,此腳本也可用ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{a[$1, $2]=$3; b[$1]=1; c[$2]=1;}END\    {printf("ID"); for(i in c) printf("\t%s",  i); \        for (i in b) {printf("%s", i); for(j in c) {printf("\t%s", a[i, j]);} print "";}}

原文檔(http://quinlanlab.org/tutorials/bedtools/bedtools.html)的命令,稍微有些複雜,利於學習不同命令的組合。使用時推薦使用上面的命令。

ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} \      | awk 'NR>1' | cut -f 3 \      > {1}.{2}.jaccard" \     ::: `ls *.merge.bed` ::: `ls *.merge.bed`

This command will create a single file containing the pairwise Jaccard

measurements from all 400 tests.

find . \    | grep jaccard \    | xargs grep "" \    | sed -e s"/\.\///" \    | perl -pi -e "s/.bed./.bed\t/" \    | perl -pi -e "s/.jaccard:/\t/" \    > pairwise.dnase.txt

A bit of cleanup to use more intelligible names for each of the samples.

cat pairwise.dnase.txt \| sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \| sed -e 's/.hg19//g' \> pairwise.dnase.shortnames.txt

Now let’s make a 20x20 matrix of the Jaccard statistic. This will allow

the data to play nicely with R.

awk 'NF==3' pairwise.dnase.shortnames.txt \| awk '$1 ~ /^f/ && $2 ~ /^f/' \| python make-matrix.py \> dnase.shortnames.distance.matrix
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章