生信人的linux考試20題解析

文章目錄

原題目鏈接:http://www.bio-info-trainee.com/2900.html

一、在任意文件夾下面創建形如 1/2/3/4/5/6/7/8/9 格式的文件夾系列

[sunchengquan 21:42:32 ~/test]
$ mkdir -p 1/2/3/4/5/6/7/8/9
[sunchengquan 21:42:55 ~/test]
$ ls
000.csv   chrY_SNP_pos_id.vcf  id             rs.csv            sun
1         encode.py            info           rs_split_v1.txt   test.bed
123.csv   gff                  merge_file.sh  sk.sh             tmp
allid     grasp_ncbi_snp01.py  name.txt       socket_client.py  tmp.sh
arg1.txt  grasp_ncbi_snp.py    output_file    socket_server.py
arg.txt   hum_name.txt         result.csv     sum_name.csv
[sunchengquan 21:43:04 ~/test]
$ cd 1/2/3/4/5/6/7/8/9
[sunchengquan 21:43:46 ~/test/1/2/3/4/5/6/7/8/9]
$ pwd
/home/sunchengquan/test/1/2/3/4/5/6/7/8/9

二、在創建好的文件夾下面,比如我的是 /Users/jimmy/tmp/1/2/3/4/5/6/7/8/9 ,裏面創建文本文件 me.txt

[sunchengquan 21:43:50 ~/test/1/2/3/4/5/6/7/8/9]
$ touch me.txt
[sunchengquan 21:44:51 ~/test/1/2/3/4/5/6/7/8/9]
$ ls
me.txt

三、在文本文件 me.txt 裏面輸入內容:

Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?

[sunchengquan 15:29:52 ~/test]
$ cat > me.txt
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
^C
[sunchengquan 15:36:18 ~/test]
$ cat me.txt 
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?

[sunchengquan 21:44:53 ~/test/1/2/3/4/5/6/7/8/9]
$ vi me.txt 
[sunchengquan 21:45:56 ~/test/1/2/3/4/5/6/7/8/9]
$ cat me.txt 
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?

四 、 刪除上面創建的文件夾 1/2/3/4/5/6/7/8/9 及文本文件 me.txt

[sunchengquan 21:46:01 ~/test/1/2/3/4/5/6/7/8/9]
$ rm -r ~/test/1

五、在任意文件夾下面創建 folder1~5這5個文件夾,然後每個文件夾下面繼續創建 folder1~5這5個文件夾,效果如下:

在這裏插入圖片描述

[sunchengquan 15:46:04 ~/test/sun]
$ mkdir -p folder_{1..5}/folder_{1..5}
[sunchengquan 15:46:10 ~/test/sun]
$ ls *
folder_1:
folder_1  folder_2  folder_3  folder_4  folder_5

folder_2:
folder_1  folder_2  folder_3  folder_4  folder_5

folder_3:
folder_1  folder_2  folder_3  folder_4  folder_5

folder_4:
folder_1  folder_2  folder_3  folder_4  folder_5

folder_5:
folder_1  folder_2  folder_3  folder_4  folder_5

六、在第五題創建的每一個文件夾下面都 創建第二題文本文件 me.txt ,內容也要一樣。

[sunchengquan 15:46:24 ~/test/sun]
$ touch  folder_{1..5}/folder_{1..5}/me.txt

簡潔明瞭的方法

[sunchengquan 15:56:34 ~/test/sun]
$ cat > me.txt
sun
cheng
quan
^C
[sunchengquan 17:31:34 ~/test/sun]
$ echo folder_{1..5}/folder_{1..5} |xargs -n1 cp -v me.txt
"me.txt" -> "folder_1/folder_1/me.txt"
"me.txt" -> "folder_1/folder_2/me.txt"
"me.txt" -> "folder_1/folder_3/me.txt"
"me.txt" -> "folder_1/folder_4/me.txt"
"me.txt" -> "folder_1/folder_5/me.txt"
"me.txt" -> "folder_2/folder_1/me.txt"
"me.txt" -> "folder_2/folder_2/me.txt"
"me.txt" -> "folder_2/folder_3/me.txt"
"me.txt" -> "folder_2/folder_4/me.txt"
"me.txt" -> "folder_2/folder_5/me.txt"
"me.txt" -> "folder_3/folder_1/me.txt"
"me.txt" -> "folder_3/folder_2/me.txt"
"me.txt" -> "folder_3/folder_3/me.txt"
"me.txt" -> "folder_3/folder_4/me.txt"
"me.txt" -> "folder_3/folder_5/me.txt"
"me.txt" -> "folder_4/folder_1/me.txt"
"me.txt" -> "folder_4/folder_2/me.txt"
"me.txt" -> "folder_4/folder_3/me.txt"
"me.txt" -> "folder_4/folder_4/me.txt"
"me.txt" -> "folder_4/folder_5/me.txt"
"me.txt" -> "folder_5/folder_1/me.txt"
"me.txt" -> "folder_5/folder_2/me.txt"
"me.txt" -> "folder_5/folder_3/me.txt"
"me.txt" -> "folder_5/folder_4/me.txt"
"me.txt" -> "folder_5/folder_5/me.txt”
[sunchengquan 17:33:29 ~/test/sun]
$ cat folder_5/folder_5/me.txt
sun
cheng
quan

麻煩瑣碎的方法

#!/usr/bin/bash

mkdir -p ~/test/sun/folder_{1..5}/folder_{1..5}
for i in {1..5};do
        cd ~/test/sun/folder_$i
        for a in {1..5};do
                cd ~/test/sun/folder_$i/folder_$a
                echo -e "sun\ncheng\nquan\n I love you" >me.txt
    done
done

七,再次刪除掉前面幾個步驟建立的文件夾及文件

[sunchengquan 17:36:39 ~/test/sun]
$ rm -r folder*
[sunchengquan 17:36:47 ~/test/sun]
$ ll
總用量 0

八、下載 http://www.biotrainee.com/jmzeng/igv/test.bed 文件,後在裏面選擇含有 H3K4me3 的那一行是第幾行,該文件總共有幾行。

[sunchengquan 17:36:49 ~/test/sun]
$ wget -c http://www.biotrainee.com/jmzeng/igv/test.bed
[sunchengquan 17:39:29 ~/test/sun]
$ ll
總用量 4.0K
-rw-r----- 1 sunchengquan sunchengquan 3.1K 5月  18 2017 test.bed
[sunchengquan 17:41:21 ~/test/sun]
$ grep -n 'H3K4me3' test.bed 
8:chr1    9810    10438    ID=SRX387603;Name=H3K4me3%20(@%20HMLE);Title=GSM1280527:%20HMLE%20Twist3D%20H3K4me3%20rep2%3B%20Homo%20sapiens%3B%20ChIP-Seq;Cell%20group=Breast;
source_name=HMLE_Twist3D_H3K4me3;cell%20type=human%20mammary%20epithelial%20cells;transfected%20with=Twist1;culture%20type=sphere;chip%20antibody=H3K4me3;chip%20antibody%20vendor=Millipore;    222    9810    10438    0,226,255
[sunchengquan 17:42:59 ~/test/sun]
$ wc -l test.bed 
10 test.bed

九、下載 http://www.biotrainee.com/jmzeng/rmDuplicate.zip 文件,並且解壓,查看裏面的文件夾結構

[sunchengquan 17:44:31 ~/test/sun]
$ wget -c http://www.biotrainee.com/jmzeng/rmDuplicate.zip
[sunchengquan 17:45:33 ~/test/sun]
$ ll
總用量 112K
-rw-r----- 1 sunchengquan sunchengquan 103K 11月 12 2016 rmDuplicate.zip
-rw-r----- 1 sunchengquan sunchengquan 3.1K 5月  18 2017 test.bed
[sunchengquan 17:45:48 ~/test/sun]
$ unzip rmDuplicate.zip 
[sunchengquan 17:45:59 ~/test/sun]
$ ll
總用量 116K
drwxr-x--- 4 sunchengquan sunchengquan 4.0K 11月 12 2016 rmDuplicate
-rw-r----- 1 sunchengquan sunchengquan 103K 11月 12 2016 rmDuplicate.zip
-rw-r----- 1 sunchengquan sunchengquan 3.1K 5月  18 2017 test.bed
[sunchengquan 17:46:21 ~/test/sun]
$ tree rmDuplicate
rmDuplicate
├── picard
│   ├── paired
│   │   ├── readme.txt
│   │   ├── tmp.header
│   │   ├── tmp.MarkDuplicates.log
│   │   ├── tmp.metrics
│   │   ├── tmp.rmdup.bai
│   │   ├── tmp.rmdup.bam
│   │   ├── tmp.sam
│   │   └── tmp.sorted.bam
│   └── single
│       ├── readme.txt
│       ├── tmp.header
│       ├── tmp.MarkDuplicates.log
│       ├── tmp.metrics
│       ├── tmp.rmdup.bai
│       ├── tmp.rmdup.bam
│       ├── tmp.sam
│       └── tmp.sorted.bam
└── samtools
    ├── paired
    │   ├── readme.txt
    │   ├── tmp.header
    │   ├── tmp.rmdup.bam
    │   ├── tmp.rmdup.vcf.gz
    │   ├── tmp.sam
    │   ├── tmp.sorted.bam
    │   └── tmp.sorted.vcf.gz
    └── single
        ├── readme.txt
        ├── tmp.header
        ├── tmp.rmdup.bam
        ├── tmp.rmdup.vcf.gz
        ├── tmp.sam
        ├── tmp.sorted.bam
        └── tmp.sorted.vcf.gz

6 directories, 30 files

十、打開第九題解壓的文件,進入 rmDuplicate/samtools/single 文件夾裏面,查看後綴爲 .sam 的文件,搞清楚 生物信息學裏面的SAM/BAM 定義是什麼。

序列比對的存儲格式,由一些比對軟件產生,如bwa,bowtie2

sam stands for Sequence Alignment Mapping
sam‘ 序列比對映射’的首字母縮寫
sam分爲兩部分,註釋信息(header section)和比對結果部分(alignment section)
比對結果部分(alignment section),每一行表示一個片段(segment)的比對信息,包括11個必須的字段(mandatory fields)和一個可選的字段,字段之間用tag分割。必須的字段有11個,順序固定,不可用時,根據字段定義,可以爲’0‘或者’*’

  • 1 QNAME,序列的名字(Read的名字)
  • 2 FLAG, 概括出一個合適的標記,各個數字分別代表
  • 3 RNAME,參考序列的名字(染色體)
  • 4 POS,在參考序列上的位置(染色體上的位置)
  • 5 MAPQ, mapping qulity 越高則位點越獨特
  • 6 CIGAR,代表比對結果的CIGAR字符串
  • 7 RNEXT, mate 序列所在參考序列的名稱; 下一個片段比對上的參考序列的編號,沒有另外的片段,這裏是’*‘,同一個片段,用’=‘;
  • 8 PNEXT, mate 序列在參考序列上的位置;下一個片段比對上的位置,如果不可用,此處爲0;
  • 9 TLEN,估計出的片段的長度,當mate 序列位於本序列上游時該值爲負值。Template的長度,最左邊得爲正,最右邊的爲負,中間的不用定義正負,不分區段(single-segment)的比對上,或者不可用時,此處爲0
  • 10 SEQ,read的序列;序列片段的序列信息,如果不存儲此類信息,此處爲’*‘,注意CIGAR中M/I/S/=/X對應數字的和要等於序列長度;
    11 QUAL,ASCII碼格式的序列質量;序列的質量信息,格式同FASTQ一樣。
    可選的字段(field)
  • 12 NM:i 經過編輯的序列
    -13 MD:Z 代表序列和參考序列錯配的字符串
  • 14 AS:i 匹配的得分

bam是其壓縮格式,samtools可以轉化
samtools view -bS tmp.sam >tmp.bam

十一、安裝 samtools 軟件

curl -OL https://sourceforge.net/projects/samtools/files/samtools/1.6/samtools-1.6.tar.bz2

tar jxvf samtools-1.6.tar.bz2
cd samtools-1.6
./configure 
make
mv samtools-1.6 samtools
ln -s ~/local/app/samtools/samtools ~/bin/samtools

查看samtools的手冊
man ~/local/app/samtools/samtools.1

十二、打開 後綴爲BAM 的文件,找到產生該文件的命令。 提示一下命令是:

/home/jianmingzeng/biosoft/bowtie/bowtie2-2.2.9/bowtie2-align-s --wrapper basic-0 -p 20 -x /home/jianmingzeng/reference/index/bowtie/hg38 -S /home/jianmingzeng/data/public/allMouse/alignment/WT_rep2_Input.sam -U /tmp/41440.unp

[sunchengquan 21:15:22 ~/test/sun/rmDuplicate/samtools/single]
$ samtools view -h tmp.sorted.bam |grep '^@PG'|awk 'BEGIN{FS="\t"}{print $5}'|cut -d: -f2
"/home/jianmingzeng/biosoft/bowtie/bowtie2-2.2.9/bowtie2-align-s --wrapper basic-0 -p 20 -x /home/jianmingzeng/reference/index/bowtie/hg38 -S /home/jianmingzeng/data/public/allMouse/alignment/WT_rep2_Input.sam -U /tmp/41440.unp"

十三題、根據上面的命令,找到我使用的參考基因組 /home/jianmingzeng/reference/index/bowtie/hg38 具體有多少條染色體

[sunchengquan 22:02:03 ~/test/sun/rmDuplicate/samtools/single]
$ samtools view -h tmp.sorted.bam |egrep '^@S.*?(chr[XYM]\s+.*|chr[1-9]?[0-9]\s+).*'|less
[sunchengquan 22:03:49 ~/test/sun/rmDuplicate/samtools/single]
$ samtools view -h tmp.sorted.bam |egrep '^@S.*?(chr[XYM]\s+.*|chr[1-9]?[0-9]\s+).*'|wc -l
25
十四題、上面的後綴爲BAM 的文件的第二列,只有 0 和 16 兩個數字,用 cut/sort/uniq等命令統計它們的個數
[sunchengquan 22:10:11 ~/test/sun/rmDuplicate/samtools/single]
$ samtools view tmp.rmdup.bam |cut -f2|sort |uniq -c
     16 0
     12 16

十五題、重新打開 rmDuplicate/samtools/paired 文件夾下面的後綴爲BAM 的文件,再次查看第二列,並且統計

[sunchengquan 22:14:04 ~/test/sun/rmDuplicate/samtools/paired]
$ samtools view tmp.rmdup.bam |cut -f2 |sort |uniq -c|sort -t' ' -nrk1,1
      8 99
      7 147
      2 97
      2 83
      2 163
      1 433
      1 387
      1 371
      1 353
      1 323

十六題、下載 http://www.biotrainee.com/jmzeng/sickle/sickle-results.zip 文件,並且解壓,查看裏面的文件夾結構, 這個文件有2.3M,注意留心下載時間及下載速度。

[sunchengquan 22:15:31 ~/test/sun]
$ wget -c http://www.biotrainee.com/jmzeng/sickle/sickle-results.zip
[sunchengquan 22:21:36 ~/test/sun]
$ unzip sickle-results.zip 

[sunchengquan 22:22:36 ~/test/sun]
$ tree sickle-results
sickle-results
├── command.txt
├── single_tmp_fastqc.html
├── single_tmp_fastqc.zip
├── test1_fastqc.html
├── test1_fastqc.zip
├── test2_fastqc.html
├── test2_fastqc.zip
├── trimmed_output_file1_fastqc.html
├── trimmed_output_file1_fastqc.zip
├── trimmed_output_file2_fastqc.html
└── trimmed_output_file2_fastqc.zip

0 directories, 11 files

十七題、解壓 sickle-results/single_tmp_fastqc.zip 文件,並且進入解壓後的文件夾,找到 fastqc_data.txt 文件,並且搜索該文本文件以 >>開頭的有多少行?

[sunchengquan 22:23:20 ~/test/sun/sickle-results]
$ unzip single_tmp_fastqc.zip
[sunchengquan 22:23:42 ~/test/sun/sickle-results]
$ cd single_tmp_fastqc
[sunchengquan 22:24:06 ~/test/sun/sickle-results/single_tmp_fastqc]
$ ls
fastqc_data.txt  fastqc.fo  fastqc_report.html  Icons  Images  summary.txt

[sunchengquan 22:25:02 ~/test/sun/sickle-results/single_tmp_fastqc]
$ grep '^>>' fastqc_data.txt |wc -l
24


[sunchengquan 08:02:54 ~/test/sun/sickle-results/single_tmp_fastqc]
$ cat fastqc_data.txt |awk '/^>>/{print}'|wc -l 
24

十八題、下載 http://www.biotrainee.com/jmzeng/tmp/hg38.tss 文件,去NCBI找到TP53/BRCA1等自己感興趣的基因對應的 refseq數據庫 ID,然後找到它們的hg38.tss 文件的哪一行。

[sunchengquan 22:27:31 ~/test/sun]
$ wget -c  http://www.biotrainee.com/jmzeng/tmp/hg38.tss
[sunchengquan 22:43:17 ~/test/sun]
$ grep 'NM_000546' hg38.tss 
NM_000546    chr17    7685550    7689550    1
[sunchengquan 22:43:27 ~/test/sun]
$ grep 'NM_001126113' hg38.tss 
NM_001126113    chr17    7685550    7689550    1

十九題、解析hg38.tss 文件,統計每條染色體的基因個數。

[sunchengquan 22:50:12 ~/test/sun]
$ cat hg38.tss |cut -f2|sort|uniq -c
   6050 chr1
   2824 chr10
   ………


[sunchengquan 22:59:12 ~/test/sun]
$ grep -oE 'chr[0-9]{1,2}|chr[a-zA-Z]{1,2}' hg38.tss |sort |uniq -c
   6157 chr1
   2838 chr10
   3577 chr11
   3014 chr12
   1133 chr13
   1982 chr14
   2377 chr15
   2696 chr16
   3794 chr17
    883 chr18
   5880 chr19
   4090 chr2
   1692 chr20
    895 chr21
   1410 chr22
   3395 chr3
   2277 chr4
   2821 chr5
   5782 chr6
   2785 chr7
   2221 chr8
   2310 chr9
      2 chrM
     32 chrUn
   2561 chrX
    414 chrY

二十題、解析hg38.tss 文件,統計NM和NR開頭的序列,瞭解NM和NR開頭的含義。

[sunchengquan 23:01:06 ~/test/sun]
$ grep '^NR' hg38.tss |wc -l
15954
[sunchengquan 23:01:13 ~/test/sun]
$ grep '^NM' hg38.tss |wc -l
51064

$ grep -oE '^(NM|NR)' hg38.tss |sort|uniq -c
  51064 NM
  15954 NR


[sunchengquan 23:06:29 ~/test/sun]
$ grep -E '^(NM|NR)' hg38.tss |wc -l
67018
[sunchengquan 23:08:30 ~/test/sun]
$ wc -l hg38.tss 
67018 hg38.tss



NM:轉錄組產物的序列mRNA
NR:非編碼的轉錄組序列ncRNA
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章