
Falcon: 一個實驗性的二倍體組裝工具,測試multi Gb genomes。
Canu :Celera Assembler的一個分支,專門用於高噪音單分子測序。
blast :的全稱是 basic local alignment search tool 是一種極其常見的序列比對工具。其中包含幾個模塊(可以這麼認爲),blastn blastp blastx,tblastx 等等。

blastp 用於蛋白序列之間的比對。BLAST是序列搜索算法,同時也是工具的名字。



de novo assembly

De Bruijn 圖是目前二代測序序列最常用的拼接算法,該算法將已經非常短的reads再分割成更多個kmer短序列(k 小於reads 序列的長度),相鄰的kmers序列通過(k-1)個鹼基連接到一起(即每次只移動一個位置),進而降低算法計算重疊區域的複雜度,降低內存消耗。

1Mb=1000kb=1000000bpkb 是一千個bp 即是一千個鹼基 ( kilobases)

三代測序主要分爲兩大派別:pacbio的smart sequencing 以及 nanopore公司的nanopore sequencing

fuzzy-Bruijn graph (FBG)

de Bruijn graph(DBG)

PE reads 就是 paired-end reads。
在測序過程中,一條DNA分子的兩端都可以測序。先測其中的一端,獲得一個reads,然後再轉到另一端測序,獲得另外一個reads。得到的這兩個reads就是PE reads。
PE reads 的獲得有助於後期序列組裝。



高通量測序技術(High-throughput sequencing)又稱“下一代”測序技術(“Next-generation” sequencing technology),以能一次並行對幾十萬到幾百萬條DNA分子進行序列測定和一般讀長較短等爲標誌。高通量測序技術是對傳統測序一次革命性的改變,一次對幾十萬到幾百萬條DNA分子進行序列測定,因此在有些文獻中稱其爲下一代測序技術(next generation sequencing)足見其劃時代的改變,同時高通量測序使得對一個物種的轉錄組和基因組進行細緻全貌的分析成爲可能,所以又被稱爲深度測序(deep sequencing)。



De Novo 測序

De Novo 測序也叫從頭測序,不需要任何基因序列信息即可對某個物種進行測序。用生物信息學的分析方法對序列進行拼接、組裝,從而獲得該物種的基因組序列圖譜。目前廣泛應用於從頭解析未知物種的基因組序列、基因組成、進化特點等。基因組從頭測序也叫de novo測序,是指對基因組序列未知或沒有近源物種基因組信息的某個物種,對其不同長度基因組DNA片段及其文庫進行序列測定,然後用生物信息學方法進行拼接、組裝和註釋,從而獲得該物種完整的基因組序列圖譜。







還使用了WTDBG(ruanjue/wtdbg),這個軟件還沒有發表,但是也能組裝超過1G的基因組。最開始感覺可能不太靠譜,但是用了之後才發現WTDBG的結果很好。這個軟件不像其他的三代軟件一樣先自糾錯,也不是用的overlap layout算法,而是將三代reads打斷成kmer,然後組裝,組裝後用minimap來進行polishing。結果的基因組大小和質量都比較合適,但是因爲會產生大量kmer,所以WTDBG運行時要求的內存很大,但是也跟參數相關(有個參數是選擇使用組裝的kmer數量,比如使用全部的kmer,使用1/2的kmer,1/4的kmer等)。




確定基因組中缺失了什麼;確定難以生化研究的基因和pathways;研究感興趣的pathway通路中的每一個基因;研究基因組的非編碼區域(introns內含子、promoters啓動子、telomeres端粒等)的調控機理和結構特徵;基因組提供了一個可以進行各種統計的大型數據庫(provide large databases that are amenable to statistical methods);識別不同的可能有細微表型的序列;研究物種和基因組的進化過程。




如:將基因組打斷成200 bp的片段,reads爲10 bp,讀出的序列爲ATCTGTCCTA,那麼該序列就是read。很多read就是reads咯

reads是指測序出來的一條條序列,contig reads是生信分析後拼接得到的序列。測序出來的序列是一小段一小段的,生信分析拼接是序列篩選優化加拼接的過程,把一小段變成比較長的大段序列。




CCS(Circular Consensus Sequence): 環裝一致性序列,是一個Polymerase read 上的多條subreads 序列,相互校正得到的一條反映真實文庫的序列

kb=千鹼基 kilobase

nt=核苷酸 nucleotide

bp=鹼基對 base pair

外顯子測序(whole exon sequencing)


small RNA測序

Small RNA(micro RNAs、siRNAs和 pi RNAs)





CHIRP-Seq(Chromatin Isolation by RNA Purification)是一種檢測與RNA綁定的DNA和蛋白的高通量測序方法




測序拼接的主要過程就是把reads分組爲重疊羣(contigs),把重疊羣分組爲支架( scaffolds)。重疊羣以reads 進行多重排列,並且形成共同序列,而支架( 即超級重疊羣或巨型重疊羣)規定了重疊羣的順序和方向以及重疊羣之間缺口的大小。






Contig N50

Reads拼接後會獲得一些不同長度的Contigs。將所有的Contig長度相加,能獲得一個Contig總長度。然後將所有的Contigs按照從長到短進行排序,如獲得Contig 1,Contig 2,Contig 3…………Contig 25。將Contig按照這個順序依次相加,當相加的長度達到Contig總長度的一半時,最後一個加上的Contig長度即爲Contig N50。舉例:Contig 1+Contig 2+ Contig 3+Contig 4=Contig總長度*1/2時,Contig 4的長度即爲Contig N50。Contig N50可以作爲基因組拼接的結果好壞的一個判斷標準


基因組de novo測序,通過reads拼接獲得Contigs後,往往還需要構建454 Paired-end庫或lllumina Mate-pair庫,以獲得一定大小片段(如3Kb、6Kb、10Kb、20Kb)兩端的序列。基於這些序列,可以確定一些Contig之間的順序關係,這些先後順序已知的Contigs組成Scaffold

Scaffold N50

Scaffold N50與Contig N50的定義類似。Contigs拼接組裝獲得一些不同長度的Scaffolds。將所有的Scaffold長度相加,能獲得一個Scaffold總長度。然後將所有的Scaffolds按照從長到短進行排序,如獲得Scaffold 1,Scaffold 2,Scaffold 3…………Scaffold 25。將Scaffold按照這個順序依次相加,當相加的長度達到Scaffold總長度的一半時,最後一個加上的Scaffold長度即爲Scaffold N50。舉例:Scaffold 1+Scaffold 2+ Scaffold 3 +Scaffold 4 +Scaffold 5=Scaffold總長度*1/2時,Scaffold 5的長度即爲Scaffold N50。Scaffold N50可以作爲基因組拼接的結果好壞的一個判斷標準


  對一條染色體進行測序,將測序得到的reads進行拼接,能夠完全拼接起來,中間沒有gap的序列稱爲contig。 如果中間有gap,但是gap的 長度我們知道,這樣的序列就叫做scaffold。
將測序得到的所有contig和scaffold從大到小進行排列,當其長度達到染色體長度的一半時,這一條contig和scaffold的長度就叫做Contig N50和Scaffold N50。這兩個數值主要用來評估序列組裝的質量的,值越大,組裝效果越好,測序效率也就越好了。





用測序的數據組裝成轉錄本。有兩種組裝方式:1,de-novo構建; 2,有參考基因組重構



由Li Heng大神所開發,運用最爲廣泛的比對軟件。最新的比對算法爲mem(maximally exact matches)。aln處理小於100bp的reads,mem處理大於70bp的reads


高通量測序技術的誕生可以說是基因組學研究領域一個具有里程碑意義的事件。該技術使得核酸測序的單鹼基成本與第一代測序技術相比急劇下降, 以人類基因組測序爲例, 上世紀末進行的人類基因組 計劃花費 30 億美元解碼了人類生命密碼, 而第二代測序使得人類基因組測序已進入萬(美)元基因組時代。如此低廉的單鹼基測序成本使得我們可以實施更多物種的基因組計劃從而解密更多生物物種的基因組遺傳密碼。同時在已完成基因組序列測定的物種中, 對該物種的其他品種進行大規模地全基因組重測序也成爲了可能。

  • 新一代基因測序技術(NGS)主要包括三種具體技術即(Next Generation Sequencing),全基因組重測序(whole-genome sequencing,WGS)、全外顯子組測序(whole-exome sequencing,WES)和目標區域測序(targeted region sequencing,TRS),它們同屬於新一代基因測序的範疇。
    • 全基因組重測序(Whole Genome Sequencing,WGS)
      最新的基因測序技術可以達到同時檢測單基因病和染色體非整倍性的診斷目的,其準確率已經超過了99%。基因測序一度被看作疾病預防最重要的科技突破,它不僅可以大大降低遺傳相關的疾病發生率,減少出生缺陷,還可以實現對疾病預測、預防、預警等; 研究表明,人體內總共約有3萬多個基因,除外傷和某些常見的外在因素導致的疾病外,發病的原因大多都與基因相關。基因異常、基因受損等都會引起對應蛋白質或酶的功能變化,從而引起疾病。基因檢測就是通過血液以及其他體液細胞中的DNA或RNA進行檢測,從而使人們能瞭解自己的基因信息,預知身體患疾病的風險。
    • 全外顯子組測序(Whole Exome Sequencing,WES)
    • 目標區域測序(Targeted Regions Sequencing,TRS)
      NGS基因測序技術需要7-14天的時間才能出具檢查報告,所以選擇此項技術的前提是選擇凍胚移植 (即將培育出的囊胚進行冷凍處理,待到NGS檢查結果出來後再進行囊胚移植) 。新一代測序技術NGS會給患者帶來一個更準確的檢查結果,試管嬰兒成功率亦會得到更大程度地提升。
      PGD英文全稱Preimplantation Genetic Diagnosis,胚胎植入前遺傳學基因診斷,是通過特定基因的檢查,從而可以確定胚胎是否攜帶可能導致特定疾病的基因突變。如果基因發生某種異常,就可能導致胚胎罹患特定疾病如地中海貧血症、唐氏綜合徵、貓叫綜合徵等。目前常用的PGD技術有SNP和PCR兩種操作,最多可以診斷出125種隱性疾病,如果是家族性的糖尿病、高血壓等目前全球醫學技術也是無法排出的。





# 用於fasta格式文件的鹼基數目和GC含量的統計

grep -v '>' input.fa| perl -ne  '{$count_A=$count_A+($_=~tr/A//);$count_T=$count_T+($_=~tr/T//);$count_G=$count_G+($_=~tr/G//);$count_C=$count_C+($_=~tr/C//);$count_N=$count_N+($_=~tr/N//)};END{print qq{total count is },$count_A+$count_T+$count_G+$count_C+$count_N, qq{\nGC%:},($count_G+$count_C)/($count_A+$count_T+$count_G+$count_C+$cont_N),qq{\n} }'

# 用於fastq格式文件的read數、鹼基數、最長的read、最短的read及平均read長度

perl -ne 'BEGIN{$min=1e10;$max=0;}next if ($.%4);chomp;$read_count++;$cur_length=length($_);$total_length+=$cur_length;$min=$min>$cur_length?$cur_length:$min;$max=$max<$cur_length?$cur_length:$max;END{print qq{Totally $read_count reads\nTotally $total_length bases\nMAX length is $max bp\nMIN length is $min bp \nMean length is },$total_length/$read_count,qq{ bp\n}}' input.fq
# 用於fasta格式文件的read數、鹼基數、最長的read、最短的read及平均read長度

perl -ne 'BEGIN{$min=1e10;$max=0;}next if ($.%2);chomp;$read_count++;$cur_length=length($_);$total_length+=$cur_length;$min=$min>$cur_length?$cur_length:$min;$max=$max<$cur_length?$cur_length:$max;END{print qq{Totally $read_count reads\nTotally $total_length bases\nMAX length is $max bp\nMIN length is $min bp \nMean length is },$total_length/$read_count,qq{ bp\n}}' input.fa

wtdbg2 -h

WTDBG: De novo assembler for long noisy sequences
Author: Jue Ruan <[email protected]>
Version: 2.4 (20190417)
Usage: wtdbg2 [options] -i <reads.fa> -o <prefix> [reads.fa ...]
 -i <string> Long reads sequences file (REQUIRED; can be multiple), []
 -o <string> Prefix of output files (REQUIRED), []
 -t <int>    Number of threads, 0 for all cores, [4]
 -f          Force to overwrite output files
 -x <string> Presets, comma delimited, []
            preset1/rsII/rs: -p 21 -S 4 -s 0.05 -L 5000
                    preset2: -p 0 -k 15 -AS 2 -s 0.05 -L 5000
                    preset3: -p 19 -AS 2 -s 0.05 -L 5000
            (genome size < 1G: preset2) -p 0 -k 15 -AS 2 -s 0.05 -L 5000
            (genome size >= 1G: preset3) -p 19 -AS 2 -s 0.05 -L 5000
      preset4/corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5
 -g <number> Approximate genome size (k/m/g suffix allowed) [0]
 -X <float>  Choose the best <float> depth from input reads(effective with -g) [50.0]
 -L <int>    Choose the longest subread and drop reads shorter than <int> (5000 recommended for PacBio) [0]
             Negative integer indicate tidying read names too, e.g. -5000.
 -k <int>    Kmer fsize, 0 <= k <= 25, [0]
 -p <int>    Kmer psize, 0 <= p <= 25, [21]
             k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>
 -K <float>  Filter high frequency kmers, maybe repetitive, [1000.05]
             >= 1000 and indexing >= (1 - 0.05) * total_kmers_count
 -S <float>  Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00]
             -S is very useful in saving memeory and speeding up
             please note that subsampling kmers will have less matched length
 -l <float>  Min length of alignment, [2048]
 -m <float>  Min matched length by kmer matching, [200]
 -R          Enable realignment mode
 -A          Keep contained reads during alignment
 -s <float>  Min similarity, calculated by kmer matched length / aligned length, [0.05]
 -e <int>    Min read depth of a valid edge, [3]
 -q          Quiet
 -v          Verbose (can be multiple)
 -V          Print version information and then exit
 --help      Show more options
 ** more options **
 --cpu <int>
   See -t 0, default: all cores
 --input <string> +
   See -i
   See -f
 --prefix <string>
   See -o
 --preset <string>
   See -x
 --kmer-fsize <int>
   See -k 0
 --kmer-psize <int>
   See -p 21
 --kmer-depth-max <float>
   See -K 1000.05
 -E, --kmer-depth-min <int>
   Min kmer frequency, [2]
 --kmer-subsampling <float>
   See -S 4.0
 --kbm-parts <int>
   Split total reads into multiple parts, index one part by one to save memory, [1]
 --aln-kmer-sampling <int>
   Select no more than n seeds in a query bin, default: 256
 --dp-max-gap <int>
   Max number of bin(256bp) in one gap, [4]
 --dp-max-var <int>
   Max number of bin(256bp) in one deviation, [4]
 --dp-penalty-gap <int>
   Penalty for BIN gap, [-7]
 --dp-penalty-var <int>
   Penalty for BIN deviation, [-21]
 --aln-min-length <int>
   See -l 2048
 --aln-min-match <int>
   See -m 200. Here the num of matches counting basepair of the matched kmer's regions
 --aln-min-similarity <float>
   See -s 0.05
 --aln-max-var <float>
   Max length variation of two aligned fragments, default: 0.25
 --aln-dovetail <int>
   Retain dovetail overlaps only, the max overhang size is <--aln-dovetail>, the value should be times of 256, -1 to disable filtering, default: 256
 --aln-strand <int>
   1: forward, 2: reverse, 3: both. Please don't change the deault vaule 3, unless you exactly know what you are doing
 --aln-maxhit <int>
   Max n hits for each read in build graph, default: 1000
 --aln-bestn <int>
   Use best n hits for each read in build graph, 0: keep all, default: 500
   <prefix>.alignments always store all alignments
 -R, --realign
   Enable re-alignment, see --realn-kmer-psize=15, --realn-kmer-subsampling=1, --realn-min-length=2048, --realn-min-match=200, --realn-min-similarity=0.1, --realn-max-var=0.25
 --realn-kmer-psize <int>
   Set kmer-psize in realignment, (kmer-ksize always eq 0), default:15
 --realn-kmer-subsampling <int>
   Set kmer-subsampling in realignment, default:1
 --realn-min-length <int>
   Set aln-min-length in realignment, default: 2048
 --realn-min-match <int>
   Set aln-min-match in realignment, default: 200
 --realn-min-similarity <float>
   Set aln-min-similarity in realignment, default: 0.1
 --realn-max-var <float>
   Set aln-max-var in realignment, default: 0.25
 -A, --aln-noskip
   Even a read was contained in previous alignment, still align it against other reads
 --corr-mode <float>
   Default: 0.0. If set > 0 and set --g <genome_size>, will turn on correct-align mode.
   Wtdbg will select <genome_size> * <corr-mode> bases from reads of middle length, and align them aginst all reads.
   Then, wtdbg will correct them using POACNS, and query corrected sequences against all reads again
   In correct-align mode, --aln-bestn = unlimited, --no-read-clip, --no-chaining-clip. Will support those features in future
 --corr-min <int>
 --corr-max <int>
   For each read to be corrected, uses at least <corr-min> alignments, and at most <corr-max> alignments
   Default: --corr_min = 5, --corr-max = 10
 --corr-cov <float>
   Default: 0.75. When aligning reads to be corrected, the alignments should cover at least <corr-cov> of read length
 --corr-block-size <int>
   Default: 2048. MUST be times of 256bp. Used in POACNS
 --corr-block-step <int>
   Default: 1536. MUST be times of 256bp. Used in POACNS
   By default, wtdbg will keep only the best alignment between two reads after chainning. This option will disable it, and keep multiple
 --verbose +
   See -v. -vvvv will display the most detailed information
   See -q
 --limit-input <int>
   Limit the input sequences to at most <int> M bp. Usually for test
 -L <int>, --tidy-reads <int>
   Default: 0. Pick longest subreads if possible. Filter reads less than <--tidy-reads>. Please add --tidy-name or set --tidy-reads to nagetive value
   if want to rename reads. Set to 0 bp to disable tidy. Suggested value is 5000 for pacbio RSII reads
   Rename reads into 'S%010d' format. The first read is named as S0000000001
 -g <number>, --genome-size <number>
   Provide genome size, e.g. 100.4m, 2.3g. In this version, it is used with -X/--rdcov-cutoff in selecting reads just after readed all.
 -X <float>, --rdcov-cutoff <float>
   Default: 50.0. Retaining 50.0 folds of genome coverage, combined with -g and --rdcov-filter.
 --rdcov-filter [0|1]
   Default 0. Strategy 0: retaining longest reads. Strategy 1: retaining medain length reads. 
   Select nodes from error-free-sequences only. E.g. you have contigs assembled from NGS-WGS reads, and long noisy reads.
   You can type '--err-free-seq your_ctg.fa --input your_long_reads.fa --err-free-nodes' to perform assembly somehow act as long-reads scaffolding
 --node-len <int>
   The default value is 1024, which is times of KBM_BIN_SIZE(always equals 256 bp). It specifies the length of intervals (or call nodes after selecting).
   kbm indexs sequences into BINs of 256 bp in size, so that many parameter should be times of 256 bp. There are: --node-len, --node-ovl, --aln-min-length, --aln-dovetail .   Other parameters are counted in BINs, --dp-max-gap, --dp-max-var .
 --node-matched-bins <int>
   Min matched bins in a node, default:1
 --node-ovl <int>
   Default: 256. Max overlap size between two adjacent intervals in any read. It is used in selecting best nodes representing reads in graph
 --node-drop <float>
   Default: 0.25. Will discard an node when has more this ratio intervals are conflicted with previous generated node
 -e <int>, --edge-min=<int>
   Default: 3. The minimal depth of a valid edge is set to 3. In another word, Valid edges must be supported by at least 3 reads
   When the sequence depth is low, have a try with --edge-min 2. Or very high, try --edge-min 4
   Don't attempt to rescue low coverage edges
 --node-min <int>
   Min depth of an interval to be selected as valid node. Defaultly, this value is automaticly the same with --edge-min.
 --node-max <int>
   Nodes with too high depth will be regarded as repetitive, and be masked. Default: 200, more than 200 reads contain this node
 --ttr-cutoff-depth <int>, 0
 --ttr-cutoff-ratio <float>, 0.5
   Tiny Tandom Repeat. A node located inside ttr will bring noisy in graph, should be masked. The pattern of such nodes is:
   depth >= <--ttr-cutoff-depth>, and none of their edges have depth greater than depth * <--ttr-cutoff-ratio 0.5>
   set --ttr-cutoff-depth 0 to disable ttr masking
 --dump-kbm <string>
   Dump kbm index into file for loaded by `kbm` or `wtdbg`
 --dump-seqs <string>
   Dump kbm index (only sequences, no k-mer index) into file for loaded by `kbm` or `wtdbg`
   Please note: normally load it with --load-kbm, not with --load-seqs
 --load-kbm <string>
   Instead of reading sequences and building kbm index, which is time-consumed, loading kbm-index from already dumped file.
   Please note that, once kbm-index is mmaped by kbm -R <kbm-index> start, will just get the shared memory in minute time.
   See `kbm` -R <your_seqs.kbmidx> [start | stop]
 --load-seqs <string>
   Similar with --load-kbm, but only use the sequences in kbmidx, and rebuild index in process's RAM.
 --load-alignments <string> +
   `wtdbg` output reads' alignments into <--prefix>.alignments, program can load them to fastly build assembly graph. Or you can offer
   other source of alignments to `wtdbg`. When --load-alignment, will only reading long sequences but skip building kbm index
   You can type --load-alignments <file> more than once to load alignments from many files
 --load-clips <string>
   Combined with --load-nodes. Load reads clips. You can find it in `wtdbg`'s <--prefix>.clps
 --load-nodes <sting>
   Load dumped nodes from previous execution for fast construct the assembly graph, should be combined with --load-clips. You can find it in `wtdbg`'s <--prefix>.1.nodes
 --bubble-step <int>
   Max step to search a bubble, meaning the max step from the starting node to the ending node. Default: 40
 --tip-step <int>
   Max step to search a tip, 10
 --ctg-min-length <int>
   Min length of contigs to be output, 5000
 --ctg-min-nodes <int>
   Min num of nodes in a contig to be ouput, 3
   Will generate as less output files (<--prefix>.*) as it can
 --bin-complexity-cutoff <int>
   Used in filtering BINs. If a BIN has less indexed valid kmers than <--bin-complexity-cutoff 2>, masks it.
   Before building edges, for each node, local-graph-analysis reads all related reads and according nodes, and builds a local graph to judge whether to mask it
   The analysis aims to find repetitive nodes
   Defaultly, `wtdbg` sorts input sequences by length DSC. The order of reads affects the generating of nodes in selecting important intervals
   In graph clean, `wtdbg` normally masks isolated (orphaned) nodes
   Defaultly, `wtdbg` clips a input sequence by analyzing its overlaps to remove high error endings, rolling-circle repeats (see PacBio CCS), and chimera.
   When building edges, clipped region won't contribute. However, `wtdbg` will use them in the final linking of unitigs
   Defaultly, performs alignments chainning in read clipping
   ** If '--aln-bestn 0 --no-read-clip', alignments will be parsed directly, and less RAM spent on recording alignments
