Shannon-Fano編碼——原理與實現

和Huffman-Tree一樣，Shannon-Fano coding也是用一棵二叉樹對字符進行編碼。但在實際操作中呢，Shannon-Fano卻沒有大用處，這是由於它與Huffman coding相比，編碼效率較低的結果（或者說香農-範諾算法的編碼平均碼字較大）。但是它的基本思路我們還是可以參考下的。

根據Wikipedia上面的解釋，我們來看下香農範諾算法的原理：

Shannon-Fano的樹是根據旨在定義一個有效的代碼表的規範而建立的。實際的算法很簡單：

對於一個給定的符號列表，制定了概率相應的列表或頻率計數，使每個符號的相對發生頻率是已知。
排序根據頻率的符號列表，最常出現的符號在左邊，最少出現的符號在右邊。
清單分爲兩部分，使左邊部分的總頻率和儘可能接近右邊部分的總頻率和。
該列表的左半邊分配二進制數字0，右半邊是分配的數字1。這意味着，在第一半符號代都是將所有從0開始，第二半的代碼都從1開始。
對左、右半部分遞歸應用步驟3和4，細分羣體，並添加位的代碼，直到每個符號已成爲一個相應的代碼樹的葉。

示例

香農-範諾編碼算法

這個例子展示了一組字母的香濃編碼結構（如圖a所示）這五個可被編碼的字母有如下出現次數:

Symbol	A	B	C	D	E
Count	15	7	6	6	5
Probabilities	0.38461538	0.17948718	0.15384615	0.15384615	0.12820513

從左到右，所有的符號以它們出現的次數劃分。在字母B與C之間劃定分割線，得到了左右兩組，總次數分別爲22,17。這樣就把兩組的差別降到最小。通過這樣的分割, A與B同時擁有了一個以0爲開頭的碼字, C，D，E的碼子則爲1,如圖b所示。隨後, 在樹的左半邊，於A，B間建立新的分割線，這樣A就成爲了碼字爲00的葉子節點，B的碼子01。經過四次分割, 得到了一個樹形編碼。如下表所示，在最終得到的樹中, 擁有最大頻率的符號被兩位編碼, 其他兩個頻率較低的符號被三位編碼。

符號	A	B	C	D	E
編碼	00	01	10	110	111

Entropy(熵，平均碼字長度):

Pseudo-code


 1:  begin
 2:     count source units
 3:     sort source units to non-decreasing order
 4:     SF-SplitS
 5:     output(count of symbols, encoded tree, symbols)
 6:     write output
 7:   end
 8:  
 9:  procedure SF-Split(S)
10:  begin
11:     if (|S|>1) then
12:      begin
13:        divide S to S1 and S2 with about same count of units
14:        add 1 to codes in S1
15:        add 0 to codes in S2
16:        SF-Split(S1)
17:        SF-Split(S2)
18:      end
19:  end

想不清楚的朋友可以看下這個網站的模擬程序，很形象，perfect~

香農-範諾算法實現（Shannon-Fano coding implementation in C++）

我們由上面的算法可知，需要迭代地尋找一個最優點，使得樹中每個節點的左右子樹頻率總和儘可能相近。

這裏我尋找最優化點用的是順次查找法，其實呢，我們還可以用二分法（dichotomy）達到更高的效率~

[cpp]view
plaincopy

/************************************************************************/  

/*  File Name: Shanno-Fano.cpp 

*       @Function: Lossless Compression 

@Author: Sophia Zhang 

@Create Time: 2012-9-26 20:20 

@Last Modify: 2012-9-26 20:57 

*/  

/************************************************************************/  

#include"iostream"  

#include "queue"  

#include "map"  

#include "string"  

#include "iterator"  

#include "vector"  

#include "algorithm"  

#include "math.h"  

using namespace std;  

#define NChar 8 //suppose use 8 bits to describe all symbols  

#define Nsymbols 1<<NChar //can describe 256 symbols totally (include a-z, A-Z)  

#define INF 1<<31-1  

typedef vector<bool> SF_Code;//8 bit code of one char  

map<char,SF_Code> SF_Dic; //huffman coding dictionary  

int Sumvec[Nsymbols];   //record the sum of symbol count after sorting  

class HTree  

{  

public :  

    HTree* left;  

    HTree* right;  

    char ch;  

    int weight;  

    HTree(){left = right = NULL; weight=0;ch ='\0';}  

    HTree(HTree* l,HTree* r,int w,char c){left = l; right = r;  weight=w;   ch=c;}  

    ~HTree(){delete left; delete right;}  

    bool Isleaf(){return !left && !right; }  

};  

bool comp(const HTree* t1, const HTree* t2)//function for sorting  

{   return (*t1).weight>(*t2).weight;    }  

typedef vector<HTree*> TreeVector;  

TreeVector TreeArr;//record the symbol count array after sorting  

void Optimize_Tree(int a,int b,HTree& root)//find optimal separate point and optimize tree recursively  

{  

    if(a==b)//build one leaf node  

    {  

        root = *TreeArr[a-1];  

        return;  

    }  

    else if(b-a==1)//build 2 leaf node  

    {  

        root.left = TreeArr[a-1];  

        root.right=TreeArr[b-1];  

        return;  

    }  

    //find optimizing point x  

    int x,minn=INF,curdiff;  

    for(int i=a;i<b;i++)//find the point that minimize the difference between left and right; this can also be implemented by dichotomy  

    {  

        curdiff = Sumvec[i]*2-Sumvec[a-1]-Sumvec[b];  

        if(abs(curdiff)<minn){  

            x=i;  

            minn = abs(curdiff);  

        }  

        else break;//because this algorithm has monotonicity  

    }  

    HTree*lc = new HTree;   HTree *rc = new HTree;  

    root.left = lc;     root.right = rc;  

    Optimize_Tree(a,x,*lc);  

    Optimize_Tree(x+1,b,*rc);  

}  

HTree* BuildTree(int* freqency)//create the tree use Optimize_Tree  

{  

    int i;  

    for(i=0;i<Nsymbols;i++)//statistic  

    {  

        if(freqency[i])  

            TreeArr.push_back(new HTree (NULL,NULL,freqency[i], (char)i));  

    }  

    sort(TreeArr.begin(), TreeArr.end(), comp);  

    memset(Sumvec,0,sizeof(Sumvec));  

    for(i=1;i<=TreeArr.size();i++)  

        Sumvec[i] = Sumvec[i-1]+TreeArr[i-1]->weight;  

    HTree* root = new HTree;  

    Optimize_Tree(1,TreeArr.size(),*root);  

    return root;  

}  

/************************************************************************/  

/* Give Shanno Coding to the Shanno Tree 

/*PS: actually, this generative process is same as Huffman coding 

/************************************************************************/  

void Generate_Coding(HTree* root, SF_Code& curcode)  

{  

    if(root->Isleaf())  

    {  

        SF_Dic[root->ch] = curcode;  

        return;  

    }  

    SF_Code lcode = curcode;  

    SF_Code rcode = curcode;  

    lcode.push_back(false);  

    rcode.push_back(true);  

    Generate_Coding(root->left,lcode);  

    Generate_Coding(root->right,rcode);  

}  

int main()  

{  

    int freq[Nsymbols] = {0};  

    char *str = "bbbbbbbccccccaaaaaaaaaaaaaaaeeeeedddddd";//15a,7b,6c,6d,5e  

    //statistic character frequency  

    while (*str!='\0')      freq[*str++]++;  

    //build tree  

    HTree* r = BuildTree(freq);  

    SF_Code nullcode;  

    Generate_Coding(r,nullcode);  

    for(map<char,SF_Code>::iterator it = SF_Dic.begin(); it != SF_Dic.end(); it++) {    

        cout<<(*it).first<<'\t';    

        std::copy(it->second.begin(),it->second.end(),std::ostream_iterator<bool>(cout));    

        cout<<endl;    

    }    

}