和Huffman-Tree一樣,Shannon-Fano coding也是用一棵二叉樹對字符進行編碼。但在實際操作中呢,Shannon-Fano卻沒有大用處,這是由於它與Huffman coding相比,編碼效率較低的結果(或者說香農-範諾算法的編碼平均碼字較大)。但是它的基本思路我們還是可以參考下的。
根據Wikipedia上面的解釋,我們來看下香農範諾算法的原理:
Shannon-Fano的樹是根據旨在定義一個有效的代碼表的規範而建立的。實際的算法很簡單:
- 對於一個給定的符號列表,制定了概率相應的列表或頻率計數,使每個符號的相對發生頻率是已知。
- 排序根據頻率的符號列表,最常出現的符號在左邊,最少出現的符號在右邊。
- 清單分爲兩部分,使左邊部分的總頻率和儘可能接近右邊部分的總頻率和。
- 該列表的左半邊分配二進制數字0,右半邊是分配的數字1。這意味着,在第一半符號代都是將所有從0開始,第二半的代碼都從1開始。
- 對左、右半部分遞歸應用步驟3和4,細分羣體,並添加位的代碼,直到每個符號已成爲一個相應的代碼樹的葉。
示例
這個例子展示了一組字母的香濃編碼結構(如圖a所示)這五個可被編碼的字母有如下出現次數:
-
Symbol A B C D E Count 15 7 6 6 5 Probabilities 0.38461538 0.17948718 0.15384615 0.15384615 0.12820513
從左到右,所有的符號以它們出現的次數劃分。在字母B與C之間劃定分割線,得到了左右兩組,總次數分別爲22,17。 這樣就把兩組的差別降到最小。通過這樣的分割, A與B同時擁有了一個以0爲開頭的碼字, C,D,E的碼子則爲1,如圖b所示。 隨後, 在樹的左半邊,於A,B間建立新的分割線,這樣A就成爲了碼字爲00的葉子節點,B的碼子01。經過四次分割, 得到了一個樹形編碼。 如下表所示,在最終得到的樹中, 擁有最大頻率的符號被兩位編碼, 其他兩個頻率較低的符號被三位編碼。
-
符號 A B C D E 編碼 00 01 10 110 111
Entropy(熵,平均碼字長度):
Pseudo-code
1: begin
2: count source units
3: sort source units to non-decreasing order
4: SF-SplitS
5: output(count of symbols, encoded tree, symbols)
6: write output
7: end
8:
9: procedure SF-Split(S)
10: begin
11: if (|S|>1) then
12: begin
13: divide S to S1 and S2 with about same count of units
14: add 1 to codes in S1
15: add 0 to codes in S2
16: SF-Split(S1)
17: SF-Split(S2)
18: end
19: end
香農-範諾算法實現(Shannon-Fano coding implementation in C++)
- /************************************************************************/
- /* File Name: Shanno-Fano.cpp
- * @Function: Lossless Compression
- @Author: Sophia Zhang
- @Create Time: 2012-9-26 20:20
- @Last Modify: 2012-9-26 20:57
- */
- /************************************************************************/
- #include"iostream"
- #include "queue"
- #include "map"
- #include "string"
- #include "iterator"
- #include "vector"
- #include "algorithm"
- #include "math.h"
- using namespace std;
- #define NChar 8 //suppose use 8 bits to describe all symbols
- #define Nsymbols 1<<NChar //can describe 256 symbols totally (include a-z, A-Z)
- #define INF 1<<31-1
- typedef vector<bool> SF_Code;//8 bit code of one char
- map<char,SF_Code> SF_Dic; //huffman coding dictionary
- int Sumvec[Nsymbols]; //record the sum of symbol count after sorting
- class HTree
- {
- public :
- HTree* left;
- HTree* right;
- char ch;
- int weight;
- HTree(){left = right = NULL; weight=0;ch ='\0';}
- HTree(HTree* l,HTree* r,int w,char c){left = l; right = r; weight=w; ch=c;}
- ~HTree(){delete left; delete right;}
- bool Isleaf(){return !left && !right; }
- };
- bool comp(const HTree* t1, const HTree* t2)//function for sorting
- { return (*t1).weight>(*t2).weight; }
- typedef vector<HTree*> TreeVector;
- TreeVector TreeArr;//record the symbol count array after sorting
- void Optimize_Tree(int a,int b,HTree& root)//find optimal separate point and optimize tree recursively
- {
- if(a==b)//build one leaf node
- {
- root = *TreeArr[a-1];
- return;
- }
- else if(b-a==1)//build 2 leaf node
- {
- root.left = TreeArr[a-1];
- root.right=TreeArr[b-1];
- return;
- }
- //find optimizing point x
- int x,minn=INF,curdiff;
- for(int i=a;i<b;i++)//find the point that minimize the difference between left and right; this can also be implemented by dichotomy
- {
- curdiff = Sumvec[i]*2-Sumvec[a-1]-Sumvec[b];
- if(abs(curdiff)<minn){
- x=i;
- minn = abs(curdiff);
- }
- else break;//because this algorithm has monotonicity
- }
- HTree*lc = new HTree; HTree *rc = new HTree;
- root.left = lc; root.right = rc;
- Optimize_Tree(a,x,*lc);
- Optimize_Tree(x+1,b,*rc);
- }
- HTree* BuildTree(int* freqency)//create the tree use Optimize_Tree
- {
- int i;
- for(i=0;i<Nsymbols;i++)//statistic
- {
- if(freqency[i])
- TreeArr.push_back(new HTree (NULL,NULL,freqency[i], (char)i));
- }
- sort(TreeArr.begin(), TreeArr.end(), comp);
- memset(Sumvec,0,sizeof(Sumvec));
- for(i=1;i<=TreeArr.size();i++)
- Sumvec[i] = Sumvec[i-1]+TreeArr[i-1]->weight;
- HTree* root = new HTree;
- Optimize_Tree(1,TreeArr.size(),*root);
- return root;
- }
- /************************************************************************/
- /* Give Shanno Coding to the Shanno Tree
- /*PS: actually, this generative process is same as Huffman coding
- /************************************************************************/
- void Generate_Coding(HTree* root, SF_Code& curcode)
- {
- if(root->Isleaf())
- {
- SF_Dic[root->ch] = curcode;
- return;
- }
- SF_Code lcode = curcode;
- SF_Code rcode = curcode;
- lcode.push_back(false);
- rcode.push_back(true);
- Generate_Coding(root->left,lcode);
- Generate_Coding(root->right,rcode);
- }
- int main()
- {
- int freq[Nsymbols] = {0};
- char *str = "bbbbbbbccccccaaaaaaaaaaaaaaaeeeeedddddd";//15a,7b,6c,6d,5e
- //statistic character frequency
- while (*str!='\0') freq[*str++]++;
- //build tree
- HTree* r = BuildTree(freq);
- SF_Code nullcode;
- Generate_Coding(r,nullcode);
- for(map<char,SF_Code>::iterator it = SF_Dic.begin(); it != SF_Dic.end(); it++) {
- cout<<(*it).first<<'\t';
- std::copy(it->second.begin(),it->second.end(),std::ostream_iterator<bool>(cout));
- cout<<endl;
- }
- }
Result:
以上面圖中的統計數據爲例,進行編碼。
符號 | A | B | C | D | E |
---|---|---|---|---|---|
計數 | 15 | 7 | 6 | 6 | 5 |