貪心算法(Greedy Algorithm)之霍夫曼編碼(Huffman codes)

其實這個霍夫曼編碼本身不是一個很難的技巧(也是霍夫曼在期末考試的過程中想出來的方案：）)，因爲中間用到了貪心的思想，所以也在這裏列舉了出來。這個問題本身在計算機系的很多教材上都出現過。這裏權且記錄下來。

霍夫曼的編碼是這樣的。假設我有一組帶壓縮的文本，裏面各個字符出現的頻率不同，現在需要對他們進行壓縮。比如

假設我們有100,000個字符的文本.最直觀的壓縮辦法就是原來每個字符要8個bits。現在我一共只有6個字符，那我就把每個字符用3個二進制位來表示，這樣所有100,000個字符用300,000個bit就可以表示了。這種是最直觀的方案。但是霍夫曼提出的方案更精妙一些。他提出，基於每個字符出現的頻率不同，可以讓出現次數多的字符用更少的二進制位來描述，出現次數少的字符用多一些二進制來描述。比如上圖顯示的這個Variable-length codeword裏面。a出現的頻率最高，所以用一個二進制位0來表示。而f出現的頻率很小，所以用4個二進制位來表示。這樣總共 (45 · 1 + 13 · 3 + 12 · 3 + 16 · 3 + 9 · 4 + 5 · 4) · 1,000 = 224,000 bits。可以看到這個是比原來的方案更優化的解法。

我們一樣的還是用一張圖來描述霍夫曼編碼的流程：

這個過程概括的說就是一個根據頻率建立二叉樹的過程。建完之後對應的編碼也就完成了。

第一步a. 這個a就和之前的活動選擇問題一樣，把需要的所以字符按照頻率排序。

第二部b. 選取出現頻率最小的兩個節點 f 和 e。組成一個新的節點，新的節點的頻率就是e和f的和。原來的e和f分別成了新節點的左子節點和右子節點。(注意這裏一個默認的規則就是頻率小的是左子節點，大的是右子節點。)然後把之前的兩個節點從原來的組中刪除，加新的節點加入排序。

第三部c. 其實和第二部雷同，就是一個循環的過程。這裏再次去除隊列中的最小頻率的兩項（這時是c和b）。組成新的節點加入隊列排序。

如此循環往復，最後就形成了(f)這個二叉樹。現在有了二叉樹只有，我們把左子樹這條邊標記爲0，右子樹標記爲1。這樣就差生了對應的編碼方式 a=0; b=101;....

下面是對應的代碼:

// Huffman_Coding.cpp : Defines the entry point for the console application. // #include "stdafx.h" /* The simplest construction algorithm uses a priority queue where the node with lowest probability is given highest priority: 1. Create a leaf node for each symbol and add it to the priority queue. 2. While there is more than one node in the queue: #1. Remove the node of highest priority (lowest probability) twice to get two nodes. #2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities. #3. Add the new node to the queue. 3. The remaining node is the root node and the tree is complete. */ #include <queue> #include <map> #include <iostream> struct huffmanNode { huffmanNode* l; huffmanNode* r; char t; double f; huffmanNode() { l = NULL; r = NULL; t = 0; f = 0; } ~huffmanNode() { delete l; delete r; } }; struct huffmanNodeSort { bool operator()(huffmanNode* h1, huffmanNode* h2) { // make top the smallest return h1->f > h2->f ; } }; void generateResult (huffmanNode* pNode,std::map< char, std::vector<bool> >& output, std::vector<bool>& prefix) { if (pNode->l) { // if left, there must be right as well prefix.push_back(0); generateResult(pNode->l,output, prefix); prefix.back() = 1; generateResult(pNode->r, output, prefix); prefix.pop_back(); } else output[pNode->t] = prefix; } void huffmanCoding(const std::map<char, double>& m, std::map< char, std::vector<bool> >& output ) { std::priority_queue<huffmanNode*, std::vector<huffmanNode*>, huffmanNodeSort> pqueue; output.clear(); if (m.empty()) return; // init the queue std::map<char, double>::const_iterator mapIterator; for (mapIterator=m.begin(); mapIterator!=m.end();++mapIterator) { huffmanNode* pNode = new huffmanNode(); pNode->t = mapIterator->first; pNode->f = mapIterator->second; pqueue.push(pNode); } // create the tree huffmanNode* tree = NULL; while (!pqueue.empty()) { huffmanNode* top = pqueue.top(); pqueue.pop(); if (pqueue.empty()) { tree = top; } else { huffmanNode* top2 = pqueue.top(); pqueue.pop(); huffmanNode* pNew = new huffmanNode(); pNew->f = top->f+top2->f; pNew->l = top; pNew->r = top2; pqueue.push(pNew); } } // set to the result std::vector<bool> prefix; generateResult(tree, output, prefix); delete tree; } int _tmain(int argc, _TCHAR* argv[]) { std::map<char, double> freq; freq['a'] = 0.45; freq['b'] = 0.13; freq['c'] = 0.12; freq['d'] = 0.16; freq['e'] = 0.09; freq['f'] = 0.05; std::map< char, std::vector<bool> > s; huffmanCoding(freq, s); std::map<char, std::vector<bool> >::const_iterator si = s.begin(); for (;si!=s.end();++si) { std::vector<bool>::const_iterator it = si->second.begin(); std::cout<<si->first <<" = "; for (;it != si->second.end(); ++it) std::cout << *it ; std::cout<<std::endl; } system("pause"); return 0; }

寫這個代碼的時候挺感慨，之前看到過一個算法描述語言的帖子說到c語言比C++更適合。但是從我的角度說還是C++更親切。這裏用了priority_queue並不是說我不能自己寫個二叉堆。只是說，我們在描述這個算法的時候可以以更多的精力關注這個算法本身，而不是從輪胎開始造汽車：）

貪心算法(Greedy Algorithm)之霍夫曼編碼(Huffman codes)

Camera: Brew中的龍潭虎穴

一種快速自適應的圖像二值化方法介紹 (Wellner 1993)

如何用程序運行CAB安裝文件

win32下如何定位內存泄漏

Symbian 位圖CFbsBitmap 90度旋轉

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結