java實現哈夫曼編碼(huffman)編碼

  這篇博客主要講解如何用java實現哈夫曼編碼(Huffman)。

概念

  首先,我來簡單說一下哈夫曼編碼(Huffman),它主要是數據編碼的一種方式,也是數據壓縮的一種方法,將某些特定的字符轉化爲二進制字符,並在轉換過程中降低原有字符串的存儲量。其具體方法是先按出現的概率大小排隊,把兩個最小的概率相加,作爲新的概率 和剩餘的概率重新排隊,再把最小的兩個概率相加,再重新排隊,直到最後變成1。每次相 加時都將“0”和“1”賦與相加的兩個概率,讀出時由該符號開始一直走到最後的“1”, 將路線上所遇到的“0”和“1”按最低位到最高位的順序排好,就是該符號的哈夫曼編碼。

練習要求

1、本次實驗要求用Huffman編碼實現對已有英文材料進行信源編碼。
2、根據所給資料,以26個字母作爲信源符號,統計它們的分佈概率,構建信源的概率模型。
3、利用選定的編碼技術給各信源符號進行編碼。
4、壓縮英文文本,並計算壓縮效率。
5、拓展:列出信源材料真實的信源符號,構建信源概率模型,並進行編碼。(這個本次還沒有實現)

java實現

方法一:
  主要是使用二叉樹存儲對應的字符編碼,將某個字符的編碼轉化爲二叉樹的一條路徑,遍歷二叉樹的所有子節點的路徑就可以得到該字符的編碼。

代碼實現

OtherPersonMethodHuffman類

import java.io.IOException;
import java.util.*;
import static java.lang.System.out;

/**
 * 哈夫曼樹構造類:
 * @author Canlong
 */
public class OtherPersonMethodHuffman {

    public static void main(String[] args) {
        String fileStr = "";
        double[] tempProb =null;
        String[] strCodeAll = null;
        try {
            tempProb = Test2.countNum();

        }catch (IOException e){
            e.printStackTrace();
        }
        long[] probLong=new long[tempProb.length];
        //爲了方便計算將概率乘以10^18次方,將其轉爲long類型
        for(int i=0;i<tempProb.length;i++){
            probLong[i]=Math.round(tempProb[i] * 1000000000000000000L);
        }
        strCodeAll = countCodeStr(probLong);
        //將字母和對應的二進制編碼放入對應的map集合中
        Map<String,String> codeTable = new HashMap<>();
        //二進制作爲key,字母作爲valued的map
        HashMap<String,String> codeTableBinToChar =  new HashMap<>();
        fileStr = Test2.fileStr;
        //測試huffman編碼
        //fileStr = "afdjljaf-safsdf";
        for(int i=0;i<strCodeAll.length;i++){
            codeTable.put(String.valueOf((char)('a'+i)),strCodeAll[i]);
            codeTableBinToChar.put(strCodeAll[i],String.valueOf((char)('a'+i)));
        }
        String codeFileStr ="";
        for(int i=0;i<fileStr.length();i++){
            if(codeTable.containsKey(String.valueOf(fileStr.charAt(i)).toLowerCase())){
                codeFileStr += codeTable.get(String.valueOf(fileStr.charAt(i)).toLowerCase());
            }
        }
        out.println("huffman 編碼後爲:"+codeFileStr);
        out.println("huffman 編碼後的長度:"+codeFileStr.length());
        out.println("huffman 編碼後的所佔的比特數就是爲:6851");
        String decodeStr = decodeStr(codeTable,codeFileStr);
        out.println("huffman譯碼後的原文:"+decodeStr);
        out.println("huffman譯碼後的長度::"+decodeStr.length());
        out.println("因爲一個char字符佔1個字節,即8比特,所以其一共佔了,1649*8="+decodeStr.length()*8+"個比特。");
        out.println("因此,該huffman編碼壓縮比爲:6851/13192="+6851.0/13192.0*100+"%");

    }

    /**
     * huffman編碼的譯碼系統,也是解碼(目前這裏只能做到對26個字母的進行編碼)
     * @param codeTable 字符和編碼映射表,map集合
     * @param codeFileStr 要解碼的二進制字符串
     * @return 返回解碼後的字符串,就是編碼前的字符串
     */
    public static String decodeStr(Map<String,String> codeTable,String codeFileStr){
        String decodeFileStr = "";
        while(codeFileStr.length()>1) {
            //如果已經編碼了的字符串中有與編碼表中一樣的編碼就將那個編碼對應的字符加在譯碼字符串後
            for (int i = 'a'; i <= 'z'; i++) {
                String key = String.valueOf((char) i);
                String codeStr = codeTable.get(key);
                //截取編碼字符串與編碼表中編碼相同的位數進行對比
                String compareStr="";
                if(codeFileStr.length()>=codeStr.length()) {
                    compareStr = codeFileStr.substring(0, codeStr.length());
                }
                if (compareStr.equalsIgnoreCase(codeStr)) {
                    decodeFileStr += String.valueOf((char) i);
                    //編碼完一段後,將那一段去掉(這步是挺關鍵的一步)
                    codeFileStr = codeFileStr.substring(codeStr.length());
                }
            }
        }
        return decodeFileStr;
    }

    /**
     * 計算概率對應的編碼
     * @param probInts 概率數組
     * @return 編碼字符數組
     */
    public static String[] countCodeStr(long[] probInts){
        String[] strCodes = new String[probInts.length];
        Node [] tempsnodes = new Node[probInts.length];
        for(int i=0;i<probInts.length;i++){
            tempsnodes [i]= new Node(probInts[i]);
        }
        List<Node> nodes = Arrays.asList(tempsnodes);
        Node node = OtherPersonMethodHuffman.build(nodes);
        PrintTree(node);
        //找出所有路徑
        FindShortestBTPath getPathTool = new FindShortestBTPath();
        ArrayList<ArrayList<Long>> pathArr =  getPathTool.FindAllPath(node,1);
        HashMap<Long,Long> mapNode = getPathTool.mapNode;
        //輸出二叉樹
        // out.println(pathArr);
        // out.println(mapNode);
        for(ArrayList<Long> arr1 : pathArr){
            String tempStr = "";
            for(int i=0;i<arr1.size();i++){
                tempStr+=mapNode.get(arr1.get(i)).toString();
                //遍歷概率
            }
            for(int j=0;j<probInts.length;j++){
                //如果一條路徑中最後那個節點等於我們所定義的概率,則將對應的二進制字符串數組加上該概率對應的二進制編碼
                if(probInts[j] == arr1.get(arr1.size()-1)){
                    strCodes[j]=tempStr;
                }
            }
        }
        String[] realStrCodes = new String[strCodes.length];
        //輸出二進制編碼
        out.println();
        for(int i=0;i<strCodes.length;i++){
            out.println(""+(char)(i+'a')+"字母對應的編碼爲:"+strCodes[i].substring(1,strCodes[i].length()));
            realStrCodes[i]=strCodes[i].substring(1,strCodes[i].length());
        }
        return realStrCodes;
    }

    /**
     * 構造哈夫曼樹
     * @param nodes 結點集合
     * @return 構造出來的樹的根結點
     */
    private static Node build(List<Node> nodes) {
        nodes = new ArrayList<Node>(nodes);
        sortList(nodes);
        while (nodes.size() > 1) {
            createAndReplace(nodes);
        }
        return nodes.get(0);
    }

    /**
     * 組合兩個權值最小結點,並在結點列表中用它們的父結點替換它們
     * @param nodes 結點集合
     */
    private static void createAndReplace(List<Node> nodes) {
        Node left = nodes.get(0);
        Node right = nodes.get(1);
        Node parent = new Node(left.getValue() + right.getValue());
        parent.setLeftChild(left);
        parent.setRightChild(right);
        nodes.remove(0);
        nodes.remove(0);
        nodes.add(parent);
        sortList(nodes);
    }

    /**
     * 將結點集合由大到小排序
     */
    private static void sortList(List<Node> nodes) {
        Collections.sort(nodes);
    }

    /**
     * 打印樹結構,顯示的格式是node(left,right)
     * @param node
     */
    public static void PrintTree(Node node) {
        Node left = null;
        Node right = null;
        if(node!=null) {
            out.print(node.getValue());
            left = node.getLeftChild();
            right = node.getRightChild();
            //out.println("("+(left!=null?left.getValue():" ") +","+ (right!= null?right.getValue():" ")+")");

        }
        if(left!=null){ PrintTree(left); }
        if(right!=null){ PrintTree(right); }
    }
}

/**
 * 二叉樹節點
 * @Canlong
 */
class Node implements Comparable {
    private long value;
    private Node leftChild;
    private Node rightChild;
    public Node(long value) {
        this.value = value;
    }
    public long getValue() {
        return value;
    }
    public void setValue(long value) {
        this.value = value;
    }
    public Node getLeftChild() {
        return leftChild;
    }
    public void setLeftChild(Node leftChild) {
        this.leftChild = leftChild;
    }
    public Node getRightChild() {
        return rightChild;
    }
    public void setRightChild(Node rightChild) {
        this.rightChild = rightChild;
    }
    @Override
    public int compareTo(Object o) {
        Node that = (Node) o;
        double result = this.value - that.value;
        return result > 0 ? 1 : result == 0 ? 0 : -1;
    }
}

/**
 * 尋找最短的二叉搜索的路徑,從根節點到葉子結點(這個類是參考別人的)
 * @author Canlong
 *
 */
 class FindShortestBTPath {

     //如果是true則是左結點,如果是false則是右結點
     int flag = 1;
    //map集合,用來記錄葉子節點與是編碼爲0還是1 的的對應關係
    HashMap<Long,Long> mapNode = new HashMap<>();
    // 用來記錄所有的路徑
    private ArrayList<ArrayList<Long>> allPaths = new ArrayList<ArrayList<Long>>();
    // 用來記錄一條路徑
    private ArrayList<Long> onePath = new ArrayList<Long>();

    // 返回所有的路徑
    public ArrayList<ArrayList<Long>> FindAllPath(Node root,long tempFlag) {
        if(root == null){
            return allPaths;
        }
        // 把當前結點加入到路徑當中來
        onePath.add(root.getValue());
        if(mapNode.containsKey(root)==false){
            mapNode.put(root.getValue(),tempFlag);
        }

        // 如果爲葉子結點,則把onePath加入到allPaths當中
        if(root.getLeftChild() == null && root.getRightChild() == null){
            allPaths.add(new ArrayList<Long>(onePath));
        }
        FindAllPath(root.getLeftChild(),1);
        FindAllPath(root.getRightChild(),0);
        // 這個地方可以通過畫遞歸樹來理解,無論葉子結點是左結點還是右結點,都會經過下面這一步,而且至關重要
        onePath.remove(onePath.size() - 1);
        return allPaths;
    }
}

Test2類(是之前的一篇博客所實現的內容) ,具體請看之前博客《博客信息安全的一個實驗——構建信源模型》

import java.io.*;
import static java.lang.System.out;

/**
 * *構建信源模型
 *  @author Canlong
*/
public class Test2 {

    static String fileStr = "";
    public static void main(String[] args){
        //第一小題
        array5Col();
        //第二小題
        try {
            countNum();
        } catch (IOException e){
            throw new RuntimeException("讀取不到文件");
        }
    }

    /**
     * 1.隨機產生一個一行五列數組,使其恰好符合信源概率的要求
     */
    public static void array5Col(){
        //1的概率爲0.2,2的概率爲0.3,3的概率爲0.5
        int[] array = {1,1,2,2,2,3,3,3,3,3};
        //一行五列的數組
        int[][] array5 = new int[1][5];
        out.println("其信源概率爲:1的概率爲0.2,2的概率爲0.3,3的概率爲0.5。產生一個一行五列數組:");
        for(int i=0;i<array5[0].length;i++) {
            int randomNum = (int) Math.floor(Math.random() * 10);
            array5[0][i]=array[randomNum];
            out.print(array5[0][i]+",");
        }
        //換行
        out.println();
    }

    /**
     * 2.統計文件中26個字母的頻率並計算信息熵
     * @throws IOException 拋出找不到文件異常
     */
    public static double[] countNum() throws IOException {
        //文件路徑
        String strPath  = "C:/Users/hasee/Desktop/Types of Speech.txt";
        //26個字母出現的總次數
        double sumAllNum = 0;
        //存儲頻率
        double[] frequency = new double[26];
        //模型的信息熵 entropy
        double infoEntr = 0.0;

        //讀取文件
        BufferedReader bw = new BufferedReader(new InputStreamReader(new FileInputStream(strPath),"UTF-8"));
        //存儲文件的字符串
        StringBuilder textStrBuilder = new StringBuilder();
        String line;
        while((line=bw.readLine())!= null){
            textStrBuilder.append(line);
        }
        String textStr = textStrBuilder.toString();
        out.println("要統計的字符串爲:\r\n"+textStr);
        fileStr=textStr;
        textStr = textStr.toLowerCase();
        //統計字符串各個字母的個數
        char[] textChar = textStr.toCharArray();
        //存放26個字母和對應的次數
        char[][] char26AndNum = new char[2][26];
        //將26個字母放入到字符數組
        //表示字符a的編碼數
        int intA = 97;
        //表示字符z的編碼數
        int intZ = 123;
        for(int i=intA;i<intZ;i++){
            char26AndNum[0][i-intA]=(char)(i);
        }
        //比較字符串和26個字母的是否相等,並且計算次數
        for(int i=0;i<textChar.length;i++){
            //法一:循環26個字母,判斷是否相等
//            for(int j=0;j<char26AndNum[0].length;j++){
//                //如果字符相等,則對應的二維數組+1
//                if(Character.toString(textChar[i]).equals(Character.toString(char26AndNum[0][j]))){
//                    char26AndNum[1][j]++;
//                }
//            }
            //法二,將26個字母ASCII碼-'a'作爲數組下標,當字母等於那個數組下標時,直接將該元素++
            if(textChar[i] >= 'a' && textChar[i]<='z'){
                char26AndNum[1][textChar[i]-'a']++;
            }
        }
        //輸出26個字母及其所對應次數,即計算頻數
        for(int i=0;i<char26AndNum[1].length;i++){
            sumAllNum += (double)char26AndNum[1][i];
        }
        out.println("總次數爲:"+sumAllNum);
        //計算頻率
        for(int i=0;i<char26AndNum[1].length;i++) {
            frequency[i] = char26AndNum[1][i] / sumAllNum;
            out.println("字母爲:" + char26AndNum[0][i] + ",對應出現的次數爲:" + (int) char26AndNum[1][i] + ",其頻率爲:" + frequency[i]);

            if (frequency[i] != 0) {
                //計算信息熵,信息熵=頻率1*log2(1/頻率)
                infoEntr -= frequency[i] * (Math.log(frequency[i]) / Math.log(2));
            }
        }
        out.println("信息熵爲:"+infoEntr);
        return frequency;
    }
}

涉及的Types of Speech.txt文件

Standard usage includes those words and expressions understood, used, and accepted by a majority of the speakers of a language in any situation regardless of the level of formality. As such, these words and expressions are well defined and listed in standard dictionaries. Colloquialisms, on the other hand, are familiar words and idioms that are understood by almost all speakers of a language and used in informal speech or writing, but not considered appropriate for more formal situations. Almost all idiomatic expressions are colloquial language. Slang, however, refers to words and expressions understood by a large number of speakers but not accepted as good, formal usage by the majority. Colloquial expressions and even slang may be found in standard dictionaries but will be so identified. Both colloquial usage and slang are more common in speech than in writing.Colloquial speech often passes into standard speech. Some slang also passes into standard speech, but other slang expressions enjoy momentary popularity followed by obscurity. In some cases, the majority never accepts certain slang phrases but nevertheless retains them in their collective memories. Every generation seems to require its own set of words to describe familiar objects and events. It has been pointed out by a number of linguists that three cultural conditions are necessary for the creation of a large body of slang expressions. First, the introduction and acceptance of new objects and situations in the society; second, a diverse population with a large number of subgroups; third, association among the subgroups and the majority population.Finally, it is worth noting that the terms 'standard' 'colloquial' and 'slang' exist only as abstract labels for scholars who study language. Only a tiny number of the speakers of any language will be aware that they are using colloquial or slang expressions. Most speakers of English will, during appropriate situations, select and use all three types of expressions. 

結果
要統計的字符串爲:
Standard usage includes those words and expressions understood, used, and accepted by a majority of the speakers of a language in any situation regardless of the level of formality. As such, these words and expressions are well defined and listed in standard dictionaries. Colloquialisms, on the other hand, are familiar words and idioms that are understood by almost all speakers of a language and used in informal speech or writing, but not considered appropriate for more formal situations. Almost all idiomatic expressions are colloquial language. Slang, however, refers to words and expressions understood by a large number of speakers but not accepted as good, formal usage by the majority. Colloquial expressions and even slang may be found in standard dictionaries but will be so identified. Both colloquial usage and slang are more common in speech than in writing.Colloquial speech often passes into standard speech. Some slang also passes into standard speech, but other slang expressions enjoy momentary popularity followed by obscurity. In some cases, the majority never accepts certain slang phrases but nevertheless retains them in their collective memories. Every generation seems to require its own set of words to describe familiar objects and events. It has been pointed out by a number of linguists that three cultural conditions are necessary for the creation of a large body of slang expressions. First, the introduction and acceptance of new objects and situations in the society; second, a diverse population with a large number of subgroups; third, association among the subgroups and the majority population.Finally, it is worth noting that the terms 'standard' 'colloquial' and 'slang' exist only as abstract labels for scholars who study language. Only a tiny number of the speakers of any language will be aware that they are using colloquial or slang expressions. Most speakers of English will, during appropriate situations, select and use all three types of expressions. 
總次數爲:1651.0
字母爲:a,對應出現的次數爲:152,其頻率爲:0.09206541490006057
字母爲:b,對應出現的次數爲:29,其頻率爲:0.01756511205330103
字母爲:c,對應出現的次數爲:47,其頻率爲:0.028467595396729255
字母爲:d,對應出現的次數爲:69,其頻率爲:0.041792852816474865
字母爲:e,對應出現的次數爲:181,其頻率爲:0.1096305269533616
字母爲:f,對應出現的次數爲:33,其頻率爲:0.019987886129618413
字母爲:g,對應出現的次數爲:38,其頻率爲:0.02301635372501514
字母爲:h,對應出現的次數爲:44,其頻率爲:0.026650514839491216
字母爲:i,對應出現的次數爲:111,其頻率爲:0.06723198061780739
字母爲:j,對應出現的次數爲:7,其頻率爲:0.004239854633555421
字母爲:k,對應出現的次數爲:5,其頻率爲:0.0030284675953967293
字母爲:l,對應出現的次數爲:86,其頻率爲:0.05208964264082374
字母爲:m,對應出現的次數爲:35,其頻率爲:0.021199273167777106
字母爲:n,對應出現的次數爲:121,其頻率爲:0.07328891580860085
字母爲:o,對應出現的次數爲:139,其頻率爲:0.08419139915202907
字母爲:p,對應出現的次數爲:42,其頻率爲:0.025439127801332527
字母爲:q,對應出現的次數爲:8,其頻率爲:0.004845548152634767
字母爲:r,對應出現的次數爲:107,其頻率爲:0.06480920654149
字母爲:s,對應出現的次數爲:154,其頻率爲:0.09327680193821926
字母爲:t,對應出現的次數爲:122,其頻率爲:0.07389460932768019
字母爲:u,對應出現的次數爲:54,其頻率爲:0.03270745003028468
字母爲:v,對應出現的次數爲:9,其頻率爲:0.005451241671714113
字母爲:w,對應出現的次數爲:19,其頻率爲:0.01150817686250757
字母爲:x,對應出現的次數爲:10,其頻率爲:0.0060569351907934586
字母爲:y,對應出現的次數爲:29,其頻率爲:0.01756511205330103
字母爲:z,對應出現的次數爲:0,其頻率爲:0.0
信息熵爲:4.1696890030146925
99999999999999997640218049666868563218534221683827982492065414900060560932768019382192642168382798304058081072077528770442085208964264082373655118110236220472266505148394912162846759539672925610963052695336160059781950333131434426892792247122955012840702604482131863597819503331318308903694730466421332525741974561060569351907934597268322228952151302846759539673003028467595396730423985463355542117565112053301032327074500302846766480920654149000014052089642640823267231980617807384732889158086008483288915808600847941532404603270745007389460932768019279345850999394308375529981829194441756511205330103219987886129618412417928528164748641756511205330102948419139915202907291459721380981222430042398546335542119927316777710421804966686856450102967898243488804845548152634767545124167171411311508176862507570484554815263476682301635372501514025439127801332528
a字母對應的編碼爲:111
b字母對應的編碼爲:001011
c字母對應的編碼爲:10100
d字母對應的編碼爲:00100
e字母對應的編碼爲:100
f字母對應的編碼爲:001010
g字母對應的編碼爲:000001
h字母對應的編碼爲:10101
i字母對應的編碼爲:0101
j字母對應的編碼爲:01111100
k字母對應的編碼爲:011111000
l字母對應的編碼爲:1011
m字母對應的編碼爲:000011
n字母對應的編碼爲:0100
o字母對應的編碼爲:0001
p字母對應的編碼爲:000000
q字母對應的編碼爲:00001011
r字母對應的編碼爲:0110
s字母對應的編碼爲:110
t字母對應的編碼爲:0011
u字母對應的編碼爲:01110
v字母對應的編碼爲:00001010
w字母對應的編碼爲:0000100
x字母對應的編碼爲:0111111
y字母對應的編碼爲:001011
z字母對應的編碼爲:011111001
huffman 編碼後爲:110001111101000010011101100010001110110111000001100010101001010010110111000100100110001110101000111010000001000001011000100110111010000100100011111100000001101001101100101000101001100111001000010010001101100011000100010010001110110100001001110100001001111010010100100000000001110000100001011001011111000011111011111000001011001010011001011000100101000111010110011000000010011101111100010001101100001001010111101111101000000010111011100000110001010100111010000101111001010011011101110011010100010100011010000000111101100010010111001101100001001010001110101100101110000001010100101100010010100010100001011000001111110110101001100101111111011001110101001010100111010110011010000001000001011000100110111010000100100011111100000001101001101100101000101001101110110100000010010010111011001001000010100101010010000100111010000100101101011100011100001000101010011000111110100001001110110001000010001011010000110101000101001110110010110011010100000110111011000100001011011100101111101101011100000111100001010000111010110000010011101011000110101011110100001001110110100001010111000011010110110101111011000001000001011000100110111010000100010100100010100010000111100011101011110011111011010001110010000100100011011000110001000100100001011001011111101100001100011100011111101110111100000001001110111110001000110110000100101011110111110100000001011101110000011001110100001000111011010000100010101000101010000101000010110000011111101111000000010010010100101010001011000001000110010100110101010000000100101101110001101000001001110100000101001100101001001000110100001001110000000000000110000100000001100101111001110000101000010110000011000101101000010100001011000001111110111100101001101110111001101010001010011011110110000110001110001111110111011010100100010100010000111110011010110100100011111100000001101001101100101000101001101110110100101000001101110110001000010110111001011111011101111101000000010111011100000110011010111110100000001101010001000010010000001010100011001101000010101000110110001100010000100000101100010011011101000010010001111110000000110100110110010100010100110011100100001001000110110001100010001001000010110010111111011111011000000110001000111000001100101110001100001001010110000000100111011111000100011011000101101110001101000001001111110100101001000000000011100001001111100000010001000100100001010000101100000111111011011101101110000011000010110010110011101011000000111110111110000010110010100110010111010000011011101100010000101101110010111110111000111111000000011010011011001010001010011011101000010010000001010100010011010111110100000001000011111001011001011100001010000101110010000100010101001100011111010000100111011000100001000101101000011010100010100111011001011001100010110111000110000100010110111011001011100110000101010010010001000011010100101001011000010000101100010011101011010000011011101100010000101101110010111110110111011011100000110011101000010011010111110100000001111011010000001100010110100101000001000011000011000101000101010011000000010010010100101010011101011110100010101000000100011001010011010101000000011010000011011101100010000101101110010111110111100000001001001010010101000100101000111000100000000111110110100110010101000011000111000111110100001001110110001001100000001001001010010101110000100001110011010111110100000001111101111000010000001111101101001100101010000110001110001111101000010011101100010011000000010010010100101010010110111000110001001110101100011011010111110100000001100011111100000001101001101100101000101001101000100011111000001001011000011000100001110001000011111011000101100000000010000000111010111110110010100110010110010100001101110110001000010010000100001011001011000100101111010100011100110010100110010110101010011000010000111001010011111010011000111010110000001111101111100000101100101001100101101001000000101010001101111010010100100000000001111010100100011000111110101010011010111110100000001000000101010110111110100110001011011100011010010000001010100011000111010110010111001101100110100001111101010100110001110101100000011010101000011101011000101011010100000110111011100101000011010100001010100000011100000011000101100101100110100000010101000110001011000001100010010001101110011010100010100110100100000011110001100010110100000010110111001010110100010100111100001000010001001101000011000100101000001000001011000100110001100010010010011010100011001010010111000010101110000110101101101011110110000100101101111100100101000011110111010000100100000010101000100001111001010011101011111100010111001000100000000000101010100001110000100000101110001100101100101111101000111000001100101110001100001001010101101010100000001011100101110001111000111010111100110011101010110100100101000111010110011011100110111101110100000101000010001010011010100010100110111011010001001001010010011011011101100010110010100001011000111010110010100011010011100110101000101000001001010111101111101100000011000010110001001000010110001001010110101111101000000011000111111000000011010011011001010001010011000101001010110110001100111010110001010100001101100001001000111010100001101010001010011101000010011110100101001000000000011111010010100100000100101001001000000100000100101101111100100101000011110111010000100110010100110111011100110101000101001100101010000111010110011000011010001011000011001011110100101000001010000100111001000101000010101000110110100000000000100000001110101111100110101000101000000100010100111010111110111110110000001100010001110000011001011100011000010010101100111000101100000101100001011100000001100011101010101011000100111110110000110100010111100110101000101001110000110001010000000100111010110011001110001011000001011000010111000000011011101000010000111010110000001111101111100000101100101001100101100000000010000000111010111110011010100010100001010010101001111011101100101101010011010111000001000001011000111010101000001001101010100000001001110101111001100111010110000111000110000011110110001111101000010011101100010010100000110111011000100001011011100101111101111101000010011010111110100000001100011111101011100011000101001011001011111110111001011110001101101111010000111011111001011100101111000101000010110110101001010100011011111011011000001001010100011100011011100010000101110111110100000001011101110000011000001010010110010111110011010101000010110100011100000110010111000110000100101000111010110011000000010011101111100010001101100001001010111010000101110111110100000001011101110000011000000100010110111011001011100111000010011101101000011101011110011001110101100001011111011010001110110010101000000011010000011011101100010000101101110010111110110001011011010111110100000001100011111100000001101001101100101000101001100000110001110001111000000010011101111100010001101100001001010100010000000110110101110101010000100010110111011001000111001100101010000000111100000000000001100001000000011001011110011100110010100110111011100110101000101001101101001011100101000011111010000100011101101001111011101100111010101101001000011001011000000100110000100101010001111110000000110100110110010100010100110
huffman 編碼後的長度:6951
huffman 編碼後的所佔的比特數就是爲:6851
huffman譯碼後的原文:standardusageincludesthosewordsandexpressionsunderstoodusedandacceptedybamajorityofthespeajnrsofalanguageinanysituationregardlessofthelevelofformalityassuchthesewordsandexpressionsarewelldefinedandlistedinstandarddictionariescolloquialismsontheotherhandarefamiliarwordsandidiomsthatareunderstoodybalmostallspeajnrsofalanguageandusedininformalspeechorwritingyutnotconsideredappropriateformoreformalsituationsalmostallidiomaticexpressionsarecolloquiallanguageslanghoweverreferstowordsandexpressionsunderstoodybalargenumyerofspeajnrsyutnotacceptedasgoodformalusageybthemajoritycolloquialexpressionsandevenslangmabyefoundinstandarddictionariesyutwillyesoidentifiedyothcolloquialusageandslangaremorecommoninspeechthaninwritingcolloquialspeechoftenpassesintostandardspeechsomeslangalsopassesintostandardspeechyutotherslangexpressionsenjoymomentarypopularityfollowedyboyscurityinsomecasesthemajorityneveracceptscertainslangphrasesyutneverthelessretainsthemintheircollectivememorieseverygenerationseemstorequireitsownsetofwordstodescriyefamiliaroyzfoauedeventsithasyeenpointedoutybanumyeroflinguiststhatthreeculturalconditionsarenecessaryforthecreationofalargeyodyofslangexpressionsfirsttheintroductionandacceptanceofnewoyzfoauedsituationsinthesocietysecondadiversepopulationwithalargenumyerofsuygroupsthirdassociationamongthesuygroupsandthemajoritypopulationfinallyitisworthnotingthatthetermsstandardcolloquialandslangexistonlyasabstractlabelsforscholarswhostudylanguageonlyatinynumyerofthespeajnrsofanylanguagewillyeawarethattheyareusingcolloquialorslangexpressionsmostspeajnrsofenglishwillduringappropriatesituationsselectanduseallthreetypesofexpressions
huffman譯碼後的長度::1649
因爲一個char字符佔1個字節,即8比特,所以其一共佔了,1649*8=13192個比特。
因此,該huffman編碼壓縮比爲:6851/13192=51.932989690721655%

方法二:
  想不用二叉樹來做的,但是由於沒有時間,所以還沒有寫出來。立個flag,等有空了來完善。

總結

  這次的練習個人覺得有一定難度,需要對二叉樹和哈夫曼編碼(Huffman)概念比較熟悉,特別是在編碼過程中如何將對應字符轉化爲二進制字符。雖然本次練習,自己做了出來,但是還存在着許多不足的地方,例如如果存在兩個字符的頻率相等的話,可能會存在問題,還有就是還沒有實現其他字符的二進制字符串轉化,單單實現了26個字母的。另外就是還沒有考慮時間複雜度和空間複雜度等問題,有待優化。因爲在解碼(譯碼)的時候,我是通過遍歷整個字符串的的某個字符內嵌遍歷編碼映射表來實現,即獲取到某個字符串對應二進制的得到它的長度,然後從要解碼的字符串中截取相應長度,再去比較它們的內容是否相等。這樣的效率可能會比較低,但是目前還沒有想到更好的方法。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章