hadoop默認不支持lzo的壓縮格式;
lzo壓縮工具,支持對超過block大小的數據進行切分,在生產中可以提高Job的處理效率;
lzo所需組件:
lzo
lzop
hadoop-gpl-packaging:gpl-packaging的作用主要是對壓縮的lzo文件創建索引,否則的話,無論壓縮文件是否大於hdfs的block大小,都只會1個map操作。
編譯安裝lzo:
#安裝相關依賴 yum -y install lzo-devel zlib-devel gcc autoconf automake libtool #編譯lzo cd /usr/local/src wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz tar -zxvf lzo-2.06.tar.gz cd lzo-2.06 export CFLAGS=-m64 ./configure -enable-shared -prefix=/usr/local/lzo make make install 注:編譯完lzo包之後,會在/usr/local/lzo/目錄下生成一些文件。 [root@hadoop004 lzo]# ll total 0 drwxr-xr-x 3 root root 17 Apr 19 10:30 include drwxr-xr-x 2 root root 103 Apr 19 10:30 lib drwxr-xr-x 3 root root 17 Apr 19 10:30 share #查看lzop命令: [root@hadoop004 lzo]# which lzop /usr/bin/lzop #lzo命令使用方法及壓縮測試 lzo壓縮:lzop -v filename lzo解壓:lzop -dv filename [root@hadoop004 hadoop]# du -sh * 69M access.log 16M access.log.lzo 注:.lzo爲壓縮過後的文件,可以看到壓縮比例很高。
安裝hadoop-lzo:
#編譯安裝hadoop-lzo的準備工作 cd /usr/local/src wget unzip master.zip cd hadoop-lzo-master/ 注:因爲hadoop使用的是2.6.0;所以版本修改爲2.6.0 <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hadoop.current.version>2.6.0</hadoop.current.version> <hadoop.old.version>1.0.4</hadoop.old.version> </properties> 設置環境變量,hadoop-lzo在編譯的時候會用到 export CFLAGS=-m64 export CXXFLAGS=-m64 注:下面這2個環境變量要設置爲自己hadoop的實際路徑 export C_INCLUDE_PATH=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lzo/include export LIBRARY_PATH=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lzo/lib #開始編譯 mvn clean package -Dmaven.test.skip=true 注:編譯成功 [INFO] Building jar: /usr/local/src/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:08 min [INFO] Finished at: 2019-04-19T10:39:55+08:00 [INFO] Final Memory: 28M/136M [INFO] ------------------------------------------------------------------------ #編譯生成的文件hadoop-lzo-0.4.21-SNAPSHOT.jar,很重要(在target目錄下) [root@hadoop004 target]# pwd /usr/local/src/hadoop-lzo-master/target [root@hadoop004 target]# ll total 432 drwxr-xr-x 2 root root 4096 Apr 19 10:39 antrun drwxr-xr-x 5 root root 4096 Apr 19 10:39 apidocs drwxr-xr-x 5 root root 77 Apr 19 10:39 classes drwxr-xr-x 3 root root 25 Apr 19 10:39 generated-sources -rw-r--r-- 1 root root 188906 Apr 19 10:39 hadoop-lzo-0.4.21-SNAPSHOT.jar -rw-r--r-- 1 root root 185078 Apr 19 10:39 hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar -rw-r--r-- 1 root root 52021 Apr 19 10:39 hadoop-lzo-0.4.21-SNAPSHOT-sources.jar drwxr-xr-x 2 root root 71 Apr 19 10:39 javadoc-bundle-options drwxr-xr-x 2 root root 28 Apr 19 10:39 maven-archiver drwxr-xr-x 3 root root 28 Apr 19 10:39 native drwxr-xr-x 3 root root 18 Apr 19 10:39 test-classes 注:我們最終需要的就是hadoop-lzo-0.4.21-SNAPSHOT.jar文件