大數據環境啓用LZO壓縮

一、準備工作:
yum -y install lzo-devel zlib-devel gcc autoconf automake libtool

二、安裝LZO
1、解壓編譯,並安裝
cd /opt/software
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.09.tar.gz
tar -zxvf lzo-2.09.tar.gz
cd lzo-2.09
./configure -enable-shared -prefix=/usr/local/hadoop/lzo/
make && make test && make install

2、複製文件
將/usr/local/hadoop/lzo/lib/ 複製到/usr/lib/和/usr/lib64/下
cp /usr/local/hadoop/lzo/lib/
/usr/lib/
cp /usr/local/hadoop/lzo/lib/* /usr/lib64/

3、修改配置環境變量(vi ~/.bash_profile),增加如下內容:
export PATH=/usr/local/hadoop/lzo/:$PATH

三、安裝LZOP
1、下載並解壓
cd /opt/software
wget http://www.lzop.org/download/lzop-1.04.tar.gz
tar -zxvf lzop-1.04.tar.gz

2、在編譯前需要的環境變量(~/.bash_profile)中配置如下內容:
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include/
注:如不配置變量,在編譯時會報:configure: error: LZO header files not found. Please check your installation or set the environment variable `CPPFLAGS'.

3、進入解壓後目錄,並編譯安裝
cd cd /opt/software/lzop-1.04
./configure -enable-shared -prefix=/usr/local/hadoop/lzop
make && make install

4、將lzop複製到/usr/bin/
ln -s /usr/local/hadoop/lzop/bin/lzop /usr/bin/lzop

5、測試lzop
輸入:lzop nohup.out
產生:lzo後綴的壓縮文件: /home/hadoop/data/access_20131219.log.lzo即表示成功
注:在測試中可能遇到報錯:lzop: error while loading shared libraries: liblzo2.so.2: cannot open shared object file: No such file or directory
解決辦法:增加環境變量(~/.bash_profile)export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64

四、安裝Hadoop-LZO
注:編譯時需要maven,自行配置好maven
1、下載介質:https://github.com/twitter/hadoop-lzo

2、解壓並編譯:
cd /opt/software/hadoop-lzo-release-0.4.19
mvn clean package -Dmaven.test.skip=true

3、編譯完成執行如下命令:
tar -cBf --C target/native/Linux-amd64-64/lib . | tar -xBvf --C /app/hadoop-2.6.0-cdh5.7.0/lib/native
cp target/hadoop-lzo-0.4.19.jar /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/

如果爲集羣環境,則接下來就是將/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar以及/app/hadoop-2.6.0-cdh5.7.0/lib/native/同步到其它所有的hadoop節點。
注意,要保證目錄/app/hadoop-2.6.0-cdh5.7.0/lib/native/下的jar包,運行hadoop的用戶都有執行權限。

五、產生index文件
cd /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common
hadoop jar hadoop-lzo-0.4.19.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/page_views_parquet1/page_views_parquet.lzo
注:lzo文件必須在hdfs文件系統中。

至此完成CentOS7中安裝LZO壓縮程序

配置Hadoop中啓用LZO壓縮

配置Hadoop中啓用LZO壓縮,並完成測試。步驟如下:
一、配置hadoop的hadoop-evn.sh文件,增加如下內容:
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib

二、配置core-site.xml文件,增加如下內容:

io.compression.codecs org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec io.compression.codec.lzo.class com.hadoop.compression.lzo.LzopCodec 二、配置mapred-site.xml文件,增加如下內容: mapreduce.map.output.compress true mapred.map.output.compression.codec com.hadoop.compression.lzo.LzopCodec mapreduce.output.fileoutputformat.compress true mapreduce.output.fileoutputformat.compress.codec com.hadoop.compression.lzo.LzopCodec mapred.child.env LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib 三、使用hadoop自帶wordcount程序測試 1、測試生成lzo文件 cd /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /input/test1.txt /output/wc2 測試結果: [hadoop@spark220 mapreduce]$ hdfs dfs -ls /output/wc2 Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2018-03-17 00:21 /output/wc2/_SUCCESS -rw-r--r-- 1 hadoop supergroup 113 2018-03-17 00:21 /output/wc2/part-r-00000.lzo 2、生成index文件: cd /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common hadoop jar hadoop-lzo-0.4.19.jar com.hadoop.compression.lzo.LzoIndexer /output/wc2/part-r-00000.lzo 日誌: 18/03/17 00:23:05 INFO lzo.GPLNativeCodeLoader: Loaded native gpl libraryutput/wc2/part-r-00000.lzo 18/03/17 00:23:05 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 049362b7cf53ff5f739d6b1532457f2c6cd495e8] 18/03/17 00:23:06 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /output/wc2/part-r-00000.lzo, size 0.00 GB... 18/03/17 00:23:07 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 18/03/17 00:23:07 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.80 seconds (0.00 MB/s). Index size is 0.01 KB. 測試結果: [hadoop@spark220 common]$ hdfs dfs -ls /output/wc2 Found 3 items -rw-r--r-- 1 hadoop supergroup 0 2018-03-17 00:21 /output/wc2/_SUCCESS -rw-r--r-- 1 hadoop supergroup 113 2018-03-17 00:21 /output/wc2/part-r-00000.lzo -rw-r--r-- 1 hadoop supergroup 8 2018-03-17 00:23 /output/wc2/part-r-00000.lzo.index 至此完成配置與測試 spark中配置啓用LZO壓縮 Spark中配置啓用LZO壓縮,步驟如下: 一、spark-env.sh配置 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/tools/lib/*:/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/* 二、spark-defaults.conf配置 spark.driver.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar spark.executor.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar 注:指向編譯生成lzo的jar包 三、測試 1、讀取Lzo文件 spark-shell --master local[2] scala> import com.hadoop.compression.lzo.LzopCodec scala> val page_views = sc.textFile("/user/hive/warehouse/page_views_lzo/page_views.dat.lzo") 2、寫出lzo文件 spark-shell --master local[2] scala> import com.hadoop.compression.lzo.LzopCodec scala> val lzoTest = sc.parallelize(1 to 10) scala> lzoTest.saveAsTextFile("/input/test_lzo", classOf[LzopCodec]) 結果: [hadoop@spark220 common]$ hdfs dfs -ls /input/test_lzo Found 3 items -rw-r--r-- 1 hadoop supergroup 0 2018-03-16 23:24 /input/test_lzo/_SUCCESS -rw-r--r-- 1 hadoop supergroup 60 2018-03-16 23:24 /input/test_lzo/part-00000.lzo -rw-r--r-- 1 hadoop supergroup 61 2018-03-16 23:24 /input/test_lzo/part-00001.lzo 至此配置與測試完成。 四、配置與測試中存問題 1、引用native,缺少LD_LIBRARY_PATH 1.1、錯誤提示: Caused by: java.lang.RuntimeException: native-lzo library not available at com.hadoop.compression.lzo.LzopCodec.getDecompressorType(LzopCodec.java:120) at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:111) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:246) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:203) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 1.2、解決辦法:在spark的conf中配置spark-evn.sh,增加以下內容: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/tools/lib/*:/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/* 2、無法找到LzopCodec類 2.1、錯誤提示: Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzopCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135) at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:175) at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45) Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzopCodec not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980) at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128) 2.2、解決辦法:在spark的conf中配置spark-defaults.conf,增加以下內容: spark.driver.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar spark.executor.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章