Hadoop常見錯誤解析

1：Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out

Answer：
程序裏面需要打開多個文件，進行分析，系統一般默認數量是1024，（用ulimit -a可以看到）對於正常使用是夠了，但是對於程序來講，就太少了。
修改辦法：
修改2個文件。
/etc/security/limits.conf
vi /etc/security/limits.conf
加上：
* soft nofile 102400
* hard nofile 409600

$cd /etc/pam.d/
$sudo vi login
添加 session required /lib/security/pam_limits.so

針對第一個問題我糾正下答案：
這是reduce預處理階段shuffle時獲取已完成的map的輸出失敗次數超過上限造成的，上限默認爲5。引起此問題的方式可能會有很多種，比如網絡連接不正常，連接超時，帶寬較差以及端口阻塞等。。。通常框架內網絡情況較好是不會出現此錯誤的。

2：Too many fetch-failures
Answer:
出現這個問題主要是結點間的連通不夠全面。
1) 檢查、/etc/hosts
要求本機ip 對應服務器名
要求要包含所有的服務器ip + 服務器名
2) 檢查 .ssh/authorized_keys
要求包含所有服務器（包括其自身）的public key

3：處理速度特別的慢出現map很快但是reduce很慢而且反覆出現 reduce=0%
Answer:
結合第二點，然後
修改 conf/hadoop-env.sh 中的export HADOOP_HEAPSIZE=4000

4：能夠啓動datanode，但無法訪問，也無法結束的錯誤
在重新格式化一個新的分佈式文件時，需要將你NameNode上所配置的dfs.name.dir這一namenode用來存放NameNode 持久存儲名字空間及事務日誌的本地文件系統路徑刪除，同時將各DataNode上的dfs.data.dir的路徑 DataNode 存放塊數據的本地文件系統路徑的目錄也刪除。如本此配置就是在NameNode上刪除/home/hadoop/NameData，在DataNode上刪除/home/hadoop/DataNode1和/home/hadoop/DataNode2。這是因爲Hadoop在格式化一個新的分佈式文件系統時，每個存儲的名字空間都對應了建立時間的那個版本（可以查看/home/hadoop /NameData/current目錄下的VERSION文件，上面記錄了版本信息），在重新格式化新的分佈式系統文件時，最好先刪除NameData 目錄。必須刪除各DataNode的dfs.data.dir。這樣纔可以使namedode和datanode記錄的信息版本對應。
注意：刪除是個很危險的動作，不能確認的情況下不能刪除！！做好刪除的文件等通通備份！！

5：java.io.IOException: Could not obtain block: blk_194219614024901469_1100 file=/user/hive/warehouse/src_20090724_log/src_20090724_log
出現這種情況大多是結點斷了，沒有連接上。

6：java.lang.OutOfMemoryError: Java heap space
出現這種異常，明顯是jvm內存不夠得原因，要修改所有的datanode的jvm內存大小。
Java -Xms1024m -Xmx4096m
一般jvm的最大內存使用應該爲總內存大小的一半，我們使用的8G內存，所以設置爲4096m，這一值可能依舊不是最優的值。

本主題由 admin 於 2009-11-20 10:50 置頂

頂，這樣的貼子非常好，要置頂。附件是由Hadoop技術交流羣中若冰的同學提供的相關資料：
(12.58 KB)
Hadoop添加節點的方法
自己實際添加節點過程：
1. 先在slave上配置好環境，包括ssh，jdk，相關config，lib，bin等的拷貝；
2. 將新的datanode的host加到集羣namenode及其他datanode中去；
3. 將新的datanode的ip加到master的conf/slaves中；
4. 重啓cluster,在cluster中看到新的datanode節點；
5. 運行bin/start-balancer.sh，這個會很耗時間
備註：
1. 如果不balance，那麼cluster會把新的數據都存放在新的node上，這樣會降低mr的工作效率；
2. 也可調用bin/start-balancer.sh 命令執行，也可加參數 -threshold 5
threshold 是平衡閾值，默認是10%，值越低各節點越平衡，但消耗時間也更長。
3. balancer也可以在有mr job的cluster上運行，默認dfs.balance.bandwidthPerSec很低，爲1M/s。在沒有mr job時，可以提高該設置加快負載均衡時間。

其他備註：
1. 必須確保slave的firewall已關閉;
2. 確保新的slave的ip已經添加到master及其他slaves的/etc/hosts中，反之也要將master及其他slave的ip添加到新的slave的/etc/hosts中
mapper及reducer個數
url地址： http://wiki.apache.org/hadoop/HowManyMapsAndReduces
HowManyMapsAndReduces
Partitioning your job into maps and reduces
Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead.
Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the [WWW] InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Number of Reduces
The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces << heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound.
The number of reduces also controls the number of output files in the output directory, but usually that is not important because the next map/reduce step will split them into even smaller splits for the maps.
The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).
自己的理解：
mapper個數的設置：跟input file 有關係，也跟filesplits有關係，filesplits的上線爲dfs.block.size，下線可以通過mapred.min.split.size設置，最後還是由InputFormat決定。

較好的建議：
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
<property>
   <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>2</value>
   <description>The maximum number of reduce tasks that will be run
   simultaneously by a task tracker.
   </description>
</property>

單個node新加硬盤
1.修改需要新加硬盤的node的dfs.data.dir，用逗號分隔新、舊文件目錄
2.重啓dfs

同步hadoop 代碼
hadoop-env.sh
# host:path where hadoop code should be rsync'd from.   Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

用命令合併HDFS小文件
hadoop fs -getmerge <src> <dest>

重啓reduce job方法
Introduced recovery of jobs when JobTracker restarts. This facility is off by default.
Introduced config parameters "mapred.jobtracker.restart.recover", "mapred.jobtracker.job.history.block.size", and "mapred.jobtracker.job.history.buffer.size".
還未驗證過。

IO寫操作出現問題
0-1246359584298, infoPort=50075, ipcPort=50020):Got exception while serving blk_-5911099437886836280_1292 to /172.16.100.165:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/
172.16.100.165:50010 remote=/172.16.100.165:50930]
       at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
       at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
       at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
       at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
       at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
       at java.lang.Thread.run(Thread.java:619)

It seems there are many reasons that it can timeout, the example given in
HADOOP-3831 is a slow reading client.

解決辦法：在hadoop-site.xml中設置dfs.datanode.socket.write.timeout=0試試；
My understanding is that this issue should be fixed in Hadoop 0.19.1 so that
we should leave the standard timeout. However until then this can help
resolve issues like the one you're seeing.

HDFS退服節點的方法
目前版本的dfsadmin的幫助信息是沒寫清楚的，已經file了一個bug了，正確的方法如下：
1. 將 dfs.hosts 置爲當前的 slaves，文件名用完整路徑，注意，列表中的節點主機名要用大名，即 uname -n 可以得到的那個。
2. 將 slaves 中要被退服的節點的全名列表放在另一個文件裏，如 slaves.ex，使用 dfs.host.exclude 參數指向這個文件的完整路徑
3. 運行命令 bin/hadoop dfsadmin -refreshNodes
4. web界面或 bin/hadoop dfsadmin -report 可以看到退服節點的狀態是 Decomission in progress，直到需要複製的數據複製完成爲止
5. 完成之後，從 slaves 裏（指 dfs.hosts 指向的文件）去掉已經退服的節點

附帶說一下 -refreshNodes 命令的另外三種用途：
2. 添加允許的節點到列表中（添加主機名到 dfs.hosts 裏來）
3. 直接去掉節點，不做數據副本備份（在 dfs.hosts 裏去掉主機名）
4. 退服的逆操作——停止 exclude 裏面和 dfs.hosts 裏面都有的，正在進行 decomission 的節點的退服，也就是把 Decomission in progress 的節點重新變爲 Normal （在 web 界面叫 in service)

hadoop 學習借鑑
1. 解決hadoop OutOfMemoryError問題：
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx800M -server</value>
</property>
With the right JVM size in your hadoop-site.xml , you will have to copy this
to all mapred nodes and restart the cluster.
或者：hadoop jar jarfile [main class] -D mapred.child.java.opts=-Xmx800M

2. Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.
when i use nutch1.0,get this error:
Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.
這個也很好解決：
可以刪除conf/log4j.properties，然後可以看到詳細的錯誤報告
我這兒出現的是out of memory
解決辦法是在給運行主類org.apache.nutch.crawl.Crawl加上參數：-Xms64m -Xmx512m
你的或許不是這個問題，但是能看到詳細的錯誤報告問題就好解決了

distribute cache使用
類似一個全局變量，但是由於這個變量較大，所以不能設置在config文件中，轉而使用distribute cache
具體使用方法：(詳見《the definitive guide》,P240)
1. 在命令行調用時：調用-files，引入需要查詢的文件(可以是local file, HDFS file(使用hdfs://xxx?)), 或者 -archives (JAR,ZIP, tar等)
% hadoop jar job.jar MaxTemperatureByStationNameUsingDistributedCacheFile \
   -files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output
2. 程序中調用：
public void configure(JobConf conf) {
   metadata = new NcdcStationMetadata();
   try {
       metadata.initialize(new File("stations-fixed-width.txt"));
   } catch (IOException e) {
       throw new RuntimeException(e);
   }
}
另外一種間接的使用方法：在hadoop-0.19.0中好像沒有
調用addCacheFile()或者addCacheArchive()添加文件，
使用getLocalCacheFiles() 或 getLocalCacheArchives() 獲得文件

hadoop的job顯示web
There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system. By default, these are located at [WWW] http://job.tracker.addr:50030/ and [WWW] http://name.node.addr:50070/.

hadoop監控
OnlyXP(52388483) 131702
用nagios作告警，ganglia作監控圖表即可

status of 255 error
錯誤類型：
java.io.IOException: Task process exit with nonzero status of 255.
       at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

錯誤原因：
Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to higher value. By default, their values are 24 hours. These might be the reason for failure, though I'm not sure

split size
FileInputFormat input splits: (詳見《the definitive guide》P190)
mapred.min.split.size: default=1, the smallest valide size in bytes for a file split.
mapred.max.split.size: default=Long.MAX_VALUE, the largest valid size.
dfs.block.size: default = 64M, 系統中設置爲128M。
如果設置 minimum split size > block size, 會增加塊的數量。(猜想從其他節點拿去數據的時候，會合並block，導致block數量增多)
如果設置maximum split size < block size, 會進一步拆分block。

split size = max(minimumSize, min(maximumSize, blockSize));
其中 minimumSize < blockSize < maximumSize.

sort by value
hadoop 不提供直接的sort by value方法，因爲這樣會降低mapreduce性能。
但可以用組合的辦法來實現，具體實現方法見《the definitive guide》, P250
基本思想：
1. 組合key/value作爲新的key；
2. 重載partitioner，根據old key來分割；
conf.setPartitionerClass(FirstPartitioner.class);
3. 自定義keyComparator：先根據old key排序，再根據old value排序；
conf.setOutputKeyComparatorClass(KeyComparator.class);
4. 重載GroupComparator, 也根據old key 來組合；   conf.setOutputValueGroupingComparator(GroupComparator.class);

small input files的處理
對於一系列的small files作爲input file，會降低hadoop效率。
有3種方法可以將small file合併處理：
1. 將一系列的small files合併成一個sequneceFile，加快mapreduce速度。
詳見WholeFileInputFormat及SmallFilesToSequenceFileConverter,《the definitive guide》, P194
2. 使用CombineFileInputFormat集成FileinputFormat，但是未實現過；
3. 使用hadoop archives(類似打包)，減少小文件在namenode中的metadata內存消耗。(這個方法不一定可行，所以不建議使用)
方法：
將/my/files目錄及其子目錄歸檔成files.har，然後放在/my目錄下
bin/hadoop archive -archiveName files.har /my/files /my

查看files in the archive:
bin/hadoop fs -lsr har://my/files.har

skip bad records
JobConf conf = new JobConf(ProductMR.class);
conf.setJobName("ProductMR");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Product.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setMapOutputCompressorClass(DefaultCodec.class);
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);
String objpath = "abc1";
SequenceFileInputFormat.addInputPath(conf, new Path(objpath));
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
SkipBadRecords.setSkipOutputPath(conf, new Path("data/product/skip/"));
String output = "abc";
SequenceFileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);

For skipping failed tasks try : mapred.max.map.failures.percent

restart 單個datanode
如果一個datanode 出現問題，解決之後需要重新加入cluster而不重啓cluster，方法如下：
bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start jobtracker

reduce exceed 100%
"Reduce Task Progress shows > 100% when the total size of map outputs (for a
single reducer) is high "
造成原因：
在reduce的merge過程中，check progress有誤差，導致status > 100%，在統計過程中就會出現以下錯誤：java.lang.ArrayIndexOutOfBoundsException: 3
       at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228)
       at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.doGet(StatusHttpServer.java:159)
       at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
       at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
       at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
       at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
       at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
       at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
       at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
       at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
       at org.mortbay.http.HttpServer.service(HttpServer.java:954)

jira地址：

counters
3中counters：
1. built-in counters: Map input bytes, Map output records...
2. enum counters
調用方式：
   enum Temperature {
MISSING,
MALFORMED
   }

reporter.incrCounter(Temperature.MISSING, 1)
結果顯示：
09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature Recor
09/04/20 06:33:36 INFO mapred.JobClient:     Malformed=3
09/04/20 06:33:36 INFO mapred.JobClient:     Missing=66136856
3. dynamic countes:
調用方式：
reporter.incrCounter("TemperatureQuality", parser.getQuality(),1);

結果顯示：
09/04/20 06:33:36 INFO mapred.JobClient: TemperatureQuality
09/04/20 06:33:36 INFO mapred.JobClient:     2=1246032
09/04/20 06:33:36 INFO mapred.JobClient:     1=973422173
09/04/20 06:33:36 INFO mapred.JobClient:     0=1

7: Namenode in safe mode
解決方法
bin/hadoop dfsadmin -safemode leave

8:java.net.NoRouteToHostException: No route to host
j解決方法：
sudo /etc/init.d/iptables stop

9：更改namenode後，在hive中運行select 依舊指向之前的namenode地址
這是因爲：When youcreate a table, hive actually stores the location of the table (e.g.
hdfs://ip:port/user/root/...) in the SDS and DBS tables in the metastore . So when I bring up a new cluster the master has a new IP, but hive's metastore is still pointing to the locations within the old
cluster. I could modify the metastore to update with the new IP everytime I bring up a cluster. But the easier and simpler solution was to just use an elastic IP for the master
所以要將metastore中的之前出現的namenode地址全部更換爲現有的namenode地址

10：Your DataNode is started and you can create directories with bin/hadoop dfs -mkdir, but you get an error message when you try to put files into the HDFS (e.g., when you run a command like bin/hadoop dfs -put).
解決方法：
Go to the HDFS info web page (open your web browser and go to http://namenode:dfs_info_port where namenode is the hostname of your NameNode and dfs_info_port is the port you chose dfs.info.port; if followed the QuickStart on your personal computer then this URL will be http://localhost:50070). Once at that page click on the number where it tells you how many DataNodes you have to look at a list of the DataNodes in your cluster.
If it says you have used 100% of your space, then you need to free up room on local disk(s) of the DataNode(s).
If you are on Windows then this number will not be accurate (there is some kind of bug either in Cygwin's df.exe or in Windows). Just free up some more space and you should be okay. On one Windows machine we tried the disk had 1GB free but Hadoop reported that it was 100% full. Then we freed up another 1GB and then it said that the disk was 99.15% full and started writing data into the HDFS again. We encountered this bug on Windows XP SP2.
11：Your DataNodes won't start, and you see something like this in logs/*datanode*:
Incompatible namespaceIDs in /tmp/hadoop-ross/dfs/data
原因：
Your Hadoop namespaceID became corrupted. Unfortunately the easiest thing to do reformat the HDFS.
解決方法：
You need to do something like this:
bin/stop-all.sh
rm -Rf /tmp/hadoop-your-username/*
bin/hadoop namenode -format
12：You can run Hadoop jobs written in Java (like the grep example), but your HadoopStreaming jobs (such as the Python example that fetches web page titles) won't work.
原因：
You might have given only a relative path to the mapper and reducer programs. The tutorial originally just specified relative paths, but absolute paths are required if you are running in a real cluster.
解決方法：
Use absolute paths like this from the tutorial:
bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \
   -mapper   $HOME/proj/hadoop/multifetch.py       \
   -reducer $HOME/proj/hadoop/reducer.py          \
   -input urls/*                               \
   -output   titles

13： 2009-01-08 10:02:40,709 ERROR metadata.Hive (Hive.java:getPartitions(499)) - javax.jdo.JDODataStoreException: Required table missing : ""PARTITIONS"" in Catalog "" Schema "". JPOX requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "org.jpox.autoCreateTables"
原因：就是因爲在 hive-default.xml 裏把 org.jpox.fixedDatastore 設置成 true 了

starting namenode, logging to /home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-namenode-hadoop.out
localhost: starting datanode, logging to /home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-datanode-hadoop.out
localhost: starting secondarynamenode, logging to /home/hadoop/HadoopInstall/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-hadoop.out
localhost: Exception in thread "main" java.lang.NullPointerException
localhost:    at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:130)
localhost:    at org.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:116)
localhost:    at org.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:120)
localhost:    at org.apache.hadoop.dfs.SecondaryNameNode.initialize(SecondaryNameNode.java:124)
localhost:    at org.apache.hadoop.dfs.SecondaryNameNode.<init>(SecondaryNameNode.java:108)
localhost:    at org.apache.hadoop.dfs.SecondaryNameNode.main(SecondaryNameNode.java:460)

14：09/08/31 18:25:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:Bad connect ack with firstBadLink 192.168.1.11:50010
> 09/08/31 18:25:45 INFO hdfs.DFSClient: Abandoning block blk_-8575812198227241296_1001
> 09/08/31 18:25:51 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/31 18:25:51 INFO hdfs.DFSClient: Abandoning block blk_-2932256218448902464_1001
> 09/08/31 18:25:57 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.11:50010
> 09/08/31 18:25:57 INFO hdfs.DFSClient: Abandoning block blk_-1014449966480421244_1001
> 09/08/31 18:26:03 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/31 18:26:03 INFO hdfs.DFSClient: Abandoning block blk_7193173823538206978_1001
> 09/08/31 18:26:09 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable
to create new block.
>       at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2731)
>       at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
>       at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
>
> 09/08/31 18:26:09 WARN hdfs.DFSClient: Error Recovery for block blk_7193173823538206978_1001
bad datanode[2] nodes == null
> 09/08/31 18:26:09 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/umer/8GB_input"
- Aborting...
> put: Bad connect ack with firstBadLink 192.168.1.16:50010

解決方法：
I have resolved the issue:
What i did:

1) '/etc/init.d/iptables stop' -->stopped firewall
2) SELINUX=disabled in '/etc/selinux/config' file.-->disabled selinux
I worked for me after these two changes

解決jline.ConsoleReader.readLine在Windows上不生效問題方法
在 CliDriver.java的main()函數中，有一條語句reader.readLine，用來讀取標準輸入，但在Windows平臺上該語句總是返回null，這個reader是一個實例jline.ConsoleReader實例，給Windows Eclipse調試帶來不便。
我們可以通過使用java.util.Scanner.Scanner來替代它，將原來的
while ((line=reader.readLine(curPrompt+"> ")) != null)
複製代碼
替換爲：
Scanner sc = new Scanner(System.in);
while ((line=sc.nextLine()) != null)
複製代碼
重新編譯發佈，即可正常從標準輸入讀取輸入的SQL語句了。

Windows eclispe調試hive報does not have a scheme錯誤可能原因
1、Hive配置文件中的“hive.metastore.local”配置項值爲false，需要將它修改爲true，因爲是單機版
2、沒有設置HIVE_HOME環境變量，或設置錯誤
3、 “does not have a scheme”很可能是因爲找不到“hive-default.xml”。使用Eclipse調試Hive時，遇到找不到hive- default.xml的解決方法：http://bbs.hadoopor.com/thread-292-1-1.html

1、中文問題
從url中解析出中文,但hadoop中打印出來仍是亂碼?我們曾經以爲hadoop是不支持中文的，後來經過查看源代碼，發現hadoop僅僅是不支持以gbk格式輸出中文而己。
這是TextOutputFormat.class中的代碼，hadoop默認的輸出都是繼承自FileOutputFormat來的，FileOutputFormat的兩個子類一個是基於二進制流的輸出，一個就是基於文本的輸出TextOutputFormat。
public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
   protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
private static final String utf8 = “UTF-8″;//這裏被寫死成了utf-8
private static final byte[] newline;
static {
   try {
       newline = “\n”.getBytes(utf8);
   } catch (UnsupportedEncodingException uee) {
       throw new IllegalArgumentException(”can’t find ” + utf8 + ” encoding”);
   }
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
   this.out = out;
   try {
       this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
   } catch (UnsupportedEncodingException uee) {
       throw new IllegalArgumentException(”can’t find ” + utf8 + ” encoding”);
   }
}
…
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
       Text to = (Text) o;
       out.write(to.getBytes(), 0, to.getLength());//這裏也需要修改
   } else {
       out.write(o.toString().getBytes(utf8));
   }
}
…
}
可以看出hadoop默認的輸出寫死爲utf-8，因此如果decode中文正確，那麼將Linux客戶端的character設爲utf-8是可以看到中文的。因爲hadoop用utf-8的格式輸出了中文。
因爲大多數數據庫是用gbk來定義字段的，如果想讓hadoop用gbk格式輸出中文以兼容數據庫怎麼辦？
我們可以定義一個新的類：
public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
   protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
//寫成gbk即可
private static final String gbk = “gbk”;
private static final byte[] newline;
static {
   try {
       newline = “\n”.getBytes(gbk);
   } catch (UnsupportedEncodingException uee) {
       throw new IllegalArgumentException(”can’t find ” + gbk + ” encoding”);
   }
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
   this.out = out;
   try {
       this.keyValueSeparator = keyValueSeparator.getBytes(gbk);
   } catch (UnsupportedEncodingException uee) {
       throw new IllegalArgumentException(”can’t find ” + gbk + ” encoding”);
   }
}
…
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
//        Text to = (Text) o;
//        out.write(to.getBytes(), 0, to.getLength());
//    } else {
       out.write(o.toString().getBytes(gbk));
   }
}
…
}
然後在mapreduce代碼中加入conf1.setOutputFormat(GbkOutputFormat.class)
即可以gbk格式輸出中文。

2、某次正常運行mapreduce實例時,拋出錯誤

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:xxx are bad. Aborting…
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
java.io.IOException: Could not get block locations. Aborting…
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2143)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
經查明，問題原因是linux機器打開了過多的文件導致。用命令ulimit -n可以發現linux默認的文件打開數目爲1024，修改/ect/security/limit.conf，增加hadoop soft 65535

再重新運行程序（最好所有的datanode都修改），問題解決

3、運行一段時間後hadoop不能stop-all.sh的問題，顯示報錯
no tasktracker to stop ，no datanode to stop
問題的原因是hadoop在stop的時候依據的是datanode上的mapred和dfs進程號。而默認的進程號保存在/tmp下，linux默認會每隔一段時間（一般是一個月或者7天左右）去刪除這個目錄下的文件。因此刪掉hadoop-hadoop-jobtracker.pid和hadoop- hadoop-namenode.pid兩個文件後，namenode自然就找不到datanode上的這兩個進程了。
在配置文件中的export HADOOP_PID_DIR可以解決這個問題

問題：
Incompatible namespaceIDs in /usr/local/hadoop/dfs/data: namenode namespaceID = 405233244966; datanode namespaceID = 33333244
原因：
在每次執行hadoop namenode -format時，都會爲NameNode生成namespaceID,，但是在hadoop.tmp.dir目錄下的DataNode還是保留上次的 namespaceID，因爲namespaceID的不一致，而導致DataNode無法啓動，所以只要在每次執行hadoop namenode -format之前，先刪除hadoop.tmp.dir目錄就可以啓動成功。請注意是刪除hadoop.tmp.dir對應的本地目錄，而不是HDFS 目錄。

Problem: Storage directory not exist
2010-02-09 21:37:53,203 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory D:\hadoop\run\dfs_name_dir does not exist.
2010-02-09 21:37:53,203 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory D:\hadoop\run\dfs_name_dir is in an inconsistent state: storage directory does not exist or is not accessible.
solution: 是因爲存儲目錄D:\hadoop\run\dfs_name_dir不存在，所以只需要手動創建好這個目錄即可。
Problem: NameNode is not formatted
solution: 是因爲HDFS還沒有格式化，只需要運行hadoop namenode -format一下，然後再啓動即可

bin/hadoop jps後報如下異常：
Exception in thread "main" java.lang.NullPointerException
       at sun.jvmstat.perfdata.monitor.protocol.local.LocalVmManager.activeVms(LocalVmManager.java:127)
       at sun.jvmstat.perfdata.monitor.protocol.local.MonitoredHostProvider.activeVms(MonitoredHostProvider.java:133)
       at sun.tools.jps.Jps.main(Jps.java:45)
原因爲：
系統根目錄/tmp文件夾被刪除了。重新建立/tmp文件夾即可。
bin/hive中出現 unable to   create log directory /tmp/...也可能是這個原因

過雲雨後

發佈了39 篇原創文章 · 獲贊 8 · 訪問量 4萬+

私信關注

Hadoop常見錯誤解析

redis的key亂碼問題和值自增問題

CORS error 但是 status code 是200 OK

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

Hadoop2.2.0+HA+zookeeper3.4.5詳細配置過程+錯誤處理(二)

Java正則表達式：我最期望弄懂的知識，希望對大家都有幫助

javaWeb Note1

Hadoop2異常分析（一）：hdfs移動數據至 hive,爲什麼原數據沒有了？

Hadoop2調優（一）：如何控制job的map任務和reduce任務的數量

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結