nutch on hadoop 遇到 ls: 無法訪問data/segments: 沒有那個文件或目錄

原創

2020-06-16 16:56

將nutch部署在hadoop上運行

bin/crawl hdfs://localhost:9000/user/hadoop/urls data http://localhost:8983/solr/ 1

在generator完成之後，提示：

ls: 無法訪問data/segments/: 沒有那個文件或目錄
Operating on segment : 
Fetching :

打開HDFS查看，發現明明有這個目錄存在。

百思不得其解

在各種百度，google無解之後，想到了查看nutch的源碼。

查看了一下crawl腳本的文件內容：

# determines whether mode based on presence of job file

mode=local
if [ -f ../*nutch-*.job ]; then
    mode=distributed
fi

......

  if [ $mode = "local" ]; then
   SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`
  else
   SEGMENT=`hadoop fs -ls $CRAWL_PATH/segments/ | grep segments |  sed -e "s/\//\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1`
  fi
  
  echo "Operating on segment : $SEGMENT"

試了一下

hadoop fs -ls $CRAWL_PATH/segments/ | grep segments |  sed -e "s/\//\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1

在終端上運行結果正確。然後我就懷疑是模式就選錯了mode，即mode的值是local而非distributed。

於是我在腳本中加了

echo $mode

於是發現...輸出了local

再認真研讀腳本選擇mode的代碼

# determines whether mode based on presence of job file

mode=local
if [ -f ../*nutch-*.job ]; then
    mode=distributed
fi

發現，mode選擇distributed的條件是，父目錄中有 nutch .job 這個文件。

仔細一看自己的命令...

bin/crawl ....

應該是用 ./crawl 才能讓腳本正常檢測到 nutch .job那個文件。

進入bin，並將命令改成

./crawl hdfs://localhost:9000/user/hadoop/urls data http://localhost:8983/solr/ 1

即可正常運行。

Linux不會，不熟，真可怕...（腳本寫得差，真可怕...腳本都不輸出一些關鍵信息來幫助程序員判斷。哎...）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

nutch on hadoop 遇到 ls: 無法訪問data/segments: 沒有那個文件或目錄

[題解] ZOJ1002 -- Fire Net

nutch on hadoop 遇到 ls: 無法訪問data/segments: 沒有那個文件或目錄

nutch + solr —— 搭建初探

nutch + tomcat7 ——探索中

彙編與接口——串口通信篇

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結