fscrawler導入文件到elasticsearch

1.elasticsearch-5.6.12

2.elastic search header

3.fscrawler-es5-2.6

安裝和啓動請看:https://blog.csdn.net/fulq1234/article/details/96485228

文檔:https://fscrawler.readthedocs.io/en/fscrawler-2.6/user/rest.html

 翻譯後的:https://www.helplib.com/GitHub/article_94667

1.To start FSCrawler with the REST service,

E:\soft\fscrawler-es5-2.6\bin>fscrawler test --loop 0 --rest
10:03:20,487 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV5] Elasticsearch Client for version 5.x connected to a node running version 5.6.12
10:03:20,528 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:03:21,038 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi will be ignored.
10:03:21,039 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi will be ignored.
10:03:21,250 INFO  [f.p.e.c.f.r.RestServer] FS crawler Rest service started on [http://127.0.0.1:8080/fscrawler]

Check the service is working with:

curl http://127.0.0.1:8080/fscrawler/

 你可以通過在表單數據中添加 id=YOUR_ID 來強制任何標識:

curl -F "[email protected]" -F "id=my-test""http://127.0.0.1:8080/fscrawler/_upload"

2.上傳文件

E:\soft\fscrawler-es5-2.6\bin>curl -F "file=@C:\tmp\folderA\subfolderA\ni.pdf" "http://127.0.0.1:8080/fscrawler/_upload"
{"ok":true,"filename":"ni.pdf","url":"http://127.0.0.1:9200/test/doc/bfce7866cb7abdd1232cc7604f60e3"}
E:\soft\fscrawler-es5-2.6\bin>curl -F "file=@C:\Users\lunmei\Desktop\bx\光大永明達爾文.pdf" "http://127.0.0.1:8080/fscrawler/_upload"
{"ok":true,"filename":"錕斤拷錕斤拷錕斤拷錕斤拷錕斤拷錕斤拷.pdf","url":"http://127.0.0.1:9200/test/doc/31749656282e2e9afcc4368d58f9331b"}

上傳成功後可以在elastic search header裏面看到

3.新建一個

E:\soft\fscrawler-es5-2.6\bin>fscrawler lm
13:06:22,092 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [lm] does not exist
13:06:22,094 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
y
13:06:24,744 INFO  [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [C:\Users\lunmei\.fscrawler\lm\_settings.json]. Please review and edit before relaunch

修改配置文件

參數  
url job_name,必須
includes 包括。不應用於目錄名稱,而是應用於文件名。
filename_as_id 使用文件名作爲 _id
continue_on_error 默認情況下,如果爬行器遇到拒絕權限異常,它將立即停止索引。 如果你想跳過這裏文件並繼續使用目錄樹的其餘部分
   
   

再重新啓動

E:\soft\fscrawler-es5-2.6\bin>fscrawler lm
13:09:15,268 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV5] Elasticsearch Client for version 5.x connected to a node running version 5.6.12
13:09:15,309 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
13:09:15,310 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
13:09:15,324 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [lm] for [C:\tmp\tmp\es] every [15m]
13:09:15,489 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

成功

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章