新開一篇專門記Nutch&Solr。

版本

Nutch版本

Nutch目前是兩條線路開發，所以2.x並不比1.x來的高，來的新。

1.x（目前，最新1.8，默認搭配hadoop1.2，可以搭配hadoop2.2。）
2.x（目前，最新2.2.1,默認搭配hadoop1.2，並且不可以搭配hadoop2.2,因爲，gora0.3可以使用hbase0.90.x和0.92.x，但此版本的hbase不能用hadoop2.2,而hadoop1.2是可以的。）

Solr版本

環境搭建

Nutch搭建

1.x（目前，最新1.8.）
- Nutch 1.7, Hadoop 1.2.1, CentOS 6.5, JDK 1.7 把Nutch爬蟲部署到Hadoop集羣上
- Nutch 1.7 單機官方tutorial
2.x（目前，最新2.2.1）
- hadoop+hbase+Nutch2.1 Nutch的安裝與配置（for linux）
- Nutch 2.2+MySQL+Solr4.2實現網站內容的抓取和索引
- 在Eclipse中運行Nutch
  - 官方tutorial

Solr搭建

Solr本身

4.7
- 官方tutorial
- 管理頁面 http://localhost:8983/solr/#/

中文分詞

分詞插件

jcseg
- jcseg是使用Java開發的一箇中文分詞器，使用流行的mmseg算法實現。
- 目前最高版本：jcseg 1.9.3。兼容最高版本lucene-4.x和最高版本solr-4.x
- mmseg四種過濾算法，分詞準確率達到了98.41%。
IK Analyzer
- 採用了特有的“正向迭代最細粒度切分算法“，支持細粒度和智能分詞兩種切分模式；
- 最新版本2012年10月
mmseg4j
- mmseg4j 用 Chih-Hao Tsai 的 MMSeg 算法(http://technology.chtsai.org/mmseg/ )實現的中文分詞器，並實現 lucene 的 analyzer 和 solr 的TokenizerFactory 以方便在Lucene和Solr中使用。
- MMSeg 算法有兩種分詞方法：Simple和Complex，都是基於正向最大匹配。Complex 加了四個規則過慮。官方說：詞語的正確識別率達到了 98.41%。mmseg4j 已經實現了這兩種分詞算法。
- 最新版本2013-07-13版本1.9.1兼容 solr 4.3.1
ansj
- ansj分詞.ict的真正java實現.分詞效果速度都超過開源版的ict. 中文分詞,人名識別,詞性標註,用戶自定義詞典
- 正在積極開發中
d

插件安裝

smartcn & IK

Python&Solr

官方介紹

純HTTP，官方說明。
mysolr
mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions.Since version 0.5 mysolr supports Python 3 except concurrent search feature.
pysolr （比較簡單的API，目前，我使用的就是這個。）
pysolr is a lightweight Python wrapper for Apache Solr. It provides an interface that queries the server and returns results based on the query.
Haystack（比較複雜）
Haystack provides modular search for Django. It features a unified, familiar API that allows you to plug in different search backends (such as Solr, Elasticsearch, Whoosh, Xapian, etc.) without having to modify your code.
insol （看着不錯但是對其支持的Solr版本比較懷疑，官方稱兼容1.4）
- REPL friendly shortcuts module to start working right away
- Solr queries as Python objects, so that others can use your code abstracted away from inner workings of Solr - this is a design similar to Django ORM with it's Q and F objects
- fast and cache friendly - results as simple dicts, no builtin dict to object inflation code - either use the results as-is or provide your own inflation mechanism
  configuration module with live config reload to support connecting to multiple Solr instances or cores at run time
- flexible structure allowing you to customize the whole process of connecting to Solr instance and fetching documents without rewriting whole API
sunburnt
It's tested with Solr 1.4.1 and 3.1; previous versions were known to work with 1.3 and 1.4 as well.
solrpy

資料

源碼分析

Nutch1.7源碼分析

書籍

Solr in action 講的版本是Solr4.7（目前最新）
Solr官方推薦書籍
Web Crawling and Data Mining with Apache Nutch

論文

Building Nutch: Open Source Search

其他

《Nutch公開課從搜索引擎到網絡爬蟲》百度文庫
Nutch二次開發
Nutch1.7二次開發培訓講義之騰訊微博抓取分析
http://wiki.apache.org/nutch/HttpAuthenticationSchemes
Nutch Command line Option
SolrCloud
Dump Lucene Index
nutch-mongodb-indexer

Nutch&Solr小計

版本

Nutch版本

Solr版本

環境搭建

Nutch搭建

Solr搭建

Solr本身

中文分詞

分詞插件

插件安裝

Python&Solr

資料

源碼分析

書籍

論文

其他

985 碩士程序員，空窗 4 個月沒有 Offer！

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

vim中的調試和補全（windows平臺）

Lab樹莓派中的看門狗

Lab樹莓派實現airplay

Lab1:初見樹莓派(Raspberry)(windows平臺)

Lab1.1樹莓派上網，ssh和遠程桌面

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結