這個項目是大四上學期實習的一個項目。因爲我正好也在學Scrapy,所以就以這個作爲項目。也可能作爲我的畢業設計。
github地址:https://github.com/tianmingbo/scrapy-elastic
一、elasticsearch使用
https://blog.csdn.net/T_I_A_N_/article/details/103253975
elastic教程:https://www.elastic.co/guide/cn/elasticsearch/guide/current/getting-started.html
二、建立索引
等同於MySql中的數據庫
from elasticsearch_dsl import DocType, Date, Completion, Keyword, Text, Integer
from elasticsearch_dsl.analysis import CustomAnalyzer as _CustomAnalyzer
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts=["localhost"]) # 連接服務器
class CustomAnalyzer(_CustomAnalyzer):
def get_analysis_definition(self):
# 避免報錯
return {}
ik_analyzer = CustomAnalyzer("ik_max_word", filter=["lowercase"])
class JobType(DocType):
'''
Text:
會分詞,然後進行索引
支持模糊、精確查詢
不支持聚合
keyword:
不進行分詞,直接索引
支持模糊、精確查詢
支持聚合
'''
suggest = Completion(analyzer=ik_analyzer)
title = Text(analyzer="ik_max_word")
salary = Keyword()
job_city = Text(analyzer="ik_max_word")
work_years = Keyword()
degree_need = Keyword()
job_type = Keyword()
job_need = Keyword()
job_responsibility = Keyword()
job_advantage = Keyword()
job_url = Keyword()
publish_time = Keyword()
company_name = Text(analyzer="ik_max_word")
company_url = Keyword()
class Meta:
index = "lagou" # 索引===數據庫
doc_type = "job" # 類型===表名
if __name__ == "__main__":
JobType.init() # 根據定義的類,生成mappings
三、爬蟲獲取數據
1、建立對各大招聘網站的爬蟲。
2、數據分析。
3、保存到elastic中。
ElasticSearch 是一個分佈式、高擴展、高實時的搜索與數據分析引擎。它能很方便的使大量數據具有搜索、分析和探索的能力。充分利用ElasticSearch的水平伸縮性,能使數據在生產環境變得更有價值。ElasticSearch 的實現原理主要分爲以下幾個步驟,首先用戶將數據提交到Elastic Search 數據庫中,再通過分詞控制器去將對應的語句分詞,將其權重和分詞結果一併存入數據,當用戶搜索數據時候,再根據權重將結果排名,打分,再將返回結果呈現給用戶。
在items.py 文件中將爬取的數據保存到elastic中,
def save_to_es(self):
#保存到es中
job = JobType()
job.title = self['title']
job.salary = self["salary"]
job.job_city = self["job_city"]
job.work_years = self["work_years"]
job.degree_need = self["degree_need"]
job.job_type = self["job_type"]
job.job_need = self["job_need"]
job.job_responsibility = self["job_responsibility"]
job.job_advantage = self["job_advantage"]
job.job_url = self["job_url"]
job.publish_time = self["publish_time"]
job.company_name = self["company_name"]
job.company_url = self["company_url"]
job.save()