試試ElasticSearch

  • 搜索

    • ES前面是需要有反向代理的,因爲這樣會安全許多。不要讓elasticSearch直接面對最前端,要在es前面架設2臺,起碼架設2臺能做反向代理的WEB服務器。不是說你有F5或者其他負載均衡設備就可以了,當然了,除非你樂意在防火牆那層面對ES做專門的措施,比如在防火牆那裏動態對訪問ES的所有請求的HEADER都加上一個認證參數。否則的話,請一定在ES集羣前架設兩臺反向代理用的WEB服務器。省錢的方案是弄兩臺裝NGINX(或者衍生品)的服務器,每臺除了NGINX還要配置LVS做好浮動IP(如果你有F5這類負載均衡,可以不用LVS,而由負載均衡代替浮動IP),記得在NGINX上配置訪問ES的用戶名密碼。後面也會介紹一點KIBANA,這樣的話,你就可以爲ES增加不同的用戶、密碼以及相應的權限,給搜索用的用戶權限一定要儘量低(注意安全補丁)

      通過監控生成的靜態文件目錄,或者在靜態化期間將待搜索內容(標題,內容等)及相關聯的資源(如外網訪問的鏈接)存入ElasticSearch。ElasticSearch本身提供二進制包,可在 https://www.elastic.co/cn/ 進行下載。

    • 安裝

      前置

      #如果是centos這類系統,建議先做以下事情(ubuntu也差不多):
      vim etc/sysctrl.conf
        fs.file-max = 1000000
        vm.max_map_count=262144
        vm.swappiness = 1
      vim /etc/security/limits.conf
        * soft nofile 65536
        * hard nofile 131072
        * soft nproc 2048
        * hard nproc 4096
      vi /etc/security/limits.d/90-nproc.conf
        *          soft    nproc     2048        
      
      #配置java1.8或者寫到profile裏source
      export JAVA_HOME=/data/jdk1.8.0_151
      export PATH=$JAVA_HOME/bin:$PATH
      export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
      
      #新建用戶elk 安裝用root 運行用elk
      groupadd elk
      useradd elk -g elk
      mkdir -p /data/elk
      

      安裝ik,x-pack等插件,ik是中文分詞,需要安裝到es下,x-pack是安全、監控等功能的官方插件(收費),是選擇性使用的東西,用了後有漂亮的監控界面,不用也不影響系統正常使用,就是監控相應的情況麻煩。

      #安裝elasticsearch logstash kibana        
      
      #安裝x-pack
      bin/elasticsearch-plugin install file:///home/zzuser/elk/x-pack-5.6.3.zip
      bin/logstash-plugin install file:///home/zzuser/elk/x-pack-5.6.3.zip
      bin/kibana-plugin install file:///home/zzuser/elk/x-pack-5.6.3.zip
      #破解x-pack
      #啓動elasticsearch再停止後,替換x-pack-5.6.3.jar
      #https://license.elastic.co/registration申請基礎許可license.json郵件
      #修改license文件: "type":"platinum" "expiry_date_in_millis":146128000000 一年後的時間戳
      #更新license文件: curl -XPUT -u elastic:changeme 'http://127.0.0.1:9200/_xpack/license?acknowledge=true' -d @license.json
      
      #elasticsearch-plugin安裝分詞
      bin/elasticsearch-plugin install file:///home/zzuser/elk/ik.zip
      #用elk用戶啓動
      vi elasticsearch.yml
      bin/elasticsearch -d -v
      
      #logstash-plugin安裝http input
      bin/logstash-plugin install logstash-input-http
      #其他安裝方法
      #vi Gemfile >> gem "logstash-input-http", :path => "/root/logstash-input-http-master"
      #bin/logstash-plugin install --no-verify
      #yum install ruby
      #yum install rubygems
      #unzip master
      #gem build xxx.gemspec 
      #logstash-plugin install xxx.gem
      #用elk用戶啓動 並指定配置文件
      vi logstash.conf
      bin/logstash -f config/logstash.conf
      
      #kibana
      bin/kibana
      
    • 配置

      logstash配置

      input {
        http{
            #輸入端所監聽的地址、端口
            host => "0.0.0.0"
            port => 8111
            #內容解碼類型、內容字符集
            codec => json {
              charset => ["UTF-8"]
            }
            #線程數
            threads => 4
            #用戶名密碼
            user => tran
            password => zerotest
            #是否使用ssl
            ssl => false
        }
      }
      output {
        #可同時向標準、es輸出
        # 標準輸出解碼方式
        stdout {  codec => rubydebug }
        elasticsearch {
               #es所在的地址以及端口
               hosts => ["http://127.0.0.1:9200"]
               #希望es做什麼動作(這個字段在後面寫到了程序裏,可以由程序來控制增加、更新、刪除)
               action => "%{action}"
               #用戶名密碼
               user => elastic
               password => changeme
               #文檔唯一ID
               document_id => "%{hashid}"
               #此文檔要使用哪個索引
               index => "%{indexname}"
               #此文檔要使用哪個文檔類型(可以理解document_type是表、index是庫。。。不要完全照搬rl類數據庫的原始概念)
               document_type => mixdoc
               doc_as_upsert => true
        }
      }
      
      

      ElasticSearch配置

      #集羣名
      cluster.name: search
      #節點名稱
      node.name: node1
      #是否爲主節點
      node.master: true
      #是否爲數據節點
      node.data: false
      #綁定地址
      network.host: 172.16.100.31
      #該集羣內的機器有哪些
      discovery.zen.ping.unicast.hosts: ["172.16.100.30", "172.16.130.31", "172.16.130.32"]
      #一個節點需要看到的具有master節點資格的最小數量,然後才能在集羣中做操作。官方的推薦值是(N/2)+1
      discovery.zen.minimum_master_nodes: 2
      #設置集羣中N個節點啓動後進行數據恢復
      gateway.recover_after_nodes: 3
      #其他的可在網上進行查找
      bootstrap.memory_lock: false
      bootstrap.system_call_filter: false
      
      #主節點
      cluster.name: search
      node.name: node5
      node.master: true
      node.data: false
      network.host: 172.16.100.31
      discovery.zen.ping.unicast.hosts: ["172.16.100.30", "172.16.130.31", "172.16.130.32", "172.16.100.33", "172.16.100.36", "172.16.100.38"]
      discovery.zen.minimum_master_nodes: 2
      gateway.recover_after_nodes: 3
      bootstrap.memory_lock: false
      bootstrap.system_call_filter: false
      #數據節點
      cluster.name: search
      node.name: node6
      node.master: false
      node.data: true
      network.host: 172.16.100.32
      discovery.zen.ping.unicast.hosts: ["172.16.100.30", "172.16.130.31", "172.16.130.32", "172.16.100.33", "172.16.100.36", "172.16.100.38"]
      discovery.zen.minimum_master_nodes: 2
      gateway.recover_after_nodes: 3
      bootstrap.memory_lock: false
      bootstrap.system_call_filter: false
      
      

      用nginx反向代理

      http {
          include       mime.types;
          default_type  application/octet-stream;
      
          log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                            '$status $body_bytes_sent "$http_referer" '
                            '"$http_user_agent" "$http_x_forwarded_for"';
          log_format mini '$time_local $status $body_bytes_sent $request_time $upstream_cache_status $server_name';
          #access_log  logs/access.log  main;
          access_log logs/status_log mini;
      
          sendfile        on;
          #tcp_nopush     on;
      
          #keepalive_timeout  0;
          keepalive_timeout  65;
      
          #gzip  on;
          charset utf-8;
          client_max_body_size 16M;
      
          upstream search_upstream {
              server 127.0.0.1:9200;
          }
      
          server {
          listen       80;
          server_name  172.16.130.52;
      
      
          #access_log  logs/kibana.log main;
          #access_log  logs/kibana_status_log mini;
          #error_log   logs/kibana_error.log error;
      
          error_page 404 = /index.html;
          error_page 500 502 503 504 403 = /index.html;
             location / {
                      proxy_set_header   X-Request-Uri         $request_uri;
                      proxy_set_header   X-Real-IP            $remote_addr;
                      proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
                      proxy_set_header   Host                   $http_host;
                      proxy_set_header   X-NginX-Proxy    true;
                      proxy_set_header   Connection "";
                      proxy_http_version 1.1;
                      proxy_pass         http://localhost:5601/;
              }
              location /search/ {
                      proxy_set_header   X-Request-Uri         $request_uri;
                      proxy_set_header   X-Real-IP            $remote_addr;
                      proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
                      proxy_set_header   Host                   $http_host;
                      proxy_set_header   X-NginX-Proxy    true;
                      proxy_set_header   Connection "";
                      proxy_http_version 1.1;
                      proxy_pass  http://search_upstream/mw_index/mixdoc/_search/;
                      #用戶身份驗證  base64.b64encode("username:passwd")
                      proxy_set_header   AUthorization "Basic bXdfc2VhcmNoZXI6RTdheDE5NzY=";
                      proxy_pass_header Authorization;
              }
      
              location /search_test/ {
                      proxy_set_header   X-Request-Uri         $request_uri;
                      proxy_set_header   X-Real-IP            $remote_addr;
                      proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
                      proxy_set_header   Host                   $http_host;
                      proxy_set_header   X-NginX-Proxy    true;
                      proxy_set_header   Connection "";
                      proxy_http_version 1.1;
                      proxy_pass  http://search_upstream/indexnameismine/mixdoc/_search/;
                      proxy_set_header   AUthorization "Basic ZWxhc3RpYzpjaGFuZ2VtZQ==";
                      proxy_pass_header Authorization;
              }
          }
      }
      

      開始建立索引和文檔類型(在裝好ik的情況下,另外es系統缺省用戶名密碼爲elastic:changeme),以索引名somename舉例

      #先建立somename_v1這個索引,參數都很好理解就不多說了
      curl -u elastic:changeme -XPUT 'http://172.16.100.30:9200/somename_v1' -d'
      {
        "settings": {
          "number_of_shards": 10,
          "number_of_replicas": 1,
          "index": {
            "analysis": {
              "analyzer": {
                "by_smart": {
                  "type": "custom",
                  "tokenizer": "ik_smart",
                  "filter": ["by_tfr","by_sfr"],
                  "char_filter": ["by_cfr"]
                },
                "by_max_word": {
                  "type": "custom",
                  "tokenizer": "ik_max_word",
                  "filter": ["by_tfr","by_sfr"],
                  "char_filter": ["by_cfr"]
                }
              },
              "filter": {
                "by_tfr": {
                  "type": "stop",
                  "stopwords": [" "]
                },
                "by_sfr": {
                  "type": "synonym",
                  "synonyms_path": "analysis/synonyms.txt"
                }
              },
              "char_filter": {
                "by_cfr": {
                  "type": "mapping",
                  "mappings": ["| => |"]
                }
              }
            }
          }    
        }
      }'        
      

      建立文檔映射(document_type類似表及其內部字段)

      curl  -u elastic:changeme  -XPUT 'http://172.16.100.30:9200/somename_v1/_mapping/mixdoc' -d'
      {
        "properties": {
          "hashid" : {
            "type": "keyword"
          },
          "title": {
            "type": "text",
            "index": true,
            "analyzer": "by_max_word",
            "search_analyzer": "by_smart"
          },
          "content": {
            "type": "text",
            "index": true,
            "analyzer": "by_max_word",
            "search_analyzer": "by_smart"
          },
          "keys": {
            "type": "keyword"
          },
          "date": {
            "type": "date"
          },
          "description": {
            "type": "text"
          },
          "pics": {
            "type": "text"
          },
          "link": {
            "type": "text"
          },
          "author": {
            "type": "keyword"
          },
          "mark_int": {
            "type": "long"
          },
          "mark_text": {
            "type": "text"
          },    
          "area_location": {
            "type": "geo_point"
          },
          "area_shape": {
            "type": "geo_shape"
          },
          "other_backup1": {
            "type": "text",
            "index": false
          },
          "other_backup2": {
            "type": "text",
            "index": false
          }          
        }
      }'        
      

      通過別名方式建立真正的somename這個doucment_type(這麼做的好處是因爲es不支持爲document_type增加字段,如果增加就會全量索引,爲了避免更迭出現的問題,用別名的方式建立最好,這樣可以做成類似鏈表的接口,改變指針即可)

      curl -u elastic:changeme -XPOST 172.16.100.30:9200/_aliases -d '
      {
          "actions": [
              { "add": {
                  "alias": "somename",
                  "index": "somename_v1"
              }}
          ]
      }'        
      

      刪除、查看

      #要檢查集羣健康
      curl 'localhost:9200/_cat/health?v'
      #獲得節集羣中的節點列表
      curl 'localhost:9200/_cat/nodes?v'
      #列出所有的索引
      curl 'localhost:9200/_cat/indices?v'
      #創建一個叫做“customer”的索引,然後再列出所有的索引:
      curl -XPUT 'localhost:9200/customer?pretty'
      curl 'localhost:9200/_cat/indices?v'
      #現在讓我們放一些東西到customer索引中。首先要知道的是,爲了索引一個文檔,我們必須告訴Elasticsearch這個文檔要到這個索引的哪個類型(type)下。讓我們將一個簡單的客戶文檔索引到customer索引、“external”類型中,這個文檔的ID是1,操作如下:
      curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
      {
        "name": "John Doe"
      }'
      #把剛剛索引的文檔取出來:
      curl -XGET 'localhost:9200/customer/external/1?pretty'
      #刪除我們剛剛創建的索引,並再次列出所有的索引:
      curl -XDELETE 'localhost:9200/customer?pretty'
      curl 'localhost:9200/_cat/indices?v'
      
      curl  -u elastic:changeme  -XDELETE '172.16.100.30:9200/tianjinwe_v1?pretty'
      curl  -u elastic:changeme  -XGET '172.16.100.30:9200/tianjinwe?pretty'
      curl -<REST Verb> <Node>:<Port>/<Index>/<Type><ID>
      

      靜態化內容存入ES的方式有很多,如果是涉及到已經成型的發佈系統,而發佈系統本身沒有對外提供相應的自動化數據接口,那麼就我個人的習慣會根據歷史數據量的多少、每天更新量的多少、本項目是否與其他項目混用同一級別的發佈目錄來進行不同方案的選擇。因爲本項目實質上內容不會太多,所以直接採用了監控目錄的方法,方法很簡單,使用linux內核的inotify機制(對系統內核有一點點要求,這個請自行查找),採用類似下面的python代碼即可,屆時只需稍微提高代碼的容錯性並實現故障的情況下自動重啓(個人建議supervisor),並要將打印部分的代碼修改爲連接kafka或者存入數據庫,另外再寫一個消費這些信息的程序將數據寫入logstash或者直接寫入elasticSearch即可,這兩個部分的示意代碼如下(py3比較好。至於py2.。。。):

      監控目錄的示例程序是:

      import os
      from  pyinotify import * # WatchManager, Notifier, ProcessEvent,IN_DELETE, IN_CREATE,IN_MODIFY,IN_CLOSE_WRITE,IN_CLOSE_NOWRITE
      class EventHandler(ProcessEvent):
          """事件處理"""
          max_queued_events.value = 99999
          def process_IN_CREATE(self, event):
              findIndex = event.name.find('index')
              findHtm = event.name.find('htm')
              if findIndex >= 0 and findHtm >= 0:
                  print("Create file: %s "  %   os.path.join(event.path,event.name))
      
          def process_IN_DELETE(self, event):
              findIndex = event.name.find('index')
              findHtm = event.name.find('htm')
              if findIndex >= 0 and findHtm >= 0:
                  print("Delete file: %s "  %   os.path.join(event.path,event.name))
      
          def process_IN_MODIFY(self, event):
              findIndex = event.name.find('index')
              findHtm = event.name.find('htm')
              if findIndex >= 0 and findHtm >= 0:
                  print("Modify file: %s "  %   os.path.join(event.path,event.name))
      
          def process_IN_CLOSE_WRITE(self, event):
              findIndex = event.name.find('index')
              findHtm = event.name.find('htm')
              if findIndex >= 0 and findHtm >= 0:
                  print("CLOSE WRITE file: %s "  %   os.path.join(event.path,event.name))
      
          def process_IN_NOMODIFY(self, event):
              findIndex = event.name.find('index')
              findHtm = event.name.find('htm')
              if findIndex >= 0 and findHtm >= 0:
                  print("CLOSE NOWRITE: %s "  %   os.path.join(event.path,event.name))
      
          def process_IN_Q_OVERFLOW(self, event):
              print('-_-',max_queued_events.value)
              max_queued_events.value *= 3
      
      def FSMonitor(path='.'):
          wm = WatchManager()
          mask = IN_DELETE | IN_CLOSE_WRITE | IN_CLOSE_NOWRITE
          notifier = Notifier(wm, EventHandler())
          wm.add_watch(path, mask,rec=True)
          print('now starting monitor %s'%(path))
          while True:
              try:
                  notifier.process_events()
                  if notifier.check_events():
                      notifier.read_events()
              except KeyboardInterrupt:
                  notifier.stop()
                  break
      
      if __name__ == "__main__":
          FSMonitor('/data/wwwroot/')        
      

      向logstash傳遞數據的示例程序(此示例程序用於搜索某個目錄下的所有html文件,並解析出其標題、內容等項,並傳遞給logstash):

      from bs4 import BeautifulSoup
      import json
      import os
      import os.path
      import hashlib
      import urllib3
      http = urllib3.PoolManager()
      logstash = 'http://172.16.130.52:8080'
      rootdir="/data/wwwroot/"
      _counter = 0
      for parent,dirnames,filenames in os.walk(rootdir):
          for filename in filenames:
              if filename.find('.html') < 0:
                  continue
              completeFileName = os.path.join(parent,filename)
              hash = hashlib.sha256()
              hash.update(completeFileName.encode('utf-8'))
              #爲每一個文件生成唯一的ID
              myHash = hash.hexdigest()
              #打開文件夾下的一個文件
              with open(completeFileName,'rb') as fh:
                  soup = BeautifulSoup(fh,'html5lib')
              fh.close()
              #下面基本都是解包、搜索對應特徵標籤,提出相關內容。使用者需要根據自己頁面的結構自行修改
              [m.extract() for m in soup.find_all('style')]
              titleEle = soup.find(name='h1')
              if titleEle is None:
                  continue
              contentEle = soup.find(name='div',class_='TRS_Editor')
              if contentEle is None:
                  continue
              txtContent = contentEle.get_text().replace('\n','')
              titleContent = titleEle.get_text()
              urlPrefix = 'http://www.somename.com/'
              url = completeFileName.replace(rootdir, urlPrefix)
              #這裏因爲是全量修改,所以'action'對應的永遠是index,index一般是出現在要求重新索引(更新),新增內容時候所傳遞的action。指定hashid是處於要兼容更新、新增兩種操作,對於update,有hashid能用於找到數據,對於insert new record可視作人工添加主索引數據,title是標題,content是內容,url是指在外網訪問的數據的地址
              normal = json.dumps({'action':'index', 'hashid': myHash, 'title': titleContent, 'content':txtContent, 'url':  url}, ensure_ascii=False)
              result = http.request('POST', logstash, body=normal.encode('utf-8'), headers={'Content-Type': 'application/json'})
              if result.status == 200:
                  #print(normal)
                  _counter += 1
                  print("success:", myHash)
                  if _counter % 10000 == 0:
                    print("_counter", _counter)
              else:
                  print("fail:", myHash, ":", result.status,":", completeFileName)
      
      

下面需要介紹一下nginx配置訪問ES的密碼這部分,同時要注意有HOST問題

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章