The Flask Mega-Tutorial 之 Chapter 16：Full Text Search （全文搜索）

小引

爲 Microblog 添加 全文搜索 ，對於給定的搜索詞（search term），返回包含搜索詞的所有 posts，並按照相關度降序排列。

Intro to Full-Text Search Engines

1、開源 full-text search 引擎：

Elasticsearch
Apache Solr
Whoosh
Xapian
Sphinx

2、具備搜索能力的 database：

SQLite、MySQL、PostgreSQL
MongoDB、CouchDB

relational db 雖有搜索功能，但由於 SQLAlchemy 不支持這個功能，所以必須自己寫原生 SQL 語句，或者找到一個庫能夠實現 text search 的 high-level acess 同時與 SQLAlchemy 協同。

Elasticsearch 作爲 ELK 棧（Elasticsearch-Logstash-Kibana，for indexing logs）的一員，有很高的流行度，選擇Elasticsearch 用於本項目。

注：將 text indexing 和 searching 相關的 funcs，封裝到單獨的 module 中。若之後需要改換 search engine ，則只需改寫此 module 的相關 funcs 即可。

Installing Elasticsearch

1、安裝 Elasticsearch 之前，須先安裝 JDK 8

How to Install Java 8 on Debian 9/8/7 via PPA
How to Install JAVA 8 on Ubuntu 18.04/16.04, LinuxMint 18/17
Ubuntu 安裝 JDK 7 / JDK8 的兩種方式

1-1 Add Java 8 PPA

Create a new Apt configuration file， /etc/apt/sources.list.d/java-8-debian.list,

sudo vim /etc/apt/sources.list.d/java-8-debian.list

添加如下內容

deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main
deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main

引入 GPG key（用於 package 安裝前的驗證）.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886

1-2 安裝 Java 8

sudo apt-get update
sudo apt-get install oracle-java8-installer

1-3 驗證 Java 安裝成功

設定版本

sudo apt-get install oracle-java8-set-default

The apt repository provides package oracle-java8-set-default to set Java 8 as default Java version.
驗證版本

$ java -version

java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

1-4 搭建 JAVA_HOME 和 JRE_HOME 環境變量

修改環境變量（針對 user ）

sudo vim ~/.bashrc

如需要針對系統，則更改 /etc/environment
在 ~/.bashrc 追加內容

# set oracle jdk environment
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=${JAVA_HOME}/jre

使環境變量馬上生效

source ~/.bashrc

2、安裝 Elasticsearch

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb.sha512
shasum -a 512 -c elasticsearch-6.3.0.deb.sha512 
sudo dpkg -i elasticsearch-6.3.0.deb

Compares the SHA of the downloaded Debian package and the published checksum, which should output elasticsearch-{version}.deb: OK.

3、啓動 / 關閉 Elasticsearch

Running / Stopping Elasticsearch with systemd

sudo systemctl start elasticsearch.service
sudo systemctl stop elasticsearch.service

如果想開機啓動Elasticsearch，則

sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service

驗證 Elasticsearch 運行

http://localhost:9200

4、安裝 Elasticsearch 對應的 python 客戶端

(venv) $ pip install elasticsearch

注：更新 requirements.txt

Elasticsearch Tutorial

1、建立 Elasticsearch connection

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch('http://localhost:9200')

實例化 + 傳參URL

2、將 data （JSON ）寫入 index： es.index()

>>> es.index(index='my_index', doc_type='my_index', id=1, body={'text': 'this is a test'})
>>> es.index(index='my_index', doc_type='my_index', id=2, body={'text': 'a second test'})

index， Elasticsearch 的 storage container
doc_type ，存儲類型，一個 index 可以存儲多種類型
id，unique
body，JSON object with the data，包含 field 及 data

3、search： es.search()

>>> es.search(index='my_index', doc_type='my_index',
... body={'query': {'match': {'text': 'this test'}}})

注意 body 的格式，{'query': {'match': {<field>: <expression>}}}

response 格式，是 python dict

{
    'took': 1,
    'timed_out': False,
    '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
    'hits': {
        'total': 2, 
        'max_score': 0.5753642, 
        'hits': [
            {
                '_index': 'my_index',
                '_type': 'my_index',
                '_id': '1',
                '_score': 0.5753642,
                '_source': {'text': 'this is a test'}
            },
            {
                '_index': 'my_index',
                '_type': 'my_index',
                '_id': '2',
                '_score': 0.25316024,
                '_source': {'text': 'a second test'}
            }
        ]
    }
}

4、刪除 index

>>> es.indices.delete('my_index')

注：如果是刪除其中某個 id，則

es.delete(index=index, doc_type=index, id=<id>)

Elasticsearch Configuration

1、ELASTICSEARCH_URL

config.py: elasticsearch configuration.

class Config(object):
    # ...
    ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL')

更新 .env

ELASTICSEARCH_URL=http://localhost:9200

2、初始化 Elasticsearch

由於 Elasticsearch 不是 Flask extension，故不能在沒有 app instance 的情況下，在 global scope 中實例化。

app/__init__.py: Elasticsearch instance.

# ...
from elasticsearch import Elasticsearch

# ...

def create_app(config_class=Config):
    app = Flask(__name__)
    app.config.from_object(config_class)

    # ...
    app.elasticsearch = Elasticsearch([app.config['ELASTICSEARCH_URL']]) \
        if app.config['ELASTICSEARCH_URL'] else None

    # ...

若未配置 URL 環境變量，則 app.elasticsearch is None

A Full-Text Search Abstraction

抽象的目的

不侷限於某個 Elasticsearch，便於更換 search engine
一般化 model，不侷限於 Post

1、爲 Model 添加 __searchable__=[]

爲需要 indexing 的 Model 添加 __searchable__ 屬性，列入需要添加到 index 的 fields。

app/models.py:

class Post(db.Model):
    __searchable__ = ['body']
    # ...

注： _searchable_ 只是一個變量，不會產生任何 behavior，僅用於輔助稍後的 funcs。

2、封裝 app / search.py

from flask import current_app

def add_to_index(index, model):
    if not current_app.elasticsearch:
        return
    payload = {}
    for field in model.__searchable__:
        payload[field] = getattr(model, field)
    current_app.elasticsearch.index(index=index, doc_type=index, id=model.id,
                                    body=payload)

def remove_from_index(index, model):
    if not current_app.elasticsearch:
        return
    current_app.elasticsearch.delete(index=index, doc_type=index, id=model.id)

def query_index(index, query, page, per_page):
    if not current_app.elasticsearch:
        return [], 0
    search = current_app.elasticsearch.search(
        index=index, doc_type=index,
        body={'query': {'multi_match': {'query': query, 'fields': ['*']}},
              'from': (page - 1) * per_page, 'size': per_page})
    ids = [int(hit['_id']) for hit in search['hits']['hits']]
    return ids, search['hits']['total']

application 通過 app/search.py 與 elasticsearch 建立連接，便於之後的更換 search engine

注：

id=model.id，使得 Elasticsearch 與 SQLAlchemy 兩個 db 的 unique id 相同，便於之後的定向 delete 及 search CASE 排序。
add_to_index()，兼具 add 及 update 的功能
multi_match， search across multiple fields.
'fields': ['*']，tell Elasticsearch to look in all the fields (listed in __searchable__), i.e. search the entire index.

This is useful to make this function generic, since different models can have different field names in the index.
無 SQLAlchemy 的 paginate() 可用，須自己計算 'from': (page - 1) * per_page
用 list comprehension ，獲取 IDs

3、測試

測試（測試前，須先添加相應的 posts ）

>>> from app.search import add_to_index, remove_from_index, query_index
>>> for post in Post.query.all():
...     add_to_index('posts', post)
>>> query_index('posts', 'one two three four five', 1, 100)
([15, 13, 12, 4, 11, 8, 14], 7)

清除測試內容

>>> app.elasticsearch.indices.delete('posts')

Integrating Searches with SQLAlchemy

app/search.py 中採用的方法，有兩類問題：

1、query_index() 返回的結果之一，爲 IDs，而不是 model objects

我們希望能直接拿到 model objects，這樣可以傳給 templates 來進行 rendering

Solution：根據 IDs，寫出 SQL query 語句，提取到相應的 model objs

2、posts 添加/刪除時，須顯性地調用 add_to_index 及 remove_from_index

容易滋生 bug，使得 Elasticsearch 和 SQLAlchemy db 越來越不同步（async）

Solution：利用SQLAlchemy events，監聽 db.session，使得 SQLAlchemy db 發生更改時，自動更新 Elasticsearch

爲解決上述兩類問題，創建一類 mixin class —— SearchableMixin

mixin 類將作爲 SQLAlchemy —— Elasticsearch 的粘結層
當某個 Model 繼承了 SearchableMixin 後, 將具備自動管理 associated full-text index.

1、app/models.py: SearchableMixin class.

from app.search import add_to_index, remove_from_index, query_index

class SearchableMixin(object):
    @classmethod
    def search(cls, expression, page, per_page):

    @classmethod
    def before_commit(cls, session):

    @classmethod
    def after_commit(cls, session):

    @classmethod
    def reindex(cls):

1-1 - search()

@classmethod
    def search(cls, expression, page, per_page):
        ids, total = query_index(cls.__tablename__, expression, page, per_page)
        if total == 0:
            return cls.query.filter_by(id=0), 0
        when = []
        for i in range(len(ids)):
            when.append((ids[i], i))
        return cls.query.filter(cls.id.in_(ids)).order_by(
            db.case(when, value=cls.id)), total

引入 app/search.py 中的 query_index()，其中參數 index = cls.__tablename__
when = [(ids[i], i)...]
返回的 cls.query.filter() 中，cls.id.in_(ids) 系 SQLAlchemy 語法（注：非 filter_by）
order_by 中，採用 CASE，依次將 when 每個tuple 中的 ids[id] 與 value 比較，當 cls.id == ids[id] 時，返回 tuple 中的 i 作爲排序序號。

最終，search()返回的 model objects 按照 IDs 的順序排列。

1-2- before_commit 和 after_commit

    @classmethod
    def before_commit(cls, session):
        session._changes = {
            'add': list(session.new),
            'update': list(session.dirty),
            'delete': list(session.deleted)
        }

    @classmethod
    def after_commit(cls, session):
        for obj in session._changes['add']:
            if isinstance(obj, SearchableMixin):
                add_to_index(obj.__tablename__, obj)
        for obj in session._changes['update']:
            if isinstance(obj, SearchableMixin):
                add_to_index(obj.__tablename__, obj)
        for obj in session._changes['delete']:
            if isinstance(obj, SearchableMixin):
                remove_from_index(obj.__tablename__, obj)
        session._changes = None

注：

一旦 SQLAlchemy db.session 出現改動，則將 objects 存儲到 session._changes ={} 中
session 一旦提交，則無法通過 session 屬性追蹤（session.new/session.dirty/session.deleted）
db.session 改動時，session._changes 存儲的不只是添加了 SearchableMixin的 Model，還有其他 Model 的 objects
db.session 提交後，after_commit 需要判斷 session._changes 中的 obj 是不是 SearchableMixin 的 instance。
after_commit 中調用 add_to_index 及 remove_from_index 時，均爲index=obj.__tablename__，不可用 cls.__tablename__（如果兩類 Model 如 A 和 B 均繼承了 SearchableMixin，且 A 有db.session 改動，而 B 沒有，但 B.after_commit() 亦有效，此時 cls.__tablename__ 指向 B，所以應該用 obj.__tablename__ 保證始終指向真實提交的 Model）

1-3- reindex

    @classmethod
    def reindex(cls):
        for obj in cls.query:
            add_to_index(cls.__tablename__, obj)

Add all the model objects in the database to the search index.
注： cls.query 等同 cls.query.all()

1-4 db.event

sqlalchemy event

格式 sqlalchemy.event.listen(target, identifier, fn, *args, **kw)

db.event.listen(db.session, 'before_commit', SearchableMixin.before_commit)
db.event.listen(db.session, 'after_commit', SearchableMixin.after_commit)

測試

>>> Post.reindex()

>>> query, total = Post.search('one two three four five', 1, 5)
>>> total
7
>>> query.all()
[<Post five>, <Post two>, <Post one>, <Post one more>, <Post one>]

注：返回的 query 也是 SQLAlchemy query 對象，所以可以用 query.all()

query = cls.query.filter(cls.id.in_(ids)).order_by(db.case(when, value=cls.id))

希望把 search term 以 q 參數的方式傳至 URL，以便直接訪問搜索結果，類似： https://www.google.com/search?q=python

爲把 Client 提交的 search term ，以 query string 的方式加入到 URL，則須 request method 爲 GET。

POST ，用於提交 app 表單的 form data（前面章節已展示）
GET，在瀏覽器輸入 URL 或者點擊 link 時，用到的 request method

1、創建表單： app / main / forms.py: Search form.

from flask import request

class SearchForm(FlaskForm):
    q = StringField(_l('Search'), validators=[DataRequired()])

    def __init__(self, *args, **kwargs):
        if 'formdata' not in kwargs:
            kwargs['formdata'] = request.args
        if 'csrf_enabled' not in kwargs:
            kwargs['csrf_enabled'] = False
        super(SearchForm, self).__init__(*args, **kwargs)

只設一個 text field q ，未設 submit button（表單如果有 text field，點擊 Enter 鍵則提交）

For a form that has a text field, the browser will submit the form when you press Enter with the focus on the field, so a button is not needed.
formdata，決定 Flask-WTF 從哪裏獲得 form submission，默認request.form 。‘GET’ 時，改爲 request.args，使 Flask-WTF 從 query string 獲得 formdata。
csrf_enabled，表單默認添加 CSRF protection，通過表單添加 CSRF token 實現（{{ form.hidden_tag() }}）。爲使 clickable search links 有效，需 bypass CSRF validation。

2、展示 Search Form （visible in all pages，不含 error page）

常規方法： creat a form object in every route, then pass the form to all the templates

利用 before_request 實例化 g.search form = SearchForm()

app / main / routes.py:

from flask import g
from app.main.forms import SearchForm

@bp.before_app_request
def before_request():
    if current_user.is_authenticated:
        current_user.last_seen = datetime.utcnow()
        db.session.commit()
        g.search_form = SearchForm()
    g.locale = str(get_locale())

g 針對 request，完整地貫穿某個 request 的生命週期，所以綁定的 search_form 也會如此。
當 before_request handler 結束，某個 URL 對應的 view func 被激活來處理 request 時，g 維持不變。
g 特定於 request 及 client，當 server 同時處理多位 clients 的多個 requests 時，仍然可以使用 g 完成 private storage，每個 request 的g 獨立於併發的其他 request。

g variable is specific to each request and each client, so even if your web server is handling multiple requests at a time for different clients, you can still rely on g to work as private storage for each request, independently of what goes on in other requests that are handled concurrently.

3、將 g.search_form 插入到 app / templates / base.html

            ...
            <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
                <ul class="nav navbar-nav">
                    ... home and explore links ...
                </ul>
                {% if g.search_form %}
                <form class="navbar-form navbar-left" method="get"
                        action="{{ url_for('main.search') }}">
                    <div class="form-group">
                        {{ g.search_form.q(size=20, class='form-control',
                            placeholder=g.search_form.q.label.text) }}
                    </div>
                </form>
                {% endif %}
                ...

判斷是否存在 g.search_form
method="get"，因爲需要通過‘GET’ 請求，將form data 提交到 query string。
action="{{ url_for('main.search') }}"，之前的表單 action 爲空，是因爲提交表單的 page，即是待渲染的 page。現在由於 Search 出現在所有頁面，所以必須指明，將表單提交到哪裏進行渲染。
action 的作用，即明確 form 提交時觸發的行爲。

because they were submitted to the same page that rendered the form

Search View Function

1、創建 view func，處理 search request （http://localhost:5000/search?q=search-words）

app /main / routes.py: search view function.

@bp.route('/search')
@login_required
def search():

    if not g.search_form.validate():
        return redirect(url_for('main.explore'))
    # just validate field values, without checking how the data was submitted. 

    page = request.args.get('page', 1, type=int)
    per_page = current_app.config['POSTS_PER_PAGE']

    posts, total = Post.search(g.search_form.q.data, page, per_page)

    next_url = url_for('main.search', q=g.search_form.q.data, page=page+1) \
        if total > page * per_page else None
    prev_url = url_for('main.search', q=g.search_form.q.data, page=page-1) \
        if page > 1 else None

    return render_template('search.html', title=_('Search'), posts=posts,
                            next_url=next_url, prev_url=prev_url)

# url_for() will issue 'GET' request, 
# q is the argument in http://localhost:5000/search?q=search-words, just like Google.

form.validate()，只驗證 field values, 不驗證數據提交的方式（form.validate_on_submit() 要求 POST）。
利用 SearchableMixin 類中的 classmethod search() ，通過Post.search()來獲取 list of search results。
form 提交的 q=g.search_form.q.data，此時作爲 query expression。
page 及 per_page 設置類似其他 view func。
利用返回的第二個參數 total 計算 next_url

2、創建模板 search.html

app / templates / search.html: search results template.

{% extends "base.html" %}

{% block app_content %}
    <h1>{{ _('Search Results') }}</h1>
    {% for post in posts %}
        {% include '_post.html' %}
    {% endfor %}
    <nav aria-label="...">
        <ul class="pager">
            <li class="previous{% if not prev_url %} disabled{% endif %}">
                <a href="{{ prev_url or '#' }}">
                    <span aria-hidden="true">&larr;</span>
                    {{ _('Previous results') }}
                </a>
            </li>
            <li class="next{% if not next_url %} disabled{% endif %}">
                <a href="{{ next_url or '#' }}">
                    {{ _('Next results') }}
                    <span aria-hidden="true">&rarr;</span>
                </a>
            </li>
        </ul>
    </nav>
{% endblock %}

The Flask Mega-Tutorial 之 Chapter 16：Full Text Search （全文搜索）

小引

Intro to Full-Text Search Engines

Installing Elasticsearch

Elasticsearch Tutorial

Elasticsearch Configuration

A Full-Text Search Abstraction

抽象的目的

Integrating Searches with SQLAlchemy

測試

Search View Function

.NET有哪些好用的定時任務調度框架

Python 將PDF轉爲PDF/A、PDF/X，以及PDF/A轉回PDF

elk3

Kafka存儲機制

aws語音呼叫調用，告警電話

深度學習框架火焰圖pprof和CUDA Nsys配置指南

爬蟲兩種繞過5s盾的方法

【轉】[C#] WebAPI 防止併發調用二（冥等性）

【轉】[SQL Server]關掉 SSMS 的 IntelliSense

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

MacOS 安裝 pyfasttext報錯解決方案

崗位與候選人匹配查詢 - Job id and skill set query

單鏈表反轉-python實現

MySQL Transaction Isolation Level

OAuth 簡介

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結