The Flask Mega-Tutorial 之 Chapter 16:Full Text Search (全文搜索)

小引

爲 Microblog 添加 全文搜索 ,對於給定的搜索詞(search term),返回包含搜索詞的所有 posts,並按照相關度降序排列。


Intro to Full-Text Search Engines

1、開源 full-text search 引擎:

  • Elasticsearch
  • Apache Solr
  • Whoosh
  • Xapian
  • Sphinx

2、具備搜索能力的 database:

  • SQLite、MySQL、PostgreSQL
  • MongoDB、CouchDB

relational db 雖有搜索功能,但由於 SQLAlchemy 不支持這個功能,所以必須自己寫原生 SQL 語句,或者找到一個庫能夠實現 text search 的 high-level acess 同時與 SQLAlchemy 協同。

Elasticsearch 作爲 ELK 棧(Elasticsearch-Logstash-Kibana,for indexing logs)的一員,有很高的流行度,選擇Elasticsearch 用於本項目。

注:將 text indexing 和 searching 相關的 funcs,封裝到單獨的 module 中。若之後需要改換 search engine ,則只需改寫此 module 的相關 funcs 即可。


Installing Elasticsearch

1、安裝 Elasticsearch 之前,須先安裝 JDK 8

How to Install Java 8 on Debian 9/8/7 via PPA
How to Install JAVA 8 on Ubuntu 18.04/16.04, LinuxMint 18/17
Ubuntu 安裝 JDK 7 / JDK8 的兩種方式

1-1 Add Java 8 PPA

  • Create a new Apt configuration file, /etc/apt/sources.list.d/java-8-debian.list,
sudo vim /etc/apt/sources.list.d/java-8-debian.list
  • 添加如下內容
deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main
deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main
  • 引入 GPG key(用於 package 安裝前的驗證).
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886


1-2 安裝 Java 8

sudo apt-get update
sudo apt-get install oracle-java8-installer


1-3 驗證 Java 安裝成功

  • 設定版本

    sudo apt-get install oracle-java8-set-default

    The apt repository provides package oracle-java8-set-default to set Java 8 as default Java version.

  • 驗證版本

$ java -version

java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)


1-4 搭建 JAVA_HOMEJRE_HOME 環境變量

  • 修改環境變量 (針對 user )

    sudo vim ~/.bashrc

    如需要針對系統,則更改 /etc/environment

  • ~/.bashrc 追加內容

# set oracle jdk environment
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=${JAVA_HOME}/jre
  • 使環境變量馬上生效

    source ~/.bashrc


2、 安裝 Elasticsearch

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb.sha512
shasum -a 512 -c elasticsearch-6.3.0.deb.sha512 
sudo dpkg -i elasticsearch-6.3.0.deb

Compares the SHA of the downloaded Debian package and the published checksum, which should output elasticsearch-{version}.deb: OK.

checksum_SHA_verification

3、啓動 / 關閉 Elasticsearch

  • Running / Stopping Elasticsearch with systemd
sudo systemctl start elasticsearch.service
sudo systemctl stop elasticsearch.service
  • 如果想開機啓動Elasticsearch,則
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service
  • 驗證 Elasticsearch 運行
http://localhost:9200

這裏寫圖片描述


4、安裝 Elasticsearch 對應的 python 客戶端

(venv) $ pip install elasticsearch

注:更新 requirements.txt


Elasticsearch Tutorial

1、建立 Elasticsearch connection

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch('http://localhost:9200')

實例化 + 傳參URL


2、將 data (JSON )寫入 index: es.index()

>>> es.index(index='my_index', doc_type='my_index', id=1, body={'text': 'this is a test'})
>>> es.index(index='my_index', doc_type='my_index', id=2, body={'text': 'a second test'})
  • indexElasticsearch 的 storage container
  • doc_type ,存儲類型,一個 index 可以存儲多種類型
  • id,unique
  • body,JSON object with the data,包含 fielddata


3、search: es.search()

>>> es.search(index='my_index', doc_type='my_index',
... body={'query': {'match': {'text': 'this test'}}})

注意 body 的格式,{'query': {'match': {<field>: <expression>}}}

response 格式,是 python dict

{
    'took': 1,
    'timed_out': False,
    '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
    'hits': {
        'total': 2, 
        'max_score': 0.5753642, 
        'hits': [
            {
                '_index': 'my_index',
                '_type': 'my_index',
                '_id': '1',
                '_score': 0.5753642,
                '_source': {'text': 'this is a test'}
            },
            {
                '_index': 'my_index',
                '_type': 'my_index',
                '_id': '2',
                '_score': 0.25316024,
                '_source': {'text': 'a second test'}
            }
        ]
    }
}

4、刪除 index

>>> es.indices.delete('my_index')

注:如果是刪除其中某個 id,則

es.delete(index=index, doc_type=index, id=<id>)

Elasticsearch Configuration

1、ELASTICSEARCH_URL

  • config.py: elasticsearch configuration.
class Config(object):
    # ...
    ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL')
  • 更新 .env
ELASTICSEARCH_URL=http://localhost:9200


2、初始化 Elasticsearch

由於 Elasticsearch 不是 Flask extension,故不能在沒有 app instance 的情況下,在 global scope 中實例化。

app/__init__.py: Elasticsearch instance.

# ...
from elasticsearch import Elasticsearch

# ...

def create_app(config_class=Config):
    app = Flask(__name__)
    app.config.from_object(config_class)

    # ...
    app.elasticsearch = Elasticsearch([app.config['ELASTICSEARCH_URL']]) \
        if app.config['ELASTICSEARCH_URL'] else None

    # ...

若未配置 URL 環境變量,則 app.elasticsearch is None


A Full-Text Search Abstraction

抽象的目的
  • 不侷限於某個 Elasticsearch,便於更換 search engine
  • 一般化 model,不侷限於 Post


1、爲 Model 添加 __searchable__=[]

爲需要 indexingModel 添加 __searchable__ 屬性,列入需要添加到 indexfields

app/models.py:

class Post(db.Model):
    __searchable__ = ['body']
    # ...

注: _searchable_ 只是一個變量,不會產生任何 behavior,僅用於輔助稍後的 funcs。


2、封裝 app / search.py

from flask import current_app

def add_to_index(index, model):
    if not current_app.elasticsearch:
        return
    payload = {}
    for field in model.__searchable__:
        payload[field] = getattr(model, field)
    current_app.elasticsearch.index(index=index, doc_type=index, id=model.id,
                                    body=payload)

def remove_from_index(index, model):
    if not current_app.elasticsearch:
        return
    current_app.elasticsearch.delete(index=index, doc_type=index, id=model.id)

def query_index(index, query, page, per_page):
    if not current_app.elasticsearch:
        return [], 0
    search = current_app.elasticsearch.search(
        index=index, doc_type=index,
        body={'query': {'multi_match': {'query': query, 'fields': ['*']}},
              'from': (page - 1) * per_page, 'size': per_page})
    ids = [int(hit['_id']) for hit in search['hits']['hits']]
    return ids, search['hits']['total']

application 通過 app/search.pyelasticsearch 建立連接,便於之後的更換 search engine

注:

  • id=model.id,使得 ElasticsearchSQLAlchemy 兩個 db 的 unique id 相同,便於之後的 定向 deletesearch CASE 排序。
  • add_to_index(),兼具 addupdate 的功能
  • multi_match, search across multiple fields.
  • 'fields': ['*'],tell Elasticsearch to look in all the fields (listed in __searchable__), i.e. search the entire index.

    This is useful to make this function generic, since different models can have different field names in the index.

  • SQLAlchemy 的 paginate() 可用,須自己計算 'from': (page - 1) * per_page

  • list comprehension ,獲取 IDs


3、測試

  • 測試(測試前,須先添加相應的 posts )
>>> from app.search import add_to_index, remove_from_index, query_index
>>> for post in Post.query.all():
...     add_to_index('posts', post)
>>> query_index('posts', 'one two three four five', 1, 100)
([15, 13, 12, 4, 11, 8, 14], 7)
  • 清除測試內容
>>> app.elasticsearch.indices.delete('posts')

Integrating Searches with SQLAlchemy

app/search.py 中採用的方法,有兩類問題:

1、query_index() 返回的結果之一,爲 IDs,而不是 model objects

  • 我們希望能直接拿到 model objects,這樣可以傳給 templates 來進行 rendering

Solution:根據 IDs,寫出 SQL query 語句,提取到相應的 model objs


2、posts 添加/刪除時,須顯性地調用 add_to_indexremove_from_index

  • 容易滋生 bug,使得 ElasticsearchSQLAlchemy db 越來越不同步(async)

Solution:利用SQLAlchemy events,監聽 db.session,使得 SQLAlchemy db 發生更改時,自動更新 Elasticsearch


爲解決上述兩類問題,創建一類 mixin class —— SearchableMixin

  • mixin 類將作爲 SQLAlchemy —— Elasticsearch 的粘結層
  • 當某個 Model 繼承了 SearchableMixin 後, 將具備自動管理 associated full-text index.

1、app/models.py: SearchableMixin class.

from app.search import add_to_index, remove_from_index, query_index

class SearchableMixin(object):
    @classmethod
    def search(cls, expression, page, per_page):

    @classmethod
    def before_commit(cls, session):

    @classmethod
    def after_commit(cls, session):

    @classmethod
    def reindex(cls):

1-1 - search()

@classmethod
    def search(cls, expression, page, per_page):
        ids, total = query_index(cls.__tablename__, expression, page, per_page)
        if total == 0:
            return cls.query.filter_by(id=0), 0
        when = []
        for i in range(len(ids)):
            when.append((ids[i], i))
        return cls.query.filter(cls.id.in_(ids)).order_by(
            db.case(when, value=cls.id)), total
  • 引入 app/search.py 中的 query_index(),其中參數 index = cls.__tablename__
  • when = [(ids[i], i)...]
  • 返回的 cls.query.filter() 中,cls.id.in_(ids) 系 SQLAlchemy 語法(注:非 filter_by)
  • order_by 中,採用 CASE,依次將 when 每個tuple 中的 ids[id]value 比較,當 cls.id == ids[id] 時,返回 tuple 中的 i 作爲排序序號。

最終,search()返回的 model objects 按照 IDs 的順序排列。


1-2- before_commitafter_commit

    @classmethod
    def before_commit(cls, session):
        session._changes = {
            'add': list(session.new),
            'update': list(session.dirty),
            'delete': list(session.deleted)
        }

    @classmethod
    def after_commit(cls, session):
        for obj in session._changes['add']:
            if isinstance(obj, SearchableMixin):
                add_to_index(obj.__tablename__, obj)
        for obj in session._changes['update']:
            if isinstance(obj, SearchableMixin):
                add_to_index(obj.__tablename__, obj)
        for obj in session._changes['delete']:
            if isinstance(obj, SearchableMixin):
                remove_from_index(obj.__tablename__, obj)
        session._changes = None

注:

  • 一旦 SQLAlchemy db.session 出現改動,則將 objects 存儲到 session._changes ={}
  • session 一旦提交,則無法通過 session 屬性追蹤(session.new/session.dirty/session.deleted
  • db.session 改動時,session._changes 存儲的不只是添加了 SearchableMixin的 Model,還有其他 Model 的 objects
  • db.session 提交後,after_commit 需要判斷 session._changes 中的 obj 是不是 SearchableMixin 的 instance。
  • after_commit 中 調用 add_to_indexremove_from_index 時,均爲index=obj.__tablename__,不可用 cls.__tablename__(如果兩類 Model 如 A 和 B 均繼承了 SearchableMixin,且 A 有db.session 改動,而 B 沒有,但 B.after_commit() 亦有效,此時 cls.__tablename__ 指向 B,所以應該用 obj.__tablename__ 保證始終指向真實提交的 Model)


1-3- reindex

    @classmethod
    def reindex(cls):
        for obj in cls.query:
            add_to_index(cls.__tablename__, obj)

Add all the model objects in the database to the search index.
注: cls.query 等同 cls.query.all()


1-4 db.event

sqlalchemy event

格式 sqlalchemy.event.listen(target, identifier, fn, *args, **kw)

db.event.listen(db.session, 'before_commit', SearchableMixin.before_commit)
db.event.listen(db.session, 'after_commit', SearchableMixin.after_commit)


測試
>>> Post.reindex()
>>> query, total = Post.search('one two three four five', 1, 5)
>>> total
7
>>> query.all()
[<Post five>, <Post two>, <Post one>, <Post one more>, <Post one>]

注: 返回的 query 也是 SQLAlchemy query 對象,所以可以用 query.all()

query = cls.query.filter(cls.id.in_(ids)).order_by(db.case(when, value=cls.id))


Search Form

希望把 search termq 參數的方式傳至 URL,以便直接訪問搜索結果,類似: https://www.google.com/search?q=python

爲把 Client 提交的 search term ,以 query string 的方式加入到 URL,則須 request methodGET

  • POST ,用於提交 app 表單 的 form data(前面章節已展示)
  • GET,在瀏覽器輸入 URL 或者 點擊 link 時,用到的 request method


1、創建表單: app / main / forms.py: Search form.

from flask import request

class SearchForm(FlaskForm):
    q = StringField(_l('Search'), validators=[DataRequired()])

    def __init__(self, *args, **kwargs):
        if 'formdata' not in kwargs:
            kwargs['formdata'] = request.args
        if 'csrf_enabled' not in kwargs:
            kwargs['csrf_enabled'] = False
        super(SearchForm, self).__init__(*args, **kwargs)
  • 只設一個 text field q ,未設 submit button(表單如果有 text field,點擊 Enter 鍵則提交)

    For a form that has a text field, the browser will submit the form when you press Enter with the focus on the field, so a button is not needed.

  • formdata,決定 Flask-WTF 從哪裏獲得 form submission,默認request.form 。‘GET’ 時, 改爲 request.args,使 Flask-WTF 從 query string 獲得 formdata。

  • csrf_enabled,表單默認添加 CSRF protection,通過表單添加 CSRF token 實現({{ form.hidden_tag() }})。爲使 clickable search links 有效,需 bypass CSRF validation。


2、展示 Search Form (visible in all pages,不含 error page)

常規方法: creat a form object in every route, then pass the form to all the templates

利用 before_request 實例化 g.search form = SearchForm()

app / main / routes.py:

from flask import g
from app.main.forms import SearchForm

@bp.before_app_request
def before_request():
    if current_user.is_authenticated:
        current_user.last_seen = datetime.utcnow()
        db.session.commit()
        g.search_form = SearchForm()
    g.locale = str(get_locale())


  • g 針對 request,完整地貫穿某個 request 的生命週期,所以綁定的 search_form 也會如此。
  • 當 before_request handler 結束,某個 URL 對應的 view func 被激活來處理 request 時,g 維持不變。
  • g 特定於 requestclient,當 server 同時處理多位 clients 的多個 requests 時,仍然可以使用 g 完成 private storage,每個 request 的g 獨立於併發的其他 request。

g variable is specific to each request and each client, so even if your web server is handling multiple requests at a time for different clients, you can still rely on g to work as private storage for each request, independently of what goes on in other requests that are handled concurrently.


3、將 g.search_form 插入到 app / templates / base.html

            ...
            <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
                <ul class="nav navbar-nav">
                    ... home and explore links ...
                </ul>
                {% if g.search_form %}
                <form class="navbar-form navbar-left" method="get"
                        action="{{ url_for('main.search') }}">
                    <div class="form-group">
                        {{ g.search_form.q(size=20, class='form-control',
                            placeholder=g.search_form.q.label.text) }}
                    </div>
                </form>
                {% endif %}
                ...


  • 判斷是否存在 g.search_form
  • method="get",因爲需要通過‘GET’ 請求,將form data 提交到 query string。
  • action="{{ url_for('main.search') }}",之前的表單 action 爲空,是因爲提交表單的 page,即是待渲染的 page。現在由於 Search 出現在所有頁面,所以必須指明,將表單提交到哪裏進行渲染。
  • action 的作用,即明確 form 提交時觸發的行爲。

because they were submitted to the same page that rendered the form


Search View Function

1、創建 view func,處理 search requesthttp://localhost:5000/search?q=search-words

app /main / routes.py: search view function.

@bp.route('/search')
@login_required
def search():

    if not g.search_form.validate():
        return redirect(url_for('main.explore'))
    # just validate field values, without checking how the data was submitted. 

    page = request.args.get('page', 1, type=int)
    per_page = current_app.config['POSTS_PER_PAGE']

    posts, total = Post.search(g.search_form.q.data, page, per_page)

    next_url = url_for('main.search', q=g.search_form.q.data, page=page+1) \
        if total > page * per_page else None
    prev_url = url_for('main.search', q=g.search_form.q.data, page=page-1) \
        if page > 1 else None

    return render_template('search.html', title=_('Search'), posts=posts,
                            next_url=next_url, prev_url=prev_url)

# url_for() will issue 'GET' request, 
# q is the argument in http://localhost:5000/search?q=search-words, just like Google.
  • form.validate(),只驗證 field values, 不驗證數據提交的方式(form.validate_on_submit() 要求 POST)。
  • 利用 SearchableMixin 類中的 classmethod search() ,通過Post.search()來獲取 list of search results。
  • form 提交的 q=g.search_form.q.data,此時作爲 query expression。
  • pageper_page 設置類似其他 view func。
  • 利用返回的第二個參數 total 計算 next_url


2、創建模板 search.html

app / templates / search.html: search results template.

{% extends "base.html" %}

{% block app_content %}
    <h1>{{ _('Search Results') }}</h1>
    {% for post in posts %}
        {% include '_post.html' %}
    {% endfor %}
    <nav aria-label="...">
        <ul class="pager">
            <li class="previous{% if not prev_url %} disabled{% endif %}">
                <a href="{{ prev_url or '#' }}">
                    <span aria-hidden="true">&larr;</span>
                    {{ _('Previous results') }}
                </a>
            </li>
            <li class="next{% if not next_url %} disabled{% endif %}">
                <a href="{{ next_url or '#' }}">
                    {{ _('Next results') }}
                    <span aria-hidden="true">&rarr;</span>
                </a>
            </li>
        </ul>
    </nav>
{% endblock %}

這裏寫圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章