小引
爲 Microblog 添加 全文搜索 ,對於給定的搜索詞(search term),返回包含搜索詞的所有 posts,並按照相關度降序排列。
Intro to Full-Text Search Engines
1、開源 full-text search 引擎:
- Elasticsearch
- Apache Solr
- Whoosh
- Xapian
- Sphinx
2、具備搜索能力的 database:
- SQLite、MySQL、PostgreSQL
- MongoDB、CouchDB
relational db 雖有搜索功能,但由於 SQLAlchemy 不支持這個功能,所以必須自己寫原生 SQL 語句,或者找到一個庫能夠實現 text search 的 high-level acess 同時與 SQLAlchemy 協同。
Elasticsearch 作爲 ELK 棧(Elasticsearch-Logstash-Kibana,for indexing logs)的一員,有很高的流行度,選擇Elasticsearch 用於本項目。
注:將 text indexing 和 searching 相關的 funcs,封裝到單獨的 module 中。若之後需要改換 search engine ,則只需改寫此 module 的相關 funcs 即可。
Installing Elasticsearch
1、安裝 Elasticsearch 之前,須先安裝 JDK 8
How to Install Java 8 on Debian 9/8/7 via PPA
How to Install JAVA 8 on Ubuntu 18.04/16.04, LinuxMint 18/17
Ubuntu 安裝 JDK 7 / JDK8 的兩種方式
1-1 Add Java 8 PPA
- Create a new Apt configuration file,
/etc/apt/sources.list.d/java-8-debian.list
,
sudo vim /etc/apt/sources.list.d/java-8-debian.list
- 添加如下內容
deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main
deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main
- 引入 GPG key(用於 package 安裝前的驗證).
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886
1-2 安裝 Java 8
sudo apt-get update
sudo apt-get install oracle-java8-installer
1-3 驗證 Java 安裝成功
設定版本
sudo apt-get install oracle-java8-set-default
The apt repository provides package oracle-java8-set-default to set Java 8 as default Java version.
驗證版本
$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
1-4 搭建 JAVA_HOME
和 JRE_HOME
環境變量
修改環境變量 (針對 user )
sudo vim ~/.bashrc
如需要針對系統,則更改 /etc/environment
在 ~/.bashrc 追加內容
# set oracle jdk environment
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=${JAVA_HOME}/jre
使環境變量馬上生效
source ~/.bashrc
2、 安裝 Elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.deb.sha512
shasum -a 512 -c elasticsearch-6.3.0.deb.sha512
sudo dpkg -i elasticsearch-6.3.0.deb
Compares the SHA of the downloaded Debian package and the published checksum, which should output elasticsearch-{version}.deb: OK.
3、啓動 / 關閉 Elasticsearch
- Running / Stopping Elasticsearch with
systemd
sudo systemctl start elasticsearch.service
sudo systemctl stop elasticsearch.service
- 如果想開機啓動Elasticsearch,則
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service
- 驗證 Elasticsearch 運行
http://localhost:9200
4、安裝 Elasticsearch 對應的 python 客戶端
(venv) $ pip install elasticsearch
注:更新 requirements.txt
Elasticsearch Tutorial
1、建立 Elasticsearch connection
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch('http://localhost:9200')
實例化 + 傳參URL
2、將 data (JSON )寫入 index: es.index()
>>> es.index(index='my_index', doc_type='my_index', id=1, body={'text': 'this is a test'})
>>> es.index(index='my_index', doc_type='my_index', id=2, body={'text': 'a second test'})
index
, Elasticsearch 的 storage containerdoc_type
,存儲類型,一個 index 可以存儲多種類型id
,uniquebody
,JSON object with the data,包含 field 及 data
3、search: es.search()
>>> es.search(index='my_index', doc_type='my_index',
... body={'query': {'match': {'text': 'this test'}}})
注意 body 的格式,{'query': {'match': {<field>: <expression>}}}
response 格式,是 python dict
{
'took': 1,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {
'total': 2,
'max_score': 0.5753642,
'hits': [
{
'_index': 'my_index',
'_type': 'my_index',
'_id': '1',
'_score': 0.5753642,
'_source': {'text': 'this is a test'}
},
{
'_index': 'my_index',
'_type': 'my_index',
'_id': '2',
'_score': 0.25316024,
'_source': {'text': 'a second test'}
}
]
}
}
4、刪除 index
>>> es.indices.delete('my_index')
注:如果是刪除其中某個 id,則
es.delete(index=index, doc_type=index, id=<id>)
Elasticsearch Configuration
1、ELASTICSEARCH_URL
config.py
: elasticsearch configuration.
class Config(object):
# ...
ELASTICSEARCH_URL = os.environ.get('ELASTICSEARCH_URL')
- 更新
.env
ELASTICSEARCH_URL=http://localhost:9200
2、初始化 Elasticsearch
由於 Elasticsearch 不是 Flask extension,故不能在沒有 app instance 的情況下,在 global scope 中實例化。
app/__init__.py
: Elasticsearch instance.
# ...
from elasticsearch import Elasticsearch
# ...
def create_app(config_class=Config):
app = Flask(__name__)
app.config.from_object(config_class)
# ...
app.elasticsearch = Elasticsearch([app.config['ELASTICSEARCH_URL']]) \
if app.config['ELASTICSEARCH_URL'] else None
# ...
若未配置 URL 環境變量,則 app.elasticsearch is None
A Full-Text Search Abstraction
抽象的目的
- 不侷限於某個 Elasticsearch,便於更換 search engine
- 一般化 model,不侷限於 Post
1、爲 Model 添加 __searchable__=[]
爲需要 indexing 的 Model 添加 __searchable__
屬性,列入需要添加到 index 的 fields。
app/models.py
:
class Post(db.Model):
__searchable__ = ['body']
# ...
注: _searchable_ 只是一個變量,不會產生任何 behavior,僅用於輔助稍後的 funcs。
2、封裝 app / search.py
from flask import current_app
def add_to_index(index, model):
if not current_app.elasticsearch:
return
payload = {}
for field in model.__searchable__:
payload[field] = getattr(model, field)
current_app.elasticsearch.index(index=index, doc_type=index, id=model.id,
body=payload)
def remove_from_index(index, model):
if not current_app.elasticsearch:
return
current_app.elasticsearch.delete(index=index, doc_type=index, id=model.id)
def query_index(index, query, page, per_page):
if not current_app.elasticsearch:
return [], 0
search = current_app.elasticsearch.search(
index=index, doc_type=index,
body={'query': {'multi_match': {'query': query, 'fields': ['*']}},
'from': (page - 1) * per_page, 'size': per_page})
ids = [int(hit['_id']) for hit in search['hits']['hits']]
return ids, search['hits']['total']
application 通過 app/search.py
與 elasticsearch 建立連接,便於之後的更換 search engine
注:
id=model.id
,使得 Elasticsearch 與 SQLAlchemy 兩個 db 的 unique id 相同,便於之後的 定向 delete 及 search CASE 排序。add_to_index()
,兼具 add 及 update 的功能multi_match
, search across multiple fields.'fields': ['*']
,tell Elasticsearch to look in all the fields (listed in__searchable__
), i.e. search the entire index.This is useful to make this function generic, since different models can have different field names in the index.
無 SQLAlchemy 的 paginate() 可用,須自己計算
'from': (page - 1) * per_page
- 用 list comprehension ,獲取 IDs
3、測試
- 測試(測試前,須先添加相應的 posts )
>>> from app.search import add_to_index, remove_from_index, query_index
>>> for post in Post.query.all():
... add_to_index('posts', post)
>>> query_index('posts', 'one two three four five', 1, 100)
([15, 13, 12, 4, 11, 8, 14], 7)
- 清除測試內容
>>> app.elasticsearch.indices.delete('posts')
Integrating Searches with SQLAlchemy
app/search.py 中採用的方法,有兩類問題:
1、query_index()
返回的結果之一,爲 IDs,而不是 model objects
- 我們希望能直接拿到 model objects,這樣可以傳給 templates 來進行 rendering
Solution:根據 IDs,寫出 SQL query 語句,提取到相應的 model objs
2、posts 添加/刪除時,須顯性地調用 add_to_index
及 remove_from_index
- 容易滋生 bug,使得 Elasticsearch 和 SQLAlchemy db 越來越不同步(async)
Solution:利用SQLAlchemy events
,監聽 db.session
,使得 SQLAlchemy db 發生更改時,自動更新 Elasticsearch
爲解決上述兩類問題,創建一類 mixin class —— SearchableMixin
- mixin 類將作爲 SQLAlchemy —— Elasticsearch 的粘結層
- 當某個 Model 繼承了 SearchableMixin 後, 將具備自動管理 associated full-text index.
1、app/models.py
: SearchableMixin class.
from app.search import add_to_index, remove_from_index, query_index
class SearchableMixin(object):
@classmethod
def search(cls, expression, page, per_page):
@classmethod
def before_commit(cls, session):
@classmethod
def after_commit(cls, session):
@classmethod
def reindex(cls):
1-1 - search()
@classmethod
def search(cls, expression, page, per_page):
ids, total = query_index(cls.__tablename__, expression, page, per_page)
if total == 0:
return cls.query.filter_by(id=0), 0
when = []
for i in range(len(ids)):
when.append((ids[i], i))
return cls.query.filter(cls.id.in_(ids)).order_by(
db.case(when, value=cls.id)), total
- 引入
app/search.py
中的query_index()
,其中參數index = cls.__tablename__
when = [(ids[i], i)...]
- 返回的 cls.query.filter() 中,cls.id.in_(ids) 系 SQLAlchemy 語法(注:非 filter_by)
- order_by 中,採用 CASE,依次將 when 每個tuple 中的
ids[id]
與value
比較,當cls.id == ids[id]
時,返回 tuple 中的i
作爲排序序號。
最終,search()
返回的 model objects 按照 IDs 的順序排列。
1-2- before_commit
和 after_commit
@classmethod
def before_commit(cls, session):
session._changes = {
'add': list(session.new),
'update': list(session.dirty),
'delete': list(session.deleted)
}
@classmethod
def after_commit(cls, session):
for obj in session._changes['add']:
if isinstance(obj, SearchableMixin):
add_to_index(obj.__tablename__, obj)
for obj in session._changes['update']:
if isinstance(obj, SearchableMixin):
add_to_index(obj.__tablename__, obj)
for obj in session._changes['delete']:
if isinstance(obj, SearchableMixin):
remove_from_index(obj.__tablename__, obj)
session._changes = None
注:
- 一旦 SQLAlchemy db.session 出現改動,則將 objects 存儲到
session._changes ={}
中 - session 一旦提交,則無法通過 session 屬性追蹤(
session.new
/session.dirty
/session.deleted
) - db.session 改動時,
session._changes
存儲的不只是添加了 SearchableMixin的 Model,還有其他 Model 的 objects - db.session 提交後,
after_commit
需要判斷session._changes
中的 obj 是不是 SearchableMixin 的 instance。 after_commit
中 調用add_to_index
及remove_from_index
時,均爲index=obj.__tablename__
,不可用cls.__tablename__
(如果兩類 Model 如 A 和 B 均繼承了 SearchableMixin,且 A 有db.session 改動,而 B 沒有,但 B.after_commit() 亦有效,此時cls.__tablename__
指向 B,所以應該用obj.__tablename__
保證始終指向真實提交的 Model)
1-3- reindex
@classmethod
def reindex(cls):
for obj in cls.query:
add_to_index(cls.__tablename__, obj)
Add all the model objects in the database to the search index.
注: cls.query
等同 cls.query.all()
1-4 db.event
格式 sqlalchemy.event.listen(target, identifier, fn, *args, **kw)
db.event.listen(db.session, 'before_commit', SearchableMixin.before_commit)
db.event.listen(db.session, 'after_commit', SearchableMixin.after_commit)
測試
>>> Post.reindex()
>>> query, total = Post.search('one two three four five', 1, 5)
>>> total
7
>>> query.all()
[<Post five>, <Post two>, <Post one>, <Post one more>, <Post one>]
注: 返回的 query
也是 SQLAlchemy query 對象,所以可以用 query.all()
query = cls.query.filter(cls.id.in_(ids)).order_by(db.case(when, value=cls.id))
Search Form
希望把 search term 以 q
參數的方式傳至 URL,以便直接訪問搜索結果,類似: https://www.google.com/search?q=python
爲把 Client 提交的 search term ,以 query string 的方式加入到 URL,則須 request method
爲 GET
。
- POST ,用於提交 app 表單 的 form data(前面章節已展示)
- GET,在瀏覽器輸入 URL 或者 點擊 link 時,用到的 request method
1、創建表單: app / main / forms.py
: Search form.
from flask import request
class SearchForm(FlaskForm):
q = StringField(_l('Search'), validators=[DataRequired()])
def __init__(self, *args, **kwargs):
if 'formdata' not in kwargs:
kwargs['formdata'] = request.args
if 'csrf_enabled' not in kwargs:
kwargs['csrf_enabled'] = False
super(SearchForm, self).__init__(*args, **kwargs)
只設一個 text field
q
,未設 submit button(表單如果有 text field,點擊 Enter 鍵則提交)For a form that has a text field, the browser will submit the form when you press Enter with the focus on the field, so a button is not needed.
formdata
,決定 Flask-WTF 從哪裏獲得 form submission,默認request.form 。‘GET’ 時, 改爲request.args
,使 Flask-WTF 從 query string 獲得 formdata。csrf_enabled
,表單默認添加 CSRF protection,通過表單添加 CSRF token 實現({{ form.hidden_tag() }}
)。爲使 clickable search links 有效,需 bypass CSRF validation。
2、展示 Search Form (visible in all pages,不含 error page)
常規方法: creat a form object in every route, then pass the form to all the templates
利用 before_request 實例化 g.search form = SearchForm()
app / main / routes.py
:
from flask import g
from app.main.forms import SearchForm
@bp.before_app_request
def before_request():
if current_user.is_authenticated:
current_user.last_seen = datetime.utcnow()
db.session.commit()
g.search_form = SearchForm()
g.locale = str(get_locale())
g
針對 request,完整地貫穿某個 request 的生命週期,所以綁定的 search_form 也會如此。- 當 before_request handler 結束,某個 URL 對應的 view func 被激活來處理 request 時,
g
維持不變。 g
特定於 request 及 client,當 server 同時處理多位 clients 的多個 requests 時,仍然可以使用g
完成 private storage,每個 request 的g
獨立於併發的其他 request。
g variable is specific to each request and each client, so even if your web server is handling multiple requests at a time for different clients, you can still rely on g to work as private storage for each request, independently of what goes on in other requests that are handled concurrently.
3、將 g.search_form
插入到 app / templates / base.html
...
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav">
... home and explore links ...
</ul>
{% if g.search_form %}
<form class="navbar-form navbar-left" method="get"
action="{{ url_for('main.search') }}">
<div class="form-group">
{{ g.search_form.q(size=20, class='form-control',
placeholder=g.search_form.q.label.text) }}
</div>
</form>
{% endif %}
...
- 判斷是否存在
g.search_form
method="get"
,因爲需要通過‘GET’ 請求,將form data 提交到 query string。action="{{ url_for('main.search') }}"
,之前的表單 action 爲空,是因爲提交表單的 page,即是待渲染的 page。現在由於 Search 出現在所有頁面,所以必須指明,將表單提交到哪裏進行渲染。- action 的作用,即明確 form 提交時觸發的行爲。
because they were submitted to the same page that rendered the form
Search View Function
1、創建 view func,處理 search request (http://localhost:5000/search?q=search-words
)
app /main / routes.py
: search view function.
@bp.route('/search')
@login_required
def search():
if not g.search_form.validate():
return redirect(url_for('main.explore'))
# just validate field values, without checking how the data was submitted.
page = request.args.get('page', 1, type=int)
per_page = current_app.config['POSTS_PER_PAGE']
posts, total = Post.search(g.search_form.q.data, page, per_page)
next_url = url_for('main.search', q=g.search_form.q.data, page=page+1) \
if total > page * per_page else None
prev_url = url_for('main.search', q=g.search_form.q.data, page=page-1) \
if page > 1 else None
return render_template('search.html', title=_('Search'), posts=posts,
next_url=next_url, prev_url=prev_url)
# url_for() will issue 'GET' request,
# q is the argument in http://localhost:5000/search?q=search-words, just like Google.
form.validate()
,只驗證 field values, 不驗證數據提交的方式(form.validate_on_submit() 要求 POST)。- 利用 SearchableMixin 類中的 classmethod
search()
,通過Post.search()
來獲取 list of search results。 - form 提交的
q=g.search_form.q.data
,此時作爲 query expression。 page
及per_page
設置類似其他 view func。- 利用返回的第二個參數
total
計算next_url
2、創建模板 search.html
app / templates / search.html
: search results template.
{% extends "base.html" %}
{% block app_content %}
<h1>{{ _('Search Results') }}</h1>
{% for post in posts %}
{% include '_post.html' %}
{% endfor %}
<nav aria-label="...">
<ul class="pager">
<li class="previous{% if not prev_url %} disabled{% endif %}">
<a href="{{ prev_url or '#' }}">
<span aria-hidden="true">←</span>
{{ _('Previous results') }}
</a>
</li>
<li class="next{% if not next_url %} disabled{% endif %}">
<a href="{{ next_url or '#' }}">
{{ _('Next results') }}
<span aria-hidden="true">→</span>
</a>
</li>
</ul>
</nav>
{% endblock %}