背景
爲了高效、快速統計詞頻,故而採用KenLM。至於KenLM的詳情,請參考源碼: https://github.com/kpu/kenlm。
安裝
作者提供了安裝指南:https://kheafield.com/code/kenlm/ 。確實在一切其他依賴環境都具備的前提下,安裝如下:
wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j4
PS:本文在Centos 7下安裝,gcc版本是5.2。
boost
在boost版本過低時,cmake步驟大概率會出現以下錯誤:
解決方案:
yum install -y boost boost-devel boost-doc
再重新cmake,報錯如下:
CMake Error at /usr/local/share/cmake-3.15/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
Could NOT find Boost (missing: thread) (found suitable version "1.55.0",
minimum required is "1.41.0")
Call Stack (most recent call first):
/usr/local/share/cmake-3.15/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
/usr/local/share/cmake-3.15/Modules/FindBoost.cmake:2142 (find_package_handle_standard_args)
CMakeLists.txt:66 (find_package)
CMake Warning (dev) in /usr/local/share/cmake-3.15/Modules/FindBoost.cmake:
Policy CMP0011 is not set: Included scripts do automatic cmake_policy PUSH
and POP. Run "cmake --help-policy CMP0011" for policy details. Use the
cmake_policy command to set the policy and suppress this warning.
The included script
/usr/local/share/cmake-3.15/Modules/FindBoost.cmake
affects policy settings. CMake is implying the NO_POLICY_SCOPE option for
compatibility, so the effects are applied to the including context.
Call Stack (most recent call first):
CMakeLists.txt:66 (find_package)
This warning is for project developers. Use -Wno-dev to suppress it.
-- Configuring incomplete, errors occurred!
See also "/home/data1/devtools/kenlm/build/CMakeFiles/CMakeOutput.log".
See also "/home/data1/devtools/kenlm/build/CMakeFiles/CMakeError.log".
可以看出是沒有找到按照的boost位置。那麼安裝的boost在哪裏呢?
先查看安裝了哪些boost相關的lib:rpm -qa|grep boost
查看相關具體包的安裝位置,比如查看boost-thread-1.53.0-27.el7.x86_64
的安裝位置:rpm -ql boost-thread-1.53.0-27.el7.x86_64
,結果如下:
最終發現boost-devel-1.53.0-27.el7.x86_64
的include和lib安裝目錄:
綜上,知曉boost的include和lib目錄:
/usr/include/boost/
/usr/lib64/
將這2個目錄信息添加到CMakeLists.txt
:
SET(BOOST_INCLUDEDIR "/usr/include/boost/")
SET(BOOST_LIBRARYDIR "/usr/lib64/")
指定編譯器
再次安裝,報錯如下:
CMakeFiles/tokenize_piece_test.dir/tokenize_piece_test.cc.o: In function `boost::unit_test::make_test_case(boost::unit_test::callback0<boost::unit_test::ut_detail::unused> const&, boost::unit_test::basic_cstring<char const>)':
tokenize_piece_test.cc:(.text._ZN5boost9unit_test14make_test_caseERKNS0_9callback0INS0_9ut_detail6unusedEEENS0_13basic_cstringIKcEE[_ZN5boost9unit_test14make_test_caseERKNS0_9callback0INS0_9ut_detail6unusedEEENS0_13basic_cstringIKcEE]+0x11): undefined reference to `boost::unit_test::ut_detail::normalize_test_case_name[abi:cxx11](boost::unit_test::basic_cstring<char const>)'
collect2: error: ld returned 1 exit status
make[2]: *** [tests/tokenize_piece_test] Error 1
make[1]: *** [util/CMakeFiles/tokenize_piece_test.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 34%] Linking CXX static library ../../lib/libkenlm_interpolate.a
[ 34%] Built target kenlm_interpolate
[ 35%] Linking CXX executable ../tests/string_stream_test
CMakeFiles/string_stream_test.dir/string_stream_test.cc.o: In function `boost::unit_test::make_test_case(boost::unit_test::callback0<boost::unit_test::ut_detail::unused> const&, boost::unit_test::basic_cstring<char const>)':
string_stream_test.cc:(.text._ZN5boost9unit_test14make_test_caseERKNS0_9callback0INS0_9ut_detail6unusedEEENS0_13basic_cstringIKcEE[_ZN5boost9unit_test14make_test_caseERKNS0_9callback0INS0_9ut_detail6unusedEEENS0_13basic_cstringIKcEE]+0x11): undefined reference to `boost::unit_test::ut_detail::normalize_test_case_name[abi:cxx11](boost::unit_test::basic_cstring<char const>)'
collect2: error: ld returned 1 exit status
make[2]: *** [tests/string_stream_test] Error 1
make[1]: *** [util/CMakeFiles/string_stream_test.dir/all] Error 2
[ 36%] Linking CXX executable ../tests/sorted_uniform_test
CMakeFiles/sorted_uniform_test.dir/sorted_uniform_test.cc.o: In function `boost::unit_test::make_test_case(boost::unit_test::callback0<boost::unit_test::ut_detail::unused> const&, boost::unit_test::basic_cstring<char const>)':
sorted_uniform_test.cc:(.text._ZN5boost9unit_test14make_test_caseERKNS0_9callback0INS0_9ut_detail6unusedEEENS0_13basic_cstringIKcEE[_ZN5boost9unit_test14make_test_caseERKNS0_9callback0INS0_9ut_detail6unusedEEENS0_13basic_cstringIKcEE]+0x11): undefined reference to `boost::unit_test::ut_detail::normalize_test_case_name[abi:cxx11](boost::unit_test::basic_cstring<char const>)'
collect2: error: ld returned 1 exit status
make[2]: *** [tests/sorted_uniform_test] Error 1
make[1]: *** [util/CMakeFiles/sorted_uniform_test.dir/all] Error 2
make: *** [all] Error 2
解決方案:
修改C++編譯器。在CMakeLists.txt
頭部添加以下命令:
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_GLIBCXX_USE_CXX11_ABI=0")
最後make成功後,可以將bin目錄添加到環境變量中。
在~/.bashrc中添加kenlm的bin目錄如下:
export PATH=$PATH:/usr/local/cuda-9.0/bin:/home/data1/devtools/kenlm/build/bin
source ~/.bashrc
當然,也可以直接將編譯好需要用到的bin文件直接拷貝到待使用的目錄中,直接運行調用。