轉-多模字符串匹配算法-Aho–Corasick

轉自https://www.cnblogs.com/hwyang/p/6836438.html

背景

在做實際工作中，最簡單也最常用的一種自然語言處理方法就是關鍵詞匹配，例如我們要對n條文本進行過濾，那本身是一個過濾詞表的，通常進行過濾的代碼如下

for (String document : documents) {
    for (String filterWord : filterWords) {
        if (document.contains(filterWord)) {
            //process ...
        }
    }
}

如果文本的數量是n，過濾詞的數量是k，那麼複雜度爲O(nk)；如果關鍵詞的數量較多，那麼支行效率是非常低的。

計算機科學中，Aho–Corasick算法是由Alfred V. Aho和Margaret J.Corasick 發明的字符串搜索算法，用於在輸入的一串字符串中匹配有限組“字典”中的子串。它與普通字符串匹配的不同點在於同時與所有字典串進行匹配。算法均攤情況下具有近似於線性的時間複雜度，約爲字符串的長度加所有匹配的數量。然而由於需要找到所有匹配數，如果每個子串互相匹配（如字典爲a，aa，aaa，aaaa，輸入的字符串爲aaaa），算法的時間複雜度會近似於匹配的二次函數。

原理

在一般的情況下，針對一個文本進行關鍵詞匹配，在匹配的過程中要與每個關鍵詞一一進行計算。也就是說，每與一個關鍵詞進行匹配，都要重新從文檔的開始到結束進行掃描。AC自動機的思想是，在開始時先通過詞表，對以下三種情況進行緩存：

按照字符轉移成功進行跳轉（success表）
按照字符轉移失敗進行跳轉（fail表）
匹配成功輸出表（output表）

因此在匹配的過程中，無需從新從文檔的開始進行匹配，而是通過緩存直接進行跳轉，從而實現近似於線性的時間複雜度。

構建

構建的過程分三個步驟，分別對success表，fail表，output表進行構建。其中output表在構建sucess和fail表進行都進行了補充。fail表是一對一的，output表是一對多的。

按照字符轉移成功進行跳轉（success表）

sucess表實際就是一棵trie樹，構建的方式和trie樹是一樣的，這裏就不贅述。

按照字符轉移失敗進行跳轉（fail表）

設這個節點上的字母爲C，沿着他父親的失敗指針走，直到走到一個節點，他的兒子中也有字母爲C的節點。然後把當前節點的失敗指針指向那個字母也爲C的兒子。如果一直走到了root都沒找到，那就把失敗指針指向root。使用廣度優先搜索BFS，層次遍歷節點來處理，每一個節點的失敗路徑。

匹配成功輸出表（output表）

匹配

舉例說明，按順序先後添加關鍵詞he，she，,his，hers。在匹配ushers過程中。先構建三個表，如下圖，實線是sucess表，虛線是fail表，結點後的單詞是ourput表。

代碼

import java.util.*;

/**
 */
public class ACTrie {
    private boolean failureStatesConstructed = false; //是否建立了failure表
    private Node root; //根結點

    public ACTrie() {
        this.root = new Node(true);
    }

    /**
     * 添加一個模式串
     * @param keyword
     */
    public void addKeyword(String keyword) {
        if (keyword == null || keyword.length() == 0) {
            return;
        }
        Node currentState = this.root;
        for (Character character : keyword.toCharArray()) {
            currentState = currentState.insert(character);
        }
        currentState.addEmit(keyword);
    }

    /**
     * 模式匹配
     *
     * @param text 待匹配的文本
     * @return 匹配到的模式串
     */
    public Collection<Emit> parseText(String text) {
        checkForConstructedFailureStates();
        Node currentState = this.root;
        List<Emit> collectedEmits = new ArrayList<>();
        for (int position = 0; position < text.length(); position++) {
            Character character = text.charAt(position);
            currentState = currentState.nextState(character);
            Collection<String> emits = currentState.emit();
            if (emits == null || emits.isEmpty()) {
                continue;
            }
            for (String emit : emits) {
                collectedEmits.add(new Emit(position - emit.length() + 1, position, emit));
            }
        }
        return collectedEmits;
    }



    /**
     * 檢查是否建立了failure表
     */
    private void checkForConstructedFailureStates() {
        if (!this.failureStatesConstructed) {
            constructFailureStates();
        }
    }

    /**
     * 建立failure表
     */
    private void constructFailureStates() {
        Queue<Node> queue = new LinkedList<>();

        // 第一步，將深度爲1的節點的failure設爲根節點
        //特殊處理：第二層要特殊處理，將這層中的節點的失敗路徑直接指向父節點(也就是根節點)。
        for (Node depthOneState : this.root.children()) {
            depthOneState.setFailure(this.root);
            queue.add(depthOneState);
        }
        this.failureStatesConstructed = true;

        // 第二步，爲深度 > 1 的節點建立failure表，這是一個bfs 廣度優先遍歷
        /**
         * 構造失敗指針的過程概括起來就一句話：設這個節點上的字母爲C，沿着他父親的失敗指針走，直到走到一個節點，他的兒子中也有字母爲C的節點。
         * 然後把當前節點的失敗指針指向那個字母也爲C的兒子。如果一直走到了root都沒找到，那就把失敗指針指向root。
         * 使用廣度優先搜索BFS，層次遍歷節點來處理，每一個節點的失敗路徑。　　
         */
        while (!queue.isEmpty()) {
            Node parentNode = queue.poll();

            for (Character transition : parentNode.getTransitions()) {
                Node childNode = parentNode.find(transition);
                queue.add(childNode);

                Node failNode = parentNode.getFailure().nextState(transition);

                childNode.setFailure(failNode);
                childNode.addEmit(failNode.emit());
            }
        }
    }

    private static class Node{
        private Map<Character, Node> map;
        private List<String> emits; //輸出
        private Node failure; //失敗中轉
        private boolean isRoot = false;//是否爲根結點

        public Node(){
            map = new HashMap<>();
            emits = new ArrayList<>();
        }

        public Node(boolean isRoot) {
            this();
            this.isRoot = isRoot;
        }

        public Node insert(Character character) {
            Node node = this.map.get(character);
            if (node == null) {
                node = new Node();
                map.put(character, node);
            }
            return node;
        }

        public void addEmit(String keyword) {
            emits.add(keyword);
        }

        public void addEmit(Collection<String> keywords) {
            emits.addAll(keywords);
        }

        /**
         * success跳轉
         * @param character
         * @return
         */
        public Node find(Character character) {
            return map.get(character);
        }

        /**
         * 跳轉到下一個狀態
         * @param transition 接受字符
         * @return 跳轉結果
         */
        private Node nextState(Character transition) {
            Node state = this.find(transition);  // 先按success跳轉
            if (state != null) {
                return state;
            }
            //如果跳轉到根結點還是失敗，則返回根結點
            if (this.isRoot) {
                return this;
            }
            // 跳轉失敗的話，按failure跳轉
            return this.failure.nextState(transition);
        }

        public Collection<Node> children() {
            return this.map.values();
        }

        public void setFailure(Node node) {
            failure = node;
        }

        public Node getFailure() {
            return failure;
        }

        public Set<Character> getTransitions() {
            return map.keySet();
        }

        public Collection<String> emit() {
            return this.emits == null ? Collections.<String>emptyList() : this.emits;
        }

    }

    private static class Emit{

        private final String keyword;//匹配到的模式串
        private final int start;
        private final int end;

        /**
         * 構造一個模式串匹配結果
         * @param start   起點
         * @param end     重點
         * @param keyword 模式串
         */
        public Emit(final int start, final int end, final String keyword) {
            this.start = start;
            this.end = end;
            this.keyword = keyword;
        }

        /**
         * 獲取對應的模式串
         * @return 模式串
         */
        public String getKeyword() {
            return this.keyword;
        }

        @Override
        public String toString() {
            return super.toString() + "=" + this.keyword;
        }
    }

    public static void main(String[] args) {
        ACTrie trie = new ACTrie();
        trie.addKeyword("hers");
        trie.addKeyword("his");
        trie.addKeyword("she");
        trie.addKeyword("he");
        Collection<Emit> emits = trie.parseText("ushers");
        for (Emit emit : emits) {
            System.out.println(emit.start + " " + emit.end + "\t" + emit.getKeyword());
        }
    }
}

原始地址：http://www.thinkinglite.cn/

轉-多模字符串匹配算法-Aho–Corasick

轉自https://www.cnblogs.com/hwyang/p/6836438.html

背景

原理

構建

按照字符轉移成功進行跳轉（success表）

按照字符轉移失敗進行跳轉（fail表）

匹配成功輸出表（output表）

匹配

代碼

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

火狐瀏覽器打不開問題

類和接口關係

學習阿里巴巴開發手冊-10

學習阿里巴巴開發手冊-5

學習阿里巴巴開發手冊-補充

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結