正则表达式初探(Java String regex Grok)

前言

什么是正则表达式?不同的网站的解释略有差别。在此我引用 wikipedia 的版本:In theoretical computer science and formal language theory, a regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations. 直译过来就是:一个字符的序列,它定义了一个搜索模式

很多编程语言内置了regex ( regular expression 的缩写 ) 的功能(都是一些大神写的算法,我们凡人学会使用就行了),不同的语言在语法定义上略有不同。我初次学习正则表达式,是基于 java 的正则表达式。

来几个有用的网址。
正则表达式快速入门
常用的正则表达式集锦
Java 正则表达式中文学习网站
Java 正则表达式 English 学习网站
在线测试 Grok 正则表达式网站
Grok 正则表达式学习
BM 算法详解,传说中的 Ctrl + F ?


talk is cheap, show me the code

String 的 regex

String 有 4 个方法用到了 regex : matches( ),split( ), replaceFirst( ), replaceAll( )

package regextest;

public class RegexTestStrings
{
    public final static String EXAMPLE_TEST = 
    "This is my small example string which I'm going to use for pattern matching   .";

    public static void main(String[] args)
    {
        // 判断是否是:第一个字符是‘word字符’的字符串
        System.out.println(EXAMPLE_TEST.matches("\\w.*")); 

        // 用 white spaces 拆开字符串,返回拆开后的String数组
        String[] splitString = (EXAMPLE_TEST.split("\\s+")); 
        System.out.println(splitString.length);
        for (String string : splitString)
        {
            System.out.println(string);
        }

        // 把符合正则式"\\s+"的字符串,全部替换成"才"
        System.out.println(EXAMPLE_TEST.replaceFirst("\\s+", "才")); 

        // 把符合正则式"\\s+"的字符串,全部替换成"才"
        System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "才")); 
    }
}

输出结果:

true
15
This
is
my
small
example
string
which
I'm
going
to
use
for
pattern
matching
.
This才is my small example string which I'm going to use for pattern matching   .
This才is才my才small才example才string才which才I'm才going才tousefor才pattern才matching才.

java. util. regex

import java.util.regex.Matcher 和 java.util.regex.Pattern,里面有很多方法可以用

package regextest;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{

    public static void main(String[] args)
    {
        String line = "The price for iPhone is 5288, which is a little expensive.";
        // 提取字符串中的唯一的数字,圆括号是用来分组的, ^ 是“取反”的意思
        String regex = "(.*[^\\d])(\\d+)(.*)";

        // 创建 Pattern 对象
        Pattern pattern = Pattern.compile(regex);

        // 创建 matcher 对象
        Matcher mather = pattern.matcher(line);

        if (mather.find())
        {
            System.out.println("Found value: " + mather.group(2));
        }
        else
        {
            System.out.println("NO MATCH");
        }
    }

}

输出结果:

Found value: 5288

grok 更加强大的 regex

在 Matcher,Pattern 的基础上, import 了很多包;进行了升级,可以调用的方法更多,更加强大。

import com.google.code.regexp.Matcher;
import com.google.code.regexp.Pattern;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import org.apache.commons.lang3.StringUtils;

某网站对Grok的定义:
Java Grok is simple tool that allows you to easily parse logs and other files (single line). With Java Grok, you can turn unstructured log and event data into structured data (JSON).

Java Grok program is a great tool for parsing log data and program output. You can match any number of complex patterns on any number of inputs (processes and files) and have custom reactions.

一个简单的例子:从日志文件中读取数据,提取想要的信息:一是时间,二是来源IP

输入:

Mon Nov  9 06:47:33 2015; UDP; eth1; 461 bytes; from 88.150.240.169:tag-pm to 123.40.222.170:sip
Mon Nov  9 06:47:34 2015; UDP; eth1; 463 bytes; from 88.150.240.169:49208 to 123.40.222.170:sip
Mon Nov  9 06:47:34 2015; UDP; eth1; 463 bytes; from 88.150.240.169:54159 to 123.40.222.170:sip
Mon Nov  9 06:47:34 2015; UDP; eth1; 463 bytes; from 88.150.240.169:53640 to 123.40.222.170:sip
Mon Nov  9 06:47:34 2015; UDP; eth1; 463 bytes; from 88.150.240.169:52483 t
package com.yz.utils.grok.api;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;

public class GrokTest
{

        public static void main(String[] args)
        {
            FileInputStream   fiStream = null;
            InputStreamReader iStreamReader = null;
            BufferedReader    bReader = null; 
            //用于包装InputStreamReader,提高处理性能。因为BufferedReader有缓冲的,而InputStreamReader没有。

        try
        {
            String line = "";
            // 从文件系统中的某个文件中获取字节
            fiStream = new FileInputStream("C:\\dev1\\javagrok\\javagrok\\iptraf_eth1_15.06.11"); 

            // InputStreamReader 是字节流通向字符流的桥梁
            iStreamReader = new InputStreamReader(fiStream); 

            // 从字符输入流中读取文件中的内容,封装了一个new InputStreamReader的对象
            bReader = new BufferedReader(iStreamReader);     

            Grok grok = new Grok();
            // Grok 提供了很多现成的pattern,可以直接拿来用。用已有的pattern,来构成新的pattern。
             grok.addPatternFromFile("c:\\dev1\\cloudshield\\patterns\\patterns"); 

            grok.addPattern("fromIP", "%{IPV4}");
            // compile 一个 pattern,期间我被空格坑了一下
            grok.compile(".*%{MONTH}\\s+%{MONTHDAY}\\s+%{TIME}\\s+%{YEAR}.*%{fromIP}.* to 123.40.222.170:sip"); 
            Match match = null;

            while((line = bReader.readLine()) != null)       // 注意这里的括号,被坑了一次
            {
                match = grok.match(line);
                match.captures();
                if(!match.isNull())
                {
                    System.out.print(match.toMap().get("YEAR").toString() + " ");
                    System.out.print(match.toMap().get("MONTH").toString() + " ");
                    System.out.print(match.toMap().get("MONTHDAY").toString() + " ");
                    System.out.print(match.toMap().get("TIME").toString() + " ");
                    System.out.print(match.toMap().get("fromIP").toString() + "\n");
                }
                else
                {
                    System.out.println("NO MATCH");
                }
            }

        }
        catch (FileNotFoundException fnfe)
        {
            System.out.println("file not found exception");
            fnfe.printStackTrace();
        }
        catch (IOException ioe)
        {
            System.out.println("input/output exception");
            ioe.printStackTrace();
        }
        catch (Exception e)
        {
            System.out.println("unknown exception");
            e.printStackTrace();
        }
        finally
        {
            try
            {
                if(bReader!=null)   
                {
                    bReader.close();
                    bReader=null;
                }
                if(iStreamReader!=null)
                {
                    iStreamReader.close();
                    iStreamReader=null;
                }
                if(fiStream!=null)
                {
                    fiStream.close();
                    fiStream=null;
                }
            }
            catch(IOException ioe)
            {
                System.out.println("input/output exception");
                ioe.printStackTrace();
            }
        }
    }

}

输出:

2015 Nov 9 06:47:33 88.150.240.169
2015 Nov 9 06:47:34 88.150.240.169
2015 Nov 9 06:47:34 88.150.240.169
2015 Nov 9 06:47:34 88.150.240.169
NO MATCH
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章