瑞士軍刀：文本處理工具Sed用法與實例詳解

第一部分：sed基礎

1)簡介

sed 是一種在線編輯器，它一次處理一行內容。處理時，把當前處理的行存儲在臨時緩衝區中，稱爲“模式空間”（pattern space），接着用sed命令處理緩衝區中的內容，處理完成後，把緩衝區的內容送往屏幕。接着處理下一行，這樣不斷重複，直到文件末尾。文件內容並沒有改變，除非你使用重定向存儲輸出。Sed主要用來自動編輯一個或多個文件；簡化對文件的反覆操作；編寫文本轉換程序等。掌握好sed的用法，能夠極大的提高我們各類文件處理工作的效率。

2)sed命令參數及選項

命令	功能
a\	在當前行後添加一行或多行。多行時除最後一行外，每行末尾需用“\”續行
c\	用此符號後的新文本替換當前行中的文本。多行時除最後一行外，每行末尾需用”\”續行
i\	在當前行之前插入文本。多行時除最後一行外，每行末尾需用”\”續行
d	刪除行
h	把模式空間裏的內容複製到暫存緩衝區
H	把模式空間裏的內容追加到暫存緩衝區
g	把暫存緩衝區裏的內容複製到模式空間，覆蓋原有的內容
G	把暫存緩衝區的內容追加到模式空間裏，追加在原有內容的後面
l	列出非打印字符
p	打印行
n	讀入下一輸入行，並從下一條命令而不是第一條命令開始對其的處理
q	結束或退出sed
r	從文件中讀取輸入行
!	對所選行以外的所有行應用命令
s	用一個字符串替換另一個
g	在行內進行全局替換
w	將所選的行寫入文件
x	交換暫存緩衝區與模式空間的內容
y	將字符替換爲另一字符（不能對正則表達式使用y命令）

選項	功能
-e	進行多項編輯，即對輸入行應用多條sed命令時使用
-n	取消默認的輸出
-f	指定sed腳本的文件名

3)正則規則

身爲一個強大的文本處理工具，支持正則表達式是必須的。sed使用的正則表達式是括在斜槓線”/”之間的模式。

元字符	功能	示例
^	行首定位符	/^my/ 匹配所有以my開頭的行
$	行尾定位符	/my$/ 匹配所有以my結尾的行
.	匹配除換行符以外的單個字符	/m..y/ 匹配包含字母m，後跟兩個任意字符，再跟字母y的行
*	匹配零個或多個前導字符	/my*/ 匹配包含字母m,後跟零個或多個y字母的行
[]	匹配指定字符組內的任一字符	/[Mm]y/ 匹配包含My或my的行
[^]	匹配不在指定字符組內的任一字符	/[^Mm]y/ 匹配包含y，但y之前的那個字符不是M或m的行
$..$	保存已匹配的字符	1,20s/$you$self/\1r/ 標記元字符之間的模式，並將其保存爲標籤1，之後可以使用\1來引用它。最多可以定義9個標籤，從左邊開始編號，最左邊的是第一個。此例中，對第1到第20行進行處理，you被保存爲標籤1，如果發現youself，則替換爲your。
&	保存查找串以便在替換串中引用	s/my/&/ 符號&代表查找串。my將被替換爲my
\<	詞首定位符	/\<my/ 匹配包含以my開頭的單詞的行
\>	詞尾定位符	/my\>/ 匹配包含以my結尾的單詞的行
x\{m\}	連續m個x	/9\{5\}/ 匹配包含連續5個9的行
x\{m,\}	至少m個x	/9\{5,\}/ 匹配包含至少連續5個9的行
x\{m,n\}	至少m個，但不超過n個x	/9\{5,7\}/ 匹配包含連續5到7個9的行

4)10個小實例

sed命令的參數選項比較多，我們直接通過一些實例來了解sed的用法。

實例1.p命令

p命令用於輸出文件內容，對於匹配模式的行，會被輸出兩次。

例如：

1
2
3
4
5
6
7
8
9
10
11
cricode>>cat sed.txt
#This is a sed command test file
Hello,world hello,world
if linux command www.cricode.com
welcome to cricode
cricode>>sed '/Hello/p' sed.txt
#This is a sed command test file
Hello,world hello,world
Hello,world hello,world
if linux command www.cricode.com
welcome to cricode

文件中包含Hello的行被輸出了兩次。

如果只想要輸出匹配模式的行，則可以通過-n參數取消sed的默認行爲來實現。

1

2


cricode>>sed
-n
'/Hello/p'
sed.txt

Hello,world
hello,world

實例2. d命令

d命令用於刪除特定行。

刪除第1、2行

1
2
3
cricode>>sed '1,2d' sed.txt
if linux command www.cricode.com
welcome to cricode

刪除最後一行

1

2

3

4


cricode>>sed
'$d'
sed.txt

#This is a sed command test file

Hello,world
hello,world

if
linux
command www.cricode.com

刪除包含字符串cricode的行

1
2
3
cricode>>sed '/cricode/d' sed.txt
#This is a sed command test file
Hello,world hello,world

刪除以com字符串結尾的行

1

2

3

4


cricode>>sed
'/com$/d'
sed.txt

#This is a sed command test file

Hello,world
hello,world

welcome
to
cricode

實例3. S命令

s命令用於文本替換。

將文件中world替換成WORLD

1
2
3
4
5
cricode>>sed 's/world/WORLD/' sed.txt
#This is a sed command test file
Hello,WORLD hello,world
if linux command www.cricode.com
welcome to cricode

可以發現Hello,WORLD hello,world 中只有第一個world被替換了。

通過g參數在全行進行替換，如下：

1

2

3

4

5


cricode>>sed
's/world/WORLD/g'
sed.txt

#This is a sed command test file

Hello,WORLD
hello,WORLD

if
linux
command www.cricode.com

welcome
to
cricode

整行替換，將全文中包含特定單詞的行替換成其他字符串。

1
2
3
4
5
cricode>>sed 's/.*world/this is a replace line/' sed.txt
#This is a sed command test file
this is a replace line
if linux command www.cricode.com
welcome to cricode

實例4. e選項

-e即edit的意思，編輯命令，用於sed執行多個編輯任務的情況下。在下一行開始編輯前，所有的編輯動作將應用到模式緩衝區中的行上。

1

2

3


cricode>>sed
-e
'1,2d'
-e
's/cricode/CRICODE/'
sed.txt

if
linux
command www.CRICODE.com

welcome
to
CRICODE

上述爲sed命令使用選項-e進行多重編輯後得到的結果。第一重編輯刪除第1、2行。第二重編輯將第1、2行外的所有包含cricode替換爲大寫CRICODE。因爲是逐行進行這兩項編輯（即這兩個命令都在模式空間的當前行上執行），所以編輯命令的順序會影響結果。

實例5.r命令

r命令是讀命令。sed使用該命令將一個文本文件中的內容加到當前文件的特定位置上。這個命令在需要往文件中有規律的插入信息時特別有用。

在包含cricode單詞的行後面加入一行註釋。(註釋內容寫在sed1.txt中)

1
2
3
4
5
6
7
cricode>>sed '/cricode/r sed1.txt' sed.txt
#This is a sed command test file
Hello,world hello,world
if linux command www.cricode.com
###This is comment information!!!!!!!!!!!!!
welcome to cricode
###This is comment information!!!!!!!!!!!!!

實例6. w命令

1

2

3

4

5

6

7

8


cricode>>sed
'/cricode/w tmp.txt'
sed.txt

#This is a sed command test file

Hello,world
hello,world

if
linux
command www.cricode.com

welcome
to
cricode

cricode>>cat
tmp.txt

if
linux
command www.cricode.com

welcome
to
cricode

上述命令將sed.txt文件中包含模式cricode的行內容寫入到tmp.txt

實例7.a\命令

a\命令將添加新文本到文件中當前行（即讀入模式緩衝區中的行）的後面。所追加的文本行位於sed命令的下方另起一行。如果要追加的內容超過一行，則每一行都必須以反斜線結束，最後一行除外。最後一行將以引號和文件名結束。

在包含hello的行後插入一行內容：kidding??

1
2
3
4
5
6
cricode>>sed '/hello/a\kidding???' sed.txt
#This is a sed command test file
Hello,world hello,world
kidding???
if linux command www.cricode.com
welcome to cricode

其實第5點的r命令也能實現相同的功能。

實例8.i\命令

i\命令與a\命令相反，它是在匹配行的前面插入一行。

在包含hello的行前插入一行內容：kidding??

1

2

3

4

5

6

7

cricode>>sed
'/hello/i\kidding???'
sed.txt

#This is a sed command test file

kidding???

Hello,world
hello,world

if
linux
command www.cricode.com

welcome
to
cricode

實例9. c\命令

c\命令將匹配行修改成我們設定的內容。

將包含hello的行替換成kidding??

1
2
3
4
5
cricode>>sed '/hello/c\kidding???' sed.txt
#This is a sed command test file
kidding???
if linux command www.cricode.com
welcome to cricode

其實這個替換功能可以通過第3點中方法實現：通過正則匹配來實現全行替換。

實例10. y命令

y命令與UNIX/Linux中的tr命令類似，字符按照一對一的方式從左到右進行轉換。例如，y/abc/ABC/將把所有小寫的a轉換成A，小寫的b轉換成B，小寫的c轉換成C。

將文本中小寫字母hi分別替換爲大寫字母HI：

1

2

3

4

5


cricode>>sed
'y/hi/HI/'
sed.txt

#THIs Is a sed command test fIle

Hello,world
Hello,world

If
lInux
command www.crIcode.com

welcome
to
crIcode

第二部分：Sed腳本

通過編寫腳本我們可以方便的批量執行命令。sed腳本就是寫在文件中的一系列sed命令。sed腳本中，要求命令的末尾不能有任何多餘的空格或文本。如果在一行中有多個命令，要用分號分隔。執行腳本時，sed先將輸入文件中第一行復制到模式緩衝區，然後對其執行腳本中所有的命令。每一行處理完畢後，sed再複製文件中下一行到模式緩衝區，對其執行腳本中所有命令。使用sed腳本時，不再用引號來確保sed命令不被shell解釋。

下面舉個栗子。

我們現在來完成如下任務：

1）在sed.txt中的開頭插入一行作者信息：#Author:Jay13 2013-08-25

2）將文章中所有cricode替換成CRICODE。

3）在文本末尾加入一行：good bye!

那麼我們可以編寫一個名爲sedscript腳本，腳本具體內容如下：

1
2
3
1i\#Author:Jay13 2013-08-25
s/cricode/CRICODE/g
$a\good bye!

然後再將sed腳本應用到sed.txt 文件上

1

2

3

4

5

6

7


cricode>>sed
-f
sedscript
sed.txt

#Author:Jay13 2013-08-25

#This is a sed command test file

Hello,world
hello,world

if
linux
command www.CRICODE.com

welcome
to
CRICODE

good
bye!

第三部分：練習

學習了sed的基本知識，現在我們來實戰一下。

1、Sed中如何替換多行中的有規律的數字字符串

輸入：

hello world balabala – . gene_id “240838 “; transcript_id “240838“;

hello world balabala again – . gene_id “240838 “; transcript_id “240838“;

balaba….

輸出：

hello world balabala - . gene_id “zgg240838 “; transcript_id “zgg240838“;

hello world balabala again - . gene_id “zgg240838 “; transcript_id “zgg240838“;

balaba…

也即，將字符串gene_id“後接數字以及transcript_id “後接數字中的引號與數字之間插入zgg字符串。

解答：

腳本編寫如下：

sed ‘s/gene_id “$[0-9]\+$”; transcript_id “$[0-9]\+$”;/gene_id “zgg\1″; transcript_id “zgg\2″;/g’ test.txt

注：

1. 數字的匹配[0-9]

2.[0-9]後接\+表示匹配一個或多個數字

3.匹配部分加括號，引用匹配部分用\1,\2,….等等。

4.sed執行多個匹配用分號連接，整個命令用’’引在內部。

2、在文本中每一行添加行頭行尾，在行投添加 Head單詞，在行尾添加Tail單詞。

解答：略

3、利用sed進行格式轉換

比如需要將text.txt中如下內容

1
2
3
4
5
6
7
8
dn: uid=admin,ou=ITaccounts,dc=tc
uid: admin
cn: admin cn
sn: admin sn
dn: uid=0037,ou=employees,dc=tci
uid: 0037
cn: thinker
sn: zzz

轉換成xml格式如下：

1

2

3

4

5

6

7

8

9

10

11

12

<entity>

<dn>uid=admin,ou=ITaccounts,dc=tc</dn>

<uid>admin</uid>

<cn>admin
cn</cn>

<sn>admin
sn</sn>

</entity>

<entity>

<dn>uid=0037,ou=employees,dc=tci</dn>

<uid>0037</uid>

<cn>thinker</cn>

<sn>zzz</sn>

</entity>

則可以定義下面的sedscript文件：

1
2
3
4
5
6
7
/^dn:[[:space:]]/i\ 
<entity> 
s/^dn:[[:space:]]\(.*\)$/<dn>\1<\/dn>/g 
s/^uid:[[:space:]]\(.*\)$/<uid>\1<\/uid>/g 
s/^cn:[[:space:]]\(.*\)$/<cn>\1<\/cn>/g 
s/^sn:[[:space:]]\(.*\)$/<sn>\1<\/sn>/g 
s/^$/<\/entity>/g

第一行和第二行實現的功能就是在每個dn:開頭的行前面加上一行<entity>。最後一行是碰到空行補上</entity>這個結束標記。

另外，在執行替換的時候可以通過使用把被替換的字符串分組，在替換的部分裏面用\1、\2這種方式引用被替換文字裏面的相應的組。

用這種方法把ldif文件裏面[:]前的東西刪掉並且在兩邊增加相應的xml標記。
仔細想想，4行替換也許用下面的1行就能實現？
s/^$.*$:[[:space:]]$.*$$/<\1>\2<\/\1>/g

1

2

3

4

5

6

7

8

9

10

11

cricode>> sed
-f
sedscript
test.txt

<entity>

<dn>uid=admin,ou=ITaccounts,dc=tc</dn>

<uid>admin</uid>

<cn>admin
cn</cn>

<sn>admin
sn</sn>

<entity>

<dn>uid=0037,ou=employees,dc=tci</dn>

<uid>0037</uid>

<cn>thinker</cn>

<sn>zzz</sn>