01. 目的
累計文檔中字母、數字、漢字、標點符號以及全部字符的數目。
注:文檔中的空格、英文標點符號歸爲:“其他字符”
02. 主要方法
(1)用InputStreamReader讀入文本內容,以行讀入str = buf.readLine()
並判斷一行中每個字符str.charAt(i);
(2)判斷字母:str.charAt(i))>='A' && (str.charAt(i))<='Z') || ((str.charAt(i))>='a' && (str.charAt(i))<='z')
(3)判斷數字:str.charAt(i)>='0' && str.charAt(i)<='9'
(4)判斷漢字:str.charAt(i)>=0x4e00 && str.charAt(i)<=0x9fbb
判斷中文字符:(包括中文標點符號)str.charAt(i)>=0x0391 && s.charAt(i)<=0xFFE5
此例中漢字和標點符號分別判斷
(5)判斷中文標點符號:
參考:Java判斷中文符號 — Character.UnicodeBlock中的cjk說明
Character.UnicodeBlock pun = Character.UnicodeBlock.of(str.charAt(i)); //獲取此字符的UniCodeBlock
if (pun == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS ||
pun == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS ||
pun == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A ||
pun == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B ||
pun == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION ||
pun == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS ||
pun == Character.UnicodeBlock.GENERAL_PUNCTUATION)
Character.UnicodeBlock中cjk的說明:
- Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS : 4E00-9FBF:CJK 統一表意符號
- Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS :F900-FAFF:CJK 兼容象形文字
- Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A :3400-4DBF:CJK 統一表意符號擴展A
- Character.UnicodeBlock.GENERAL_PUNCTUATION :2000-206F:常用標點
- Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION :3000-303F:CJK 符號和標點
- Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS :FF00-FFEF:半角及全角形式
03. 程序代碼 ( charCount.java )
/*
* 累計文檔中的字母、數字、漢字、中文標點符號和其他字符的個數
* @author WTCLAB_yd
*/
import java.io.*;
public class charCount {
public static void main(String[] args) {
String character=null,figure=null,sinogram=null,others=null;//定義字符串字母、數字、漢字和其他,用於存放從文中提取出的對應內容
int c=0,f=0,s=0,o=0,b=0;//用於累計文檔中字母、數字、漢字和其他字符的個數
String str=null;
File read_file=new File(args[0]);//建立讀入文件對象
try{
InputStreamReader reader = new InputStreamReader(new FileInputStream(read_file)); // 將輸入的字節流轉換成字符流
BufferedReader buf=new BufferedReader(reader); //將字符流添加到緩衝流
while ((str = buf.readLine()) != null){
for(int i=0;i<str.length();i++)
{
//判斷是否是字母
if( ((str.charAt(i))>='A' && (str.charAt(i))<='Z') || ((str.charAt(i))>='a' && (str.charAt(i))<='z') )
{
c++;
continue;
}
if(str.charAt(i)>='0' && str.charAt(i)<='9')//判斷數字
{
f++;
continue;
}
if(str.charAt(i)>=0x4e00 && str.charAt(i)<=0x9fbb)//判斷漢字
{
s++;
continue;
}
Character.UnicodeBlock pun = Character.UnicodeBlock.of(str.charAt(i)); //獲取此字符的UniCodeBlock
if (pun == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS || pun == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS
|| pun == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A || pun == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B
|| pun == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION || pun == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
|| pun == Character.UnicodeBlock.GENERAL_PUNCTUATION) {
b++;
}
else
{
o++;
continue;
}
}
}
System.out.println("\n《西遊記》中:\n");
System.out.println("字母的個數:"+c+"\n");
System.out.println("漢字的個數:"+s+"\n");
System.out.println("數字的個數:"+f+"\n");
System.out.println("標點的個數:"+b+"\n");
System.out.println("其他字符:"+o);
System.out.println("總字符數:"+(c+s+f+b+o));
}
catch(IOException e){//捕獲異常
e.printStackTrace();
}
}
}
04. 功能演示
(1)執行 :**java char_slash xyj.txt **
(命令行參數方式:xyj.txt 是待處理文檔)
(2)測試:
英文單詞間的空格、英文標點符號算作“其他字符”