關於Xml大文件的解析 小結

1. 對於大文件,很多時候我們是不能直接通過普通的讀文件解析的。這篇文章主要是總結下解析xml大文件的思路和代碼。

2. 主要思路,其實就是通過封裝一個切割文件的工具類。如:每次讀取部分文件內容,比如10M。Xml標籤定位、標籤匹配。

1) 假設有如下的test.xml (1.42kb)

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <Header>
    <ContentDate>2020-02-03T01:00:00Z</ContentDate>
    <FileContent>BAS</FileContent>
    <DeltaStart>2020-02-02T17:00:00Z</DeltaStart>
  </Header>
  <Records>
<Record>
	   <BAS>123</BAS>
	   <Entity>
		  <LegalName>1 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="1 en" type="TRADING_OR_OPERATING_NAME">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>456</BAS>
	   <Entity>
		  <LegalName>2 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="2 en" type="TRADING_OR_OPERATING_NAME">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>
	  </Record>
	  <Record>
	   <BAS>789</BAS>
	   <Entity>
		  <LegalName>3 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="3 en" type="TRADING_OR_OPERATING_NAME">3 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>1022</BAS>
	   <Entity>
		  <LegalName>4 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="4 en" type="TRADING_OR_OPERATING_NAME">4 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
  </Records>
</Data>

2) 讀取大文件

public MappedBiggerFileReader(String fileName, int arraySize) throws IOException {
	this.fileIn = new FileInputStream(fileName);
	FileChannel fileChannel = fileIn.getChannel();
	this.fileLength = fileChannel.size();
	this.number = (int) Math.ceil((double) fileLength / (double) Integer.MAX_VALUE);
	this.mappedBufArray = new MappedByteBuffer[number];// memory File Mapping Array
	long preLength = 0;
	long regionSize = (long) Integer.MAX_VALUE;// size of mapping region
	for (int i = 0; i < number; i++) {
		// mapping contiguous areas of files to memory file mapping arrays
		if (fileLength - preLength < (long) Integer.MAX_VALUE) {
			regionSize = fileLength - preLength;// the size of the last area
		}
		mappedBufArray[i] = fileChannel.map(FileChannel.MapMode.READ_ONLY, preLength, regionSize);
		preLength += regionSize;// the beginning of the next area
	}
	this.arraySize = arraySize;
}

 測試:

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	while (reader.read() != -1) {
		System.out.println("===========================");
		System.out.println(new String(reader.getArray()));
	}
	reader.close();
}

注:等號上面讀取了1kb的內容。

 

3) Xml標籤定位

從test.xml可以看出,我們要取的是<Record>的內容,所以我們現在需要截取上一步讀取的1kb裏面的<Record>和</Record>之間的所有內容了。

public static Range getRangeForTags(StringBuffer sfb, String tag) {
	Range range = new Range();
	range.setFrom(sfb.indexOf("<" + tag));
	range.setTo(sfb.lastIndexOf("</" + tag + ">"));
	return range;
}

 測試:

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	StringBuffer cache = new StringBuffer();
	while (reader.read() != -1) {
		cache.append(new String(reader.getArray()));
		Range range = Strkit.getRangeForTags(cache, "Record");
		System.out.println(range);
	}
	reader.close();
}

 

4) Xml標籤匹配查詢

上一步取得了我們想要的標籤的index,那麼現在就是匹配裏面的<Record></Record>。

public static List<String> getSubUtil(String soap,String rgex){
	List<String> list = new ArrayList<String>();
	Pattern pattern = Pattern.compile(rgex);
	Matcher m = pattern.matcher(soap);
	while (m.find()) {
		int i = 1;
		list.add(m.group(i));
	}
	return list;
}

測試:

public static void main(String[] args) throws IOException {
	StringBuffer cache = new StringBuffer();
	cache.append("<Records>"
			+ "<Record>\r\n" + 
			"	   <BAS>123</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>1 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"1 en\" type=\"TRADING_OR_OPERATING_NAME\">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>	   \r\n" + 
			"	  </Record><Record>\r\n" + 
			"	   <BAS>456</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>2 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"2 en\" type=\"TRADING_OR_OPERATING_NAME\">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>\r\n" + 
			"	  </Record>"
			+ "</Records>");

	StringBuffer texts = new StringBuffer();
	String node = "Record";
	String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
	List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
	elemTexts.forEach(str -> {
		str = str.replace("s><" + node,"");
		String t = "<" + node + str + "</" + node + ">";
		texts.append(t);
	});
	System.out.println(texts);
}

 注:去掉了首尾的<Records>。

 

5) 集成的解析工具類

結合第3步獲取到的index來刪除cache裏面的內容,實現Xml切割的核心思路。

public void parse() {
	MappedBiggerFileReader reader = null;
	try {
		reader = new MappedBiggerFileReader(filePath, size * 1024);
		StringBuffer cache = new StringBuffer();
		while (reader.read() != -1) {
			cache.append(new String(reader.getArray()));
			Range range = Strkit.getRangeForTags(cache, node);

			if (range.getFrom() >= 0) {
				StringBuffer texts = new StringBuffer();

				String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
				List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
				elemTexts.forEach(str -> {
					str = str.replace("s>\n<" + node + "", "");
					String t = "<" + node + str + "</" + node + ">";
					texts.append(t);
				});
				texts.insert(0, "<Root>");
				texts.append("</Root>");

				System.out.println("===============");
				System.out.println(texts);
				cache = cache.delete(0, range.getTo());
			}
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
}

測試:

public static void main(String[] args) {
	new XmlParseUtil("D:\\test\\test.xml", "Record", 1).parse();
}

 注:切割後的每一個Record前後都加上了<Root></Root>,方便之後再使用XstreamUtil進行Xml解析

 

 3. 附上源碼下載,有什麼不懂或者覺得有問題的歡迎留言討論。

 

注:

1. 如果遇到:Xstream NumberFormatException: Zero length string...

    可能是因爲你測試的xml文件中,有部分數據節點存在attribute但是這個attribute的值是空的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章