關於Xml大文件的解析小結

原創

BAStriver

2020-06-21 23:00

1. 對於大文件，很多時候我們是不能直接通過普通的讀文件解析的。這篇文章主要是總結下解析xml大文件的思路和代碼。

2. 主要思路，其實就是通過封裝一個切割文件的工具類。如：每次讀取部分文件內容，比如10M。Xml標籤定位、標籤匹配。

1) 假設有如下的test.xml (1.42kb)

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <Header>
    <ContentDate>2020-02-03T01:00:00Z</ContentDate>
    <FileContent>BAS</FileContent>
    <DeltaStart>2020-02-02T17:00:00Z</DeltaStart>
  </Header>
  <Records>
<Record>
	   <BAS>123</BAS>
	   <Entity>
		  <LegalName>1 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="1 en" type="TRADING_OR_OPERATING_NAME">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>456</BAS>
	   <Entity>
		  <LegalName>2 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="2 en" type="TRADING_OR_OPERATING_NAME">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>
	  </Record>
	  <Record>
	   <BAS>789</BAS>
	   <Entity>
		  <LegalName>3 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="3 en" type="TRADING_OR_OPERATING_NAME">3 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
	  <Record>
	   <BAS>1022</BAS>
	   <Entity>
		  <LegalName>4 AVSUPER FUND</LegalName>
		  <OtherEntityNames>
			 <OtherEntityName xml:lang="4 en" type="TRADING_OR_OPERATING_NAME">4 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
		  </OtherEntityNames>
	   </Entity>	   
	  </Record>
  </Records>
</Data>

2) 讀取大文件。

public MappedBiggerFileReader(String fileName, int arraySize) throws IOException {
	this.fileIn = new FileInputStream(fileName);
	FileChannel fileChannel = fileIn.getChannel();
	this.fileLength = fileChannel.size();
	this.number = (int) Math.ceil((double) fileLength / (double) Integer.MAX_VALUE);
	this.mappedBufArray = new MappedByteBuffer[number];// memory File Mapping Array
	long preLength = 0;
	long regionSize = (long) Integer.MAX_VALUE;// size of mapping region
	for (int i = 0; i < number; i++) {
		// mapping contiguous areas of files to memory file mapping arrays
		if (fileLength - preLength < (long) Integer.MAX_VALUE) {
			regionSize = fileLength - preLength;// the size of the last area
		}
		mappedBufArray[i] = fileChannel.map(FileChannel.MapMode.READ_ONLY, preLength, regionSize);
		preLength += regionSize;// the beginning of the next area
	}
	this.arraySize = arraySize;
}

測試：

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	while (reader.read() != -1) {
		System.out.println("===========================");
		System.out.println(new String(reader.getArray()));
	}
	reader.close();
}

注：等號上面讀取了1kb的內容。

3) Xml標籤定位。

從test.xml可以看出，我們要取的是<Record>的內容，所以我們現在需要截取上一步讀取的1kb裏面的<Record>和</Record>之間的所有內容了。

public static Range getRangeForTags(StringBuffer sfb, String tag) {
	Range range = new Range();
	range.setFrom(sfb.indexOf("<" + tag));
	range.setTo(sfb.lastIndexOf("</" + tag + ">"));
	return range;
}

測試：

public static void main(String[] args) throws IOException {
	MappedBiggerFileReader reader = new MappedBiggerFileReader(
			"D:\\test\\test.xml", 1 * 1024); // 1Kb
	StringBuffer cache = new StringBuffer();
	while (reader.read() != -1) {
		cache.append(new String(reader.getArray()));
		Range range = Strkit.getRangeForTags(cache, "Record");
		System.out.println(range);
	}
	reader.close();
}

4) Xml標籤匹配查詢。

上一步取得了我們想要的標籤的index，那麼現在就是匹配裏面的<Record></Record>。

public static List<String> getSubUtil(String soap,String rgex){
	List<String> list = new ArrayList<String>();
	Pattern pattern = Pattern.compile(rgex);
	Matcher m = pattern.matcher(soap);
	while (m.find()) {
		int i = 1;
		list.add(m.group(i));
	}
	return list;
}

測試：

public static void main(String[] args) throws IOException {
	StringBuffer cache = new StringBuffer();
	cache.append("<Records>"
			+ "<Record>\r\n" + 
			"	   <BAS>123</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>1 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"1 en\" type=\"TRADING_OR_OPERATING_NAME\">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>	   \r\n" + 
			"	  </Record><Record>\r\n" + 
			"	   <BAS>456</BAS>\r\n" + 
			"	   <Entity>\r\n" + 
			"		  <LegalName>2 AVSUPER FUND</LegalName>\r\n" + 
			"		  <OtherEntityNames>\r\n" + 
			"			 <OtherEntityName xml:lang=\"2 en\" type=\"TRADING_OR_OPERATING_NAME\">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" + 
			"		  </OtherEntityNames>\r\n" + 
			"	   </Entity>\r\n" + 
			"	  </Record>"
			+ "</Records>");

	StringBuffer texts = new StringBuffer();
	String node = "Record";
	String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
	List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
	elemTexts.forEach(str -> {
		str = str.replace("s><" + node,"");
		String t = "<" + node + str + "</" + node + ">";
		texts.append(t);
	});
	System.out.println(texts);
}

注：去掉了首尾的<Records>。

5) 集成的解析工具類。

結合第3步獲取到的index來刪除cache裏面的內容，實現Xml切割的核心思路。

public void parse() {
	MappedBiggerFileReader reader = null;
	try {
		reader = new MappedBiggerFileReader(filePath, size * 1024);
		StringBuffer cache = new StringBuffer();
		while (reader.read() != -1) {
			cache.append(new String(reader.getArray()));
			Range range = Strkit.getRangeForTags(cache, node);

			if (range.getFrom() >= 0) {
				StringBuffer texts = new StringBuffer();

				String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
				List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
				elemTexts.forEach(str -> {
					str = str.replace("s>\n<" + node + "", "");
					String t = "<" + node + str + "</" + node + ">";
					texts.append(t);
				});
				texts.insert(0, "<Root>");
				texts.append("</Root>");

				System.out.println("===============");
				System.out.println(texts);
				cache = cache.delete(0, range.getTo());
			}
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
}

測試：

public static void main(String[] args) {
	new XmlParseUtil("D:\\test\\test.xml", "Record", 1).parse();
}

注：切割後的每一個Record前後都加上了<Root></Root>，方便之後再使用XstreamUtil進行Xml解析。

3. 附上源碼下載，有什麼不懂或者覺得有問題的歡迎留言討論。

注：

1. 如果遇到：Xstream NumberFormatException: Zero length string...

可能是因爲你測試的xml文件中，有部分數據節點存在attribute但是這個attribute的值是空的。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

關於Xml大文件的解析小結

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

關於Jenkins的Codedeploy 小結

關於Aws SQS的使用小結

AWS S3文件/文件夾刪除

關於應用程序中使用STS切換IAM角色

基於PowerMockito的靜態方法的多種mock測試方式小結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

關於Xml大文件的解析 小結

關於Xml大文件的解析小結