1. 對於大文件,很多時候我們是不能直接通過普通的讀文件解析的。這篇文章主要是總結下解析xml大文件的思路和代碼。
2. 主要思路,其實就是通過封裝一個切割文件的工具類。如:每次讀取部分文件內容,比如10M。Xml標籤定位、標籤匹配。
1) 假設有如下的test.xml (1.42kb)
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<Header>
<ContentDate>2020-02-03T01:00:00Z</ContentDate>
<FileContent>BAS</FileContent>
<DeltaStart>2020-02-02T17:00:00Z</DeltaStart>
</Header>
<Records>
<Record>
<BAS>123</BAS>
<Entity>
<LegalName>1 AVSUPER FUND</LegalName>
<OtherEntityNames>
<OtherEntityName xml:lang="1 en" type="TRADING_OR_OPERATING_NAME">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
</OtherEntityNames>
</Entity>
</Record>
<Record>
<BAS>456</BAS>
<Entity>
<LegalName>2 AVSUPER FUND</LegalName>
<OtherEntityNames>
<OtherEntityName xml:lang="2 en" type="TRADING_OR_OPERATING_NAME">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
</OtherEntityNames>
</Entity>
</Record>
<Record>
<BAS>789</BAS>
<Entity>
<LegalName>3 AVSUPER FUND</LegalName>
<OtherEntityNames>
<OtherEntityName xml:lang="3 en" type="TRADING_OR_OPERATING_NAME">3 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
</OtherEntityNames>
</Entity>
</Record>
<Record>
<BAS>1022</BAS>
<Entity>
<LegalName>4 AVSUPER FUND</LegalName>
<OtherEntityNames>
<OtherEntityName xml:lang="4 en" type="TRADING_OR_OPERATING_NAME">4 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>
</OtherEntityNames>
</Entity>
</Record>
</Records>
</Data>
2) 讀取大文件。
public MappedBiggerFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new FileInputStream(fileName);
FileChannel fileChannel = fileIn.getChannel();
this.fileLength = fileChannel.size();
this.number = (int) Math.ceil((double) fileLength / (double) Integer.MAX_VALUE);
this.mappedBufArray = new MappedByteBuffer[number];// memory File Mapping Array
long preLength = 0;
long regionSize = (long) Integer.MAX_VALUE;// size of mapping region
for (int i = 0; i < number; i++) {
// mapping contiguous areas of files to memory file mapping arrays
if (fileLength - preLength < (long) Integer.MAX_VALUE) {
regionSize = fileLength - preLength;// the size of the last area
}
mappedBufArray[i] = fileChannel.map(FileChannel.MapMode.READ_ONLY, preLength, regionSize);
preLength += regionSize;// the beginning of the next area
}
this.arraySize = arraySize;
}
測試:
public static void main(String[] args) throws IOException {
MappedBiggerFileReader reader = new MappedBiggerFileReader(
"D:\\test\\test.xml", 1 * 1024); // 1Kb
while (reader.read() != -1) {
System.out.println("===========================");
System.out.println(new String(reader.getArray()));
}
reader.close();
}
注:等號上面讀取了1kb的內容。
3) Xml標籤定位。
從test.xml可以看出,我們要取的是<Record>的內容,所以我們現在需要截取上一步讀取的1kb裏面的<Record>和</Record>之間的所有內容了。
public static Range getRangeForTags(StringBuffer sfb, String tag) {
Range range = new Range();
range.setFrom(sfb.indexOf("<" + tag));
range.setTo(sfb.lastIndexOf("</" + tag + ">"));
return range;
}
測試:
public static void main(String[] args) throws IOException {
MappedBiggerFileReader reader = new MappedBiggerFileReader(
"D:\\test\\test.xml", 1 * 1024); // 1Kb
StringBuffer cache = new StringBuffer();
while (reader.read() != -1) {
cache.append(new String(reader.getArray()));
Range range = Strkit.getRangeForTags(cache, "Record");
System.out.println(range);
}
reader.close();
}
4) Xml標籤匹配查詢。
上一步取得了我們想要的標籤的index,那麼現在就是匹配裏面的<Record></Record>。
public static List<String> getSubUtil(String soap,String rgex){
List<String> list = new ArrayList<String>();
Pattern pattern = Pattern.compile(rgex);
Matcher m = pattern.matcher(soap);
while (m.find()) {
int i = 1;
list.add(m.group(i));
}
return list;
}
測試:
public static void main(String[] args) throws IOException {
StringBuffer cache = new StringBuffer();
cache.append("<Records>"
+ "<Record>\r\n" +
" <BAS>123</BAS>\r\n" +
" <Entity>\r\n" +
" <LegalName>1 AVSUPER FUND</LegalName>\r\n" +
" <OtherEntityNames>\r\n" +
" <OtherEntityName xml:lang=\"1 en\" type=\"TRADING_OR_OPERATING_NAME\">1 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" +
" </OtherEntityNames>\r\n" +
" </Entity> \r\n" +
" </Record><Record>\r\n" +
" <BAS>456</BAS>\r\n" +
" <Entity>\r\n" +
" <LegalName>2 AVSUPER FUND</LegalName>\r\n" +
" <OtherEntityNames>\r\n" +
" <OtherEntityName xml:lang=\"2 en\" type=\"TRADING_OR_OPERATING_NAME\">2 AvSuper Pty Ltd as trustee for AvSuper Fund</OtherEntityName>\r\n" +
" </OtherEntityNames>\r\n" +
" </Entity>\r\n" +
" </Record>"
+ "</Records>");
StringBuffer texts = new StringBuffer();
String node = "Record";
String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
elemTexts.forEach(str -> {
str = str.replace("s><" + node,"");
String t = "<" + node + str + "</" + node + ">";
texts.append(t);
});
System.out.println(texts);
}
注:去掉了首尾的<Records>。
5) 集成的解析工具類。
結合第3步獲取到的index來刪除cache裏面的內容,實現Xml切割的核心思路。
public void parse() {
MappedBiggerFileReader reader = null;
try {
reader = new MappedBiggerFileReader(filePath, size * 1024);
StringBuffer cache = new StringBuffer();
while (reader.read() != -1) {
cache.append(new String(reader.getArray()));
Range range = Strkit.getRangeForTags(cache, node);
if (range.getFrom() >= 0) {
StringBuffer texts = new StringBuffer();
String rgex = "<" + node + "" + "([\\s\\S]*?)</" + node + ">";
List<String> elemTexts = StrSplitUtil.getSubUtil(cache.toString(), rgex);
elemTexts.forEach(str -> {
str = str.replace("s>\n<" + node + "", "");
String t = "<" + node + str + "</" + node + ">";
texts.append(t);
});
texts.insert(0, "<Root>");
texts.append("</Root>");
System.out.println("===============");
System.out.println(texts);
cache = cache.delete(0, range.getTo());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
測試:
public static void main(String[] args) {
new XmlParseUtil("D:\\test\\test.xml", "Record", 1).parse();
}
注:切割後的每一個Record前後都加上了<Root></Root>,方便之後再使用XstreamUtil進行Xml解析。
3. 附上源碼下載,有什麼不懂或者覺得有問題的歡迎留言討論。
注:
1. 如果遇到:Xstream NumberFormatException: Zero length string...
可能是因爲你測試的xml文件中,有部分數據節點存在attribute但是這個attribute的值是空的。