Office文件的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等文件的解析(完)

【題外話】

這是這個系列的最後一篇文章了，爲了不讓自己覺得少點什麼，順便讓自己感覺完美一些，就再把OOXML說一下吧。不過說實話，OOXML真的太容易解析了，而且這方面的文檔包括成熟的開源類庫也特別特別特別的多，所以我就稍微說一下，文章中引用了不少的鏈接，感興趣的話可以深入瞭解下。

【系列索引】

Office文件的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等文件的解析(一)
獲取Office二進制文檔的DocumentSummaryInformation以及SummaryInformation
Office文件的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等文件的解析(二)
獲取Word二進制文檔（.doc）的文字內容（包括正文、頁眉、頁腳、批註等等）
Office文件的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等文件的解析(三)
詳細介紹Office二進制文檔中的存儲結構，以及獲取PowerPoint二進制文檔（.ppt）的文字內容
Office文件的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等文件的解析(完)
介紹Office Open XML文檔（.docx、.pptx）如何進行解析以及解析Office文件常見開源類庫

【文章索引】

【一、初見Office Open XML(OOXML)】

先來看一段微軟官方對Office Open XML的說明（詳細見http://office.microsoft.com/zh-cn/support/HA010205815.aspx?CTT=3）：

可以看到，與Windows 複合文檔不同的是，OOXML生來就是開放的，而且由於基於zip+xml的格式，使得讀取變得更容易，如果僅是爲了抽取文字，我們甚至不需要讀取文檔的任何參數！

如果您之前不瞭解OOXML的話，我們可以把手頭docx、pptx以及xlsx文件的擴展名改爲zip，然後用壓縮軟件打開看看。

打開的這三個文件分別是docx、pptx和xlsx，我們可以看到，目錄結構清晰可見，所以我們只需要使用讀取zip的類庫讀取zip文件，然後再解析xml文件即可。對於使用.NET Framework 3.0及以上的，可以直接使用.NET自帶的Package類（System.IO.Packaging，在WindowsBase.dll中）進行解壓，個人感覺如果只是讀取zip流中的文件流或內容，WindowsBase中的Package還是很好用的。如果用於.NET CF或者2.0甚至以下的CLR可以使用SharpZipLib（支持CLR 1.1、2.0、4.0，官方網站http://www.icsharpcode.net/），也可以使用DotNetZip（支持CLR 2.0，官方網站http://dotnetzip.codeplex.com/），個人感覺後者的License更友好些。

比如我們使用自帶的Package打開OOXML文件：

#region 字段
protected FileStream m_stream;
protected Package m_package;
#endregion

#region 構造函數
/// <summary>
/// 初始化OfficeOpenXMLFile
/// </summary>
/// <param name="filePath">文件路徑</param>
public OfficeOpenXMLFile(String filePath)
{
   try
   {
       this.m_stream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
       this.m_package = Package.Open(this.m_stream);

       this.ReadProperties();
       this.ReadCoreProperties();
       this.ReadContent();
   }
   finally
   {
       if (this.m_package != null)
       {
           this.m_package.Close();
       }

       if (this.m_stream != null)
       {
           this.m_stream.Close();
       }
   }
}
#endregion

【二、OOXML文檔屬性的解析】

OOXML文件的文檔屬性其實存在於docProps目錄下，比較重要的有三個文件

app.xml：記錄文檔的屬性，內容類似之前的DocumentSummaryInformation。
core.xml：記錄文檔核心的屬性，比如創建時間、最後修改時間等等，內容類似之前的SummaryInformation。
thumbnail.*：文檔的縮略圖，不同文件存儲的是不同的格式，比如Word爲emf，Excel爲wmf，PowerPoint爲jpeg。

我們只需要遍歷XML文件中所有的子節點就可以讀出所有的屬性，爲了好看，這裏還用的Windows複合文件中的名稱：

#region 常量
private const String PropertiesNameSpace = "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties";
private const String CorePropertiesNameSpace = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
#endregion

#region 字段
protected Dictionary<String, String> m_properties;
protected Dictionary<String, String> m_coreProperties;
#endregion

#region 屬性
/// <summary>
/// 獲取DocumentSummaryInformation
/// </summary>
public override Dictionary<String, String> DocumentSummaryInformation
{
   get
   {
       return this.m_properties;
   }
}

/// <summary>
/// 獲取SummaryInformation
/// </summary>
public override Dictionary<String, String> SummaryInformation
{
   get
   {
       return this.m_coreProperties;
   }
}
#endregion

#region 讀取Properties
private void ReadProperties()
{
   if (this.m_package == null)
   {
       return;
   }

   PackagePart part = this.m_package.GetPart(new Uri("/docProps/app.xml", UriKind.Relative));
   if (part == null)
   {
       return;
   }

   XmlDocument doc = new XmlDocument();
   doc.Load(part.GetStream());

   XmlNodeList nodes = doc.GetElementsByTagName("Properties", PropertiesNameSpace);
   if (nodes.Count < 1)
   {
       return;
   }

   this.m_properties = new Dictionary<String, String>();
   foreach (XmlElement element in nodes[0])
   {
       this.m_properties.Add(element.LocalName, element.InnerText);
   }
}
#endregion

#region 讀取CoreProperties
private void ReadCoreProperties()
{
   if (this.m_package == null)
   {
       return;
   }

   PackagePart part = this.m_package.GetPart(new Uri("/docProps/core.xml", UriKind.Relative));
   if (part == null)
   {
       return;
   }

   XmlDocument doc = new XmlDocument();
   doc.Load(part.GetStream());

   XmlNodeList nodes = doc.GetElementsByTagName("coreProperties", CorePropertiesNameSpace);
   if (nodes.Count < 1)
   {
       return;
   }
   
   this.m_coreProperties = new Dictionary<String, String>();
   foreach (XmlElement element in nodes[0])
   {
       this.m_coreProperties.Add(element.LocalName, element.InnerText);
   }
}
#endregion

【三、Word 2007文件的解析】

Word文件（.docx）主要的內容基本都存在於word目錄下，比較重要的有以下的內容

document.xml：記錄Word文檔的正文內容
footer*.xml：記錄Word文檔的頁腳
header*.xml：記錄Word文檔的頁眉
comments.xml：記錄Word文檔的批註
footnotes.xml：記錄Word文檔的腳註
endnotes.xml：記錄Word文檔的尾註

這裏我們只讀取Word文檔的正文內容，由於OOXML文檔在存儲文字時也是嵌套結構存儲的，比如對於Word而言，<w:p></w:p>之間存儲的是段落，段落中會嵌套着<w:t></w:t>，而這個存儲的是文字。除此之外<w:tab/>是Tab符號，<w:br w:type="page"/>是分頁符等等，所以我們需要寫一個方法遞歸處理這些標籤：

/// <summary>
/// 抽取Node中的文字
/// </summary>
/// <param name="node">XmlNode</param>
/// <returns>Node中的文字</returns>
public static String ReadNode(XmlNode node)
{
   if ((node == null) || (node.NodeType != XmlNodeType.Element))//如果node爲空
   {
       return String.Empty;
   }

   StringBuilder nodeContent = new StringBuilder();

   foreach (XmlNode child in node.ChildNodes)
   {
       if (child.NodeType != XmlNodeType.Element)
       {
           continue;
       }

       switch (child.LocalName)
       {
           case "t"://正文
               nodeContent.Append(child.InnerText.TrimEnd());

               String space = ((XmlElement)child).GetAttribute("xml:space");
               if ((!String.IsNullOrEmpty(space)) && (space == "preserve")) nodeContent.Append(' ');
               break;
           case "cr"://換行符
           case "br"://換頁符
               nodeContent.Append(Environment.NewLine);
               break;
           case "tab"://Tab
               nodeContent.Append("\t");
               break;
           case "p"://段落
               nodeContent.Append(ReadNode(child));
               nodeContent.Append(Environment.NewLine);
               break;
           default://其他情況
               nodeContent.Append(ReadNode(child));
               break;
       }
   }

   return nodeContent.ToString();
}

然後我們從根標籤開始讀取就可以了

#region 常量
private const String WordNameSpace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
#endregion

#region 字段
private String m_paragraphText;
#endregion

#region 屬性
/// <summary>
/// 獲取文檔正文內容
/// </summary>
public String ParagraphText
{
   get { return this.m_paragraphText; }
}
#endregion

#region 讀取內容
protected override void ReadContent()
{
   if (this.m_package == null)
   {
       return;
   }

   PackagePart part = this.m_package.GetPart(new Uri("/word/document.xml", UriKind.Relative));
   if (part == null)
   {
       return;
   }

   StringBuilder content = new StringBuilder();
   XmlDocument doc = new XmlDocument();
   doc.Load(part.GetStream());

   XmlNamespaceManager nsManager = new XmlNamespaceManager(doc.NameTable);
   nsManager.AddNamespace("w", WordNameSpace);

   XmlNode node = doc.SelectSingleNode("/w:document/w:body", nsManager);

   if (node == null)
   {
       return;
   }

   content.Append(NodeHelper.ReadNode(node));

   this.m_paragraphText = content.ToString();
}
#endregion

【四、PowerPoint 2007文件的解析】

PowerPoint文件（.pptx）主要的內容都存在於ppt目錄下，而幻燈片的信息則又在slides子目錄下，這裏邊幻燈片按照slide + 頁序號 +.xml的名稱進行存儲，我們挨個順序讀取就可以。不過需要注意的是，由於字符串比較的問題，如“slide10.xml”<"slide2.xml"，所以如果你按順序讀取的話可能會出現頁碼錯亂的情況，所以我們可以先進行排序然後再挨個頁面從根標籤讀取就可以了。

#region 常量
private const String PowerPointNameSpace = "http://schemas.openxmlformats.org/presentationml/2006/main";
#endregion

#region 字段
private StringBuilder m_allText;
#endregion

#region 屬性
/// <summary>
/// 獲取PowerPoint幻燈片中所有文本
/// </summary>
public String AllText
{
   get { return this.m_allText.ToString(); }
}
#endregion

#region 構造函數
/// <summary>
/// 初始化PptxFile
/// </summary>
/// <param name="filePath">文件路徑</param>
public PptxFile(String filePath) :
   base(filePath) { }
#endregion

#region 讀取內容
protected override void ReadContent()
{
   if (this.m_package == null)
   {
       return;
   }

   this.m_allText = new StringBuilder();

   XmlDocument doc = null;
   PackagePartCollection col = this.m_package.GetParts();
   SortedList<Int32, XmlDocument> list = new SortedList<Int32, XmlDocument>();
   
   foreach (PackagePart part in col)
   {
       if (part.Uri.ToString().IndexOf("ppt/slides/slide", StringComparison.OrdinalIgnoreCase) > -1)
       {
           doc = new XmlDocument();
           doc.Load(part.GetStream());

           String pageName = part.Uri.ToString().Replace("/ppt/slides/slide", "").Replace(".xml", "");
           Int32 index = 0;
           Int32.TryParse(pageName, out index);

           list.Add(index, doc);
       }
   }

   foreach (KeyValuePair<Int32, XmlDocument> pair in list)
   {
       XmlNamespaceManager nsManager = new XmlNamespaceManager(doc.NameTable);
       nsManager.AddNamespace("p", PowerPointNameSpace);

       XmlNode node = pair.Value.SelectSingleNode("/p:sld", nsManager);

       if (node == null)
       {
           continue;
       }

       this.m_allText.Append(NodeHelper.ReadNode(node));
   }
}
#endregion

附，本系列全部代碼下載：https://github.com/mayswind/SimpleOfficeReader

【五、常見Office文檔（Word、PowerPoint、Excel）文件的開源類庫】

1、NPOI：http://npoi.codeplex.com

這個沒的說，.NET上最好的，沒有之一，Office文檔類庫，提供完整的Excel讀取與編輯操作，目前支持二進制（.xls）文件和OOXML（.xlsx）兩種格式。如果用過Apache的Java類庫POI的話，NPOI提供幾乎一樣的類庫。實際上，對於ASP.NET，需要編輯的Office文檔大多都是Excel文件，或者也可以使用Excel文件代替，所以使用NPOI幾乎已經能滿足所有需要。目前已經支持docx文件，而doc的支持則在NPOI.ScratchPad中，大家可以去Source Code中下載自己編譯。如果不需要OOXML的話，類庫僅有1.5MB，並且支持.NET CLR 2.0和4.0。

2、Open XML SDK 2.0 for Microsoft Office：http://msdn.microsoft.com/en-us/library/bb448854(office.14).aspx

微軟提供的Open XML SDK，支持讀寫任意OOXML文檔，其同時提供了一個工具，可以打開Office文檔然後直接生成使用該類庫生成該文檔的程序代碼。只不過類庫確實大了些，有5MB之多，並且需要.NET Framework 3.5的支持。

3、Office Binary Translator to Open XML：http://b2xtranslator.sourceforge.net/

這是我最近才知道的一個類庫，其實很早很早以前就有了，其可以將Windows複合文檔（.doc、.ppt、.xls）轉換爲對應的OOXML格式（.docx、.pptx、.xlsx），當然你也可以獲取文件中存儲的內容。不知道爲什麼，這個網站被牆了。如果你想研究Windows複合文檔的話，我比較推薦這個類庫，因爲NPOI實在是太完美的一個類庫，要想走一遍文件讀取的流程實在是太複雜，但是如果用這個類庫單步的話還是很容易懂的。這個類庫將每種文件的支持（以及支持的模塊等）都拆分到了不同的項目中，支持每種文件僅需要幾百KB，而且是基於.NET CLR 2.0的。

4、EPPlus：http://epplus.codeplex.com

在2010年NPOI還不支持OOXML的時候，個人感覺EPPlus是最好的.xlsx文件處理的類庫，其僅有幾百KB，非常輕量，對於zip文件的讀取，這個類庫沒有選擇SharpZipLib或者DotNetZip，老版本需要.NET Framework 3.0就行，剛看了下新版本得需要.NET Framework 3.5纔可以。

5、ExcelDataReader：http://exceldatareader.codeplex.com

也是一個非常輕量並且好用的庫，同時支持讀取.xls和.xlsx，當年在使用EPPlus之前使用的這個類庫，記不得是因爲什麼問題替換成了EPPlus，也不知道這個問題現在解決了沒有。這個類庫的好處是僅需要.NET CLR 2.0，並且支持.NET CF，只不過現在已經不需要開發Windows Mobile的應用了。

【六、相關鏈接】

1、OpenXMLDeveloper.org：http://openxmldeveloper.org
2、如何：從 Office Open XML 文檔檢索段落：http://msdn.microsoft.com/zh-cn/library/bb669175.aspx
3、如何操作 Office Open XML 格式文檔：http://www.microsoft.com/china/msdn/library/office/office/howManipulateOfficexml.mspx
4、如何實現...（打開 XML SDK）：http://msdn.microsoft.com/zh-cn/library/bb491088.aspx

【後記】

終於到了最後一篇，這個系列就到這結束了，感謝大家的捧場，我也終於實現了兩年前的心願。說實話，我確實沒想到第一篇會有那麼多的訪問和推薦，因爲需要解析Office文檔的畢竟是少數的。寫這四篇文章也希望起到拋磚引玉的作用，起碼可以對Office文檔有個最基礎的瞭解，而之後如果想深入瞭解下去也會容易得多，這也是我要把這些內容寫出來的原因。

【補遺】

在寫完這四篇文章後，我偶然發現微軟關於這方面竟然有中文文檔，淚奔了，爲什麼之前我沒有找到。所以在此附上幾篇常用的鏈接。

1、瞭解 Office 二進制文件格式：http://msdn.microsoft.com/zh-cn/library/gg615407(v=office.14).aspx
2、瞭解 Word MS-DOC 二進制文件格式：http://msdn.microsoft.com/zh-CN/library/gg615596
3、瞭解 PowerPoint MS-PPT 二進制文件格式：http://msdn.microsoft.com/zh-CN/library/gg615594
4、瞭解採用 Office 二進制文件格式的圖形：http://msdn.microsoft.com/zh-CN/library/gg985447
5、在二進制 PowerPoint MS-PPT 文件中查找圖形：http://msdn.microsoft.com/zh-CN/library/hh244173

Office文件的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等文件的解析(完)

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

Vb.net註釋

Word操作自定義類

Office文件的奧祕——.NET平臺下不借助Office實現Word、Powerpoint等文件的解析(完)

開源Word2007以上版本讀寫組件DocX介紹與入門

開源Word讀寫組件DocX 的深入研究和問題總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結