Convert HTML to Plain Text

Introduction

This acticle provides the procedure for stripping out HTML tags while preserving most basic formaitting. In other words it converts HTML to plain text.

Background (optional)

This example heavily relies or regular expressions, in paticular System.Text.RegularExpressions.Regex.Replace() method. You may also find this reference on regular expressions syntax useful.

Using the code

The code uses System.Text.RegularExpressions namespace and consists of a single function, StripHTML().

First the development formatting is removed such as tabs used for step-identations and repeated whitespaces. As a result the input HTML is "flattened" into one continous string. This serves two reasons: (1) to remove the folmatting ignored by browsers, (2) to make the regexes work reliably (they seem to get confused by escaped characters).

Then the header is removed by removing anything between <head> and </head> tags.

Then all scripts are removed by chopping out anyting between <script> and </script> tags inclusive. Similarly with styles.

Then the basic formatting tags, such as <BR> and <DIV> are replaced with /r or /r/r. Also <TR> tags are replaced by linebreaks and <TD>s by tabs.

<LI> are replaced by *s and special characters such as   are replaced with their corresponding values.

Finally all the ramining tags are replaces by emptry strings.

By this stage there are likely to be alot or redundant repeating line breaks and tabs. Any sequence over 2 line breaks long is replaced by two linebreaks. Similarly with tabs: sequences over 4 tabs long are replaced by 4 tabs.

private string StripHTML(string source)

{

try

{

string result;

// Remove HTML Development formatting

// Replace line breaks with space

// because browsers inserts space

result = source.Replace(" ", " ");

// Replace line breaks with space

// because browsers inserts space

result = result.Replace(" ", " ");

// Remove step-formatting

result = result.Replace(" ", string.Empty);

// Remove repeating speces becuase browsers ignore them

result = System.Text.RegularExpressions.Regex.Replace(result,

@"( )+", " ");

// Remove the header (prepare first by clearing attributes)

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*head([^>])*>","<head>",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"(<( )*(/)( )*head( )*>)","</head>",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

"(<head>).*(</head>)",string.Empty,

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// remove all scripts (prepare first by clearing attributes)

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*script([^>])*>","<script>",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"(<( )*(/)( )*script( )*>)","</script>",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

//result = System.Text.RegularExpressions.Regex.Replace(result,

// @"(<script>)([^(<script>.</script>)])*(</script>)",

// string.Empty,

// System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"(<script>).*(</script>)",string.Empty,

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// remove all styles (prepare first by clearing attributes)

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*style([^>])*>","<style>",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"(<( )*(/)( )*style( )*>)","</style>",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

"(<style>).*(</style>)",string.Empty,

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// insert tabs in spaces of <td> tags

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*td([^>])*>"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// insert line breaks in places of <BR> and <LI> tags

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*br( )*>"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*li( )*>"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// insert line paragraphs (double line breaks) in place

// if <P>, <DIV> and <TR> tags

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*div([^>])*>"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*tr([^>])*>"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<( )*p([^>])*>"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// Remove remaining tags like <a>, links, images,

// comments etc - anything thats enclosed inside < >

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<[^>]*>",string.Empty,

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// replace special characters:

result = System.Text.RegularExpressions.Regex.Replace(result,

@" "," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"•"," * ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"&lsaquo;","<",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"&rsaquo;",">",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"™","(tm)",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"&frasl;","/",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"<","<",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@">",">",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"©","(c)",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

@"®","(r)",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// Remove all others. More can be added, see

// http://hotwired.lycos.com/webmonkey/reference/special_characters/

result = System.Text.RegularExpressions.Regex.Replace(result,

@"&(.{2,6});", string.Empty,

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// for testng

//System.Text.RegularExpressions.Regex.Replace(result,

// this.txtRegex.Text,string.Empty,

// System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// make line breaking consistent

result = result.Replace(" ", " ");

// Remove extra line breaks and tabs:

// replace over 2 breaks with 2 and over 4 tabs with 4.

// Prepare first to remove any whitespaces inbetween

// the escaped characters and remove redundant tabs inbetween linebreaks

result = System.Text.RegularExpressions.Regex.Replace(result,

"( )( )+( )"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

"( )( )+( )"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

"( )( )+( )"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

result = System.Text.RegularExpressions.Regex.Replace(result,

"( )( )+( )"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// Remove redundant tabs

result = System.Text.RegularExpressions.Regex.Replace(result,

"( )( )+( )"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// Remove multible tabs followind a linebreak with just one tab

result = System.Text.RegularExpressions.Regex.Replace(result,

"( )( )+"," ",

System.Text.RegularExpressions.RegexOptions.IgnoreCase);

// Initial replacement target string for linebreaks

string breaks = " ";

// Initial replacement target string for tabs

string tabs = " ";

for (int index=0; index<result.Length; index++)

{

result = result.Replace(breaks, " ");

result = result.Replace(tabs, " ");

breaks = breaks + " ";

tabs = tabs + " ";

}

// Thats it.

return result;

}

catch

{

MessageBox.Show("Error");

return source;

}

Points of Interest

Escaped characters such as /n and /r had to be removed first because they cause regexes to cease working as expected.

Also to make the result string display correctly in the textbox one might need to split it up and set textbox's Lines property instead of assigning to Text property

this.txtResult.Lines = 
      StripHTML(this.txtSource.Text).Split("/r".ToCharArray());

paceman

Click here to view paceman's online profile.

Convert HTML to Plain Text

Introduction

Background (optional)

Using the code

Points of Interest

paceman

C語言--右移左移

一個開源且全面的C#算法實戰教程

12款高效開源Wiki系統推薦，打造團隊知識管理利器

dotnet 基於 DirectML 控制檯運行 Phi-3 模型

自定義MyBatis插件

常用的 Git 指令

sm4加密工具類

深感找喫的地方不方便，於是盟生了把水木food版的文章搬到手機上去的想法

常用DOS命令大全

RP爆發的驗證碼

爲什麼我被限制發帖了？

Environment Variables

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結