Convert HTML to Plain Text

Introduction

This acticle provides the procedure for stripping out HTML tags while preserving most basic formaitting. In other words it converts HTML to plain text.

Background (optional)

This example heavily relies or regular expressions, in paticular System.Text.RegularExpressions.Regex.Replace() method. You may also find this reference on regular expressions syntax useful.

Using the code

The code uses System.Text.RegularExpressions namespace and consists of a single function, StripHTML().

First the development formatting is removed such as tabs used for step-identations and repeated whitespaces. As a result the input HTML is "flattened" into one continous string. This serves two reasons: (1) to remove the folmatting ignored by browsers, (2) to make the regexes work reliably (they seem to get confused by escaped characters).

Then the header is removed by removing anything between <head> and </head> tags.

Then all scripts are removed by chopping out anyting between <script> and </script> tags inclusive. Similarly with styles.

Then the basic formatting tags, such as <BR> and <DIV> are replaced with /r or /r/r. Also <TR> tags are replaced by linebreaks and <TD>s by tabs.

<LI> are replaced by *s and special characters such as &nbsp; are replaced with their corresponding values.

Finally all the ramining tags are replaces by emptry strings.

By this stage there are likely to be alot or redundant repeating line breaks and tabs. Any sequence over 2 line breaks long is replaced by two linebreaks. Similarly with tabs: sequences over 4 tabs long are replaced by 4 tabs.

 
private string StripHTML(string source)
{
    
    
try
    
{

        
string result;

        
// Remove HTML Development formatting
        
// Replace line breaks with space
        
// because browsers inserts space
        result = source.Replace(" "" ");
        
// Replace line breaks with space
        
// because browsers inserts space
        result = result.Replace(" "" ");
        
// Remove step-formatting
        result = result.Replace(" "string.Empty);
        
// Remove repeating speces becuase browsers ignore them
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                                                              
@"( )+"" ");

        
// Remove the header (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*head([^>])*>","<head>"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"(<( )*(/)( )*head( )*>)","</head>"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
"(<head>).*(</head>)",string.Empty, 
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        
// remove all scripts (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*script([^>])*>","<script>"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"(<( )*(/)( )*script( )*>)","</script>"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        
//result = System.Text.RegularExpressions.Regex.Replace(result, 
        
//         @"(<script>)([^(<script>.</script>)])*(</script>)",
        
//         string.Empty, 
        
//         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"(<script>).*(</script>)",string.Empty, 
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        
        
// remove all styles (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*style([^>])*>","<style>"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"(<( )*(/)( )*style( )*>)","</style>"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
"(<style>).*(</style>)",string.Empty, 
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        
// insert tabs in spaces of <td> tags
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*td([^>])*>"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        
// insert line breaks in places of <BR> and <LI> tags
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*br( )*>"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*li( )*>"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        
// insert line paragraphs (double line breaks) in place
        
// if <P>, <DIV> and <TR> tags
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*div([^>])*>"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*tr([^>])*>"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<( )*p([^>])*>"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        
// Remove remaining tags like <a>, links, images,
        
// comments etc - anything thats enclosed inside < >
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<[^>]*>",string.Empty, 
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        
// replace special characters:
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&nbsp;"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&bull;"," * "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);    
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&lsaquo;","<"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&rsaquo;",">"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&trade;","(tm)"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&frasl;","/"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"<","<"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@">",">"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&copy;","(c)"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);        
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&reg;","(r)"
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);    
        
// Remove all others. More can be added, see
        
// http://hotwired.lycos.com/webmonkey/reference/special_characters/
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
@"&(.{2,6});"string.Empty, 
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);    

        
// for testng
        
//System.Text.RegularExpressions.Regex.Replace(result, 
        
//       this.txtRegex.Text,string.Empty, 
        
//       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        
// make line breaking consistent
        result = result.Replace(" "" ");

        
// Remove extra line breaks and tabs:
        
// replace over 2 breaks with 2 and over 4 tabs with 4. 
        
// Prepare first to remove any whitespaces inbetween
        
// the escaped characters and remove redundant tabs inbetween linebreaks
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
"( )( )+( )"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
"( )( )+( )"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
"( )( )+( )"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result 
= System.Text.RegularExpressions.Regex.Replace(result, 
                 
"( )( )+( )"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        
// Remove redundant tabs
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
"( )( )+( )"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        
// Remove multible tabs followind a linebreak with just one tab
        result = System.Text.RegularExpressions.Regex.Replace(result, 
                 
"( )( )+"," "
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        
// Initial replacement target string for linebreaks
        string breaks = " ";
        
// Initial replacement target string for tabs
        string tabs = " ";
        
for (int index=0; index<result.Length; index++)
        
{
            result 
= result.Replace(breaks, " ");
            result 
= result.Replace(tabs, " ");
            breaks 
= breaks + " ";    
            tabs 
= tabs + " ";
        }


        
// Thats it.
        return result;

    }

    
catch
    
{
        MessageBox.Show(
"Error");
        
return source;
    }

}

Points of Interest

Escaped characters such as /n and /r had to be removed first because they cause regexes to cease working as expected.

Also to make the result string display correctly in the textbox one might need to split it up and set textbox's Lines property instead of assigning to Text property

this.txtResult.Lines = 
      StripHTML(this.txtSource.Text).Split("/r".ToCharArray());

paceman


Click here to view paceman's online profile.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章