RTF Builder

http://www.codeproject.com/KB/recipes/RtfConverter.aspx

在RTF中,中文等特殊字符的轉換,需要添加Unicode的標記(\ucN, \uN);

the current RTF Specification 1.9.1,附件:


Unicode RTF
From Word 97 onward, Word is based on Unicode. Text characters can be handled using the 16-
bit Unicode character-encoding scheme defined in this section. Expressing this text in RTF
required a new mechanism, because until Word 97, RTF handled only 7-bit characters directly
and 8-bit characters encoded as hexadecimal using \'xx. The Unicode mechanism described here
can be applied to any RTF destination or body text.
Control word Meaning
\ucN This keyword represents the number (count) of bytes that follow a \uN Unicode character to give
the codepage code that best corresponds to the Unicode character.
This keyword may be used at
any time, and values are scoped like character properties. That is, a \ucN keyword applies only
to text following the keyword, and within the same (or deeper) nested braces. On exiting the
group, the previous \ucN value is restored. The reader must keep a stack of counts seen and
use the most recent one to skip the appropriate number of characters when it encounters a \uN
keyword. When leaving an RTF group that specified a \ucN value, the reader must revert to the
previous value. A default of 1 should be assumed if no \ucN keyword has been seen in the
current or outer scopes.
A common practice is to emit no ANSI representation for Unicode characters within a Unicode
destination context (that is, inside a \ud destination). Typically, the destination will contain a
\uc0 control sequence. There is no need to reset the count on leaving the \ud destination,
because the scoping rules will ensure the previous value is restored.
\uN This keyword represents a single Unicode character that has no equivalent ANSI representation
based on the current ANSI code page. N represents the Unicode character value expressed as a
decimal number.

This keyword is followed immediately by equivalent character(s) in ANSI representation. In this
way, old readers will ignore the \uN keyword and pick up the ANSI representation properly.
When this keyword is encountered, the reader should ignore the next N' characters, where N'
corresponds to the last \ucN' value encountered.
As with all RTF keywords, a keyword-terminating space may be present (before the ANSI
characters) that is not counted in the characters to skip. While this is not likely to occur (or
recommended), a \binN keyword, its argument, and the binary data that follows are considered
one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or
closing brace) is encountered while scanning skippable data, the skippable data is considered to
end before the delimiter. This makes it possible for a reader to perform some rudimentary error
recovery. To include an RTF delimiter in skippable data, it must be represented using the
appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control
word or symbol is considered a single character for the purposes of counting skippable
Rich Text Format (RTF) Specification, Version 1.9.1 Header
© 2008 Microsoft Corporation. All rights reserved. Page 15
By using or providing feedback on these materials, you agree to the license agreement on p. 1.
Control word Meaning
characters.
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character,
should output \uN followed by the best ANSI representation it can manage. Often a question
mark is used if no reasonable ANSI character exists. In addition, if the Unicode character
translates into an ANSI character stream with a count of bytes differing from the current Unicode
Character Byte Count, it should emit the appropriate \ucN keyword prior to the \uN keyword to
notify the reader of the change.
Most RTF control words accept signed 16-bit numbers as arguments. For these control words,
Unicode values greater than 32767 are expressed as negative numbers. For example, the
character code U+F020 is given by \u-4064. To get -4064, convert F02016 to decimal (61472)
and subtract 65536.
Occasionally Word writes SYMBOL_CHARSET (nonUnicode) characters in the range
U+F020..U+F0FF instead of U+0020..U+00FF. Internally Word uses the values U+F020..U+F0FF
for these characters so that plain-text searches don’t mistakenly match SYMBOL_CHARSET
characters when searching for Unicode characters in the range U+0020..U+00FF. To find out the
correct symbol font to use, e.g., Wingdings, Symbol, etc., find the last SYMBOL_CHARSET font
control word \fN used, look up font N in the font table and find the face name. The charset is
specified by the \fcharsetN control word and SYMBOL_CHARSET is for N = 2. This corresponds
to codepage 42.
\upr This keyword represents a destination with two embedded destinations, one represented using
Unicode and the other using ANSI. This keyword operates in conjunction with the \ud keyword to
provide backward compatibility. The general syntax is as follows:
'{' \upr '{' keyword ansi_text '}{\*' \ud '{' keyword Unicode_text '}}}'
Notice that the \upr keyword destination does not use the \* keyword; this forces the old RTF
readers to pick up the ANSI representation and discard the Unicode one.
\ud This destination is represented in Unicode. The text is represented using a mixture of ANSI
translation and \uN keywords to represent characters that do not have exact ANSI equivalents.

 /// <summary>
        /// Converts a character to a hexadecimal string
        /// </summary>
        /// <param name="val">Character to convert</param>
        /// <returns></returns>
        private static string Dec2Hex(char val)
        {
            const string hex = "0123456789abcdef";
            string s1 = new string(hex[(val >> 4) & 0xf], 1);
            string s2 = new string(hex[val & 0xf], 1);


            var str = ((int)val).ToString("x");
            // handle Chinese/Japanese char
            if (str.Length == 4)
            {
                str = "\\uc2\\u" + (int)val + "\\'" + str.Substring(0, 2) + "\\'" + str.Substring(2, 2);
                return str;
            }


            return "\\'" + s1 + s2;
        }


        /// <summary>
        /// Converts an integer to a hexadecimal string
        /// </summary>
        /// <param name="val">Integer to convert</param>
        /// <param name="bytes">Number of bytes to convert</param>
        /// <param name="reversebyteorder">True if bytes are to be converted in reverse order</param>
        /// <returns></returns>
        private static string Dec2Hex(int val, int bytes, bool reversebyteorder)
        {
            const string hex = "0123456789abcdef";
            string ret = "";


            if (reversebyteorder)
            {
                for (int i = 0; i < bytes; i++)
                {
                    ret += hex[(val >> 4) & 0xf];
                    ret += hex[val & 0xf];
                    val >>= 8;
                }
            }
            else
            {
                for (int i = 0; i < bytes * 2; i++)
                {
                    ret = hex[val & 0xf] + ret;
                    val >>= 4;
                }
            }


            return ret;
        }


        /// <summary>
        /// Converts a block of bytes to a hexadecimal string
        /// </summary>
        /// <param name="data">Block of bytes to convert</param>
        /// <param name="startpos">Initial x position of string</param>
        /// <returns></returns>
        private static string Dec2Hex(byte[] data, int startpos)
        {
            const string hex = "0123456789abcdef";


            int size = (data.Length + ((data.Length + startpos / 2 + 38) / 39)) * 2;
            char[] buffer = new char[size];
            int ix = 0;
            for (int i = 0; i < data.Length; i++)
            {
                buffer[ix++] = hex[(data[i] >> 4) & 0xf];
                buffer[ix++] = hex[data[i] & 0xf];


                if (((ix + startpos) % 80) == 0)
                {
                    buffer[ix++] = '\r';
                    buffer[ix++] = '\n';
                }
            }
            buffer[ix++] = '\r';
            buffer[ix++] = '\n';


            return new string(buffer);
        }


        private string ParseText(WordBase obj)
        {
            string ret = obj.Text.Replace("\\", "\\\\").Replace("\t", "\\tab ").Replace("{", "\\{").Replace("}", "\\}").Replace("\n", "").Replace("\r", "\\par \\hich\\af2\\dbch\\af31505\\loch\\f2 \r\n").Replace("<pagenumber>", "\\chpgn ").Replace("<pagebreak>", "\\page ");
            // Translate special characters!
            for (int i = 0; i < ret.Length; i++)
            {
                if (ret[i] >= 0x80) ret = ret.Substring(0, i) + Dec2Hex(ret[i]) + ret.Substring(i + 1, ret.Length - i - 1);
            }


            int pos;


            // Check subscript
            while ((pos = ret.IndexOf("<sub>")) >= 0)
            {
                int end = ret.IndexOf("</sub>");
                if (end > pos)
                {
                    string mid = "{\\sub " + ret.Substring(pos + 5, end - pos - 5) + "}";
                    ret = ret.Substring(0, pos) + mid + ret.Substring(end + 6, ret.Length - end - 6);
                }
                else break;
            }


            // Check superscript
            while ((pos = ret.IndexOf("<super>")) >= 0)
            {
                int end = ret.IndexOf("</super>");
                if (end > pos)
                {
                    string mid = "{\\super " + ret.Substring(pos + 7, end - pos - 7) + "}";
                    ret = ret.Substring(0, pos) + mid + ret.Substring(end + 8, ret.Length - end - 8);
                }
                else break;
            }


            // Check local bold
            while ((pos = ret.IndexOf("<b>")) >= 0)
            {
                int end = ret.IndexOf("</b>");
                if (end > pos)
                {
                    string mid = "{\\b " + ret.Substring(pos + 3, end - pos - 3) + "}";
                    ret = ret.Substring(0, pos) + mid + ret.Substring(end + 4, ret.Length - end - 4);
                }
                else break;
            }


            // Check local italics
            while ((pos = ret.IndexOf("<i>")) >= 0)
            {
                int end = ret.IndexOf("</i>");
                if (end > pos)
                {
                    string mid = "{\\i " + ret.Substring(pos + 3, end - pos - 3) + "}";
                    ret = ret.Substring(0, pos) + mid + ret.Substring(end + 4, ret.Length - end - 4);
                }
                else break;
            }


            // Check local underline
            while ((pos = ret.IndexOf("<u>")) >= 0)
            {
                int end = ret.IndexOf("</u>");
                if (end > pos)
                {
                    string mid = "{\\u " + ret.Substring(pos + 3, end - pos - 3) + "}";
                    ret = ret.Substring(0, pos) + mid + ret.Substring(end + 4, ret.Length - end - 4);
                }
                else break;
            }


            // Greek letters
            string[] greek = { "alpha", "beta", "gamma", "delta", "epsilon", "zeta", "eta", "theta", "iota", "kappa", "lambda", "mu", "nu", "xi", "omicron", "pi", "rho", "sigma", "tau", "upsilon", "phi", "chi", "psi", "omega", "thetasym", "phisym" };
            string[] rtfgl = { "a", "b", "g", "d", "e", "z", "h", "q", "i", "k", "l", "m", "n", "x", "o", "p", "r", "s", "t", "u", "f", "c", "y", "w", "J", "j" };
            string[] rtfgu = { "A", "B", "G", "D", "E", "Z", "H", "Q", "I", "K", "L", "M", "N", "X", "O", "P", "R", "S", "T", "U", "F", "C", "Y", "W", "J", "j" };
            //string[] rtfgl = { "\\'e1", "\\'e2", "\\'e3", "\\'e4", "\\'e5", "\\'e6", "\\'e7", "\\'e8", "\\'e9", "\\'ea", "\\'eb", "\\'ec", "\\'ed", "\\'ee", "\\'ef", "\\'f0", "\\'f1", "\\'f3", "\\'f4", "\\'f5", "\\'f6", "\\'f7", "\\'f8", "\\'f9", "J" };
            //string[] rtfgu = { "\\'c1", "\\'c2", "\\'c3", "\\'c4", "\\'c5", "\\'c6", "\\'c7", "\\'c8", "\\'c9", "\\'ca", "\\'cb", "\\'cc", "\\'cd", "\\'ce", "\\'cf", "\\'d0", "\\'d1", "\\'d3", "\\'d4", "\\'d5", "\\'d6", "\\'d7", "\\'d8", "\\'d9", "j" };


            string normalfontselect = "\\f" + GetFontIndex(MainFont) + " ";
            string greekfontselect = "\\f" + ((MainFontGreek != null) ? GetFontIndex(MainFontGreek) : 1) + " ";


            pos = -1;
            while ((pos = ret.IndexOf("&", pos + 1)) >= 0)
            {
                int pos2 = ret.IndexOf("&", pos + 1);
                int end = ret.IndexOf(";", pos + 1);
                if (end > pos && (pos2 > end || pos2 < 0))
                {
                    string cmp = ret.Substring(pos + 1, end - pos - 1);
                    string temp = cmp.ToLower();
                    for (int gl = 0; gl < greek.Length; gl++)
                    {
                        if (greek[gl] == temp)
                        {
                            string left = ret.Substring(0, pos);
                            string right = ret.Substring(end + 1, ret.Length - end - 1);
                            if (cmp[0] >= 'a' && cmp[0] <= 'z')
                            {
                                // Lower case
                                ret = left + greekfontselect + rtfgl[gl] + normalfontselect + right;
                            }
                            else
                            {
                                // Upper case
                                ret = left + greekfontselect + rtfgu[gl] + normalfontselect + right;
                            }
                            break;
                        }
                    }
                }
            }


            if (obj.Bold || obj.Underline || obj.Italics)
            {
                string pre = "{";
                if (obj.Bold) pre += "\\b ";
                if (obj.Underline) pre += "\\u ";
                if (obj.Italics) pre += "\\i ";


                pre = pre.Replace(" \\", "\\");
                ret = pre + ret + "}";
            }


            if (obj.Alignment != WordAlignment.Left || obj.GetType() == typeof(WordTableCell) || obj.GetType() == typeof(WordTableColumn))
            {
                ret = ((obj.Alignment == WordAlignment.Center) ? "\\qc " : (obj.Alignment == WordAlignment.Right) ? "\\qr " : (obj.Alignment == WordAlignment.Justified) ? "\\qj " : (obj.Alignment == WordAlignment.Distributed) ? "\\qd " : "\\ql ") + ret;
            }


            /*
if (obj.Size != _currentsize)
{
ret = "\\fs" + obj.Size + " " + ret;
_currentsize = obj.Size;
}
*/


            return ret;
        }

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章