PHP 缩短HTML内文的文字 html_substr



Here's a little addon to the html_substr function posted by fox.

Now it counts only chars outside of tags, and doesn't cut words.

Note: this will only work in xhtml strict/transitional due to the checking of "/>" tags and the requirement of quotations in every value of a tag. It's also only been tested with the presence of br, img, and a tags, but it should work with the presence of any tag.


function html_substr($posttext, $minimum_length = 200, $length_offset = 20, $cut_words = FALSE, $dots = TRUE) {
    // $minimum_length:
    // The approximate length you want the concatenated text to be  
    // $length_offset:
    // The variation in how long the text can be in this example text
    // length will be between 200 and 200-20=180 characters and the
    // character where the last tag ends
    // Reset tag counter & quote checker
    $tag_counter = 0;
    $quotes_on = FALSE;
    // Check if the text is too long
    if (strlen($posttext) > $minimum_length) {
        // Reset the tag_counter and pass through (part of) the entire text
        $c = 0;
        for ($i = 0; $i < strlen($posttext); $i++) {
            // Load the current character and the next one
            // if the string has not arrived at the last character
            $current_char = substr($posttext,$i,1);
            if ($i < strlen($posttext) - 1) {
                $next_char = substr($posttext,$i + 1,1);
            else {
                $next_char = "";
            // First check if quotes are on
            if (!$quotes_on) {
                // Check if it's a tag
                // On a "<" add 3 if it's an opening tag (like <a href...)
                // or add only 1 if it's an ending tag (like </a>)
                if ($current_char == '<') {
                    if ($next_char == '/') {
                        $tag_counter += 1;
                    else {
                        $tag_counter += 3;
                // Slash signifies an ending (like </a> or ... />)
                // substract 2
                if ($current_char == '/' && $tag_counter <> 0) $tag_counter -= 2;
                // On a ">" substract 1
                if ($current_char == '>') $tag_counter -= 1;
                // If quotes are encountered, start ignoring the tags
                // (for directory slashes)
                if ($current_char == '"') $quotes_on = TRUE;
            else {
                // IF quotes are encountered again, turn it back off
                if ($current_char == '"') $quotes_on = FALSE;
            // Count only the chars outside html tags
            if($tag_counter == 2 || $tag_counter == 0){
            // Check if the counter has reached the minimum length yet,
            // then wait for the tag_counter to become 0, and chop the string there
            if ($c > $minimum_length - $length_offset && $tag_counter == 0 && ($next_char == ' ' || $cut_words == TRUE)) {
                $posttext = substr($posttext,0,$i + 1);             
                   $posttext .= '...';
                return $posttext;
    return $posttext;

function html_strlen($str) {
  $chars = preg_split('/(&[^;s]+;)|/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
  return count($chars);
function html_substr($str, $start, $length = NULL) {
  if ($length === 0) return ""; //stop wasting our time ;)
  //check if we can simply use the built-in functions
  if (strpos($str, '&') === false) { //No entities. Use built-in functions
    if ($length === NULL)
      return substr($str, $start);
      return substr($str, $start, $length);
  // create our array of characters and html entities
  $chars = preg_split('/(&[^;s]+;)|/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_OFFSET_CAPTURE);
  $html_length = count($chars);
  // check if we can predict the return value and save some processing time
  if (
       ($html_length === 0) /* input string was empty */ or
       ($start >= $html_length) /* $start is longer than the input string */ or
       (isset($length) and ($length <= -$html_length)) /* all characters would be omitted */
    return "";
  //calculate start position
  if ($start >= 0) {
    $real_start = $chars[$start][1];
  } else { //start'th character from the end of string
    $start = max($start,-$html_length);
    $real_start = $chars[$html_length+$start][1];
  if (!isset($length)) // no $length argument passed, return all remaining characters
    return substr($str, $real_start);
  else if ($length > 0) { // copy $length chars
    if ($start+$length >= $html_length) { // return all remaining characters
      return substr($str, $real_start);
    } else { //return $length characters
      return substr($str, $real_start, $chars[max($start,0)+$length][1] - $real_start);
  } else { //negative $length. Omit $length characters from end
      return substr($str, $real_start, $chars[$html_length+$length][1] - $real_start);


PHP Function Code

    function substr_big5($str,$start,$len)
        return end_big5(substr($str,$start,$len));
    function end_big5($src){
        $str = preg_replace("/[xa1-xf9][x40-x7exa1-xfe]/","",$src);
        return (preg_match("/[xa1-xf9]$/",$str)) ? substr($src,0,-1) : $src;
    function html_substr($str,$start,$len){
        //必须是 > 结尾
        return end_big5(substr($str,$start,$newlen));



<?php echo html_substr($row["ArticleContent"],0,_Web_ShortText_Length); ?>

当然这个模组还是有点小问题,就是在巢状的HTML码(如<ul><li></li></ul>),或是Tag还是会有没收合的问题(如<a href=...>),这可能要参考先前找到的两篇资料来看看,希望能解决。

翻了一下 Smarty plugin里面的 Html_Substr , 发现写的实在是太漂亮了.



 function html_substr($string, $length)
        if( !empty( $string ) && $length>0 ) {
            $isText = true;
            $ret = "";
            $i = 0;
            $currentChar = "";
            $lastSpacePosition = -1;
            $lastChar = "";
            $tagsArray = array();
            $currentTag = "";
            $tagLevel = 0;
            $noTagLength = strlen( strip_tags( $string ) );
            // Parser loop
            for( $j=0; $j<strlen( $string ); $j++ ) {
                $currentChar = substr( $string, $j, 1 );
                $ret .= $currentChar;
                // Lesser than event
                if( $currentChar == "<") $isText = false;
                // Character handler
                if( $isText ) {
                    // Memorize last space position
                    if( $currentChar == " " ) { $lastSpacePosition = $j; }
                    else { $lastChar = $currentChar; }
                } else {
                    $currentTag .= $currentChar;
                // Greater than event
                if( $currentChar == ">" ) {
                    $isText = true;
                    // Opening tag handler
                    if( ( strpos( $currentTag, "<" ) !== FALSE ) &&
                        ( strpos( $currentTag, "/>" ) === FALSE ) &&
                        ( strpos( $currentTag, "</") === FALSE ) ) {
                        // Tag has attribute(s)
                        if( strpos( $currentTag, " " ) !== FALSE ) {
                            $currentTag = substr( $currentTag, 1, strpos( $currentTag, " " ) - 1 );
                        } else {
                            // Tag doesn't have attribute(s)
                            $currentTag = substr( $currentTag, 1, -1 );
                        array_push( $tagsArray, $currentTag );
                    } else if( strpos( $currentTag, "</" ) !== FALSE ) {
                        array_pop( $tagsArray );
                    $currentTag = "";
                if( $i >= $length) {
            // Cut HTML string at last space position
            if( $length < $noTagLength ) {
                if( $lastSpacePosition != -1 ) {
                    $ret = substr( $string, 0, $lastSpacePosition );
                } else {
                    $ret = substr( $string, $j );
            // Close broken XHTML elements
            while( sizeof( $tagsArray ) != 0 ) {
                $aTag = array_pop( $tagsArray );
                $ret .= "</" . $aTag . ">n";
        } else {
            $ret = "";
        return( $ret );


之前看了 Smart Plugin 中的 Html_Substr 函数 , 发现在中文的处理上还是有点不够完美 , 因此就打算自己写一个来专门处理中文 , 并且一样要保持 Html 码在截短后的完整性 , 看是否能够更完整的处理本文缩短的问题.

构想 :HTML标签使用堆叠方法来记录 , 并在截短后输入补上结尾 . 中文字部份使用 php mbstring 系列函数来处理 , 包含长度及取字 . 取字长度使用内文长度 , 而非包含HTML的原始码长度.

function html_substr($string, $length)
  if( !empty( $string ) && $length>0 ) {
   $isText = true;   //是否为内文的判断器
   $ret = "";    //最后输出的字串
   $i = 0;     //内文字记数器 (判断长度用)
   $currentChar = "";  //目前处理的字元
   $lastSpacePosition = -1;//最后设定输出的位置
   $tagsArray = array(); //标签阵列 , 堆叠设计想法
   $currentTag = "";  //目前处理中的标签
   $noTagLength = mb_strlen( strip_tags( $string ),'BIG-5' ); //没有HTML标签的字串长度
   // 判断所有字的回圈
   for( $j=0; $j<mb_strlen($string,'BIG-5'); $j++ ) {
    $currentChar = mb_substr( $string, $j, 1 ,'BIG-5');
    $ret .= $currentChar;
    // 如果是HTML标签开头
    if( $currentChar == "<") $isText = false;
    // 如果是内文
    if( $isText ) {
     // 如果遇到空白则表示暂定输出到这
     if( $currentChar == " " ) { $lastSpacePosition = $j; }
    } else {
     $currentTag .= $currentChar;
    // 如果是HTML标签结尾
    if( $currentChar == ">" ) {
     $isText = true;
     // 判断标签是否要处理 , 是否有结尾
     if( ( mb_strpos( $currentTag, "<" ,0,'BIG-5') !== FALSE ) &&
      ( mb_strpos( $currentTag, "/>",0,'BIG-5' ) === FALSE ) &&
      ( mb_strpos( $currentTag, "</",0,'BIG-5') === FALSE ) ) {
      // 取出标签名称 (有无属性的情况皆处理)
      if( mb_strpos( $currentTag, " ",0,'BIG-5' ) !== FALSE ) {
       // 有属性
       $currentTag = mb_substr( $currentTag, 1, mb_strpos( $currentTag, " " ,0,'BIG-5') - 1 ,'BIG-5');
      } else {
       // 没属性
       $currentTag = mb_substr( $currentTag, 1, -1 ,'BIG-5');
      // 加入标签阵列
      array_push( $tagsArray, $currentTag );
     } else if( mb_strpos( $currentTag, "</" ,0,'BIG-5') !== FALSE ) {
      // 取出最后一个标签(表示已结尾)
      array_pop( $tagsArray );
     $currentTag = "";
    // 判断是否还要继续抓字 (用内文长度判断)
    if( $i >= $length) {
   // 取出要截短的HTML字串
   if( $length < $noTagLength ) {
    if( $lastSpacePosition != -1 ) {
     // 指定的结尾
     $ret = mb_substr( $string, 0, $lastSpacePosition ,'BIG-5' );
    } else {
     // 预设的内文长度位置
     $ret = mb_substr( $string, 0 , $j ,'BIG-5' );
   // 补上未结尾的标签
   while( sizeof( $tagsArray ) != 0 ) {
    $aTag = array_pop( $tagsArray );
    $ret .= "</" . $aTag . ">n";
  } else {
   $ret = "";
  return( $ret );


還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.