PHP统计UTF-8编码文件中的字符数

UTF-8（8-bit Unicode Transformation Format）是一种针对Unicode的可变长度字符编码，由Ken Thompson于1992年创建，现在已经标准化为RFC 3629。UTF-8用1到4个字节编码Unicode字符。

那如何判断一个字符到底是用几个字节表示的呢，这个从一个UTF-8字符的首字节中就可以判断，UTF-8编码规则：如果只有一个字节则其最高二进制位为0；如果是多字节，其第一个字节从最高位开始，连续的二进制位值为1的个数决定了其编码的字节数，其余各字节均以10开头。UTF-8转换表表示如下：

从文件开头通过每次读取一个字符的首字节，判断其中连续的1的个数就可以得知这个字符占用几个字节，如果最高位是0，则该字符是个单字节字符（即ASCII码），然后跳过相应的字节数，再去读下一个字符的首字节，依此类推，就可以很容易的统计一个文件中的字符数了。

附上代码：

<?php

class Utf8Counter {

    /**
     * 使用utf8字符首字节特征，用以确定单个字符的字节数，utf8编码的特征如下
     * 1111 110x  6字节
     * 1111 10xx  5字节
     * 1111 0xxx  4字节  当前的UTF-8编码一个字符最多用4个字节表示
     * 1110 xxxx  3字节，汉字多为3字节编码
     * 110x xxxx  2字节
     * 0xxx xxxx  1字节，兼容ASCII码
     */
    const UTF8_FIRST_BYTE_FEATURE = [
//        0xfc => 6,
//        0xfa => 5,
        0xf0 => 4,
        0xe0 => 3,
        0xc0 => 2,
    ];

    const BUFFER_LEN = 1024;

    /**
     * 忽略UTF-8文件可能存在的BOM头
     * @param resource $fp
     * @return bool
     */
    private static function clearBomHeader($fp)
    {
        // BOM头是固定的3个字节：0xEF,0xBB,0xBF
        if (fread($fp, 3) === "\xEF\xBB\xBF") {
            return true;
        }
        // 把文件指针重新移到文件开头
        rewind($fp);
        return false;
    }

    /**
     * 根据UTF-8字符首字节确定该字符的字节数
     * @param string $char
     * @return int|mixed
     */
    public static function getCharByteCount($char)
    {
        $ascii = ord($char);
        if (($ascii & 0x80) == 0) {
            return 1;
        }
        foreach (self::UTF8_FIRST_BYTE_FEATURE as $bits => $byteCount) {
            if (($ascii & $bits) == $bits) {
                return $byteCount;
            }
        }
        throw new \RuntimeException('非法的UTF-8字符');
    }

    /**
     * @param $string
     * @param null $absence 最后一个字符缺少的字节数，例如给定的字符串最后一个字符长度是3字节，但只提供了其中前两个字节，则该值为1。
     * @return int
     */
    public static function getStringCounter($string, &$absence = null)
    {
        $counter = 0;
        $pos = 0;
        $len = strlen($string);
        if (!$len) {
            return 0;
        }
        while (true) {
            $byteCount = self::getCharByteCount($string{$pos});
            $pos += $byteCount;
            $counter++;
            if ($pos >= $len) {
                $absence = $pos - $len;
                break;
            }
        }
        return $counter;
    }

    /**
     * @param string $file 文件路径
     * @return int
     */
    public static function getFileCounter($file)
    {
        $counter = 0;
        
        if (!is_file($file)) {
            throw new \InvalidArgumentException('incorrect file name');
        }

        $fp = fopen($file, 'rb');
        if (!$fp) {
            throw new \RuntimeException('请确保文件存在并可读');
        }

        self::clearBomHeader($fp);

        $absence = 0;
        while (!feof($fp)) {
            $buf = fread($fp, self::BUFFER_LEN);
            if (!$buf) {
                break;
            }
            if ($absence > 0) {
                $buf = substr($buf, $absence);
            }
            $counter += self::getStringCounter($buf, $absence);
        }

        @fclose($fp);

        return $counter;
    }

    /**
     * 统计文件夹中所有文件的字符数总和
     * @param string $dir 目录
     * @param string|array $extName 只处理指定的扩展名，支持字符串和数组，'.txt' 或 ['.txt', '.php', '.html', '.md', ...]
     * @return int
     */
    public static function getDirCounter($dir, $extName = '')
    {
        $counter = 0;
        if (!is_dir($dir)) {
            throw new \InvalidArgumentException('不合法的目录');
        }
        $dir = rtrim($dir, "/\\");
        if ($extName && is_string($extName)) {
            $extName = [$extName];
        }
        $fileArr = scandir($dir);
        foreach ($fileArr as $file) {
            if ($file == '.' || $file == '..') {
                continue;
            }
            $file = $dir . '/' . $file;
            if (is_dir($file)) {
                $counter += self::getDirCounter($file, $extName);
            } else {
                if (!$extName
                    || (is_array($extName) && in_array('.' . pathinfo($file, PATHINFO_EXTENSION), $extName))) {
                    $counter += self::getFileCounter($file);
                }
            }
        }
        return $counter;
    }
}

调用：

<?php

echo Utf8Counter::getStringCounter('😂😭🙂🐷你好中国😈👹') . PHP_EOL;
echo Utf8Counter::getFileCounter('foo/1112.txt') . PHP_EOL;
echo Utf8Counter::getDirCounter('foo', '.txt') . PHP_EOL;

执行结果：

C:\Users\zhangyi\PhpstormProjects\demo>php utf8wc.php
10
9
34217

本人目前正在翻译MySQL8.0官方文档，感兴趣的可以来看看，欢迎star和fork：

https://github.com/zhyee/Mysql8.0_Reference_Manual_Translation

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

PHP统计UTF-8编码文件中的字符数

本人目前正在翻译MySQL8.0官方文档，感兴趣的可以来看看，欢迎star和fork：

https://github.com/zhyee/Mysql8.0_Reference_Manual_Translation

工作中用到的脚本合集

微服务实践Aspire项目发布到远程k8s集群

通过f-string编写简洁高效的Python格式化输出代码

[转帖]20个常用的Linux工具命令

[转帖]PostgreSQL从小白到高手教程 - 第46讲：poc-tpch测试

24-5-18 X

Docker容器技術介紹（四） --- Docker倉庫操作

libpulse.so.0 not found解決

php+kafka安裝

如何把本地項目提交到github

vim插入模式下退格鍵（BackSpace）無效顯示^?解決辦法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結