ID选择,你做对了吗?

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"何为ID"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ID 是标识符(identifier)的前缀,它代表一个可以唯一识别一个对象或者物体的名称。在软件系统中,ID 用于对一组信息进行标识,它是信息系统里最底层、最基础的概念,从系统诞生到消亡,都与 ID 息息相关。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但我们经常会发现,在很多系统的早期,都会采用自增的 Mysql Int ID,笔者可以认为这是对 ID 未做过深入思考的选择,而到后期时,才发现 ID 变更已经几乎不可能实现。由于信息系统相互引用无处不在,如果 ID 选择不当,带来的负面影响往往非常深远。还有一些常见的坑,比如聚合层服务要聚合不同底层服务 ID 时,才发现它们类型不同;资源被黑客遍历攻击,才发现早期 ID 使用的是自增的 Int 类型; Int64 传递至 JavaScript 发现错误等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"笔者调查了一些 ID 选择的情况,如下表,各有不同。那么ID 该怎么选呢,如类型、长度等属性,还有其他属性要考量的么?ID 的选择,并非技术复杂度问题,更多是对 ID 属性的认知程度,以及统一规范问题。本文会重点对 ID 的属性进行分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/68\/681d97f869ef75726aca8c567fcfbc74.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"属性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们列出 5 个属性来分析,并按事物的性质与事物之间关系的角度将 ID 属性分为两类:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自身属性:类型、长度"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"领域属性:唯一性、稀疏度、递增性"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"唯一性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"唯一性是 ID 的本质属性,ID 一定能帮助我们在其领域内识别唯一对象,否则就不是标识符。唯一性的实现需要各种生成策略。目前,业界已经有多种解决方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于Int 类型,最简单的生成策略可以依赖数据库中间件,如依赖 Mysql 的自增 ID,但依赖数据库中间件的缺点也很明显,不支持水平分片架构,且对数据库有依赖,每种数据库可能实现不同,一旦数据库切换时涉及到代码的修改,则不利于扩展。而且依赖数据库的自增 ID 也会有安全问题,容易被遍历。更加有效成熟的解决方案,是依赖集中式发号器,被广泛使用的发号器有通常被使用的雪花算法及其变种,网传实际测试每秒最多生成 26w 个id,可以有效的生成全局的 ID。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"String 类型 ID 的生成方式已经被标准化,主要依赖SDK生成。如 UUID 被开放软件基金会(OSF)标准化,各种语言均提供按标准实现的SDK,它由时间戳、网卡、时钟序列等构成标准,确保了在分布式环境下,单机高速生成,支持100ns级并发;MongoDB 的主键是 STRING 类型,其标准为: ObjectId = epoch 时间 + 机器标识 + 进程号 PID + 计数器,依赖不同语言的驱动来生成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们经常会听到一种说法,ID 应该是无意义的。这里的无意义,有人阐述为 ID 中不应该包含任何的具体场景信息,避免 ID 在可标识数量上降低而唯一性受到挑战。对此说法,笔者不能认同,因为在分布式 ID 生成的策略里,包含具体场景是必须的,我们通常会利用时间场景和机器场景来保证分布式场景下 ID 各种属性的要求。比如上文 UUID 嵌入网卡、时间等具体的场景信息。下图经典的 Snowflake 算法,也嵌入 41 位时间戳和 5 位数据中心的场景信息,用来保证唯一性和递增趋势:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8e\/8ed547296849b8971975b8b8eb3b025b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"笔者认为ID 的无意义是指 ID 不应当包含具体的业务信息,从而避免因业务的发展,出现对 ID 的唯一性挑战,如避免使用邮箱做 ID、或用户账号内包含用户生日等。一个领域的 ID 在其他领域被引用时,会具有外部引用的业务意义,也要避免。如下面案例,用户会员表直接使用用户 ID 做其数据库主键 ID,当新需求希望一个用户,需要同时拥有多个类型的会员时,就扩展艰难了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"UserID \/\/ 主键, 账号ID,全局唯一\nMembership \/\/ 该账号的会员类型 "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"类型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"类型选择无外乎:字符类型、整形。这两种类型的本质,都是支撑人类思维能力所形成的标准。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整数是人类计数体系的标准,它不但天然具有符合自然公理的完整运算规则,而且有来自CPU、操作系统、编译器、甚至中间件等各个层级的直接支持,如当代的 CPU 一般具有 64 位宽的整数型寄存器。Mysql 提供可以自增的整形主键,Redis 提供整数对象池来节省存储资源。这些支持,使整数存储上更加节省资源,运算上更加高效,其运算时间复杂度通常为O(1)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"字符是有人类语言体系的标准,天生为了标识而存在。字符不具有天然的计算规则,其常见的比较策略是逐一比较,算法时间复杂度为O(n)。字符的实现方式,依赖不同的编码标准及操作系统的实现,有 ASCII、Unicode、UTF-8 等。字符可以标识更大范围,理论上是无限空间。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"对比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整形类型的优点在于充分利用底层基础设施的性能优化措施,使得其支持系统或中间件达到最佳状态,带来这一优势的是计算机体系的支撑,自然也会有它的约束。字符类型的优点在于标识空间的巨大,当然这一优势,是以存储空间和计算性能为代价的。具体对比如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a5\/a5a9aa86c27a98e1d6a206eb9d3d7de0.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从运算属性来看,整形天然具有运算属性的优势,整形的对比时间复杂度为O(1)。 字符类型不具备天然的运算属性,其比较通常要自行定义排序规则。如 MySQL 和 Oracle 数据库中,字符串类型比较规则是按照相同位置的字符的 ASCII 码值的大小进行排序的,时间复杂度为 O(N)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从标识空间来看,Int64 必然受到最大64位的限制,空间虽然非常巨大,但依旧是有限的。字符类型可以标识的空间是无限的,也容易嵌入业务信息。那么字符表示更广范围的基础是什么?当然是占用了更多的二进制存储空间。笔者也很期待,像从 16 位、32 位,到发展到今天的 64 位一样,未来会出现 128 位的整形,那时候 Int 的空间也足够大了,可以和 UUID 直接转换。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从约束度来看,Int 类型是标准最为严格的,可发挥空间小,位数也都受到较为严格的约束,属于强约束 ID, 难以嵌入业务信息。String 类型,容易被扩展或违反约束,这一条,对 Int 既是劣势,也是优势。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"选择"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"随着计算机 CPU 和存储的快速发展,存储和计算复杂度,都不再成为影响的关键因素时,String 和 Int 类型作为 ID,各自的优势变得并非十分明显,所以,互联网上也开始经常出现争论:究竟使用哪种类型更好。笔者建议:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"统一最为重要,参考存留的系统,如果大量的使用 Int,建议相关联的系统仍保持使用 Int,如果大量使用 String、建议使用 String。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"参考中间件的基础特性。如选择的 ID 是否可以使 Mysql、MongoDB 等中间件是否可以达到最佳状态。如果大量基础设施是 Mysql,则递增的 int 类型作为 ID 更合适;如果都是 MongoDB,则可以使用其自动生成的字符类型 ID 。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"分布式海量数据的 ID,可以采用字符类型,字符类型分布式 ID 生成更加简单,如服务链路的日志追踪。 需要埋入较多业务数据的,应当使用字符类型,如阿里 SPM 埋点"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一条更通用的建议是:首先,优先考虑统一,要尽量避免一个关联系统内,Int 和 String 类型共存的情况。其次,优先考虑使用 Int 类型,Int 类型的约束更为严格,被中间件的支撑度和可迁移性更好,而且 Int 类型对海量数据支撑能力通常也是足够的,如雪花算法及各种改进,每秒可以生成几十万个 ID,满足绝大多数分布式场景。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"长度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在存储已经不是瓶颈的今天,采用 Int 类型时,建议强制使用 Int64,笔者经历过两次公司级别将用户 Id 由 Int32 改造为 Int64 的过程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采用 String 类型时,建议制定约束规范,有以下长度可以借鉴:UUID 通常是36位,MongoDB的 ObjectID 是 24 位、Yuotube 的视频 ID 是11位。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这里存在一个常见的坑:JavaScript 语言对 Int64 支持度仍然不够,其使用 53 位以上的 Int64 类型会有精度损失。因为 JavaScript 语言内置数值类型依赖 IEEE 754 规范的双精度浮点数,IEEE 754 规范字节分配如下图。最大的安全整数是 52 位 fraction bit 刚好用到的情况,即:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"katexinline","attrs":{"mathString":"$$2^{53} - 1 = 9007199254740991$$"}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e3\/e369e9379159b0d836a207f871deceac.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"JavaScript 提供了解决方案 BigInt 来解决,但预计需要到 2025 年,浏览器更新覆盖率才能全面使用("},{"type":"link","attrs":{"href":"https:\/\/caniuse.com\/bigint?fileGuid=t3TV9DvWCJxTpgjy","title":"","type":null},"content":[{"type":"text","text":"https:\/\/caniuse.com\/bigint"}]},{"type":"text","text":")。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"递增趋势"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"出于数据库中间件索引和连续存储的要求,ID 的递增趋势是非常有必要的。递增的实现并不复杂,ID 标准生成方案已经完全可以做到支撑递增且高性能。递增趋势有两种,如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非严格递增,要求整体上后生成的 ID 大于之前生成 ID。由于多数 RDBMS 使用 B-tree 的数据结构来存储索引数据,索引页的数据是按逻辑大小连续存储的,如果使用非自增主键,MySQL不得不为了将新纪录插到合适位置而移动数据,频繁的移动会大大降低数据库的性能。 分布式 ID 生成算法通常会将 ID 内嵌入一定的时间戳,来实现递增趋势,如雪花算法、MongoDB 的主键。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"严格递增,要求保证下一个 ID 一定大于上一个 ID,例如事务版本号、IM 增量消息、排序等特殊需求,需要严格递增的逻辑来确保业务正确。严格递增的这一约束,通常只能单机集中式实现。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"稀疏度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"增加 ID 之间的稀疏度,可以提高恶意遍历、碰撞攻击的成本。稀疏度越高则效果越好。例如 YouTube 使用 11 位字符来作为视频编码标识,11 位字符有超过 73 亿种可能的组合,目前在 YouTube 上有 5 到 100 亿个视频,这意味着,如果依赖随机输入 11 个字符组成 URL 的方式获得视频,平均每尝试七百万次才可以访问到资源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分布式 ID 生成策略所生产的 ID 稀疏度通常足够,参见上文的 snowflake 示例图,理论上单机每毫秒理论最多生成 2^12 个 ID,再加上机器位,则每秒可生成的 ID 数目理论上为 40 亿。对大部分信息资源而言,每秒生成的 ID 散落在 40 亿的 ID 空间内,稀疏度是够用的。例如我司的 PGC 内容生产而言,其稀疏度通常在 100 亿以上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ID 稀疏度能大大降低未公开的资源,被猜测到的行为,再去寻找漏洞进行单点突破的行为。比如还未上架的课程已经被黑客公开了,会是件非常被动的事情。而且,如果数据的总量对外是机密的,ID 的稀疏度,还可以避免被黑客猜到数据总量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但 ID 稀疏度不能解决已公开资源被遍历问题,已公开的资源很容易通过爬虫收集 ID。完全公开的资源,通常是没必要考虑进行安全限制的,比如上文表格中看到微博的用户,美团的商品仍采用连续的 Int,因为其不需要保密。保密级别高的资源,仅靠稀疏度来确保权限远远不够,需要配合的措施有很多,如关键接口对指定 Token 限制访问频率等、APP 使用防混淆配合签名加解密等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Mysql 自增 ID 的稀疏度是 1,作为需要防止遍历的资源 ID 不是合适的。但笔者也见过一些公司做了另外一个极端:一定要把所有的 ID(如雪花算法生成的稀疏度在 100亿以上)转换成 String 类型的加密 ID 对 Web 端输出,这种做法都是想当然的决策,除了徒增复杂度外并无太多价值。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"实践"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"ID域"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对微服务体系来说,当各种不同服务的 ID 需要聚合时,引入域标识是有必要的,因为我们也需要对不同的 ID 去不同的微服务引用资源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ID 域的主要作用是标识这个 ID 属于哪个业务领域,对应哪个微服务。ID 引入域,就像编程里语言引入 namespacing 一样,那是否需要引入多个层级的域概念呢?未尝不可。但是层级越多,复杂度越高,通常建议只引如一个域层级,标识 ID 属于不同业务领域即可。当 ID 传递至聚合层业务时,域标识和 ID 一并进行被传递或存储,但不允许存在 ID 到对应服务的反查服务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"AType \/\/ 域标识,属于业务分类概念,对应于微服务划分\nID \/\/ 该商品类型下 ID"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"变更"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了本质属性外,其他几种 ID 属性都有可能变更:类型、长度、稀疏度、递增性。不同属性的变更成本代价不一样。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"类型变更通常最为困难,因为 Int 和 String 类型的本质不同,其支持系统如数据库的实现方式差别巨大,除变更代码外,历史数据处理都困难重重。由于 ID 的引用无处不在,变更类型,所需的关联系统的改造也非常困难,甚至不可能,比如改变 ID 类型,会影响用户的购买记录、历史记录、收藏记录,甚至大数据团队的日常报表等系统的存储。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"稀疏度、递增性的变更,只需要变更 ID 生成方式即可。如我们很容易可以将 ID 生成方式,由 Mysql 的自增,变更为使用发号器来生成,来增加稀疏度。但历史数据的处理仍会比较困难。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"长度的变更相对容易,如 Int32 变更为 Int64,不会涉及业务逻辑变化,也无需处理历史数据。前文也提到,笔者经历过两次公司级 int32至 int64 的变更,推动的都很顺利。有些人也在担心 snowflake 41 位时间戳可以使用 69 年,但笔者认为,到时候 128 位的整形也应该普及了, ID 再次进行一次长度的变更即可。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"兼容"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整个系统初始便选定合适的 ID 生成策略,使得各种属性,稀疏度、长度、生成方式,类型等满足要求自然最好。如果未选好,而且也无法变更。该怎么办呢?兼容。兼容一定会存在关联 ID 的转换,大的原则是但要"},{"type":"text","marks":[{"type":"strong"}],"text":"尽可能的减少双 ID 的扩散范围"},{"type":"text","text":"。参考如下两个方案:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"将对应关系存储下来。如将数字 ID 进行加盐 Hash(或生成对应UUID)生成合适的 String ID,直接关联存储在数据库内,但要尽可能的减少双 ID 的扩散范围。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"使用规则或者对称加密算法进行可逆转换,不做存储。Int 可以较为容易利用规则或算法转换为String 类型,反之不能,因为毕竟巨大的 String 空间是无法依赖算法被压缩入较小 Int 空间。本方案一定要注意,避免新 ID 被相关系统存入数据库,包括客户端的数据库,以免导致无法变更算法。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实践中,虽不理想,但也各有场景需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下图,当聚合层服务聚合不同的资源 ID 时,有的底层服务用 Int,有的资源用 String,可以使用方案一,聚合服务将 ID 转换保留在自己服务体内,对外输出仍使用原有类型,避免了双 ID 的扩散。笔者反对源服务提供两种类型 ID 的做法,这样会造成双 ID 扩散和引用关系复杂,带来的一系列难以维护的问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e1\/e1fcf689d08220eadcaf81e0793125cd.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下图,底层服务和聚合层服务都已经存储大量连续(不安全) ID,后期安全问题暴露,可以采用方案二。该做法首先要止损,将 ID 生成方式由原来的自增,改为发号器;再采用 SDK 对 ID 进行加密对外输出,以减少被遍历风险。但从下图可以看到,各SDK 采用相同秘钥,破坏了系统之间的边界,而且秘钥变更也有引发故障的风险,属于“瘸子里挑将军”。如果读者有更合适的实践,比如接口适配等,欢迎留言探讨。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3a\/3a81d422b3d2a59dc15f315bbbc94118.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"总结"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ID 的选择,往往不经意却影响深远。它并非是技术复杂度问题,更多是对 ID 属性的认知程度,以及架构的约束规范问题。特别是采用微服务体系时,各服务独立设计,ID 选择更容易失控。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文从类型、长度、唯一性、稀疏度、递增性的角度对 ID 进行全面分析,并提出一些通用的约束。文中也提出了一个粗略的选择建议:业务生产类,建议使用 Int64 类型,采用发号器生成,并维护递增趋势。分布式海量数据且领域较窄的,可以采用标准的 UUID 等 String 类型 ID。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"兼容的大原则是减少双 ID 的扩散,文中所列的实践,您可能也会遇到过,并有自己心得,欢迎留言探讨,更欢迎留言您遇到的其他场景及解决方案。"}]},{"type":"heading","attrs":{"align":null,"level":5},"content":[{"type":"text","text":"作者简介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"奇正,曾在 Adobe 、百度任高级工程师,现任某互联网公司技术总监,致力于业务架构、项目管理等方向。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章