如何基于 Flink 生成在线机器学习的样本？

原創

2020-09-11 16:49

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在线机器学习中，样本是关键的一环。本文将给大家详细的介绍微博是如何用 Flink 来实现在线样本生成的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"为何选择 Flink 来做在线的样本生成？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在线样本生成对样本的时效性和准确性都有极高的要求。同样对作业的稳定性及是否容灾也都有严格的指标要求。基于这个前提，我们对目前较为流行的几种实时计算框架（Storm 0.10, Spark 2.11, Flink 1.10）进行了分析比较，结论如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/37c5e0ec2beb106f3d5725fe43178799.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此，我们决定使用 Flink 来作为在线样本生成的实时流计算框架。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"如何实现？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在线样本生成，简单描述一个业务场景：对用户的曝光数据和点击数据实时的做关联，关联后将数据输出到 Kafka 中，给下游的在线训练作业用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我们要确定两个数据流关联的时间窗口。这一步一般建议先离线对两个数据流的日志做关联，通过离线的方式对两份数据在不同的时间范围内做 join，来判断在线需要的时间窗口。比如业务接受的最低关联比例是 85%，并且通过离线测试确认 20 分钟内两个数据流可以关联 85%的数据，那么就可以采用 20 分钟作为时间窗口。这里的关联比例和窗口时间实际上是在准确性和实时性之间的一个 trade-off。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"确定时间窗口后，我们并没有使用 Flink 的 time window 来实现多个数据流的 join，而是选择采用 union + timer 方式来实现。这里主要考虑两点：第一、Flink 自带的 join 操作不支持多个数据流。第二、使用 timer+state 来实现，自定义程度更高，限制更少，也更方便。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来，我们把样本生成过程细分为："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"① 输入数据流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般我们的数据源包括 Kafka，Trigger，MQ 等。Flink 需要从数据源中实时的读取日志。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"② 输入数据流的格式化和过滤"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"读取日志后，对数据做格式化，并且过滤掉不需要的字段和数据。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"指定样本 join 的 key。例如：用户 id 和内容 id 作 key。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"输出的数据格式一般为 tuple2（K,V）,K:参与 join 的 key。V：样本用到的字段。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"③ 输入数据流的 union"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"使用 Flink 的 union 操作，将多个输入流叠加到一起，形成一个 DataStream。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"为每个输入流指定一个可以区分的别名或者增加一个可以区分的字段。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"④ 输入数据流的聚合：keyby 操作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"对 join 的 key 做 keyby 操作。接上例，表示按照用户 id 和内容 id 对多个数据流做 join。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"如果 key 存在数据倾斜的情况，建议对 key 加随机数后先聚合，去掉随机数后再次聚合。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"⑤ 数据存储 state + timer"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"定义一个Value State。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"keyby后的process方法中，我们会重写processElement方法，在processElement方法中判断，如果value state为空，则new 一个新的state，并将数据写到value state中，并且为这条数据注册一个timer（timer会由Flink按key+timestamp自动去重），另外此处我们使用的是ProcessingTime（表示onTimer()在系统时间戳达到Timer设定的时间戳时触发）。如果不为空则按照拼接的策略，更新已经存在的结果。比如：时间窗口内用户id1，内容id1的第一条日志数据没有点击行为，则这个字段为0，第二条点击数据进入后，将这个字段更新为1。当然除了更新操作，还有计数、累加、均值等各种操作。如何在process里区分数据是来自曝光还是点击呢，使用上面步骤③定义的别名。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"重写onTimer方法，在onTimer方法中主要是定义定时器触发时执行的逻辑：从value state里获取到存入的数据，并将数据输出。然后执行state.clear。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"样本从窗口输出的条件有2个：第一，timer到期。第二，业务需要的样本都拼接上了。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此处参考伪代码："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"public class StateSampleFunction extends KeyedProcessFunction {\n /**\n * 这个状态是通过过程函数来维护,使用ValueState\n */\n private ValueState state;\n\n private Long timer = null;\n\n public StateSampleFunction (String time){\n timer = Long.valueOf(time);\n }\n\n @Override\n public void open(Configuration parameters) throws Exception {\n // 获取state\n state = getRuntimeContext().getState(new ValueStateDescriptor<>(\"state\", TypeInformation.of(new TypeHint< ReturnSample >() {})));\n }\n\n @Override\n public void processElement(Tuple2value, Context context, Collector< ReturnSample > collector) throws Exception {\n if (value.f0 == null){\n return;\n }\n\n Object sampleValue = value.f1;\n Long time = context.timerService().currentProcessingTime();\n ReturnSample returnSample = state.value();\n if (returnSample == null) {\n returnSample = new ReturnSample();\n returnSample.setKey(value.f0);\n returnSample.setTime(time);\n context.timerService().registerProcessingTimeTimer(time +timer);\n }\n\n // 更新点击数据到state里\n if (sampleValue instanceof ClickLog){\n ClickLog clickLog = (ClickLog)values;\n returnSample =(ReturnSample) clickLog.setSample(returnSample);\n }\n state.update(returnSample);\n }\n\n /**\n * @param timestamp\n * @param ctx\n * @param out\n * @throws Exception\n */\n @Override\n public void onTimer(long timestamp, OnTimerContext ctx, Collector< ReturnSample > out) throws Exception {\n ReturnSample value = state.value();\n state.clear();\n out.collect(value);\n }\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"⑥ 拼接后的日志格式化和过滤"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"拼接后的数据需要按照在线训练作业的要求对数据做格式化，比如 json、CSV 等格式。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"过滤：决定什么样的数据是合格的样本。例如：有真正阅读的内容才算是可用的样本。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"⑦ 输出"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"样本最终输出到实时的数据队列中。下面是实际的作业拓扑和运行时状态："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/10/1010d80ad3a3d49261b846b92a983638.webp","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e7/e75b4ab08334620fad624b4cb44ecbd9.webp","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整个样本拼接过程的流程图："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9e/9e12665adf7b419460a3e1d84c3eca2e.webp","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"StateBackend 的选取"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用 RocksDB/Gemini 作为 state 的 Backend 的优势和建议："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们用大数据对 memory 和 RocksDB，Gemini 做了实验对比，结果显示 RocksDB 和 Gemin 在数据处理，作业稳定性和资源使用等方面比 memory 更合理。其中 Gemini 的优势最为明显。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，如果是大数据量的 state，建议使用 Gemini + SSD 固态硬盘。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"样本的监控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. Flink 作业的异常监控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业失败监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Failover 监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Checkpoint 失败的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RocksDB 使用情况的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业消费 Kafka 的 Comsumer Lag 的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业反压的监控"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. 样本输入端 Kafka 的消费延迟监控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3. 样本输出端 Kafka 的写入量的监控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4. 样本监控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拼接率监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正样本监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输出样本格式的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输出标签对应的值是否在正常范围"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入标签对应的值是否为 null"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输出标签对应的值是否为空"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"样本的校验"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"样本生成后，如何验证数据是否准确"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"在线和离线的相互校验 "}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"用同等条件下生成的离线样本和在线样本做对比"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"白名单用户的全流程校验"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"故障的处理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"样本异常对线上模型训练的影响非常大。当发现异常报警时，首先要做的是向在线模型训练作业发送样本异常的报警。收到报警信息后，模型停止更新。从而避免影响模型线上效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"普通意义的业务故障解决后，丢弃原来的数据，所有输入日志流从最新的时间点开始消费并生成新的样本即可。重要业务需要重置输入日志流的 Kafka offset 从故障时间点开始重新生成样本数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"平台化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过平台化对样本生成的流程做出严格的规范非常重要。在平台化的过程中，需要提供简单通用的开发模板以提高作业开发效率；提供平台化的作业监控和样本指标监控框架，避免重复造车；提供通用的样本输出落地策略，和在线/离线校验策略，更便捷的为业务方服务。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微博基于 Flink 搭建的在线样本生成平台架构，如图："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/53/53e9355dd7b72b8f5a8987dd60f5a39e.webp","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UI 页面，如图："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/846ed9d1efd2ed3539a3b184f41f7427.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" "}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d6734b52b3acd1ac62eb902029169bd.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于平台化开发，用户只需要关心业务逻辑部分即可。需要用户开发的有："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"对应输入数据的数据清洗逻辑"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"样本输出前的数据清洗逻辑"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其余的在 UI 上配置即可实现，具体有："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"输入 Kafka 的配置信息及对应数据清洗的 UDF 类"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"样本拼接的时间窗口"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"窗口内对字段的聚合操作"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"样本输出的 Kafka 配置信息及输出前数据清洗和格式化的 UDF 类"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"资源情况由平台方审核并配置。完成后，自动生成并提交作业。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/25/254a4f8e2a750e5dd69bb7b9b7fe6333.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业提交后:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1. 平台会提供如前所述的作业相关监控"},{"type":"text","text":"，如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"■ "},{"type":"text","marks":[{"type":"strong"}],"text":"Flink 作业的异常监控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业失败监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Failover 监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Checkpoint 失败的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"RocksDB 使用情况的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业消费 Kafka 的 Comsumer Lag 的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业反压的监控"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"■ 样本监控"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拼接率监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正样本监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输出样本格式的监控"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输出标签对应的值是否在正常范围"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输入标签对应的值是否为 null"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"输出标签对应的值是否为空"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2. 平台会自动将数据落盘"},{"type":"text","text":"，存储到HDFS上。方便离线验证或者离线训练。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3. 用户只需将精力放到样本的验证上即可"},{"type":"text","text":"，由平台方保证作业的稳定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介绍："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"曹富强，微博机器学习研发中心-高级系统工程师。现负责微博机器学习平台数据计算/数据存储模块，主要涉及实时计算 Flink、Storm、Spark Streaming，数据存储Kafka、Redis，离线计算 Hive、Spark 等。目前专注于 Flink/Kafka/Redis 在微博机器学习场景的应用，为机器学习提供框架，技术，应用层面的支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

2024年DataOps趋势预测：AI不会取代数据工程师

APM digest收集了多位行業專家對DataOps在2024的發展形勢及對IT和業務的影響的預測，這些技術最高管理者，包括Confluent技術戰略負責人Andrew Sellers的深刻洞見可能與你的感覺一致嗎？快來探討一下。數據可

2024-04-30 11:49:29

数字化转型新篇章：企业通往智能化的新范式

早在十多年前，一些具有前瞻視野的企業以實現“數字化”爲目標啓動轉型實踐。但時至今日，可以說尚無幾家企業能夠在真正意義上實現“數字化”。在實現“數字化”的征途上，人們發現，努力愈進，彷彿終點愈遠。究其原因，還在於轉型一直落後於技術邊界的拓展

2024-04-29 21:22:20

Stable Diffusion中的embedding

Stable Diffusion中的embedding 嵌入，也稱爲文本反轉，是在 Stable Diffusion 中控制圖像樣式的另一種方法。在這篇文章中，我們將學習什麼是嵌入，在哪裏可以找到它們，以及如何使用它們。什麼是嵌入embe

2024-04-25 21:31:13

AI从入门到入门之手写数字识别模型java方式Dense全连接神经网络实现

前言：授人以魚不如授人以漁.先學會用，在學原理，在學創造，可能一輩子用不到這種能力，但是不能不具備這種能力。這篇文章主要是介紹算法入門Helloword之手寫圖片識別模型java中如何實現以及部分解釋。目前大家對於人工智能-機器學習-神經網

2024-04-19 23:17:21

Pinecone: 大模型时代的智能索引与搜索解决方案

隨着人工智能技術的飛速發展，大模型（Large Models）已成爲衆多領域的重要工具。無論是自然語言處理、圖像識別還是其他複雜任務，大模型都展現出了強大的性能。然而，隨着模型規模的不斷擴大，數據量的激增，如何有效地管理、索引和搜索這些模型

2024-04-19 11:29:43

软件测试从自动化到智能化，大模型开始加入

隨着科技的飛速發展，軟件行業也在不斷地演進和創新。作爲軟件行業的關鍵環節之一，軟件測試行業也在經歷着前所未有的變革。從最初的手動測試，到自動化測試，再到如今的智能化測試，軟件測試行業正在經歷一場深刻的技術革命。在這場革命中，Testin雲測

2024-04-19 00:53:25

裁员了！别错过2024年大数据工程师必备的10项技能

在當今快速發展的世界中，數據被視爲新的石油。隨着對數據驅動洞察的日益依賴，大數據工程師的角色比以往任何時候都更爲關鍵。這些專業人員在管理和優化組織內的數據操作中扮演着至關重要的角色。在本文中，我們將探索2024年大數據工程師必須具備的十

2024-04-16 11:00:53

DevOps已死？2024年的DevOps将如何发展

隨着我們進入2024年，DevOps也隨之發生變化。新興的技術、變化的需求和發展的方法正在重新定義有效實施DevOps實踐。 IDC預測顯示，未來五年，支持DevOps實踐的產品市場繼續保持健康且快速增長，2022年-2027年的複合年增長

2024-04-08 12:51:44

从模型到部署，教你如何用Python构建机器学习API服务

本文分享自華爲雲社區《Python構建機器學習API服務從模型到部署的完整指南》，作者：檸檬味擁抱。在當今數據驅動的世界中，機器學習模型在解決各種問題中扮演着重要角色。然而，將這些模型應用到實際問題中並與其他系統集成，往往需要構建API

2024-04-08 10:33:17

测试左移已经开始影响DevOps的发展？

在軟件開發的早期，該過程通常是開發人員編寫代碼，再將其交給質量保證（QA）進行測試。這種瀑布開發方法可能會導致質量問題和延遲，因爲問題是在週期後期發現的。一、瞭解DevOps和測試左移 DevOps是Development和Operati

2024-04-07 12:48:37

黑盒Prompt优化：提升大模型反馈效果的新思路

隨着人工智能技術的快速發展，大模型在各種應用場景中發揮着越來越重要的作用。然而，如何提升大模型的反饋效果，使其更加準確、高效地爲用戶提供服務，一直是研究者和開發者關注的焦點。本文提出了一種新的思路——黑盒Prompt優化，旨在通過改進輸入提

2024-03-29 00:01:17

分布式数据库技术的演进和发展方向

這些年大家都在談分佈式數據庫，各大企業也紛紛開始做數據庫的分佈式改造。那麼，所謂的分佈式數據庫到底是什麼？採用什麼架構？優勢在哪？爲什麼越來越多企業選擇它？分佈式數據庫技術會向什麼方向發展？帶着這些疑問，一探究竟吧！參與文末的話題互動

2024-03-26 11:34:43

利用RAG技术打破大模型幻觉

隨着人工智能技術的不斷進步，大模型在各個領域中發揮着越來越重要的作用。然而，大模型幻覺問題一直是制約其進一步發展的瓶頸。爲了解決這一問題，研究者們不斷探索新的技術和方法。近年來，一種名爲RAG（檢索增強生成）的技術備受關注，它通過結合知識圖

2024-03-21 00:28:34

与 NVIDIA 再次合作、深度参与 GTC，Zilliz 与全球顶尖开发者共迎 AI 变革时刻！

Zilliz 與全球的頂尖開發者齊聚 GTC 2024。近日，備受關注的 NVIDIA GTC 2024 已拉開序幕，來自世界各地的頂尖 AI 開發者齊聚美國加州聖何塞會議中心，共同探索行業未來。作爲去年被 NVIDIA CEO 黃仁

2024-03-19 21:26:53

多模态+大模型会带来哪些“化学反应”？

導語：沒人懷疑，2024 年，AI 依然將是科技界的主角。上個月，OpenAI 推出了可以生成 60 秒高清視頻的視頻生成模型 Sora，掀起了對多模態模型的進一輪討論。多模態大模型技術的最新進展如何？這一波新技術，對於行業和消費者的體驗會

2024-03-15 13:45:01

24小時熱門文章

最新文章

最新評論文章