4個主要思路:
1 單線程處理
2 普通多線程
3 hive
4 Hadoop
搜到一些參考資料
《Hadoop實戰》的筆記-2、Hadoop輸入與輸出
https://book.douban.com/annotation/17068812/
TextInputFormat:文件偏移量:整行數據
但是這個偏移量,貌似是在一個文件的偏移,而不是全局。
Generate Auto-increment Id in Map-reduceJob
http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/
Generate unique customer id / insert uniquerows in hive
http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive
Need to add auto increment column in atable using hive
http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive
https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/
Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.
最後我採取了用hive寫udf的方案。
package hive.udf;
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
/**
* UDFRowSequence.
*/
@Description(name = "row_sequence",
value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)//stateful參數是必要的
public class UDFRowSequence extends UDF
{
private int result;
public UDFRowSequence() {
result=0;
}
public int evaluate() {
result++;
return result;
}
}
// End UDFRowSequence.java
本文作者:linger
本文鏈接:http://blog.csdn.net/lingerlanlan/article/details/46430747