在Census Income Data Set上訓練
訓練集
訓練數據是人口普查收入數據集Census Income Data Set
該數據集包含48000條樣本,其中屬性有年齡(age)、職業(occupation)、教育(education)和收入(income)等,收入是二元標籤,要不>50k要不<=50k。數據集大概分爲32000條訓練樣本和16000條測試樣本。
包含的屬性如下:
字段 | 取值 | 描述 |
---|---|---|
age | continuous | 市民的年齡 |
fnlwgt | continuous | 這個值表示受訪者提供的消息的置信度 |
education-num | continuous | 市民最高學歷的數字形式 |
capital-gain | continuous. | 資本利得記錄 |
capital-loss | continuous. | 資本虧損記錄 |
hours-per-week | continuous. | 每週工作時間 |
workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked | 市民職位的所屬類型(政府, 軍隊, 私人, 等等) |
education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. | 市民的最高學歷 |
marital-status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. | 市民的婚姻狀況 |
occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. | 市民的職位 |
relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. | 妻子, 孩子, 丈夫, 不在家庭裏, 其他親屬, 未婚 |
race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. | 白種人, 亞太島民, 美洲-印度-愛斯基摩人, 其他, 黑種人 |
sex | Female, Male. | 女性,男性 |
native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. | 市民的國家 |
income | >50K, <=50K | 市民的年收入是否超過$ 50,000 |
訓練讀取的數據集是逗號分割,下載地址。
adult.data :訓練集。
adult.test:測試集
adult.data打開效果如下,其中<=50K是標籤:
特徵
數據集都是原始特徵,一些還是字符串格式,並不能直接傳入模型,需要進行特徵處理。
關於連續和分類數據的特徵工程,可以參考《DNNLinear組合分類器的使用 & Feature column》裏面的Feature column。
下面代碼組合起來能夠完整的進行一個模型的構建及訓練、驗證,如遇到什麼問題歡迎留言討論,首先進行簡單的模塊導入及參數設置,:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import sys
import tempfile
import pandas as pd
from six.moves import urllib
import tensorflow as tf
# 原始數據是沒有列名的,這裏是特徵的列名。
CSV_COLUMNS = [
"age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "gender",
"capital_gain", "capital_loss", "hours_per_week", "native_country",
"income_bracket"
]
對於上述數據集中連續數值特徵(continuous),如age、education_num等直接處理:
# Continuous base columns.
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")
對於離散特徵的處理如下:
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
education = tf.feature_column.categorical_column_with_vocabulary_list(
"education", [
"Bachelors", "HS-grad", "11th", "Masters", "9th",
"Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
"Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
"Preschool", "12th"
])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
"marital_status", [
"Married-civ-spouse", "Divorced", "Married-spouse-absent",
"Never-married", "Separated", "Married-AF-spouse", "Widowed"
])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
"relationship", [
"Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
"Other-relative"
])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
"workclass", [
"Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
"Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
])
# To show an example of hashing:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
"occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
"native_country", hash_bucket_size=1000)
# Transformations.
age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
設置deep&wide的特徵:
# Wide columns and deep columns.
base_columns = [
gender, education, marital_status, relationship, workclass, occupation,
native_country, age_buckets,
]
crossed_columns = [
tf.feature_column.crossed_column(
["education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column(
[age_buckets, "education", "occupation"], hash_bucket_size=1000),
tf.feature_column.crossed_column(
["native_country", "occupation"], hash_bucket_size=1000)
]
deep_columns = [
tf.feature_column.indicator_column(workclass),
tf.feature_column.indicator_column(education),
tf.feature_column.indicator_column(gender),
tf.feature_column.indicator_column(relationship),
# To show an example of embedding
tf.feature_column.embedding_column(native_country, dimension=8),
tf.feature_column.embedding_column(occupation, dimension=8),
age,
education_num,
capital_gain,
capital_loss,
hours_per_week,
]
input_fn
input_fn是DNNLinear組合分類器一個特別重要的函數,本篇展示了兩種input_fn方法,本人比較傾向於第一種方法。
def input_fn(data_file, num_epochs, shuffle):
"""Input builder function."""
df_data = pd.read_csv(
tf.gfile.Open(data_file),
names=CSV_COLUMNS,
skipinitialspace=True,
engine="python",
skiprows=1)
# remove NaN elements
df_data = df_data.dropna(how="any", axis=0)
labels = df_data["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
return tf.estimator.inputs.pandas_input_fn(
x=df_data,
y=labels,
batch_size=100,
num_epochs=num_epochs,
shuffle=shuffle,
num_threads=5)
def input_fn(data_file, num_epochs, shuffle, batch_size):
"""爲Estimator創建一個input function"""
assert tf.gfile.Exists(data_file), "{0} not found.".format(data_file)
def parse_csv(line):
print("Parsing", data_file)
# tf.decode_csv會把csv文件轉換成很a list of Tensor,一列一個。record_defaults用於指明每一列的缺失值用什麼填充
columns = tf.decode_csv(line, record_defaults=_CSV_COLUMN_DEFAULTS)
features = dict(zip(_CSV_COLUMNS, columns))
labels = features.pop('income_bracket')
return features, tf.equal(labels, '>50K') # tf.equal(x, y) 返回一個bool類型Tensor, 表示x == y, element-wise
dataset = tf.data.TextLineDataset(data_file) \
.map(parse_csv, num_parallel_calls=5)
# dataframe轉tensor https://cloud.tencent.com/developer/ask/135418
# dataframe 轉 tensorflow
if shuffle:
dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'] + _NUM_EXAMPLES['validation'])
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
batch_features, batch_labels = iterator.get_next()
return batch_features, batch_labels
model
def build_estimator(model_dir, model_type):
"""Build an estimator."""
if model_type == "wide":
m = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns + crossed_columns)
elif model_type == "deep":
m = tf.estimator.DNNClassifier(
model_dir=model_dir,
feature_columns=deep_columns,
hidden_units=[100, 50])
else:
m = tf.estimator.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=crossed_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[100, 50])
return m
train_and_eval
def train_and_eval(model_dir, model_type, train_steps, train_data, test_data):
"""Train and evaluate the model."""
#train_file_name, test_file_name = maybe_download(train_data, test_data)
train_file_name = './data/adult.data'
test_file_name = './data/adult.test'
#model_dir = tempfile.mkdtemp() if not model_dir else model_dir
m = build_estimator(model_dir, model_type)
# set num_epochs to None to get infinite stream of data.
m.train(
input_fn=input_fn(train_file_name, num_epochs=None, shuffle=True),
steps=train_steps)
# set steps to None to run evaluation until all data consumed.
results = m.evaluate(
input_fn=input_fn(test_file_name, num_epochs=1, shuffle=False),
steps=None)
print("model directory = %s" % model_dir)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
main
model_dir = './model2/wide_deep'
def main(_):
train_and_eval(model_dir, FLAGS.model_type, FLAGS.train_steps,
FLAGS.train_data, FLAGS.test_data)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.register("type", "bool", lambda v: v.lower() == "true")
parser.add_argument(
"--model_dir",
type=str,
default="",
help="Base directory for output models."
)
parser.add_argument(
"--model_type",
type=str,
default="wide_n_deep",
help="Valid model types: {'wide', 'deep', 'wide_n_deep'}."
)
parser.add_argument(
"--train_steps",
type=int,
default=2000,
help="Number of training steps."
)
parser.add_argument(
"--train_data",
type=str,
default="",
help="Path to the training data."
)
parser.add_argument(
"--test_data",
type=str,
default="",
help="Path to the test data."
)
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
在自己數據集上訓練
在自己數據上進行訓練,其實只需要改變的是input_fn函數,這裏的數據是一個hive表,存在hdfs上的數據,數據格式是:feature1,feature2,feature3,…,label。這裏使用的平臺是pyspark,利用spark.sql讀取hive表,轉爲pandas的DataFrame。如果你的數據是pandas的DataFrame或者是numpy其實都是可以的。這裏存在的隱患是:如果數據是千萬級,而每個樣本特徵維度是百維以上,那麼轉pandas的時候存在內存溢出,下一步嘗試,直接讀取hdfs數據。
def input_fn(num_epochs, shuffle):
feature_list_sql = "select \
base_sex,base_age,base_edu, \
base_marry,base_profession,base_city_level, \
browse_times_15_cate,browse_times_30_cate,browse_times_60_cate,browse_times_90_cate, \
label \
from X_X_feature_v1 \
where id_ ='92' and train_flag='1'"
feature_df = spark.sql(feature_list_sql)
feature_df_pd = feature_df.limit(10000).toPandas()
feature_df_pd['label'] = feature_df_pd['label'].astype("int")
feature_df_pd['base_age'] = feature_df_pd['base_age'].astype("int")
feature_df_pd = feature_df_pd.dropna(how="any", axis=0)
labels = feature_df_pd['label']
#labels = feature_df_pd.pop('label')
print('******************************************')
print(feature_df_pd.columns)
print('******************************************')
return tf.estimator.inputs.pandas_input_fn(
x=feature_df_pd,
y=labels,
batch_size=100,
num_epochs=num_epochs,
shuffle=shuffle,
num_threads=5)
#labels = feature_df_pd["label"].apply(lambda x: ">50K" in x).astype(int)
參考:
https://blog.csdn.net/u013608336/article/details/78031788
https://www.jianshu.com/p/6868fc1f65d0