DNNLinear組合分類器實戰

在Census Income Data Set上訓練

訓練集

訓練數據是人口普查收入數據集Census Income Data Set
該數據集包含48000條樣本,其中屬性有年齡(age)、職業(occupation)、教育(education)和收入(income)等,收入是二元標籤,要不>50k要不<=50k。數據集大概分爲32000條訓練樣本和16000條測試樣本。
包含的屬性如下:

字段 取值 描述
age continuous 市民的年齡
fnlwgt continuous 這個值表示受訪者提供的消息的置信度
education-num continuous 市民最高學歷的數字形式
capital-gain continuous. 資本利得記錄
capital-loss continuous. 資本虧損記錄
hours-per-week continuous. 每週工作時間
workclass Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked 市民職位的所屬類型(政府, 軍隊, 私人, 等等)
education Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 市民的最高學歷
marital-status Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 市民的婚姻狀況
occupation Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 市民的職位
relationship Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 妻子, 孩子, 丈夫, 不在家庭裏, 其他親屬, 未婚
race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 白種人, 亞太島民, 美洲-印度-愛斯基摩人, 其他, 黑種人
sex Female, Male. 女性,男性
native-country United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. 市民的國家
income >50K, <=50K 市民的年收入是否超過$ 50,000

訓練讀取的數據集是逗號分割,下載地址。
adult.data :訓練集。
adult.test:測試集
adult.data打開效果如下,其中<=50K是標籤:
在這裏插入圖片描述

特徵

數據集都是原始特徵,一些還是字符串格式,並不能直接傳入模型,需要進行特徵處理。
關於連續和分類數據的特徵工程,可以參考《DNNLinear組合分類器的使用 & Feature column》裏面的Feature column。
下面代碼組合起來能夠完整的進行一個模型的構建及訓練、驗證,如遇到什麼問題歡迎留言討論,首先進行簡單的模塊導入及參數設置,:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import argparse
import sys
import tempfile

import pandas as pd
from six.moves import urllib
import tensorflow as tf

# 原始數據是沒有列名的,這裏是特徵的列名。
CSV_COLUMNS = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "gender",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "income_bracket"
]

對於上述數據集中連續數值特徵(continuous),如age、education_num等直接處理:

# Continuous base columns.
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

對於離散特徵的處理如下:

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
    ])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
        "Other-relative"
    ])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
    ])
    
# To show an example of hashing:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

# Transformations.
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

設置deep&wide的特徵:

# Wide columns and deep columns.
base_columns = [
    gender, education, marital_status, relationship, workclass, occupation,
    native_country, age_buckets,
]

crossed_columns = [
    tf.feature_column.crossed_column(
        ["education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, "education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        ["native_country", "occupation"], hash_bucket_size=1000)
]

deep_columns = [
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(gender),
    tf.feature_column.indicator_column(relationship),
    # To show an example of embedding
    tf.feature_column.embedding_column(native_country, dimension=8),
    tf.feature_column.embedding_column(occupation, dimension=8),
    age,
    education_num,
    capital_gain,
    capital_loss,
    hours_per_week,
]

input_fn

input_fn是DNNLinear組合分類器一個特別重要的函數,本篇展示了兩種input_fn方法,本人比較傾向於第一種方法。

def input_fn(data_file, num_epochs, shuffle):
  """Input builder function."""
  df_data = pd.read_csv(
      tf.gfile.Open(data_file),
      names=CSV_COLUMNS,
      skipinitialspace=True,
      engine="python",
      skiprows=1)
  # remove NaN elements
  df_data = df_data.dropna(how="any", axis=0)
  labels = df_data["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
  return tf.estimator.inputs.pandas_input_fn(
      x=df_data,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)
def input_fn(data_file, num_epochs, shuffle, batch_size):
    """爲Estimator創建一個input function"""
    assert tf.gfile.Exists(data_file), "{0} not found.".format(data_file)

    def parse_csv(line):
        print("Parsing", data_file)
        # tf.decode_csv會把csv文件轉換成很a list of Tensor,一列一個。record_defaults用於指明每一列的缺失值用什麼填充
        columns = tf.decode_csv(line, record_defaults=_CSV_COLUMN_DEFAULTS)
        features = dict(zip(_CSV_COLUMNS, columns))
        labels = features.pop('income_bracket')
        return features, tf.equal(labels, '>50K') # tf.equal(x, y) 返回一個bool類型Tensor, 表示x == y, element-wise

    dataset = tf.data.TextLineDataset(data_file) \
                .map(parse_csv, num_parallel_calls=5)
# dataframe轉tensor https://cloud.tencent.com/developer/ask/135418
# dataframe  轉 tensorflow
    if shuffle:
        dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'] + _NUM_EXAMPLES['validation'])

    dataset = dataset.repeat(num_epochs)
    dataset = dataset.batch(batch_size)

    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

model

def build_estimator(model_dir, model_type):
  """Build an estimator."""
  if model_type == "wide":
    m = tf.estimator.LinearClassifier(
        model_dir=model_dir, feature_columns=base_columns + crossed_columns)
  elif model_type == "deep":
    m = tf.estimator.DNNClassifier(
        model_dir=model_dir,
        feature_columns=deep_columns,
        hidden_units=[100, 50])
  else:
    m = tf.estimator.DNNLinearCombinedClassifier(
        model_dir=model_dir,
        linear_feature_columns=crossed_columns,
        dnn_feature_columns=deep_columns,
        dnn_hidden_units=[100, 50])
  return m

train_and_eval

def train_and_eval(model_dir, model_type, train_steps, train_data, test_data):
  """Train and evaluate the model."""
  #train_file_name, test_file_name = maybe_download(train_data, test_data)
  train_file_name = './data/adult.data'
  test_file_name = './data/adult.test'
  #model_dir = tempfile.mkdtemp() if not model_dir else model_dir

  m = build_estimator(model_dir, model_type)
  # set num_epochs to None to get infinite stream of data.
  m.train(
      input_fn=input_fn(train_file_name, num_epochs=None, shuffle=True),
      steps=train_steps)
  # set steps to None to run evaluation until all data consumed.
  results = m.evaluate(
      input_fn=input_fn(test_file_name, num_epochs=1, shuffle=False),
      steps=None)
  print("model directory = %s" % model_dir)
  for key in sorted(results):
    print("%s: %s" % (key, results[key]))

main

model_dir = './model2/wide_deep'
def main(_):
  train_and_eval(model_dir, FLAGS.model_type, FLAGS.train_steps,
                 FLAGS.train_data, FLAGS.test_data)

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.register("type", "bool", lambda v: v.lower() == "true")
  parser.add_argument(
      "--model_dir",
      type=str,
      default="",
      help="Base directory for output models."
  )
  parser.add_argument(
      "--model_type",
      type=str,
      default="wide_n_deep",
      help="Valid model types: {'wide', 'deep', 'wide_n_deep'}."
  )
  parser.add_argument(
      "--train_steps",
      type=int,
      default=2000,
      help="Number of training steps."
  )
  parser.add_argument(
      "--train_data",
      type=str,
      default="",
      help="Path to the training data."
  )
  parser.add_argument(
      "--test_data",
      type=str,
      default="",
      help="Path to the test data."
  )
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

在自己數據集上訓練

在自己數據上進行訓練,其實只需要改變的是input_fn函數,這裏的數據是一個hive表,存在hdfs上的數據,數據格式是:feature1,feature2,feature3,…,label。這裏使用的平臺是pyspark,利用spark.sql讀取hive表,轉爲pandas的DataFrame。如果你的數據是pandas的DataFrame或者是numpy其實都是可以的。這裏存在的隱患是:如果數據是千萬級,而每個樣本特徵維度是百維以上,那麼轉pandas的時候存在內存溢出,下一步嘗試,直接讀取hdfs數據。

def input_fn(num_epochs, shuffle):
    feature_list_sql = "select \
                base_sex,base_age,base_edu, \
                base_marry,base_profession,base_city_level, \
                browse_times_15_cate,browse_times_30_cate,browse_times_60_cate,browse_times_90_cate, \
                label \
                from X_X_feature_v1 \
                where id_ ='92' and train_flag='1'"
    feature_df = spark.sql(feature_list_sql)
    feature_df_pd = feature_df.limit(10000).toPandas()
    feature_df_pd['label'] = feature_df_pd['label'].astype("int")
    feature_df_pd['base_age'] = feature_df_pd['base_age'].astype("int")
    feature_df_pd = feature_df_pd.dropna(how="any", axis=0)
    labels = feature_df_pd['label']
    #labels = feature_df_pd.pop('label')
    print('******************************************')
    print(feature_df_pd.columns)
    print('******************************************')
    
    return tf.estimator.inputs.pandas_input_fn(
      x=feature_df_pd,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)
    #labels = feature_df_pd["label"].apply(lambda x: ">50K" in x).astype(int)

參考:
https://blog.csdn.net/u013608336/article/details/78031788
https://www.jianshu.com/p/6868fc1f65d0

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章