使用 TensorFlow 的基本步骤

Intro to Pandas

  1. import pandas as pd

  2. pandas 中的主要数据结构被实现为以下两类:

    • DataFrame,您可以将它想象成一个关系型数据表格,其中包含多个行和已命名的列。

      1
      DaraFrame({'': series})
    • Series,它是单一列。DataFrame 中包含一个或多个 Series,每个 Series 均有一个名称。

      1
      Series([])   eg: ['name_a', 'name_b', 'name_c']
  3. series.apply() is like python function map, params is lambda function, and it apply on every value in series

    1
    2
    population = pd.Series([852469, 1015785, 485199])
    population.apply(lambda val: val > 1000000)
  4. update dataframe is easy too

    1
    2
    3
    4
    # add a new series 
    cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
    # make a new series from other two series
    cities['Population density'] = cities['Population'] / cities['Area square miles']
  5. series use (| & !) not (and or not)

  6. index

    1. 1
      2
      3
      city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
      city_name.index
      cities.index
    2. reindex

      1
      cities.reindex([2, 0, 1])

      after reindex, index not update, just update location

    3. reindex in random

      most of pandas series is numpy’s function params

      1
      cities.reindex(np.random.permutation(cities.index))
    4. reindex with outline index will create new serice in data frame

First setp with TensorFlow

导入的数据主要使用两种类型

  1. 分类数据 Categorical Data
  2. 数值数据 Numerical Data

定义输入特征

1
2
3
4
5
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

定义标签

1
2
# Define the label.
targets = california_housing_dataframe["median_house_value"]

使用 LinearRegressor(线性回归) 配置线性回归模型

1
2
3
4
5
6
7
8
9
10
# 使用梯度下降优化器,并设置学习速率为 0.0000001
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
# 将梯度裁剪应用到优化器
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

# 使用特征列和优化器配置我们的线性回归模型
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)

定义输入函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
"""Trains a linear regression model of one feature.

Args:
features: pandas DataFrame of features
targets: pandas DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""

# 将 pandas 数据转化为 key 是列名 value 是列里面所有数据的 numpy.array() 的字典
features = {key:np.array(value) for key,value in dict(features).items()}

# 定义一个 dataset,特征和标签
# 配置分批次的大小(batch_size)和重复的次数(num_epochs)
ds = Dataset.from_tensor_slices((features, targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)

# 随机处理数据,buffer_size 是从数据中随机取的数量
if shuffle:
ds = ds.shuffle(buffer_size=10000)

# 返回下一个批次的数据
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels

训练模型

调用我们定义好的线性回归模型中的 train() 方法,将输入函数传入模型,并指定步数

1
2
3
4
_ = linear_regressor.train(
input_fn = lambda:my_input_fn(my_feature, targets),
steps=100
)

评估模型

  1. 为预测创建一个输入函数

    当我们只对每一个例子做一次预测时,我们不需要重复或者随即打乱数据

    1
    prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)
  2. 调用模型的predict()方法来做预测

    1
    predictions = linear_regressor.predict(input_fn=prediction_input_fn)
  3. 将预测的结果转换成 numpy 的 array 格式,以便于计算误差

    1
    predictions = np.array([item['predictions'][0] for item in predictions])
  4. 查看均方误差(MSE)和均方根误差(RMSE)

    1
    2
    mean_squared_error = metrics.mean_squared_error(predictions, targets)
    root_mean_squared_error = math.sqrt(mean_squared_error)
  5. 查看 RMSE 与目标最大值最小值的差值

    1
    2
    3
    min_house_value = california_housing_dataframe["median_house_value"].min()
    max_house_value = california_housing_dataframe["median_house_value"].max()
    min_max_difference = max_house_value - min_house_value

    我们的误差跨越目标值的近一半范围,需要进一步缩小误差

查看误差

先查看数据,取样并绘制散点图和我们模型预测权重和偏差的线

  1. 取样

    1
    sample = california_housing_dataframe.sample(n=300)
  2. 画出我们预测值的方程线和散点图,查看差距

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    # 取样本数据中的最大最小值
    x_0 = sample["total_rooms"].min()
    x_1 = sample["total_rooms"].max()

    # 获取训练过程中产生的最终权重和偏差。
    weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
    bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

    # 计算出对房间数最小和最大的房屋的预测房价中值
    y_0 = weight * x_0 + bias
    y_1 = weight * x_1 + bias

    # 绘制我们(x_0, y_0) to (x_1, y_1)两点间的回归线
    plt.plot([x_0, x_1], [y_0, y_1], c='r')

    # 定义坐标轴
    plt.ylabel("median_house_value")
    plt.xlabel("total_rooms")

    # 绘制我们样本数据的散点图
    plt.scatter(sample["total_rooms"], sample["median_house_value"])

    plt.show()

调整模型超参数以降低误差

整合代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
def train_model(learning_rate, steps, batch_size, input_feature="total_rooms"):
"""
根据一个参数训练一个线性回归模型

Args:
learning_rate: A `float`, 学习率
steps: A non-zero `int`, 是指训练迭代的总次数
batch_size: A non-zero `int`, 是指单步的样本数量(随机选择)
input_feature: A `string` 输入的特征
"""

periods = 10
steps_per_period = steps / periods

my_feature = input_feature
my_feature_data = california_housing_dataframe[[my_feature]]
my_label = "median_house_value"
targets = california_housing_dataframe[my_label]

# 创建特征列
feature_columns = [tf.feature_column.numeric_column(my_feature)]

# 创建输入函数
training_input_fn = lambda:my_input_fn(my_feature_data, targets, batch_size=batch_size)
prediction_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)

# 创建线性回归模型
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)

# 定义我们的模型绘制的图和线
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.title("Learned Line by Period")
plt.ylabel(my_label)
plt.xlabel(my_feature)
sample = california_housing_dataframe.sample(n=300)
plt.scatter(sample[my_feature], sample[my_label])
colors = [cm.coolwarm(x) for x in np.linspace(-1, 1, periods)]

# Train the model, but do so inside a loop so that we can periodically assess
# loss metrics.
# 训练模型并且周期性评估误差
print "Training model..."
print "RMSE (on training data):"
root_mean_squared_errors = []
for period in range (0, periods):
# 训练模型
linear_regressor.train(
input_fn=training_input_fn,
steps=steps_per_period
)
# 计算预测值
predictions = linear_regressor.predict(input_fn=prediction_input_fn)
predictions = np.array([item['predictions'][0] for item in predictions])

# 计算误差
root_mean_squared_error = math.sqrt(
metrics.mean_squared_error(predictions, targets))
print " period %02d : %0.2f" % (period, root_mean_squared_error)

# 将误差添加到误差列表中
root_mean_squared_errors.append(root_mean_squared_error)

# 定义模型的特征与标签方程式
y_extents = np.array([0, sample[my_label].max()])

weight = linear_regressor.get_variable_value('linear/linear_model/%s/weights' % input_feature)[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

x_extents = (y_extents - bias) / weight
x_extents = np.maximum(np.minimum(x_extents,
sample[my_feature].max()),
sample[my_feature].min())
y_extents = weight * x_extents + bias
plt.plot(x_extents, y_extents, color=colors[period])
print "Model training finished."

# 绘图
plt.subplot(1, 2, 2)
plt.ylabel('RMSE')
plt.xlabel('Periods')
plt.title("Root Mean Squared Error vs. Periods")
plt.tight_layout()
plt.plot(root_mean_squared_errors)

# 输出预测值和标签值
calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
display.display(calibration_data.describe())

print "Final RMSE (on training data): %0.2f" % root_mean_squared_error


train_model(
learning_rate=0.01,
steps=100000,
batch_size=1000
)