前言
上一篇文章稍微入門了一下Tensorflow Federated框架,但是目前來說,要實現聯邦學習(實驗)算法用它還是“殺雞用牛刀”。因此,在一番探索後,我發現樸素Tensorflow也能實現聯邦學習算法,甚至還可以手動分開Client端和Server端代碼,邏輯更清晰。稍作修改,添加網絡傳輸後甚至可以部署到分佈式場景,實現真正意義上的聯邦學習(性能估計不會太好hhh)。
這篇文章中,我將分享一種實現聯邦學習的方法,它具有以下優點:
- 不需要讀寫文件來保存、切換Client模型
- 不需要在每次epoch重新初始化Client變量
- 內存佔用儘可能小(參數量僅翻一倍,即Client端+Server端)
- 切換Client只增加了一些賦值操作
繼續閱讀之前,默認大家對聯邦學習有一些瞭解,並達成以下共識:
- 學習的目標是一個更好的模型,由Server保管,Clients提供更新
- 數據(Data)由Clients保管、使用
文章的代碼環境、庫依賴:
- Python 3.7
- Tensorflow v1.14.x
- tqdm(一個Python模塊)
接下來本文會分成Client端、Server端代碼設計與實現進行講解。懶得看講解的胖友可以直接拉到最後的完整代碼章節,共有四個代碼文件,運行python Server.py
即可以立馬體驗原汁原味的(單機模擬)聯邦學習。
Client端
明確一下Client端的任務,包含下面三個步驟:
- 將Server端發來的模型變量加載到模型上
- 用自己的所有數據更新當前模型
- 將更新後的模型變量發回給Server
在這些任務下,我們可以設計出Client代碼需要具備的一些功能:
- 創建、訓練Tensorflow模型(也就是計算圖)
- 加載Server端發過來的模型變量值
- 提取當前模型的變量值,發送給Server
- 維護自己的數據集用於訓練
其實,仔細一想也就比平時寫的tf模型代碼多了個加載、提取模型變量。假設Client類已經構建好了模型,那麼sess.run()
一下每個變量,即可得到模型變量的值了。下面的代碼展示了部分Clients類的定義,get_client_vars
函數將返回計算圖中所有可訓練的變量值:
class Clients:
def __init__(self, input_shape, num_classes, learning_rate, clients_num):
self.graph = tf.Graph()
self.sess = tf.Session(graph=self.graph)
""" 本函數未完待續... """
def get_client_vars(self):
""" Return all of the variables list """
with self.graph.as_default():
client_vars = self.sess.run(tf.trainable_variables())
return client_vars
加載Server端發過來的global_vars
到模型變量上,核心在於tf.Variable.load()
函數,把一個Tensor
的值加載到模型變量中,例如:
variable.load(tensor, sess)
將tensor
(類型爲tf.Tensor
)的值賦值給variable
(類型爲tf.Varibale
),sess
是tf.Session
。
如果要把整個模型中的變量值都加載,可以用tf.trainable_variables()
獲取計算圖中的所有可訓練變量(一個list
),保證它和global_vars
的順序對應後,可以這樣實現:
def set_global_vars(self, global_vars):
""" Assign all of the variables with global vars """
with self.graph.as_default():
all_vars = tf.trainable_variables()
for variable, value in zip(all_vars, global_vars):
variable.load(value, self.sess)
此外,Clients類還需要進行模型定義和訓練。我相信這不是實現聯邦的重點,因此在下面的代碼中,我將函數體去掉只留下接口定義(完整代碼在最後一個章節):
import tensorflow as tf
import numpy as np
from collections import namedtuple
import math
# 自定義的模型定義函數
from Model import AlexNet
# 自定義的數據集類
from Dataset import Dataset
# The definition of fed model
# 用namedtuple來儲存一個模型,依次爲:
# X: 輸入
# Y: 輸出
# DROP_RATE: 顧名思義
# train_op: tf計算圖中的訓練節點(一般是optimizer.minimize(xxx))
# loss_op: 顧名思義
# loss_op: 顧名思義
FedModel = namedtuple('FedModel', 'X Y DROP_RATE train_op loss_op acc_op')
class Clients:
def __init__(self, input_shape, num_classes, learning_rate, clients_num):
self.graph = tf.Graph()
self.sess = tf.Session(graph=self.graph)
# Call the create function to build the computational graph of AlexNet
# `net` 是一個list,依次包含模型中FedModel需要的計算節點(看上面)
net = AlexNet(input_shape, num_classes, learning_rate, self.graph)
self.model = FedModel(*net)
# initialize 初始化
with self.graph.as_default():
self.sess.run(tf.global_variables_initializer())
# Load Cifar-10 dataset
# NOTE: len(self.dataset.train) == clients_num
# 加載數據集。對於訓練集:`self.dataset.train[56]`可以獲取56號client的數據集
# `self.dataset.train[56].next_batch(32)`可以獲取56號client的一個batch,大小爲32
# 對於測試集,所有client共用一個測試集,因此:
# `self.dataset.test.next_batch(1000)`將獲取大小爲1000的數據集(無隨機)
self.dataset = Dataset(tf.keras.datasets.cifar10.load_data,
split=clients_num)
def run_test(self, num):
"""
Predict the testing set, and report the acc and loss
預測測試集,返回準確率和loss
num: number of testing instances
"""
pass
def train_epoch(self, cid, batch_size=32, dropout_rate=0.5):
"""
Train one client with its own data for one epoch
用`cid`號的client的數據對模型進行訓練
cid: Client id
"""
pass
def choose_clients(self, ratio=1.0):
"""
randomly choose some clients
隨機選擇`ratio`比例的clients,返回編號(也就是下標)
"""
client_num = self.get_clients_num()
choose_num = math.floor(client_num * ratio)
return np.random.permutation(client_num)[:choose_num]
def get_clients_num(self):
""" 返回clients的數量 """
return len(self.dataset.train)
細心的同學可能已經發現了,類名是Clients
是複數,表示一堆Clients的集合。但模型self.model
只有一個,原因是:不同Clients的模型實際上是一樣的,只是數據不同;類成員self.dataset
已經對數據進行了劃分,需要不同client參與訓練時,只需要用Server給的變量值把模型變量覆蓋掉,再用下標cid
找到該Client的數據進行訓練就得了。
當然,這樣實現的最重要原因,是避免構建那麼多個Client的計算圖。咱沒那麼多顯存TAT
概括一下:聯邦學習的Clients,只是普通TF訓練模型代碼上,加上模型變量的值提取、賦值功能。
Server端
按照套路,明確一下Server端代碼的主要任務:
- 使用Clients:給一組模型變量給某個Client進行更新,把更新後的變量值拿回來
- 管理全局模型:每一輪更新,收集多個Clients更新後的模型進行歸總,成爲新一輪的模型
簡單起見,我們Server端的代碼不再抽象成一個類,而是以腳本的形式編寫。首先,實例化咱們上面定義的Clients:
from Client import Clients
def buildClients(num):
learning_rate = 0.0001
num_input = 32 # image shape: 32*32
num_input_channel = 3 # image channel: 3
num_classes = 10 # Cifar-10 total classes (0-9 digits)
#create Client and model
return Clients(input_shape=[None, num_input, num_input, num_input_channel],
num_classes=num_classes,
learning_rate=learning_rate,
clients_num=num)
CLIENT_NUMBER = 100
client = buildClients(CLIENT_NUMBER)
global_vars = client.get_client_vars()
client
變量儲存着CLIENT_NUMBER
個Clients的模型(實際上只有一個計算圖)和數據。global_vars
儲存着Server端的模型變量值,也就是我們大名鼎鼎的訓練目標,目前它只是Client端模型初始化的值。
接下來,對於Server的一個epoch,Server會隨機挑選一定比例的Clients參與這輪訓練,分別把當前的Server端模型global_vars
交給它們進行更新,並分別收集它們更新後的變量。本輪參與訓練的Clients都收集後,平均一下這些更新後的變量值,就得到新一輪的Server端模型,然後進行下一個epoch。下面是循環epoch更新的代碼,仔細看註釋哦:
def run_global_test(client, global_vars, test_num):
""" 跑一下測試集,輸出ACC和Loss """
client.set_global_vars(global_vars)
acc, loss = client.run_test(test_num)
print("[epoch {}, {} inst] Testing ACC: {:.4f}, Loss: {:.4f}".format(
ep + 1, test_num, acc, loss))
CLIENT_RATIO_PER_ROUND = 0.12 # 每輪挑選clients跑跑看的比例
epoch = 360 # epoch上限
for ep in range(epoch):
# We are going to sum up active clients' vars at each epoch
# 用來收集Clients端的參數,全部疊加起來(節約內存)
client_vars_sum = None
# Choose some clients that will train on this epoch
# 隨機挑選一些Clients進行訓練
random_clients = client.choose_clients(CLIENT_RATIO_PER_ROUND)
# Train with these clients
# 用這些Clients進行訓練,收集它們更新後的模型
for client_id in tqdm(random_clients, ascii=True):
# Restore global vars to client's model
# 將Server端的模型加載到Client模型上
client.set_global_vars(global_vars)
# train one client
# 訓練這個下標的Client
client.train_epoch(cid=client_id)
# obtain current client's vars
# 獲取當前Client的模型變量值
current_client_vars = client.get_client_vars()
# sum it up
# 把各個層的參數疊加起來
if client_vars_sum is None:
client_vars_sum = current_client_vars
else:
for cv, ccv in zip(client_vars_sum, current_client_vars):
cv += ccv
# obtain the avg vars as global vars
# 把疊加後的Client端模型變量 除以 本輪參與訓練的Clients數量
# 得到平均模型、作爲新一輪的Server端模型參數
global_vars = []
for var in client_vars_sum:
global_vars.append(var / len(random_clients))
# run test on 1000 instances
# 跑一下測試集、輸出一下
run_global_test(client, global_vars, test_num=600)
經過那麼一些輪的迭代,我們就可以得到Server端的訓練好的模型參數global_vars
了。雖然它邏輯很簡單,但我希望觀衆老爺們能注意到其中的兩個聯邦點:Server端代碼沒有接觸到數據;每次參與訓練的Clients數量相對於整體來說是很少的。
擴展
如果要更換模型,只需要實現新的模型計算圖構造函數,替換Client端的AlexNet
函數,保證它能返回那一系列的計算節點即可。
如果要實現Non-I.I.D.的數據分佈,只需要修改Dataset.py
中的數據劃分方式。但是,我稍微試驗了一下,目前這個模型+訓練方式,不能應對極度Non-I.I.D.的情況。也反面證明了,Non-I.I.D.確實是聯邦學習的一個難題。
如果要Clients和Server之間傳模型梯度,需要把Client端的計算梯度和更新變量分開,中間插入和Server端的交互,交互內容就是梯度。這樣說有點抽象,很多同學可能經常用Optimizer.minimize
(文檔在這),但並不知道它是另外兩個函數的組合,分別爲:compute_gradients()
和apply_gradients()
。前者是計算梯度,後者是把梯度按照學習率更新到變量上。把梯度拿到後,交給Server,Server返回一個全局平均後的梯度再更新模型。嘗試過是可行的,但是並不能減少傳輸量,而且單機模擬實現難度大了許多。
如果要分佈式部署,那就把Clients端代碼放在flask等web後端服務下進行部署,Server端通過網絡傳輸與Clients進行通信。需要注意,Server端發起請求的時候,可能因爲參數量太大導致一些問題,考慮換個非HTTP協議。
完整代碼
一共有四個代碼文件,他們應當放在同一個文件目錄下:
- Client.py:Client端代碼,管理模型、數據
- Server.py:Server端代碼,管理Clients、全局模型
- Dataset.py:定義數據的組織形式
- Model.py:定義TF模型的計算圖
我也將它們傳到了Github上,倉庫鏈接:https://github.com/Zing22/tf-fed-demo。下面開始分別貼出它們的完整代碼,其中的註釋只有我邊打碼邊寫的一點點,上文的介紹中補充了更多中文註釋。運行方法非常簡單:
python Server.py
Client.py
import tensorflow as tf
import numpy as np
from collections import namedtuple
import math
from Model import AlexNet
from Dataset import Dataset
# The definition of fed model
FedModel = namedtuple('FedModel', 'X Y DROP_RATE train_op loss_op acc_op')
class Clients:
def __init__(self, input_shape, num_classes, learning_rate, clients_num):
self.graph = tf.Graph()
self.sess = tf.Session(graph=self.graph)
# Call the create function to build the computational graph of AlexNet
net = AlexNet(input_shape, num_classes, learning_rate, self.graph)
self.model = FedModel(*net)
# initialize
with self.graph.as_default():
self.sess.run(tf.global_variables_initializer())
# Load Cifar-10 dataset
# NOTE: len(self.dataset.train) == clients_num
self.dataset = Dataset(tf.keras.datasets.cifar10.load_data,
split=clients_num)
def run_test(self, num):
with self.graph.as_default():
batch_x, batch_y = self.dataset.test.next_batch(num)
feed_dict = {
self.model.X: batch_x,
self.model.Y: batch_y,
self.model.DROP_RATE: 0
}
return self.sess.run([self.model.acc_op, self.model.loss_op],
feed_dict=feed_dict)
def train_epoch(self, cid, batch_size=32, dropout_rate=0.5):
"""
Train one client with its own data for one epoch
cid: Client id
"""
dataset = self.dataset.train[cid]
with self.graph.as_default():
for _ in range(math.ceil(dataset.size // batch_size)):
batch_x, batch_y = dataset.next_batch(batch_size)
feed_dict = {
self.model.X: batch_x,
self.model.Y: batch_y,
self.model.DROP_RATE: dropout_rate
}
self.sess.run(self.model.train_op, feed_dict=feed_dict)
def get_client_vars(self):
""" Return all of the variables list """
with self.graph.as_default():
client_vars = self.sess.run(tf.trainable_variables())
return client_vars
def set_global_vars(self, global_vars):
""" Assign all of the variables with global vars """
with self.graph.as_default():
all_vars = tf.trainable_variables()
for variable, value in zip(all_vars, global_vars):
variable.load(value, self.sess)
def choose_clients(self, ratio=1.0):
""" randomly choose some clients """
client_num = self.get_clients_num()
choose_num = math.floor(client_num * ratio)
return np.random.permutation(client_num)[:choose_num]
def get_clients_num(self):
return len(self.dataset.train)
Server.py
import tensorflow as tf
from tqdm import tqdm
from Client import Clients
def buildClients(num):
learning_rate = 0.0001
num_input = 32 # image shape: 32*32
num_input_channel = 3 # image channel: 3
num_classes = 10 # Cifar-10 total classes (0-9 digits)
#create Client and model
return Clients(input_shape=[None, num_input, num_input, num_input_channel],
num_classes=num_classes,
learning_rate=learning_rate,
clients_num=num)
def run_global_test(client, global_vars, test_num):
client.set_global_vars(global_vars)
acc, loss = client.run_test(test_num)
print("[epoch {}, {} inst] Testing ACC: {:.4f}, Loss: {:.4f}".format(
ep + 1, test_num, acc, loss))
#### SOME TRAINING PARAMS ####
CLIENT_NUMBER = 100
CLIENT_RATIO_PER_ROUND = 0.12
epoch = 360
#### CREATE CLIENT AND LOAD DATASET ####
client = buildClients(CLIENT_NUMBER)
#### BEGIN TRAINING ####
global_vars = client.get_client_vars()
for ep in range(epoch):
# We are going to sum up active clients' vars at each epoch
client_vars_sum = None
# Choose some clients that will train on this epoch
random_clients = client.choose_clients(CLIENT_RATIO_PER_ROUND)
# Train with these clients
for client_id in tqdm(random_clients, ascii=True):
# Restore global vars to client's model
client.set_global_vars(global_vars)
# train one client
client.train_epoch(cid=client_id)
# obtain current client's vars
current_client_vars = client.get_client_vars()
# sum it up
if client_vars_sum is None:
client_vars_sum = current_client_vars
else:
for cv, ccv in zip(client_vars_sum, current_client_vars):
cv += ccv
# obtain the avg vars as global vars
global_vars = []
for var in client_vars_sum:
global_vars.append(var / len(random_clients))
# run test on 1000 instances
run_global_test(client, global_vars, test_num=600)
#### FINAL TEST ####
run_global_test(client, global_vars, test_num=10000)
Dataset.py
import numpy as np
from tensorflow.keras.utils import to_categorical
class BatchGenerator:
def __init__(self, x, yy):
self.x = x
self.y = yy
self.size = len(x)
self.random_order = list(range(len(x)))
np.random.shuffle(self.random_order)
self.start = 0
return
def next_batch(self, batch_size):
if self.start + batch_size >= len(self.random_order):
overflow = (self.start + batch_size) - len(self.random_order)
perm0 = self.random_order[self.start:] +\
self.random_order[:overflow]
self.start = overflow
else:
perm0 = self.random_order[self.start:self.start + batch_size]
self.start += batch_size
assert len(perm0) == batch_size
return self.x[perm0], self.y[perm0]
# support slice
def __getitem__(self, val):
return self.x[val], self.y[val]
class Dataset(object):
def __init__(self, load_data_func, one_hot=True, split=0):
(x_train, y_train), (x_test, y_test) = load_data_func()
print("Dataset: train-%d, test-%d" % (len(x_train), len(x_test)))
if one_hot:
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
if split == 0:
self.train = BatchGenerator(x_train, y_train)
else:
self.train = self.splited_batch(x_train, y_train, split)
self.test = BatchGenerator(x_test, y_test)
def splited_batch(self, x_data, y_data, count):
res = []
l = len(x_data)
for i in range(0, l, l//count):
res.append(
BatchGenerator(x_data[i:i + l // count],
y_data[i:i + l // count]))
return res
Model.py
import tensorflow as tf
import numpy as np
from tensorflow.compat.v1.train import AdamOptimizer
#### Create tf model for Client ####
def AlexNet(input_shape, num_classes, learning_rate, graph):
"""
Construct the AlexNet model.
input_shape: The shape of input (`list` like)
num_classes: The number of output classes (`int`)
learning_rate: learning rate for optimizer (`float`)
graph: The tf computation graph (`tf.Graph`)
"""
with graph.as_default():
X = tf.placeholder(tf.float32, input_shape, name='X')
Y = tf.placeholder(tf.float32, [None, num_classes], name='Y')
DROP_RATE = tf.placeholder(tf.float32, name='drop_rate')
# 1st Layer: Conv (w ReLu) -> Lrn -> Pool
# conv1 = conv(X, 11, 11, 96, 4, 4, padding='VALID', name='conv1')
conv1 = conv(X, 11, 11, 96, 2, 2, name='conv1')
norm1 = lrn(conv1, 2, 2e-05, 0.75, name='norm1')
pool1 = max_pool(norm1, 3, 3, 2, 2, padding='VALID', name='pool1')
# 2nd Layer: Conv (w ReLu) -> Lrn -> Pool with 2 groups
conv2 = conv(pool1, 5, 5, 256, 1, 1, groups=2, name='conv2')
norm2 = lrn(conv2, 2, 2e-05, 0.75, name='norm2')
pool2 = max_pool(norm2, 3, 3, 2, 2, padding='VALID', name='pool2')
# 3rd Layer: Conv (w ReLu)
conv3 = conv(pool2, 3, 3, 384, 1, 1, name='conv3')
# 4th Layer: Conv (w ReLu) splitted into two groups
conv4 = conv(conv3, 3, 3, 384, 1, 1, groups=2, name='conv4')
# 5th Layer: Conv (w ReLu) -> Pool splitted into two groups
conv5 = conv(conv4, 3, 3, 256, 1, 1, groups=2, name='conv5')
pool5 = max_pool(conv5, 3, 3, 2, 2, padding='VALID', name='pool5')
# 6th Layer: Flatten -> FC (w ReLu) -> Dropout
# flattened = tf.reshape(pool5, [-1, 6*6*256])
# fc6 = fc(flattened, 6*6*256, 4096, name='fc6')
flattened = tf.reshape(pool5, [-1, 1 * 1 * 256])
fc6 = fc_layer(flattened, 1 * 1 * 256, 1024, name='fc6')
dropout6 = dropout(fc6, DROP_RATE)
# 7th Layer: FC (w ReLu) -> Dropout
# fc7 = fc(dropout6, 4096, 4096, name='fc7')
fc7 = fc_layer(dropout6, 1024, 2048, name='fc7')
dropout7 = dropout(fc7, DROP_RATE)
# 8th Layer: FC and return unscaled activations
logits = fc_layer(dropout7, 2048, num_classes, relu=False, name='fc8')
# loss and optimizer
loss_op = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits,
labels=Y))
optimizer = AdamOptimizer(
learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)
# Evaluate model
prediction = tf.nn.softmax(logits)
pred = tf.argmax(prediction, 1)
# accuracy
correct_pred = tf.equal(pred, tf.argmax(Y, 1))
accuracy = tf.reduce_mean(
tf.cast(correct_pred, tf.float32))
return X, Y, DROP_RATE, train_op, loss_op, accuracy
def conv(x, filter_height, filter_width, num_filters,
stride_y, stride_x, name, padding='SAME', groups=1):
"""Create a convolution layer.
Adapted from: https://github.com/ethereon/caffe-tensorflow
"""
# Get number of input channels
input_channels = int(x.get_shape()[-1])
# Create lambda function for the convolution
convolve = lambda i, k: tf.nn.conv2d(
i, k, strides=[1, stride_y, stride_x, 1], padding=padding)
with tf.variable_scope(name) as scope:
# Create tf variables for the weights and biases of the conv layer
weights = tf.get_variable('weights',
shape=[
filter_height, filter_width,
input_channels / groups, num_filters
])
biases = tf.get_variable('biases', shape=[num_filters])
if groups == 1:
conv = convolve(x, weights)
# In the cases of multiple groups, split inputs & weights and
else:
# Split input and weights and convolve them separately
input_groups = tf.split(axis=3, num_or_size_splits=groups, value=x)
weight_groups = tf.split(axis=3,
num_or_size_splits=groups,
value=weights)
output_groups = [
convolve(i, k) for i, k in zip(input_groups, weight_groups)
]
# Concat the convolved output together again
conv = tf.concat(axis=3, values=output_groups)
# Add biases
bias = tf.reshape(tf.nn.bias_add(conv, biases), tf.shape(conv))
# Apply relu function
relu = tf.nn.relu(bias, name=scope.name)
return relu
def fc_layer(x, input_size, output_size, name, relu=True, k=20):
"""Create a fully connected layer."""
with tf.variable_scope(name) as scope:
# Create tf variables for the weights and biases.
W = tf.get_variable('weights', shape=[input_size, output_size])
b = tf.get_variable('biases', shape=[output_size])
# Matrix multiply weights and inputs and add biases.
z = tf.nn.bias_add(tf.matmul(x, W), b, name=scope.name)
if relu:
# Apply ReLu non linearity.
a = tf.nn.relu(z)
return a
else:
return z
def max_pool(x,
filter_height, filter_width,
stride_y, stride_x,
name, padding='SAME'):
"""Create a max pooling layer."""
return tf.nn.max_pool2d(x,
ksize=[1, filter_height, filter_width, 1],
strides=[1, stride_y, stride_x, 1],
padding=padding,
name=name)
def lrn(x, radius, alpha, beta, name, bias=1.0):
"""Create a local response normalization layer."""
return tf.nn.local_response_normalization(x,
depth_radius=radius,
alpha=alpha,
beta=beta,
bias=bias,
name=name)
def dropout(x, rate):
"""Create a dropout layer."""
return tf.nn.dropout(x, rate=rate)