Tensorflow分布式訓練例項詳解

模型並行in-graph replication

：將模型部署到很多裝置上執行，比如乙個或多個機器的gpus，

不同 gpu 負責網路模型的不同部分。

資料並行between-graph replication

：每個worker擁有模型的完整副本但分配到不同的資料，各自單獨地訓練，然後將計算結果合併。這是目前主流的做法。

同步更新：等到所有 gpu 的梯度計算完成，再統一根據梯度均值計算新引數，然後所有 gpu 同步新值後，才進行下一輪計算，loss下降比較穩定，但需要等待最慢的計算節點。

非同步更新：所有 gpu 分別計算，分別更新引數，計算資源能充分利用，但

loss下降不穩定，存在梯度失效問題

。

1. grpc (google remote procedure call)tensorflow分布式並行基於grpc通訊框架（谷歌遠端過程呼叫），其中包括乙個master建立session，還有多個worker負責執行計算圖中的任務

。

即：乙個cluster可以切分多個job，乙個job指一類特定的任務，每個job包含多個task，比如parameter server(ps)、worker，在大多數情況下,乙個機器上只執行乙個task.

2.ring allreduce架構

ps架構中，當worker數量較多時，ps節點的網路頻寬將成為系統的瓶頸。

ring allreduce架構中各個裝置都是worker，沒有中心節點來聚合所有worker計算的梯度。所

有device 在乙個邏輯環路中，每個 device 從上行的device 接收資料，並向下行的 deivce 傳送資料。

首先定義乙個由參與分布式計算的機器組成的集群，

集群中一般有多個worker，需要指定其中乙個worker為主節點（cheif），chief節點會執行一些額外的工作，比如模型匯出之類的。在ps分布式架構環境中，還需要定義ps節點。

如下：

cluster =

設定好tf_config

環境變數：

# example of non-chief node:
os.environ['tf_config'] = json.dumps(
})# example of chief node:
os.environ['tf_config'] = json.dumps(
})# example of evaluator node (evaluator is not part of training cluster)
os.environ['tf_config'] = json.dumps(
})

這個專案的分布式架構是用tf.train.clusterspec 和tf.train.server構建的，與tensorflow estimator api的定義方式有所區別

# 設定job name引數
flags.define_string('job_name', none, 'job name: worker or ps')
# 設定任務的索引
flags.define_integer('task_index', none, 'index of task within the job')
# 引數伺服器節點
flags.define_string('ps_hosts', 'localhost:22')
# 兩個worker節點
flags.define_string('worker_hosts', 'localhost:23,localhost:24')
# 定義任務集合
cluster = tf.train.clusterspec()
# tf的sever及session
server = tf.train.server(cluster, job_name=flags.job_name, task_index=flags.task_index)
sv = tf.train.supervisor(is_chief=is_chief, logdir='logs', init_op=init_op, recovery_wait_secs=1,
global_step=global_step)
sess = sv.prepare_or_wait_for_session(server.target)

ps 節點執行：

python distributed.py --job_name=ps --task_index=0

worker1 節點執行：

python distributed.py --job_name=worker --task_index=0

worker2 節點執行：

python distributed.py --job_name=worker --task_index=1

TF 2 5 Tensorflow 分布式訓練

簡介構建步驟實現方式 demo演示 1 使用單台機器或者單個gpu cpu來進行模型訓練，訓練速度會受資源的影響，因為畢竟單個的裝置的計算能力和儲存能力具有一定的上限的，針對這個問題，tensorflow支援分布式模型運算，支援多機器多gpu 多cpu各種模型的組合執行方案的設計。預設情況下，...

分布式訓練

分布式訓練深度學習中，越來越多的場景需要分布式訓練。由於分布式系統面臨單機單卡所沒有的分布式任務排程複雜的資源並行等問題，因此，通常情況下，分布式訓練對使用者有一定的技術門檻。在 oneflow 中，通過頂層設計與工程創新，做到了分布式最易用，使用者不需要特別改動網路結構和業務邏輯就可以方便...

TensorFlow分布式計算

分布式tensorflow底層的通訊是grpc。grpc首先是乙個rpc，即遠端過程呼叫，通俗的解釋是假設你在本機上執行一段 num add a,b 它呼叫了乙個過程call，然後返回了乙個值num，你感覺這段只是在本機上執行的，但實際情況是，本機上的add方法是將引數打包傳送給伺服器，然後伺服...

Tensorflow分布式訓練例項詳解

TF 2 5 Tensorflow 分布式訓練

分布式訓練

TensorFlow分布式計算

相關推薦