數值計算 GPU加速演算法

典型的cuda程式的執行流程如下：

分配host記憶體，並進行資料初始化；

分配device記憶體，並從host將資料拷貝到device上；

呼叫cuda的核函式在device上完成指定的運算；

將device上的運算結果拷貝到host上；

釋放device和host上分配的記憶體。

下面為kernel的執行緒層次結構，由於sm的基本執行單元是包含32個執行緒的執行緒束，所以block大小一般要設定為32的倍數。

由於之前在研究基於pyspark+gpu的實時及離線研究時，gpu的效能（執行時間）並沒有得到提公升或提公升不明顯。基於這個原因，該研究只針對基於python寫cuda程式的數值計算加速演算法（不考慮使用spark的場景），進一步研究對gpu的效能研究及使用場景分析。（之前專案上對gpu研究方面過多，暫時只能推出一點內容）

（1）測試資料為python程式中自動生成的numpy型別的陣列a和b。

 每個陣列長度為1億。

 每個陣列長度為10億。

 每個陣列長度為100億。

（2）測試邏輯為陣列間的數值運算。

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from timeit import default_timer as timer
from pycuda.compiler import sourcemodule
mod = sourcemodule("""
__global__ void func(float *a, float *b, size_t n)
float temp_a = a[i];
float temp_b = b[i];
a[i] = (temp_a * 10 + 2 ) * ((temp_b + 2) * 10 - 5 ) * 5;
// a[i] = a[i] + b[i];
}""")
func = mod.get_function("func")
def test(n):
# n = 1024 * 1024 * 90 # float: 4m = 1024 * 1024
print("n = %d" % n)
n = np.int32(n)
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
# copy a to aa
aa = np.empty_like(a)
aa[:] = a
# gpu run
ntheads = 256
nblocks = int( ( n + ntheads - 1 ) / ntheads )
start = timer()
func(
drv.inout(a), drv.in(b), n,
block=( ntheads, 1, 1 ), grid=( nblocks, 1 ) )
run_time = timer() - start
print("gpu run time %f seconds " % run_time)
# cpu run
start = timer()
aa = (aa * 10 + 2 ) * ((b + 2) * 10 - 5 ) * 5
run_time = timer() - start
print("cpu run time %f seconds " % run_time)
# check result
r = a - aa
#print( min(r), max(aa) )
def main():
for n in range(1, 10):
n = 1024 * 1024 * (n * 10)
print("------------%d---------------" % n)
test(n)
if __name__ == '__main__':
main()

4.1.1. 資源列表

4.1.2. 截圖列表

傳統硬體概覽

4.2.1. 資源列表

4.2.2. 截圖列表

傳統資源概覽

gpu硬體概覽

無論是cpu或者gpu同等執行條件下，隨著資料量（資料長度）的增加，對於cpu的資源消費基本上保持不變，但是對gpu的利用率一直在提高。

gpu執行時間比cpu執行時間快4-5倍。

基於層次包圍盒的光線追蹤渲染加速演算法

光線追蹤的效率問題一直以來都是關注的焦點，因為很多時候都會有非常多的求交運算要執行。目前幾乎所有的加速演算法都是儘量減少求交運算量，比如octree kd tree 包圍盒及層次包圍盒等。基於空間分割的演算法最重要的就是如何有效地分隔空間，讓場景細節和主體脫離劃分在不同的層次中層次包圍盒對空...

數值計算設計演算法的若干原則

當x充分大時對於小的正數 sin x sinx 2cos x 2 sin 2 注 sin x sin y 2 cos x y 2 sin x y 2 在五位浮點十進位制計算機上，計算 y 54321 0.4 0.3 0.4 如果按從左到右的順序進行加法運算，後三個數都在對階過程中被當作零，得出含有...

如何建立乙個GPU加速的研究計算集群一

世界上某些最快的計算機是集群組成的。集群是有多個計算機通過高速網路連線起來的乙個計算系統。集群計算機比單臺計算機可以達到更高的可用性，可靠性和伸縮性。隨著對基於gpu的高效能計算採用越來越廣，英偉達gpu逐漸成為世界上最厲害的超級計算集群的一部分。世界前500的超級計算機中，包括差不多50個採用了英...

數值計算 GPU加速演算法

基於層次包圍盒的光線追蹤渲染加速演算法

數值計算 設計演算法的若干原則

如何建立乙個GPU加速的研究計算集群 一

相關推薦

數值計算設計演算法的若干原則

如何建立乙個GPU加速的研究計算集群一