Python cora資料集的讀取和處理

參考資料：《cora資料集介紹+python讀取》、《圖資料集之cora資料集介紹- 用pyton處理 - 可用於gcn任務》。

cora資料集由機器學習**組成。這些**分為以下七個類別之一：

import numpy as np
import pandas as pd
#讀入.content檔案
cora_content = pd.read_csv(
'./data/cora/cora.content'
,sep=
'\t'
,header=
none
)#檢視資料集初始格式
print
(cora_content.shape)
print
(cora_content.head(3)
)

(2708, 1435) 0 1 2 3 4 5 6 7 8 9 ... 1425 \ 0 31336 0 0 0 0 0 0 0 0 0 ... 0 1 1061127 0 0 0 0 0 0 0 0 0 ... 0 2 1106406 0 0 0 0 0 0 0 0 0 ... 0 1426 1427 1428 1429 1430 1431 1432 1433 1434 0 0 1 0 0 0 0 0 0 neural_networks 1 1 0 0 0 0 0 0 0 rule_learning 2 0 0 0 0 0 0 0 0 reinforcement_learning

[3 rows x 1435 columns]

#讀取.cites檔案
cora_cites = pd.read_csv(
'./data/cora/cora.cites'
,sep=
'\t'
,header=
none
)#檢視資料集初始格式
print
(cora_cites.shape)
print
(cora_cites.head(3)
)

(5429, 2)
0 1
0 35 1033
1 35 103482
2 35 103515

content_idx =
list
(cora_content.index)
#將索引製作成列表
*****_id =
list
(cora_content.iloc[:,
0])#將content第一列取出
mp =
dict
(zip
(*****_id,content_idx)
)#對映成的字典形式
#檢視某個**id對應的索引編號
mp[31336
]

#切片提取從第一列到倒數第二列（左閉右開）
feature = cora_content.iloc[:,
1:-1
]#檢視特徵矩陣
feature.shape
feature.head(
3)

3456

78910

...1424

1425

1426

1427

1428

1429

1430

1431

1432

143300

0000

0...00

0100

0000

1000

0000

000...00

1000

0000

2000

0000

000...00

0000

3 rows × 1433 columns

label = cora_content.iloc[:,
-1]#提取最後一列
label = pd.get_dummies(label)
#獨熱編碼
#檢視標籤的獨熱表示
label.head(
3)

case_based

genetic_algorithms

neural_networks

probabilistic_methods

reinforcement_learning

rule_learning

theory00

0100

0010

0000

1020

0001

mat_size = cora_content.shape[0]
#第一維的大小2708就是鄰接矩陣的規模
adj_mat = np.zeros(
(mat_size,mat_size)
)#建立0矩陣
mat_size

#建立鄰接矩陣
for i,j in
zip(cora_cites[0]
,cora_cites[1]
):#列舉形式（u，v）
x = mp[i]
y = mp[j]
adj_mat[x]
[y]= adj_mat[y]
[x]=
1

sum
(sum
(adj_mat)
)

10556.0

#轉換為numpy格式的資料
feature = np.array(feature)
label = np.array(label)
adj_mat = np.array(adj_mat)

讀「人件集」有感

軟體開發中如何保證軟體質量是很關鍵的乙個問題，但是乙個產品又要保證按時能夠上市。這明顯是乙個互相矛盾的問題，怎麼解決。在保證質量的前提下如何按時的完成產品的開發。書中提到了現在軟體開發中很多專案的乙個普遍方法，就是制定乙個周密的計畫，假定到某個日期必須完工，這種開發存在於許多軟體公司。有句話說存在...

讀「人件集」有感

讀《白金資料》

仍是一貫的東野圭吾的寫法在真實案情之上，給主人公批了一層雙層人格的外衣，意圖藉此將故事複雜化。其實劇情非常簡單，dna資料庫給特權階層設定了保護，即使特權人士犯案，通過dna匹配也不能找到真兇。沒想到，這個漏洞被乙個內部人士利用，犯了十幾宗姦殺案。看這本書，本來是衝著書名白金資料是和大資料有關...

Python cora資料集的讀取和處理

讀「人件集」有感

讀「人件集」有感

讀《白金資料》

相關推薦