機器學習 Pandas基礎學習

pandas是為了解決資料分析任務而建立的，納入了大量的庫和標準資料模型，提供了高效地操作大型資料集所需的工具。

對於pandas包，在python中常見的匯入方法如下：

from pandas import series,dataframe
import pandas as pd

pandas中的資料結構 :

in :obj=series([4,7,-5,3])
in :obj
out:04
172 -5
33

series的互動式顯示的字串表示形式是索引在左邊，值在右邊

我們還可以自己建立索引，這就類似字典了

in :obj2=series([4,7,-5,3],index=['d','b','a','c'])
in :obj2
out:
d 4b 7
a -5
c 3

in [7]:sdata=
in [8]:obj3=series(sdata)
in [9]:obj3
out[9]:
ohio
35000
texas
71000
oregon
16000
utah
5000

dictionary = 
frame = dataframe(dictionary)

修改行名

frame=dataframe(dictionary,index=['one','two','three','four','five'])

新增新列

frame['add']=[0,0,0,0,0] '列名'=[值]

讀取csv

# 如果資料集中有中文的話，最好在裡面加上 encoding = 'gbk' ，以避免亂碼問題。後面的匯出資料的時候也一樣。 df = pd.read_csv('uk_rain_2014.csv', header=0, encoding = 'gbk') # header 關鍵字告訴 pandas 哪些是資料的列名。如果沒有列名的話就將它設定為 none

資料匯入pandas之後，我們該怎麼檢視資料呢？

# 檢視前五行
df.head(5)
# 檢視後五行
df.tail(5)
# 檢視總行數
len(df)

修改列名

df.columns = ['學號','班級','性別','年齡', '專業', '手機號碼', '郵箱']

# 初始化
in: obj = series(range(4), index=['d','a','b','c'])
# 按索引排序
in: obj.sort_index() 
out: 
a 1
b 2
c 3
d 0
# 按值排序
in: obj.sort() 或 obj.order()
in: obj
out: 
d 0
a 1
b 2
c 3

# 初始化
in: frame = dataframe(np.arange(8).reshape((2,4)),index=['three', 'one'],columns=['d','a','b','c'])
in: frame
out: 
d a b c
three01
23one456
7# 按索引排序（即第一列）
in: frame.sort_index() 或frame.sort()
out: 
d a b c
one456
7three01
23# 按行（第一行）排序
in[89]: frame.sort_index(axis=1, ascending=false)
out[89]: 
d c b a
three03
21one476
5# 按值排序
in[95]: frame = dataframe()
in[97]: frame.sort_values(by='b')
out[97]: 
a b
20 -331
2004117

即刪除 series 的元素或 dataframe 的某一行（列）的意思，我們可以通過物件的 drop(labels, axis=0) 方法實現此功能。

in:
ser = series([4.5,7.2,-5.3,3.6], index=['d','b','a','c'])
in:ser
.drop('c')
out: 
d 4.5
b 7.2
a -5.3

in[17]: df = dataframe(np.arange(9).reshape(3,3), index=['a','c','d'], columns=['oh','te','ca'])
in[18]: df
out[18]: 
oh te ca
a 012
c 345
d 678
in[19]: df.drop('a')
out[19]: 
oh te ca
c 345
d 678
in[20]: df.drop(['oh','te'],axis=1)
out[20]: 
caa 2
c 5
d 8

補充：

下圖代表在dataframe當中axis為0和1時分別代表的含義:

duplicated()

dataframe的duplicated方法返回乙個布林型series，表示各行是否是重複行。具體用法如下：

in[1]: df = dataframe()
in[2]: df
out[2]: 
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
in[3]: df.duplicated()
out[3]: 
0false
1true
2false
3false
4true
5false
6true
dtype: bool

drop_duplicates()

drop_duplicates() 用於去除重複的行數，具體用法如下：

in[4]: df.drop_duplicates()
out[4]: 
k1 k2
0one12
one2
3two35
two4

將series轉化成dataframe:

in[1]:data = series(np.random
.randn(10), index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd' ],[1,2,3,1,2,3,1,2,2,3]])
in[2]:data
out[2]:
a 1
0.169239
20.689271
30.879309
b 1 -0.699176
20.260446
3 -0.321751
c 1
0.893105
20.757505
d 2 -1.223344
3 -0.802812
in[5]:data.unstack()
out[5]:12
3a 0.169239
0.689271
0.879309
b -0.699176
0.260446 -0.321751
c 0.893105
0.757505 nan
d nan -1.223344 -0.802812

機器學習之python基礎篇pandas

theme pandas time 2018 12 17 author lz content 測試pd.isnull方法和pd.notfull方法 function pd.isnull過濾缺失值得項 pd.notnull過濾出不是缺失值得項 import pandas as pd from pand...

機器學習之pandas

import pandas as pd a pd.read csv 檔案路徑讀取檔案 a.head 顯示的條數顯示前部分資料 a.tail 顯示的條數顯示後部分資料 a.columns 輸出列 a.loc 序列號輸出乙個樣本 a.columns.tolist 將列轉換成列表 c.endwit...

Python 機器學習 Pandas

import pandas pandas 資料預處理非常很好使用檢視資料 pandas詳細說明讀取.csv檔案輸入絕對路徑，同檔案可以相對路徑 print type food info 資料型別 dataframe 有許多行列組成每一行或列交series print food info.dt...

機器學習 Pandas基礎學習

機器學習之python基礎篇pandas

機器學習之pandas

Python 機器學習 Pandas

相關推薦