pandas分類資料

二、分類變數的排序

問題與習題

category的建立有多種形式：series、dataframe指定型別、內建categorical型別、cut函式。

print
(pd.series(
["a"
,"b"
,"c"
,"a"
], dtype=
"category"))
temp_df = pd.dataframe(
)print
(temp_df.dtypes)
cat = pd.categorical(
["a"
,"b"
,"c"
,"a"
], categories=
['a'
,'b'
,'c'])
print
(pd.series(cat)
)print
(pd.cut(np.random.randint(0,
60,5)
,[0,
10,30,
60], right=
false
, labels=
['0-10'
,'10-30'
,'30-60'])
)'''
0 a
1 b
2 c
3 a
dtype: category
categories (3, object): ['a', 'b', 'c']
a category
b object
dtype: object
0 a
1 b
2 c
3 a
dtype: category
categories (3, object): ['a', 'b', 'c']
['30-60', '0-10', '10-30', '30-60', '10-30']
categories (3, object): ['0-10' < '10-30' < '30-60']
'''

乙個分類變數包括三個部分，元素值（values）、分類類別（categories）、是否有序（order）

從上面可以看出，使用cut函式建立的分類變數預設為有序分類變數

下面介紹如何獲取或修改這些屬性

describe方法

該方法描述了乙個分類序列的情況，包括非缺失值個數、元素值類別數（不是分類類別數）、最多次出現的元素及其頻數

s = pd.series(pd.categorical(
["a"
,"b"
,"c"
,"a"
,np.nan]
, categories=
['a'
,'b'
,'c'
,'d'])
)print
(s.describe())
'''count 4
unique 3
top a
freq 2
dtype: object
'''

categories和ordered屬性

檢視分類類別和是否排序

print
(s.cat.categories)
'''index(['a', 'b', 'c', 'd'], dtype='object')
'''

一般來說會將乙個序列轉為有序變數，可以利用as_ordered方法

s = pd.series(
["a"
,"d"
,"c"
,"a"])
.astype(
'category'
).cat.as_ordered(
)print
(s)'''
0 a
1 d
2 c
3 a
dtype: category
categories (3, object): ['a' < 'c' < 'd']
'''

退化為無序變數，只需要使用as_unordered

print
(s.cat.as_unordered())
'''0 a
1 d
2 c
3 a
dtype: category
categories (3, object): ['a', 'c', 'd']
'''

利用set_categories方法中的order引數

print
(pd.series(
["a"
,"d"
,"c"
,"a"])
.astype(
'category'
).cat.set_categories(
['a'
,'c'
,'d'
],ordered=
true))
'''0 a
1 d
2 c
3 a
dtype: category
categories (3, object): ['a' < 'c' < 'd']
'''

利用reorder_categories方法

s = pd.series(
["a"
,"d"
,"c"
,"a"])
.astype(
'category'
)print
(s.cat.reorder_categories(
['a'
,'c'
,'d'
],ordered=
true))
'''0 a
1 d
2 c
3 a
dtype: category
categories (3, object): ['a' < 'c' < 'd']
'''

【問題一】如何使用union_categoricals方法？它的作用是什麼？

使用union_categoricals需要保證兩個categories必須是相同的dtype。作用是把兩個union_categoricals連線在一起

【問題二】利用concat方法將兩個序列縱向拼接，它的結果一定是分類變數嗎？什麼情況下不是？

不一定，只有分類的數量和類別一樣，才是分類變數

s = pd.series(
["a"
,"d"
,"c"
,"a"])
.astype(
'category'
)s1 = pd.series(
["a"
,"d"
,"c"
,"d"])
.astype(
'category'
)print
(pd.concat(
[s, s1]))
'''0 a
1 d
2 c
3 a
0 a
1 d
2 c
3 d
dtype: category
categories (3, object): ['a', 'c', 'd']
'''

s = pd.series(
["a"
,"d"
,"c"
,"a"])
.astype(
'category'
)s1 = pd.series(
["a"
,"d"
,"c"
,"b"])
.astype(
'category'
)print
(pd.concat(
[s, s1]))
'''0 a
1 d
2 c
3 a
0 a
1 d
2 c
3 b
dtype: object
'''

【問題三】當使用groupby方法或者value_counts方法時，分類變數的統計結果和普通變數有什麼區別？

分類變數會轉成對應的普通變數

缺陷：修改series變數的時候，原分類跟著變了。

建立的時候設定引數copy=true，這樣修改series的時候原分類就不會變了

cat = pd.categorical([1
,2,3
,10], categories=[1
,2,3
,4,10
])s = pd.series(cat, name=
"cat"
, copy=
true
)

Pandas中的分類

一分類變數的結構乙個分類變數包括三個部分，元素值 values 分類類別 categories 是否有序 order 從上面可以看出，使用cut函式建立的分類變數預設為有序分類變數一獲取分類屬性 a describe方法該方法描述了乙個分類序列的情況，包括非缺失值個數元素值類別數不是分...

pandas處理分類變數的方法

在做分類的任務中，如果出現了分類變數，要對其進行一些處理，例如在對這些分類變數處理的時候要注意以下兩點原則離散特徵的取值之間沒有大小的意義，比如color red,blue 那麼就使用one hot編碼離散特徵的取值有大小的意義，比如size x,xl,xxl 那麼就使用數值的對映遵循這兩...

pandas 排序 Pandas 資料排序

python 的 pandas 庫中有一類對資料排序的方法，主要分為對引數列排序，對數值排序，及二者混合三種。一.引數列排序首先我們生成乙個亂序數列 unsorted df 隨後我們可通過 df.sort index 函式對資料集進行排序操作如不做規定，返回行引數正序排序新增引數 ascend...

pandas分類資料

Pandas中的分類

pandas處理分類變數的方法

pandas 排序 Pandas 資料排序

相關推薦