pandas高階操作總結

1.pandas中的列的分位數

# 檢視列的分位數
import pandas as pd
# set columns type
my_df['col'] = my_df['col'].astype(np.float64)
# computations for 4 quantiles : quartiles
bins_col = pd.qcut(my_df['col'], 4)
bins_col_label = pd.qcut(my_df['col'], 4).labels

2.多重聚合（組函式）

# 多重聚合（組函式）
# columns settings
grouped_on = 'col_0' # ['col_0', 'col_2'] for multiple columns
aggregated_column = 'col_1'
### choice of aggregate functions
## on non-na values in the group
## - numeric choice :: mean, median, sum, std, var, min, max, prod
## - group choice :: first, last, count
# list of functions to compute
agg_funcs = ['mean', 'max']
# compute aggregate values
aggregated_values = my_df.groupby(grouped_on)[aggregated_columns].agg(agg_funcs)
# get the aggregate of group
aggregated_values.ix[group]

3.使用自定義函式進行聚合

# 使用自定義函式進行聚合
# columns settings
grouped_on = ['col_0']
aggregated_columns = ['col_1']
def my_func(my_group_array):
return my_group_array.min() * my_group_array.count()
## list of functions to compute
agg_funcs = [my_func] # could be many
# compute aggregate values
aggregated_values = my_df.groupby(grouped_on)[aggregated_columns].agg(agg_funcs)

# top n in aggregate dataframe

def top_n(group_df, col, n=2):

bests = group_df[col].value_counts()[:n]

return bests

# columns settings

grouped_on = 'col_0'

aggregated_column = 'col'

grouped = my_df.groupby(grouped_on)

5.移動平均

# 移動平均
import numpy as np
ret = np.cumsum(np.array(x), dtype=float)
ret[w:] = ret[w:] - ret[:-w]
result = ret[w - 1:] / w
# x: array-like
# window: int

6.組資料的基本資訊

# 組資料的基本資訊
# columns settings
grouped_on = 'col_0' # ['col_0', 'col_1'] for multiple columns
aggregated_column = 'col_1'
### choice of aggregate functions
## on non-na values in the group
## - numeric choice : mean, median, sum, std, var, min, max, prod
## - group choice : first, last, count
## on the group lines
## - size of the group : size
aggregated_values = my_df.groupby(grouped_on)[aggregated_column].mean()
aggregated_values.name = 'mean'
# get the aggregate of group
aggregated_values.ix[group]

7.資料組的遍歷

# 資料組的遍歷
# columns settings
grouped_on = 'col_0' # ['col_0', 'col_1'] for multiple columns
grouped = my_df.groupby(grouped_on)
i = 0
for group_name, group_dataframe in grouped:
if i > 10:
break
i += 1
print(i, group_name, group_dataframe.mean()) ## mean on all numerical columns

8.最大互資訊數

# 最大互資訊數
import numpy as np
matrix = np.transpose(np.array(x)).astype(float)
mic_result = 
for i in matrix[1:]:
mine.compute_score(t_matrix[0], i)
return mic_result

最大互資訊數

9.pearson相關係數

import numpy as np
matrix = np.transpose(np.array(x))
np.corrcoef(matrix[0], matrix[1])[0, 1]
# x: array-like
#

10.自定義聚合函式

# 自定義聚合函式
def zscore(x):
return (x - x.mean()) / x.std()
my_df['zscore_col'] = my_df.groupby(grouped_on)[aggregated_column].transform(zscore)

11.標準聚合使用groupby

# 標準聚合使用groupby
# columns settings
grouped_on = 'col_1'
aggregated_column = 'col_0'
### choice of aggregate functions
## on non-na values in the group
## - numeric choice : mean, median, sum, std, var, min, max, prod
## - group choice : first, last, count
my_df['aggregate_values_on_col'] = my_df.groupby(grouped_on)[aggregated_column].transform(lambda v: v.mean())

12.使用自定義函式設值

# 使用自定義函式設值
def to_log(v):
try:
return log(v)
except:
return np.nan
my_df['new_col'] = my_df['col_0'].map(to_log)

13.使用複雜函式設值

# 使用複雜的函式設值
import numpy as np
def complex_formula(col0_value, col1_value):
return "%s (%s)" % (col0_value, col1_value)
my_df['new_col'] = np.vectorize(complex_formula)(my_df['col_0'], my_df['col_1'])

使用複雜函式設值

14.使用字典dict設值

# 使用字典dict設值
gender_dict=
df['gender'] = df['gender'].map(gender_dict)

Pandas操作總結

具體操作詳見鏈結 multiindex series 通過類似字典標記的方式或屬性的方式，可將dataframe的列獲取為乙個series 物件列名建立dataframe,修改index和columns 建立日期行索引，叫index，axis 0 列索引，叫colums，axis 1 物件....

pandas 高階使用

目錄 csv comma separated values 格式的檔案是指以純文字形式儲存的資料，這意味著不能簡單的使用excel 工具進行處理，而且excel 處理的資料量十分有限，而使用pandas來處理資料量巨大的csv檔案就容易的多了。import pandas as pd deftest...

pandas高階 DataFrame高階操作

我想這篇部落格內容可能比較散，因為我沒有任何思路，可能想到寫到工作中用到什麼功能寫什麼功能。1.drop duplicates drop duplicates 的作用是刪除重複行，首先，有這麼乙個dataframe df.drop duplicates 後的結果 drop duplicates ...

pandas高階操作總結

Pandas操作總結

pandas 高階使用

pandas高階 DataFrame高階操作

相關推薦