9 17學習筆記(重複值處理 資料清洗)

2021-09-27 04:26:24 字數 2943 閱讀 9505

pandas的duplicated()判斷重複值記錄

pandas的drop_duplicates()刪除資料記錄,可指定特定列或全部

numpy中unique()返回所有不同的值,且按照從小到大的順序

set(),python自帶內建函式,也能返回唯一元素的集合

示例:重複值處理

import pandas as pd

data1=['a',1]

data2=['a',1]

data3=['b',2]

data4=['b',2]

data=pd.dataframe([data1,data2,data3,data4],columns=['col1','col2'])

print(data)

#判斷isduplicated=data.duplicated()

print(isduplicated)

#刪除new_1=data.drop_duplicates()

new_2=data.drop_duplicates(['col1'])

new_3=data.drop_duplicates(['col1','col2'])

print(new_1)

print(new_2)

print(new_3)

結果:

col1 col2

0 a 1

1 a 1

2 b 2

3 b 2

0 false

1 true

2 false

3 true

dtype: bool

col1 col2

0 a 1

2 b 2

col1 col2

0 a 1

2 b 2

col1 col2

0 a 1

2 b 2

示例:資料清洗

import re

#載入正規表示式庫

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import preprocessing

from sklearn import model_selection

from sklearn.preprocessing import labelencoder

from sklearn.ensemble import randomforestregressor

from sklearn.ensemble import gradientboostingregressor

#特徵工程處理

train_df_org=pd.read_csv('train.csv')

test_df_org=pd.read_csv('test.csv')

test_df_org['survived']=0

#---pclass欄位---建立pcalss fare category

def pclass_fare_category(df,pclass1_mean_fare,pclass2_mean_fare,pclass3_mean_fare):

if df['pclass']==1:

if df['fare']<=pclass1_mean_fare:

return 'pclass1_low'

else:

return 'pclass1_high'

elif df['pclass']==2:

if df['fare']<=pclass2_mean_fare:

return 'pclass2_low'

else:

return 'pclass2_high'

elif df['pclass']==3:

if df['fare']<=pclass3_mean_fare:

return 'pclass3_low'

else:

return 'pclass3_high'

pclass1_mean_fare=combined_train_test['fare'].groupby(by=combined_train_test['pclass']).mean().get([1]).values[0] //取pclass=1的艙的平均票價

pclass2_mean_fare=combined_train_test['fare'].groupby(by=combined_train_test['pclass']).mean().get([2]).values[0]

pclass3_mean_fare=combined_train_test['fare'].groupby(by=combined_train_test['pclass']).mean().get([3]).values[0]

print('# pclass_fare_category...')

print(combined_train_test.groupby(['pclass_fare_category','survived'])['survived'].count())

結果:

#/ pclass_fare_category…

pclass_fare_category survived

pclass1_high 0 49

1 48

pclass1_low 0 138

1 88

pclass2_high 0 68

1 43

pclass2_low 0 122

1 44

pclass3_high 0 174

1 42

pclass3_low 0 416

1 77

name: survived, dtype: int64

Jupyter 資料重複值處理

import os import pandas as pd import numpy as np os.chdir d workspaces jupyter df pd.read excel data test.xlsx df 重複的是true df.duplicated 顯示 df df.dupl...

資料處理之重複值,缺失值,空格值的處理

去除重複值在python中主要是用drop duplicates 函式,接下來做個小示範 這邊是我的檔案路徑,如果你想實現此功能需要輸入自己的檔案路徑 coding utf 8 import pandas as pd df pd.read csv r users herenyi downloads ...

Python資料預處理(刪除重複值和空值)

pandas幾個函式的使用,大資料的預處理 刪除重複值和空值 人工刪除很麻煩 python恰好能夠解決 注釋很詳細在這不一一解釋了 讀寫excel xls xlsx 檔案 import pandas as pd import numpy as np df excel pd.read excel da...