Python資料操作資料清理

資料丟失在現實生活中是乙個問題。機器學習和資料探勘等領域由於資料缺失導致資料質量差，因此在模型**的準確性方面面臨嚴峻的問題。在這些領域，缺失值處理是使模型更加準確和有效的關鍵。

現在來看看如何使用pandas庫處理缺失值(如na或nan)。

# 使用pandas庫處理資料中的缺失值
import pandas as pd
import numpy as np
df = pd.dataframe(np.random.rand(5,3),
index =['a','c','e','f','h'],columns=['one','two','three'])
df = df.reindex (['a','b','c','d','e','f','g','h'])
#使用reindex，建立了乙個缺失值的dataframe
print(df)

輸出結果：

one two three a 0.077988 0.476149 0.965836 b nan nannan c -0.390208 -0.551605 -2.301950 d nan nannan e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g nan nannan h 0.085100 0.532791

0.887415

一、檢查缺失值，pandas提供了isnull()和notnull()函式

import pandas as pd
import numpy as np
df = pd.dataframe(np.random.randn(5,3),index=['a','c','e','f','h'],
columns=['one','two','three'])
df = df.reindex(['a','b','c','d','e','f','g','h'])
print(df['one'].isnull()) #檢查第一列中是否為null,是返回false,否返回true

輸出結果：

a false b true c false d true e false f false g true h false

name: one, dtype: bool

二、清理/填充缺少資料，fillna函式可以通過幾種方式用非空資料「填充」na值

1、用標量值將」nan」替換為0

import pandas as pd
import numpy as np
df = pd.dataframe(np.random.randn(3,3),index=['a','c','d'],
columns=['one','two','three'])
df = df.reindex(['a','b','c'])
print(df)
print("nan replace with '0':")
print(df.fillna(0)) #這裡我們用0填充，當然也可以用其他值填充

輸出結果：

one twothree a0.538547 -0.116047 -0.413233 b nan nan nan c 0.323509 -0.709677 1.243817 nan replace with '0': onetwothree a0.538547 -0.116047 -0.413233 b 0.000000 0.000000 0.000000 c 0.323509 -0.709677

1.243817

2、正向和反向填充nan

# pad/fill:向前填充方法
# bfill/backfill:向後填充方法
import pandas as pd
import numpy as np
df = pd.dataframe(np.random.randn(5,3),index = ['a','c','e','f','h'],
columns = ['one','two','three'])
df = df.reindex(['a','b','c','d','e','f','g','h'])
print('向前填充結果：\n',df.fillna(method='pad')) #該行根據前一行的值填充
print
('向後填充結果：\n',df.fillna(method='bfill'))#該行根據後一行的值填充

輸出結果：

向前填充結果： onetwothree a -0.989952 1.692963 -1.115485 b -0.989952 1.692963 -1.115485 c -0.218375 -0.090271 -0.381034 d -0.218375 -0.090271 -0.381034 e 0.748527 1.635351 -1.993645 f -0.525781 1.185460 -0.728045 g -0.525781 1.185460 -0.728045 h -0.706908 -0.832507 1.465190 向後填充結果： onetwothree a -0.989952 1.692963 -1.115485 b -0.218375 -0.090271 -0.381034 c -0.218375 -0.090271 -0.381034 d 0.748527 1.635351 -1.993645 e 0.748527 1.635351 -1.993645 f -0.525781 1.185460 -0.728045 g -0.706908 -0.832507 1.465190 h -0.706908 -0.832507

1.465190

3、刪除缺失值：如果只想排除缺少的值，則使用dropna()函式和axis引數。

預設情況下，axis = 0，即沿著一行行查詢，這意味著如果行內的任何值是na，那麼排除整行。

import pandas as pd
import numpy as np
df = pd.dataframe(np.random.randn(5,3),index = ['a','c','e','f','h'],
columns = ['one','two','three'])
print(df)
df = df.reindex(['a','b','c','d','e','f','g','h'])
print(df.dropna())

輸出結果：

one twothree a -1.346925 -1.281311 -0.880618 c 0.494288 -0.822928 0.349231 e 0.519051 -0.459518 0.161189 f 0.143254 1.976580 -0.462714 h -1.615947 0.838520 -0.020003 onetwo three a -1.346925 -1.281311 -0.880618 c 0.494288 -0.822928 0.349231 e 0.519051 -0.459518 0.161189 f 0.143254 1.976580 -0.462714 h -1.615947

0.838520 -0.020003

4、替換丟失或通用值，用標量值替換na與fillna()函式的效果相同。

import pandas as pd
import numpy as np
df = pd.dataframe()
print('替換之前的結果：\n',df)
print ('替換之後的結果：\n',df.replace())

輸出結果：

替換之前的結果： onetwo0 101000120 023030340 4045050 52000 60替換之後的結果： onetwo0 1010120 023030340 4045050560

60

以上為對資料集中的缺失值常用的處理方法。

Python資料操作資料清理

Excel操作資料1

python 時間操作資料分析

MySQL基本操作資料操作

Python資料操作 資料清理

Excel操作資料1

python 時間操作 資料分析

MySQL基本操作 資料操作

相關推薦

Python資料操作資料清理

python 時間操作資料分析

MySQL基本操作資料操作