資料處理缺失值處理

資料缺失主要包括記錄缺失和字段資訊缺失等情況，其對資料分析會有較大影響，導致結果不確定性更加顯著

缺失值的處理：刪除記錄 / 資料插補 / 不處理

判斷是否有缺失值資料 - isnull，notnull

isnull：缺失值為true，非缺失值為false

notnull：缺失值為false，非缺失值為true

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
% matplotlib inline
s = pd.series([12,33,45,23,np.nan,np.nan,66,54,np.nan,99])
df = pd.dataframe()
# 建立資料
print(s.isnull()) # series直接判斷是否是缺失值，返回乙個series
print(df.notnull()) # dataframe直接判斷是否是缺失值，返回乙個series
print(df['value1'].notnull()) # 通過索引判斷
print('------')
s2 = s[s.isnull() == false]  
df2 = df[df['value2'].notnull()]   # 注意和 df2 = df[df['value2'].notnull()] ['value1'] 的區別
print(s2)
print(df2)
# 篩選非缺失值
刪除缺失值 - dropna
s = pd.series([12,33,45,23,np.nan,np.nan,66,54,np.nan,99])
df = pd.dataframe()
# 建立資料
s.dropna(inplace = true)
df2 = df['value1'].dropna()
print(s)
print(df2)
# drop方法：可直接用於series，dataframe
# 注意inplace引數，預設false → 生成新的值
填充/替換缺失資料 - fillna、replace
s = pd.series([12,33,45,23,np.nan,np.nan,66,54,np.nan,99])
df = pd.dataframe()
# 建立資料
s.fillna(0,inplace = true)
print(s)
print('------')
# s.fillna(value=none, method=none, axis=none, inplace=false, limit=none, downcast=none, **kwargs)
# value：填充值
# 注意inplace引數
df['value1'].fillna(method = 'pad',inplace = true)
print(df)
print('------')
# method引數：
# pad / ffill → 用之前的資料填充
# backfill / bfill → 用之後的資料填充
s = pd.series([1,1,1,1,2,2,2,3,4,5,np.nan,np.nan,66,54,np.nan,99])
s.replace(np.nan,'缺失資料',inplace = true)
print(s)
print('------')
# df.replace(to_replace=none, value=none, inplace=false, limit=none, regex=false, method='pad', axis=none)
# to_replace → 被替換的值
# value → 替換值
s.replace([1,2,3],np.nan,inplace = true)
print(s)
# 多值用np.nan代替
幾種思路：均值/中位數/眾數插補、臨近值插補、插值法
（1）均值/中位數/眾數插補
s = pd.series([1,2,3,np.nan,3,4,5,5,5,5,np.nan,np.nan,6,6,7,12,2,np.nan,3,4])
#print(s)
print('------')
# 建立資料
u = s.mean()     # 均值
me = s.median() # 中位數
mod = s.mode()   # 眾數
print('均值為：%.2f, 中位數為：%.2f' % (u,me))
print('眾數為：', mod.tolist())
print('------')
# 分別求出均值/中位數/眾數
s.fillna(u,inplace = true)
print(s)
# 用均值填補
（2）臨近值插補
s = pd.series([1,2,3,np.nan,3,4,5,5,5,5,np.nan,np.nan,6,6,7,12,2,np.nan,3,4])
#print(s)
print('------')
# 建立資料
s.fillna(method = 'ffill',inplace = true)
print(s)
# 用前值插補
（3）插值法 —— 拉格朗日插值法
# 的輸出值為的是多項式的n個係數
# 這裡輸出3個值，分別為a0,a1,a2
# y = a0 * x**2 + a1 * x + a2 → y = -0.11111111 * x**2 + 0.33333333 * x + 10
print('插值10為：%.2f' % lagrange(x,y)(10))
（3）插值法 —— 拉格朗日插值法，實際運用
資料處理之缺失值處理
coding utf 8 概念 由於某些原因，導致資料中的某些列的值缺失，這種情況可能是正常的，也可能是不正常的。我們可以選擇不處理 補齊 或刪除對應的行 dropna函式作用 去除資料結構中值為空的資料。dropna函式語法 dropna from pandas import read csv d...
python,pandas缺失值資料處理
缺失值資料處理方式 1.資料補齊 2.刪除對應資料行 3.不處理 data.csv檔案內容 uft 8編碼 包含全世界所有國家需要用到的字元，英文 用的較多 gbk編碼 包含全部的中文字元 unicode編碼 把所有語言統一到一套編碼 df pd.read csv r c data data.csv...
插值法補齊缺失資料 資料處理 缺失值處理
此文圖方便，就直接輸入資料了。importpandas as pd df pd.dataframe 1缺失值處理 如何判斷缺失值 df.isnull isna df.notnull notna 1.1刪除法 dataframe.dropna axis 0,how any thresh none,su...

資料處理 缺失值處理

資料處理之缺失值處理

python,pandas缺失值資料處理

插值法補齊缺失資料 資料處理 缺失值處理

相關推薦

資料處理缺失值處理

插值法補齊缺失資料資料處理缺失值處理