python 基於卡方值分箱演算法

2021-10-08 03:10:50 字數 4072 閱讀 7482

原理很簡單,初始分20箱或更多,先確保每箱中都含有0,1標籤,對不包含0,1標籤的箱向前合併,計算各箱卡方值,對卡方值最小的箱向後合併,**如下

import pandas as pd

import numpy as np

import scipy

from scipy import stats

defchi_bin

(df,var,target,binnum=

5,maxcut=20)

:'''

df:data

var:variable

target:target / label

binnum: the number of bins output

maxcut: initial bins number

'''data=df[

[var,target]

]#equifrequent cut the var into maxcut bins

data[

"cut"

],breaks=pd.qcut(data[var]

,q=maxcut,duplicates=

"drop"

,retbins=

true

)#count 1,0 in each bin

count_1=data.loc[data[target]==1

].groupby(

"cut"

)[target]

.count(

) count_0=data.loc[data[target]==0

].groupby(

"cut"

)[target]

.count(

)#get bins value: min,max,count 0,count 1

bins_value=[*

zip(breaks[

:maxcut-1]

,breaks[1:

],count_0,count_1)

]#define woe

defwoe_value

(bins_value)

: df_woe=pd.dataframe(bins_value)

df_woe.columns=

["min"

,"max"

,"count_0"

,"count_1"

] df_woe[

"total"

]=df_woe.count_1+df_woe.count_0

df_woe[

"bad_rate"

]=df_woe.count_1/df_woe.total

df_woe[

"woe"

]=np.log(

(df_woe.count_0/df_woe.count_0.

sum())

/(df_woe.count_1/df_woe.count_1.

sum())

)return df_woe

#define iv

defiv_value

(df_woe)

: rate=

(df_woe.count_0/df_woe.count_0.

sum())

-(df_woe.count_1/df_woe.count_1.

sum())

iv=np.

sum(rate * df_woe.woe)

return iv

#make sure every bin contain 1 and 0

##first bin merge backwards

for i in

range

(len

(bins_value)):

if0in bins_value[0]

[2:]

: bins_value[0:

2]=[

( bins_value[0]

[0],

bins_value[1]

[1],

bins_value[0]

[2]+bins_value[1]

[2],

bins_value[0]

[3]+bins_value[1]

[3])

]continue

##bins merge forwardsif0

in bins_value[i][2

:]: bins_value[i-

1:i+1]

=[( bins_value[i-1]

[0],

bins_value[i][1

],bins_value[i-1]

[2]+bins_value[i][2

],bins_value[i-1]

[3]+bins_value[i][3

])]break

else

:break

#calculate chi-square merge the minimum chisquare

while

len(bins_value)

>binnum:

chi_squares=

for i in

range

(len

(bins_value)-1

):a=bins_value[i][2

:]b=bins_value[i+1]

[2:]

chi_square=scipy.stats.chi2_contingency(

[a,b])[

0]#merge the minimum chisquare backwards

i = chi_squares.index(

min(chi_squares)

)

bins_value[i:i+2]

=[( bins_value[i][0

],bins_value[i+1]

[1],

bins_value[i][2

]+bins_value[i+1]

[2],

bins_value[i][3

]+bins_value[i+1]

[3])

]

df_woe=woe_value(bins_value)

#print bin number and iv

print

("箱數:{},iv:"

.format

(len

(bins_value)

,iv_value(df_woe)))

#return bins and woe information

return woe_value(bins_value)

以下是效果:

初始分成10箱,目標為3箱

chi_bin(data,

"age"

,"seriousdlqin2yrs"

,binnum=

3,maxcut=

10)

箱數:8,iv:0.184862

箱數:7,iv:0.184128

箱數:6,iv:0.179518

箱數:5,iv:0.176980

箱數:4,iv:0.172406

箱數:3,iv:0.160015

min max count_0 count_1 total bad_rate woe

0 0.0 52.0 70293 7077 77370 0.091470 -0.266233

1 52.0 61.0 29318 1774 31092 0.057056 0.242909

2 61.0 72.0 26332 865 27197 0.031805 0.853755

Python變數分箱 woe值單調分箱

最近上傳了乙個變數分箱的方法到pypi,這個包主要有以下說明 缺失值單獨一箱,不論缺失的數量多少 生成的分箱woe值是單調的,後續有時間會迭代u型分箱的版本 會有分箱最小樣本數佔比,類似決策樹的最小葉節點佔比 分箱成功的變數才會保留,有可能失敗的情況是找不出同時滿足上述2和3的分箱 增加了多程序,提...

連續變數最優分箱 基於CART演算法

關於變數分箱主要分為兩大類 有監督型和無監督型 對應的分箱方法 a.無監督 1 等寬 2 等頻 3 聚類 b.有監督 1 卡方分箱法 chimerge 2 id3 c4.5 cart等單變數決策樹演算法 3 信用評分建模的iv最大化分箱 等 本篇使用python,基於cart演算法對連續變數進行最優...

基於python處理問卷資料並進行卡方分析全流程

如果只關心卡方分析的 請直接跳到最後,前面是python解析execl資料 受經管的同學所託處理了一下問卷資料。程式設計環境 jupyter notebook 環境 python3.6 分享 import pandas as pd import numpy as np from scipy.stat...