字典特徵提取和文字特徵抽取

什麼叫字典特徵提取？

字典內容轉化成計算機可以處理的數值

比如現在有個字典：

data = [, ,
]

分別是老三，老四，老五的個人基本資訊

我們想提取出來這幾個人的特徵值，也就是給我用乙個向量，表示某個獨一無二的人的特徵

我們先給出**

#coding=utf-8
#特徵提取
#首先匯入轉換器類
from sklearn.feature_extraction import dictvectorizer
def dec_demo():
data = [ , , ]
# 1. 例項化乙個轉換模組，不使用稀疏矩陣
dt2= dictvectorizer( sparse=false )
# 2. 呼叫fit_transform()
result = dt.fit_transform(data)
print(result)
return none
if __name__=="__main__":
dec_demo()

結果是：

[[ 11.   0.   1.   0. 100.]
[ 12. 1. 0. 0. 60.]
[ 13. 0. 0. 1. 20.]]

特徵模板是： [『age』, 『city=上海』, 『city=北京』, 『city=深圳』, 『tempeture』]

這就叫做字典特徵提取

那麼以此類推，文字特徵提取也是類似，依靠向量表示數

#coding=utf-8
from sklearn.feature_extraction.text import countvectorizer
def context_demo():
'''文字特徵抽取'''
context=["life is hard,we need to envisage ourselves, life is"]
# 例項化乙個內容轉換器
cv = countvectorizer()
# 呼叫fit_transform
result = cv.fit_transform(context)
print(result.toarray())
print("\n")
print(cv.get_feature_names()) 
if __name__ == "__main__":
context_demo()

結果是：

[[1 1 2 2 1 1 1 1]]
['envisage', 'hard', 'is', 'life', 'need', 'ourselves', 'to', 'we']

請注意

不能設定sparse=false

反而使用 bunch (就是特徵抽取之後返回的結果) 的 toarray()方法

特徵提取中文文字特徵抽取

jieba庫 ex 1 import jieba def cut word text text join list jieba.cut text return text def cut chinese demo2 data 每乙個公民的合法權利都值得守護每乙個維權訴求都值得珍視。當且僅當舉報渠道暢...

機器學習特徵工程和文字特徵提取

命令檢視是否可用注意安裝scikit learn需要numpy，pandas等庫 from sklearn.feature extraction import dictvectorizer defdictvec 對字典特徵值 return none dit dictvectorizer 例項化 ...

機器學習特徵工程字典特徵和文字特徵抽取

mysql 效能瓶頸，讀取速度 pandas 讀取工具 numpy釋放gil cpython 協程 sklearn 特徵值目標值重複值不需要進行去重缺失值特殊處理將原始資料轉換為更好代表模型的潛在問題的特徵的過程，從而提高對未知資料的準確性 classification 分類 reg...

字典特徵提取和文字特徵抽取

特徵提取 中文文字特徵抽取

機器學習 特徵工程和文字特徵提取

機器學習 特徵工程字典特徵和文字特徵抽取

相關推薦

特徵提取中文文字特徵抽取

機器學習特徵工程和文字特徵提取

機器學習特徵工程字典特徵和文字特徵抽取