機器學習特徵工程字典特徵和文字特徵抽取

mysql 效能瓶頸，讀取速度

pandas 讀取工具

numpy釋放gil

cpython 協程

sklearn

特徵值 + 目標值

重複值不需要進行去重

缺失值特殊處理

將原始資料轉換為更好代表**模型的潛在問題的特徵的過程，從而提高對未知資料的**準確性

classification 分類

regression 回歸

clustering 聚類

dimensionality reduction 降維

model selection 模型選擇

preprocessing 特徵工程

特徵值化，讓計算機更好的理解資料

文字 -> 數字

安裝用到的庫

pip install jieba scikit-learn numpy

把文字轉換為數值

sparse矩陣，節約記憶體

ndarray 陣列

one-hot編碼布林值標記特徵

import numpy as np
from sklearn.feature_extraction import dictvectorizer
# 字典特徵抽取
data =[,
,,]dict_vectorizer = dictvectorizer(dtype=np.int32, sparse=
false
)result = dict_vectorizer.fit_transform(data)
print
(dict_vectorizer.get_feature_names())
print
(dict_vectorizer.inverse_transform(result)
)print
(result)
"""['city=上海', 'city=北京', 'city=深圳', 'price']
[ , 
, ]sparse = true 
(0, 1) 1.0
(0, 3) 2000.0
(1, 0) 1.0
(1, 3) 1500.0
(2, 2) 1.0
(2, 3) 1000.0
sparse = false 
[ [ 0 1 0 2000]
[ 1 0 0 1500]
[ 0 0 1 1000]
]"""

count 單詞列表+出現次數統計

文字分類，情感分析

對單個字母不統計

import logging
import jieba
from sklearn.feature_extraction.text import countvectorizer
jieba.setloglevel(logging.info)
defcount_vector()
:"""
文字特徵提取
"""words =
["今天的天氣很好"
,"明天我要去逛街"
,"後天天氣好我還去好天氣逛街"
] data =
for word in words:
word_cut = jieba.cut(word)
" ".join(word_cut)
)print
(data)
cv = countvectorizer(
) result = cv.fit_transform(data)
print
(cv.get_feature_names())
print
(result.toarray())
""" [
'今天 的 天氣 很 好', 
'明天 我要 去 逛街', 
'後天 天氣 好 我 還 去 好 天氣 逛街'
]['今天', '後天', '天氣', '我要', '明天', '逛街']
[[1 0 1 0 0 0]
[0 0 0 1 1 1]
[0 1 2 0 0 1]]
"""

評估詞的重要程度

tf: term frequency 詞的頻率出現次數

idf: inverse document frequency 逆文件頻率

t fi

df=n

n∗lg

(dd)

tfidf = \frac * lg(\frac)

tfidf=

nn∗

lg(d

d)說明：

n 文件中某個詞的個數

n 文件總次數

d 文件總數

d 包含某個詞的文件數

參考：

log(總文件數量n/該詞出現的文件數量n)

log輸入的值越小，輸出值也越小

樸素貝葉斯

n >= n > 0

=> n/n >= 1

=> log定義域[1, 無窮)

=> 對映log值域[0, 1)

=> n固定 n 越大 -> n/n越小 -> log(n/n)越小

=> 單個文件中詞頻越高 tf越大

=> 出現文件越多 idf越小

=> 單個文件出現次數越多，出現文件數越少，重要程度越大

# -*- coding: utf-8 -*-
from sklearn.feature_extraction.text import tfidfvectorizer
data =
["今天 天氣 逛街"
,"明天 天氣 逛街"
,"後天 天氣 吃飯"
]tf = tfidfvectorizer(
)result = tf.fit_transform(data)
print
(tf.get_feature_names())
print
(result.toarray())
""" ['今天', '吃飯', '後天', '天氣', '明天', '逛街']
[ [0.72033345 0. 0. 0.42544054 0. 0.54783215]
[0. 0. 0. 0.42544054 0.72033345 0.54783215]
[0. 0.65249088 0.65249088 0.38537163 0. 0. ]
]"""

可以看到：

"今天天氣逛街" , 今天 tf-idf值最大 0.72033345 "明天天氣逛街" , 明天 tf-idf值最大 0.72033345

"後天天氣吃飯" 後天和吃飯 tf-idf值最大 0.65249088

import logging
import jieba
from sklearn.feature_extraction.text import tfidfvectorizer
jieba.setloglevel(logging.info)
deftfidf_vector()
: words =
["今天的天氣很好"
,"明天我要去逛街"
,"後天天氣好我還去好天氣逛街"
] data =
for word in words:
word_cut = jieba.cut(word)
" ".join(word_cut)
)print
(data)
tf = tfidfvectorizer(
) result = tf.fit_transform(data)
print
(tf.get_feature_names())
print
(result.toarray())
""" 去除單個字的詞
['今天 天氣', 
'明天 我要 逛街', 
'後天 天氣 天氣 逛街'
]['今天', '後天', '天氣', '我要', '明天', '逛街']
[[0.79596054 0. 0.60534851 0. 0. 0. ]
[0. 0. 0. 0.62276601 0.62276601 0.4736296 ]
[0. 0.50689001 0.77100584 0. 0. 0.38550292]
]"""

機器學習特徵工程和文字特徵提取

命令檢視是否可用注意安裝scikit learn需要numpy，pandas等庫 from sklearn.feature extraction import dictvectorizer defdictvec 對字典特徵值 return none dit dictvectorizer 例項化 ...

機器學習特徵工程字典特徵提取

將原始資料轉換為更好地代表模型的潛在問題的特徵的過程，從而提高了對未知資料的準確性，直接影響結果。對文字等特徵進行特徵值化，為了計算機更好地理解資料 sklearn.feature extraction 對字典資料進行特徵值化字典資料抽取就是把字典中一些類別資料，分別轉換成特徵，數值型別不...

字典特徵提取和文字特徵抽取

什麼叫字典特徵提取？字典內容轉化成計算機可以處理的數值比如現在有個字典 data 分別是老三，老四，老五的個人基本資訊我們想提取出來這幾個人的特徵值，也就是給我用乙個向量，表示某個獨一無二的人的特徵我們先給出 coding utf 8 特徵提取首先匯入轉換器類 from sklearn.fe...

機器學習 特徵工程字典特徵和文字特徵抽取

機器學習 特徵工程和文字特徵提取

機器學習 特徵工程 字典特徵提取

字典特徵提取和文字特徵抽取

相關推薦

機器學習特徵工程字典特徵和文字特徵抽取

機器學習特徵工程和文字特徵提取

機器學習特徵工程字典特徵提取