用sklearn進行特徵提取及數值轉換

對自己目前常用的幾種特徵提取方法做個簡要總結。

1，將文字資料轉化為特徵向量（其中countvectorizer只考慮詞彙在文字**現的頻率）

from sklearn.feature_extraction.text import countvectorizer
from sklearn.feature_extraction.text import tfidftransformer
wordvectorizer = countvectorizer(ngram_range=(1, 2))
x_train = wordvectorizer.fit_transform(x_train)
wordtransformer = tfidftransformer()
train_feature = wordtransformer.fit_transform(x_train)

2，文字特徵數值轉換（dictvectorizer的處理物件是符號化非數位化但是具有一定結構的特徵資料，如字典，dataframe等，將符號轉成數字0/1表示。）

笨辦法是直接用字典的key-value轉換（窮舉）

from sklearn.feature_extraction import dictvectorizer
dict_vec = dictvectorizer(sparse=false) # false：不產生稀疏矩陣
x_train = dict_vec.fit_transform(x_train.to_dict(orient='record'))
x_test = dict_vec.transform(x_test.to_dict(orient='record'))
print(dict_vec.feature_names_) # 檢視轉換後的列名
print(x_train)

另，feature.extraction:

__all__ = ['dictvectorizer', 'image', 'img_to_graph', 'grid_to_graph', 'text',
'featurehasher']

feature.extraction.text:

__all__ = ['countvectorizer', 'english_stop_words', 'tfidftransformer', 'tfidfvectorizer', 'strip_accents_ascii', 'strip_accents_unicode',

'strip_tags']

附，直接看源**比較明了。

特徵工程特徵提取

特徵提取將任意資料如文字或影象轉換為可用於機器學習的數字特徵注特徵值化是為了計算機更好的去理解資料字典特徵提取作用對字典資料進行特徵值化 dictvectorizer.get feature names 返回類別名稱 from sklearn.feature extraction i...

八用scikit learn做特徵提取

現實世界中多數特徵都不是連續變數，比如分類文字影象等，為了對非連續變數做特徵表述，需要對這些特徵做數學化表述，因此就用到了特徵提取比如城市作為乙個特徵，那麼就是一系列雜湊的城市標記，這類特徵我們用二進位制編碼來表示，是這個城市為1，不是這個城市為0 比如有三個城市北京天津上海，我們用sc...

顏色特徵提取

顏色特徵是在影象檢索中應用最為廣泛的視覺特徵，主要原因在於顏色往往和影象中所包含的物體或場景十分相關。此外，與其他的視覺特徵相比，顏色特徵對影象本身的尺寸方向視角的依賴性較小，從而具有較高的魯棒性。面向影象檢索的顏色特徵的表達涉及到若干問題。首先，我們需要選擇合適的顏色空間來描述顏色特徵其次，...

用sklearn進行特徵提取及數值轉換

特徵工程 特徵提取

八 用scikit learn做特徵提取

顏色特徵提取

相關推薦

特徵工程特徵提取

八用scikit learn做特徵提取