Spark MLlib原始碼分析 TFIDF原始碼詳解

以下**是我依據sparkmllib(版本1.6)

1、hashingtf 是使用雜湊表來儲存分詞，並計算分詞頻數（tf），生成hashmap表。在map中，k為分詞對應索引號，v為分詞的頻數。在宣告hashingtf 時，需要設定numfeatures，該屬性實為設定雜湊表的大小；如果設定numfeatures過小，則在儲存分詞時會出現重疊現象，所以不要設定太小，一般情況下設定為30w~50w之間。

2、idf是計算每個分詞出現在文章中的次數，並計算log值。在宣告idf時，可以設定mindocfreq，即過濾掉出現文章數小於mindocfreq的分詞。

3、idfmodel 主要是計算tf*idf，另外idfmodel也可以將idf資料儲存下來（即模型的儲存），在測試語料時，只需要計算測試語料中每個分詞的在該篇文章中的詞頻tf，就可以計算tfidf。

package org.apache.spark.mllib.feature
class
hashingtf
(val numfeatures: int)
extends
serializable 
def indexof(term: any): int = nonnegativemod(term.##, numfeatures) //根據分詞來生成索引號
def transform(document: iterable[_]): vector = 
vectors.sparse(numfeatures, termfrequencies.toseq)
} def transform[d <: iterable[_]](dataset: rdd[d]): rdd[vector] = 
}class
idf(val mindocfreq: int)
}private
object
idf 
doc match 
k += 1
}case densevector(values) =>
val n = values.size
var j = 0
while (j < n) 
j += 1
}case other =>
throw
new unsupportedoperationexception(
s"only sparse and dense vectors are supported but got $.")
}m += 1l
this
}/** merges another. */
def merge(other: documentfrequencyaggregator): this.type = else 
}this
}/** 返回當前idf的向量 */
def idf(): vector = 
val n = df.length
val inv = new array[double](n)
var j = 0
while (j < n) 
j += 1
}vectors.dense(inv)}}}
class
idfmodel
(val idf: vector)
extends
serializable 
def transform(v: vector): vector = idfmodel.transform(idf, v)
}private
object
idfmodel 
vectors.sparse(n, indices, newvalues)
case densevector(values) =>
val newvalues = new array[double](n)
var j = 0
while (j < n) 
vectors.dense(newvalues)
case other =>
throw
new unsupportedoperationexception(
s"only sparse and dense vectors are supported but got $.")
}}}

Spark MLlib原始碼分析 TFIDF原始碼詳解

spark mllib原始碼分析之OWLQN

spring原始碼分析 spring原始碼分析

思科VPP原始碼分析（dpo機制原始碼分析）

Spark MLlib原始碼分析 TFIDF原始碼詳解

spark mllib原始碼分析之OWLQN

spring原始碼分析 spring原始碼分析

思科VPP原始碼分析（dpo機制原始碼分析）

相關推薦