Python踩坑指南(第二季)

2021-09-26 21:35:49 字數 2668 閱讀 4518

本期圍繞jieba講乙個我遇到的實際問題,在同乙個服務裡,存在兩個不同介面a和b,都用到了jieba分詞,區別在於兩者需要呼叫不同的詞庫,巧合中,存在以下情況:

詞庫a:"幹拌麵"

詞庫b:"乾拌","面"

在服務啟動的時候,由於詞庫a優先被載入了,再去載入詞庫b的時候發現,並沒有載入成功:

介面a中:

jieba.load_userdict("a.txt")
介面b中:

jieba.load_userdict("b.txt")
結果發現,在切幹拌麵這個詞的時候,介面b中還是沒有切成功。其實每次在我們載入jieba的時候,可以注意一下會出現以下info:

building prefix dict from the default dictionary ...

dumping model to file cache /var/folders/hv/kfb7n4lj06590hqxjv6f3dd00000gn/t/jieba.cache

loading model cost 0.824 seconds.

prefix dict has been built succesfully.

顯而易見,先進行了building prefix dict,再dumping model to file cache,後續loading model都會來自這,所以這個地方導致以上問題。

我是這麼處理的:

介面a中:

jieba1 = jieba.tokenizer(dictionary="a.txt")
介面b中:

jieba2 = jieba.tokenizer(dictionary="b.txt")
案例如下:

in [1]: import jieba

in [2]: jieba1=jieba.tokenizer(dictionary="a.txt")

in [3]: jieba2=jieba.tokenizer(dictionary="b.txt")

in [4]: jieba1.lcut("幹拌麵")

building prefix dict from /users/slade/desktop/a.txt ...

dumping model to file cache /var/folders/hv/kfb7n4lj06590hqxjv6f3dd00000gn/t/jieba.u5221c1b70f06b36e44bc519f39715c96.cache

loading model cost 0.006 seconds.

prefix dict has been built succesfully.

out[4]: ['幹拌麵']

in [5]: jieba2.lcut("幹拌麵")

building prefix dict from /users/slade/desktop/b.txt ...

dumping model to file cache /var/folders/hv/kfb7n4lj06590hqxjv6f3dd00000gn/t/jieba.uc4f38d90bf7ce748744ff94fb2863fe4.cache

loading model cost 0.003 seconds.

prefix dict has been built succesfully.

out[5]: ['乾拌', '面']

需要注意的是,去看tokenizer原始碼,裡面有這麼一段讀取呼叫:

def gen_pfdict(self, f):

lfreq = {}

ltotal = 0

f_name = resolve_filename(f)

for lineno, line in enumerate(f, 1):

try:

line = line.strip().decode('utf-8')

word, freq = line.split(' ')[:2]

freq = int(freq)

lfreq[word] = freq

ltotal += freq

for ch in xrange(len(word)):

wfrag = word[:ch + 1]

if wfrag not in lfreq:

lfreq[wfrag] = 0

except valueerror:

raise valueerror(

'invalid dictionary entry in %s at line %s: %s' % (f_name, lineno, line))

f.close()

return lfreq, ltotal

在load_userdict的時候詞庫的詞頻可以省略不寫,word, freq = line.split(' ')[:2]決定了這邊需要加上,這個依賴於版本,我並沒有實驗不同版本。

a.txt:

幹拌麵 1
b.txt:

乾拌 1

面 1

遷移填坑第二季

之前說到,配置了遷移環境碰到了各種坑,然後終於解決掉了,終於能夠nova live migration kobe compute5了。然後就開始批量生產遷移環境,然後。之前是只用了compute3和compute5,然後把compute6和compute7也配置好nfs和libvirt,然後嘗試把k...

Java 基礎(第二季)

public class helloworld public class helloworld int num1 int num2 初始化塊 static public static void main string args 結果如下 通過靜態初始化塊為靜態變數num3賦值 通過初始化塊為變數nu...

X A B (第二季水)

description give you two numbers a and b,if a is equal to b,you should print yes or print no input each test case contains two numbers a and b.output ...