NLTK實現分詞

本篇主要記錄在用python寫nltk分詞操作專案主要出現的錯誤以及改進的方法。

本文利用nltk，從資料庫中獲取文字並進行去停用詞處理，並將處理結果放入資料庫。

natural language toolkit，自然語言處理工具包，在nlp領域中，最常使用的乙個python庫。

nltk是乙個開源的專案，包含：python模組，資料集和教程，用於nlp的研究和開發 [1] 。

nltk由steven bird和edward loper在賓夕法尼亞大學計算機和資訊科學系開發。

nltk包括圖形演示和示例資料。其提供的教程解釋了工具包支援的語言處理任務背後的基本概念

在本文中主要用來對文字進行去停用詞處理

主要用到nltk包和pandas，可以通過以下命令進行安裝：

pip install nltkpip install pandas

import pymysql
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
con=pymysql.connect(
host=
'localhost'
, port=
3306
, user=
'root'
, passwd=
'123'
, db=
'nce'
, charset=
'utf8',)
definsert
(con,frequent,l)
: cue = con.cursor(
)# print("mysql conneted")
try:
print
(str
(frequent)
)print
(l) cue.execute(
"update article set frequent=(%s) where a_id=(%s)",[
str(frequent)
,l])
print
("insert success"
)except exception as e:
print
('insert error:'
, e)
con.rollback(
)else
: con.commit(
) cue.close(
)def
read()
: cue = con.cursor(
) query =
"""select text 
from article
"""stop_words =
set(stopwords.words(
'english'))
cue.execute(query)
result = cue.fetchall(
) df_resulet = pd.dataframe(
list
(result)
)for l in df_resulet.index:
text =
str(df_resulet.loc[l]
.values)
word_tokens = word_tokenize(text[1:
-1])
filtered_sentence =
[w for w in word_tokens if
not w in stop_words]
# print(filtered_sentence[1:-1])
insert(con,filtered_sentence[1:
-1],l+1)
read (
)con.close(
)

df_resulet = pd.dataframe(
list
(result)
)for l in df_resulet.index:
text = df_resulet.loc[l]
.values

報錯**如下：

typeerror: cannot use a string pattern on a bytes -like object

改進方法：就直接強轉成string型別就行

text =
str(df_resulet.loc[l]
.values)

一：**如下（示例）：

def
insert
(con,frequent,l)
: cue = con.cursor(
)# print("mysql conneted")
try:
# print(frequent)
cue.execute(
"insert into article (frequent) values(%s)"
,[frequent]
)print
("insert success"
)except exception as e:
print
('insert error:'
, e)
con.rollback(
)else
: con.commit(
)

insert error:
(1241
,'operand should contain 1 column(s)'
)

這裡的錯誤是說：插入的資料應該包含一列，也就是說我插入的資料不止一列。

解決辦法：

首先，我傳入的是在def read（）中強轉str0的變數，拿到sql語句中，就變成了陣列，所以是有多少個字元，就有多少個列，這樣當然插入不進，只要在語句中再強轉一次就行。

修改後**如下：

cue.execute(
"insert into article (frequent) values(%s)"
,str
(frequent)
)

try
: cue.execute(
"update article set frequent=(%s) where a_id=(%s)",[
str(frequent)
,l])
print
("insert success"
)

lock wait timeout exceeded; try restarting transaction

原因：

因為sql的update查詢語句是很耗時的，在查詢過程導致鎖了，每次更新操作等了50秒還是失敗，解決辦法也很簡單，

檢視有沒耗時特別長的，再去檢視innodb的事務表innodb_trx，看下裡面是否有正在鎖定的事務執行緒，看看id是否在show full processlist裡面的sleep執行緒中，如果是，就證明這個sleep的執行緒事務一直沒有commit或者rollback而是卡住了，直接kill掉。

由於我已經kill掉了，這裡就沒有顯示了，有的話直接根據trx_mysql_thread_id下的值

kill ******

NLTK的分詞器

最近在做nlp的任務，經常會用到分詞。有個問題 nltk的word tokenizer和直接使用split 感覺效果是類似的，而且還會出現can t這類的詞被word tokenizer切分成ca n t。這樣看來，以後分詞就直接使用split 更加高效和正確麼？2021自己更新現有分詞工具 1....

02 NLTK 分句分詞詞幹提取詞型還原

nltk 分句分詞詞幹提取詞型還原 print 案例1 分句分詞 import nltk.tokenize as tk doc are you curious about tokenization?let s see how it works we need to analyze a cou...

NLTK在去停用詞分詞分句以及詞性標註的使用

因為實習的緣故，所以有機會接觸到了自然語言處理的一些方面。這裡主要總結一下在python環境下進行自然語言處理的相關包和可能會出現的相關錯誤，目前接觸的都比較 low,但是還是想要記錄下來。nltk是 python 下處理語言的主要工具包，可以實現去除停用詞詞性標註以及分詞和分句等。安裝nltk,...

NLTK實現分詞

NLTK的分詞器

02 NLTK 分句 分詞 詞幹提取 詞型還原

NLTK在去停用詞 分詞 分句以及詞性標註的使用

相關推薦

02 NLTK 分句分詞詞幹提取詞型還原

NLTK在去停用詞分詞分句以及詞性標註的使用