NLTK讀書筆記和實踐問題記錄

python版本3.4.2：

1、書上的例子是

from nltk.corpus import wordnet as wn

wn.synset('car.n.01').lemma_names #獲得同義詞集

wn.synset('car.n.01').definition #獲得定義

在3.4.2下執行得到輸出：

和可能是版本問題，在上面命令列後加上（）即可，即如下：

wn.synset('car.n.01').lemma_names()

wn.synset('car.n.01').definition()

2、書上是from urllib import urlopen,但是報錯：importerror: cannot import name 'urlopen'；實際原因是python3的庫和python2的庫的位置不同，這裡應該改成：

from urllib.request import urlopen。說道這裡，順便說一下from ... import ...和import的不同，如果使用import，則匯入後如果訪問這個模組的功能，需要全路徑寫上，而from ... import呢，訪問時就直接寫上import後面的即可（可能的意思是這個import的東東是from這裡來的）。

3、python idle在backspace刪除時總是感覺刪除半個byte，有個白框框，可以按住alt鍵，一次刪乙個，按ctrl是一次刪乙個詞

4、可能也是python3的緣故，urlopen(url).read()返回的是bytes，而不是str，python中str和bytes轉化比較簡單，例如bytes--》string，a.decode(encoding="utf-8");string-->bytes，a.encode(encoding="utf8")

5、對於自然語言處理，首先要將文字分詞，將標點符號和單詞分開，然後再處理

6、 --《罪與罰》的位址變更

7、使用nltk.clean_html(htmltext),報錯：builtins.notimplementederror: to remove html markup, use beautifulsoup's get_text() function，發現nltk不再提供clearn_html和clean_url兩個函式功能。可以使用beautiful soup專案提供的功能來處理html

8、安裝方法：

import easy_install，easy_install packagename或者：

curl >> beautifulsoup4-4.1.2.tar.gz

tar zxvf beautifulsoup4-4.1.2.tar.gz

cd beautifulsoup4-4.1.2

python setup.py install

9、beautifulsoup 4之後，import的包改為 bs4,之前是import beautifulsoup，現在改為import bs4. 具體使用方法：

10、由於無法可靠地檢驗出文字內容的開始和結束、因此在從原始文字中挑出內容之前，需要手工檢查檔案來發現標記內容開始和結尾的特定字串（使用find/rfind--反向查詢）

NLTK讀書筆記和實踐問題記錄

讀書筆記 AgilePPP XP實踐

讀書筆記 AgilePPP XP實踐

《深入實踐SpringBoot》讀書筆記

NLTK讀書筆記和實踐問題記錄

讀書筆記 AgilePPP XP實踐

讀書筆記 AgilePPP XP實踐

《深入實踐SpringBoot》讀書筆記

相關推薦