統計單詞出現頻率

2021-09-01 20:00:49 字數 1580 閱讀 1542

這裡有乙個大文字,檔案請從 獲取,在解壓後大約有20m(實際比賽時檔案是1.1g)。 文字中都是英文單詞,空格以及英文的標點符號: [.,;-~"?'!] (句號,逗號,分號,破折號,波浪號,雙引號,問號,單引號,感嘆號)

請統計出該文字中最常出現的前10個單詞(不區分大小寫)。 請注意,在統計中這20個單詞請忽略(the, and, i, to, of, a, in, was, that, had, he, you, his, my, it, as, with, her, for, on)

#

import re, collections

import heapq

ignore_words = ['the','and','to','in','a','that','he','was','it','his','of', 'is', 'with', 'as', 'i', 'had', 'for', 'at', 'by', 'on','not', 'be', 'from', 'but', 's', 'you', 'or', 'her', 'him', 'which']

def words(text):

return re.findall('[a-z]+', text.lower())

def train(features):

model = collections.defaultdict(lambda: 1)

for f in features:

model[f] += 1

return model

import time

starttime = time.time()

f = file('/duitang/data/nltk_data/big.txt').read()

endtime = time.time()

exe_time = (endtime - starttime)*1000

print 'read',exe_time

starttime = time.time()

f = words(f)

endtime = time.time()

exe_time = (endtime - starttime)*1000

print 're',exe_time

starttime = time.time()

f = train(f)

endtime = time.time()

exe_time = (endtime - starttime)*1000

print 'dict',exe_time

starttime = time.time()

max_list=heapq.nlargest(40,f,key=f.get)

nmax_list =

for m in max_list:

if m in ignore_words: continue

print nmax_list

endtime = time.time()

exe_time = (endtime - starttime)*1000

print 'sort',exe_time

統計單詞出現的頻率

平時我們在工作的時候需要統計一篇文章或者網頁出現頻率最高的單詞,或者需要統計單詞出現頻率排序。那麼如何完成這個任務了?例如,我們輸入的語句是 hello there this is a test.hello there this was a test,but now it is not.希望得到的公...

計算單詞出現頻率

cat words.txt tr cs a z a z 012 tr a z a z sort uniq c sort k1nr k2 head 10 但是有時我們想查詢出某乙個單詞的出現頻率這時我們可以使用如下幾個命令 檔名稱 file 查詢單詞名稱 word 操作命令 1 more file g...

統計元素出現頻率

from collections import counter import random data random.randint 0,20 for in range 20 print 20個0 20之間的隨機數 data d dict.fromkeys data,0 以data 現的數字為鍵,0為...