09 使用python完成詞頻統計

2021-10-19 09:08:57 字數 2476 閱讀 6233

#!/usr/bin/python

import sys

# input comes from stdin (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words

words = line.split()

# increase counters

for word in words:

# write the results to stdout (standard output);

# what we output here will be the input for the

# reduce step, i.e. the input for reducer.py

## tab-delimited; the trivial word count is 1

print ('%s\t%s' % (word, 1))

驗證,執行以下語句:

得到以下結果:

檢視統計結果

#!/usr/bin/python

from operator import itemgetter

import sys

current_word = none

current_count = 0

word = none

# input comes from stdin

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

word, count = line.split('\t', 1)

# convert count (currently a string) to int

try:

count = int(count)

except valueerror:

# count was not a number, so silently

# ignore/discard this line

continue

# this if-switch only works because hadoop sorts map output

# by key (here: word) before it is passed to the reducer

if current_word == word:

current_count += count

else:

if current_word:

# write result to stdout

print ('%s\t%s' % (current_word, current_count))

current_count = count

current_word = word

# do not forget to output the last word if needed!

if current_word == word:

print ('%s\t%s' % (current_word, current_count))

驗證,執行以下語句:

得到以下結果:

檢視統計結果

aa bb cc dd aa cc

aa bb cc dd aa cc

aa bb cc dd aa cc

aa bb cc dd aa cc

aa bb cc dd aa cc cc dd

hdfs dfs -mkdir /data

hdfs dfs -put info.txt /data/info

$hadoop_home/bin/hadoop jar 

$hadoop_home/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar

-input "/data/*"

-output "/out99"

-reducer "python reducer.py"

-file "/root/reducer.py"

注意:$hadoop_home就是hadoop的家目錄。

以上就是通過python完成詞頻統計的過程。

使用Python進行英文詞頻統計

1.讀取檔案,通過lower replace 函式將所有單詞統一為小寫,並用空格替換特殊字元。def gettext txt open piao.txt r errors ignore read txt txt.lower for ch in txt txt.replace ch,return tx...

python使用jieba實現簡單的詞頻統計

import jieba defgettext txt open hamlet.txt r read txt txt.lower for ch in txt txt.replace ch,return txtharmtxt gettext words harmtxt.split counts for...

python對紅樓夢的每一章節進行詞頻統計

import jieba f open g 紅樓夢.txt r encoding utf 8 txt f.read words jieba.lcut txt 精準模式 ls 在這裡插入描述 第 and word 1 回 if word in ls continue else print ls for...