專案4 新聞聚合 Python基礎教程

問題

1. 只能匹配一段

發現是因為zip（title,body）函式，最多迭代title（1次），不可能迭代到body那麼多次

for title in titles:

for body in bodies:

yield newsitem(title,wrap(body))

這樣就會有個問題，就是出現很多對title，body

而實際上是乙個title對應乙個bodies,很多個body

或者這個題目用來提取title和body，而不是展示整個新聞，描述為乙個網頁上有多個新聞，提取每個title和對應的body，但是乙個html只有乙個title，還是不對。

說明要修改newitem這個類才行

2. 列印中文符號會出現某些無法顯示，比如逗號，

3. nntp未找到伺服器，暫時注釋掉

新聞**：

**為：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from nntplib import nntp
from time import time , strftime, localtime
from email import message_from_string
from urllib import urlopen
import textwrap
import re
day = 24 *60 * 60
defwrap
(string, max =70):
return
'\n'.join(textwrap.wrap(string)) + '\n'
class
newagent
(object):
def__init__
(self):
self.sources = 
self.destinations = 
defaddsource
(self,source):
defadddestination
(self,dest):
defdistribute
(self):
items = 
for source in self.sources:
items.extend(source.getitems())
#呼叫nnypsource和******websource兩個類方法getitem，用法為分別為兩個類繫結例項，通過例項來呼叫class裡的方法
for dest in self.destinations:
dest.receiveitems(items)
class
newsitem
(object):
def__init__
(self,title,body):
self.title = title
self.body = body
class
nntpsource
(object):
def__init__
(self,servername,group,window):
self.servername = servername
self.group = group
self.window = window
defgetitems
(self):
start = localtime(time()-self.window*day)
date = strftime('%y%m%d',start)
hour = strftime('%h%m%s',start)
server = nntp(self.servername)
ids =server.newnews(self.group,date , hour)[1]
for id in ids:
lines = server.article(id)[3]
message = message_from_string('\n'.join(lines))
title = message['subject']
body = message.get_payload()
if message.is_multipart():
body = body[0]
yield newsitem(title,body)
server.quit()
class
******websource
(object):
def__init__
(self,url,titlepattern,bodypattern):
self.url = url
self.titlepattern = re.compile(titlepattern)
self.bodypattern = re.compile(bodypattern)
defgetitems
(self):
text = urlopen(self.url).read()
titles = self.titlepattern.findall(text)
bodies = self.bodypattern.findall(text)
for title, body in zip(titles,bodies):
yield newsitem(title,wrap(body))
class
plaindesination
(object):
defreceiveitems
(self,items):
for item in items:
print item.title
print
'-'*len(item.title)
print item.body
class
htmldeatination
(object):
def__init__
(self,filename):
self.filename = filename
defreceiveitems
(self,items):
out = open(self.filename,'w')
print >> out,'''
'''print >> out, ''
id = 0
for item in items:
id += 1
print >> out, '%s
' % (id, item.title)
print >> out, ''
id = 0
for item in items:
id+=1
print >>out, '' % (id, item.title)
print >> out ,'%s

' % item.body
print >> out ,'''
'''defrundefaultsetup
(): agent =newagent()
_url = ''
_title = r'(.+?)'
_body = r'(.+?)
' bbc = ******websource(_url,_title,_body)
agent.addsource(bbc)
agent.adddestination(plaindesination())
agent.adddestination(htmldeatination('new.html'))
agent.distribute()
''' clap_server = ''
clap_group = ''
clap_window = 1
clap = nntpsource(clap_server, clap_group, clap_window)
agent.addsource(clap)

'''if __name__ == '__main__' : rundefaultsetup()

執行結果：

看著眼前這輛19萬買來的寶馬5系車，江西人小王�

��裡那叫乙個開心。如果不出意外，車子開回江西後�

�手一賣，還能再賺個兩三萬。

python 實踐新聞聚合

採集新聞，體會到面向問題和物件導向的區別。scoure處理 destination生成報告格式。newitem用來封裝每條新聞的主題和body agent 用來新增新聞源，新增目標源。然後將每個新聞源發布給每個目標。用到的模組 nntplib import nntp time import time...

Python 爬蟲例項（4）爬取網易新聞

自己閒來無聊，就爬取了網易資訊，重點是分析網頁，使用抓包工具詳細的分析網頁的每個鏈結，資料儲存在sqllite中，這裡只是簡單的解析了新聞頁面的文字資訊，並未對資訊進行解析僅供參考，不足之處請指正 coding utf 8 import random,re import sqlite3 impor...

Python爬蟲基礎 4

proxy 的設定 urllib2 缺省會使用環境變數 http proxy 來設定 http proxy。如果想在程式中明確控制 proxy 而不受環境變數的影響，可以使用簡單的 import urllib2 enable proxy true proxy handler urllib2.pro...

專案4 新聞聚合 Python基礎教程

python 實踐 新聞聚合

Python 爬蟲例項（4） 爬取網易新聞

Python爬蟲基礎 4

相關推薦

python 實踐新聞聚合

Python 爬蟲例項（4）爬取網易新聞