說實話,這個爬蟲寫的很爛,貼在這裡希望大家幫忙

2021-08-25 19:46:38 字數 2164 閱讀 6373

這是爬取cnbeta的爬蟲**.

網頁分析花了好久的時間,但是不如意.

分析這一塊沒有做好,希望有能之士幫忙指出方法,非常感謝.

import requests

import re

from bs4 import beautifulsoup

defarticle_num

(): headers =

url = ''

wb_data = requests.get(url, headers=headers)

wb_data.encoding = 'utf-8'

soup = beautifulsoup(wb_data.text, 'lxml')

titles = soup.select('dt > a')

url = titles[0].get('href')

article_num = re.findall('\d+', url).pop(0)

#print("首個數字為: " + article_num)

return int(article_num)

defget_url

(num):

return

'articles/%d.htm' % num

the_num = article_num()

defget_data

():global the_num

url = get_url(the_num)

the_num -= 2

headers =

wb_data = requests.get(url, headers=headers)

wb_data.encoding = 'utf-8'

soup = beautifulsoup(wb_data.text, 'lxml')

# print(soup)

data = {}

title = soup.select('header.title > h1')

# print(title)

title =title[0].text

data['title'] = title

article_summry = soup.select('div.article-summary > p')

summary = article_summry[0].text

article_content = soup.select('div.article-content > p')

all_content = summary

for content in article_content:

all_content = all_content + '\n\n' + content.text.replace('\n', '')

data['content'] = all_content

return data

defmain

(num):

with open('.//wenzhang/' + 'cnbeta1.txt','w',encoding='utf-8') as f:

for i in range(1, num + 1):

try:

data = get_data()

f.write('>> ' + str(i) + ' >>>> ' + data['title'] + ' <<<<\n\n')

f.write(data['content'] + '\n\n---------------------------------\n\n\n\n')

print(str(the_num + 2) + " ok ...", end='\n------------------------\n\n\n')

except:

f.write('>> ' + str(i) + ' >>>> ' + str(the_num + 2) + " 這個有問題..." + ' <<<<\n\n' + '\n\n---------------------------------\n\n\n')

print(str(the_num + 2) + " noooooo ...", end='\n------------------------\n\n\n')

num = 10

main(num)

說實話,我不太看好Nokia的筆記本之路

最近,比較重量級的新聞無非就是北歐通訊裝置巨頭nokia推出了第一款上網本booklet 3g。傳了很久了,今天總算是見到廬山真面目了。目前booklet 3g售價估計在800美元左右,合人名幣在5500上下,整台電腦是atom平台,亮點在於12小時的電池續航時間,支援hdmi高畫質輸出,還有就是它...

python寫爬蟲的彎路

from bs4 import beautifulsoup import requests import re reponse requests.get url reponse.encoding gbk html reponse.text 這是網頁的部分 第一章 你心裡沒點數嗎 第二章 原來是一場精...

廣搜(這個寫的最好啦)

include include include includeusing namespace std define max 13 int n char start 10 char end 10 int i,j,k char map max max max int dist max max max i...