爬取豆瓣Top250電影名稱列表

要爬取的網頁

豆瓣電影top 250

python程式

#!/usr/bin/python3.4
#-*- coding:utf-8-*-
#filename:getdangdang.py
#author:duhongjiang
#date:2018/2/24 20:08
import requests
from bs4 import beautifulsoup
import re
print('from douban dianying top 250')
print('------------------------begain------------------')
try:
file_object = open('thefile.txt', 'w')
for page in range(10):
url=''+str((page)*25)+'&filter='
print('------------------------',page,'------------------')
print(url)
html = requests.get(url)#get html code
html.raise_for_status() #check requeste
try:
soup = beautifulsoup(html.text,'html.parser')
soup = str(soup)
title = re.compile(r'([^\/]+)')
outher = re.compile(r'(\d+\.\d+)')
pepole = re.compile(r'(.*)人評價')
#allget = re.compile(r'([^\/]+) | (\d+\.\d+) | (.*)人評價')
outhers = re.findall(outher,soup)
pepoles = re.findall(pepole,soup)
names = re.findall(title,soup)
#allg = re.findall(allget,soup,re.s|re.m)
print(outhers)
print(pepoles)
#print(allg)
outhersit = iter(outhers)
pepolesit = iter(pepoles) 
namesit = iter(names)
for name in names:
print(allget)
for name in names:
#if name.find('/')==-1:
print(name)
#file_object.writelines(name+'\n')
for final_data_o in allget:
#for final_data in final_data_o:
for i in range(len(final_data_o)):
if i==(len(final_data_o)-1):
file_object.writelines(final_data_o[i])
else:
file_object.writelines(final_data_o[i]+',')
file_object.writelines('\n')
except exception as e:
print(e)
finally:
file_object.close()
print('-----------------------end----------------')

程式執行結果

duhj@ubuntu :~/desktop/work $ python3.4 getdangdang.py

程式處理得到的資料檔案

逗號分隔的csv格式，可以放入hadoop hdfs檔案系統中

肖申克的救贖,9.6,980298

霸王別姬,9.5,712104

這個殺手不太冷,9.4,925493

阿甘正傳,9.4,786983

美麗人生,9.5,459967

千與千尋,9.2,736229

鐵達尼號,9.2,726576

辛德勒的名單,9.4,419526

爬取豆瓣電影TOP250

利用css選擇器對電影的資訊進行爬取 import requests import parsel import csv import time import re class cssspider def init self self.headers defget dp self,url respon...

豆瓣Top250電影爬取

from bs4 import beautifulsoup 網頁解析，獲取資料 import re 正規表示式，進行文字匹配 import urllib.request,urllib.error 制定url，獲取網頁資料 import xlwt 進行excel操作 import sqlite3 進行...

python爬取豆瓣電影top250

簡要介紹爬取豆瓣電影top250上相關電影的資訊，包括影片鏈結影片名稱上映時間排名豆瓣評分導演劇情簡介。使用 requests etree xpath 1 檢視網頁資訊，確定爬取的內容，建立資料庫 class spiderdata peewee.model url peewee.cha...

爬取豆瓣Top250電影名稱列表

爬取豆瓣電影TOP250

豆瓣Top250電影爬取

python爬取豆瓣電影top250

相關推薦