有三种方式爬取Ajax页面
抓Json包:简单、快捷 能找到url的情况下首选使用
采用Splash插件:爬取速度快,需要Docker,部署麻烦
采用Selenium插件:爬取速度慢,需要PhantomJs
Splash插件
Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器
首先安装scrapy-splash
库
# Python2
pip install scrapy-splash
# Python3
pip3 install scrapy-splash
安装Docker:Docker官网下载页
拉取Docker镜像
docker pull scrapinghub/splash
运行这个镜像
docker run -p 8050:8050 scrapinghub/splash
配置Scrapy
# settings.py
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
爬虫文件
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
...