有三种方式爬取Ajax页面

抓Json包:简单、快捷 能找到url的情况下首选使用

采用Splash插件:爬取速度快,需要Docker,部署麻烦

采用Selenium插件:爬取速度慢,需要PhantomJs

Splash插件

Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器

首先安装scrapy-splash

# Python2
pip install scrapy-splash
# Python3
pip3 install scrapy-splash

安装Docker:Docker官网下载页

拉取Docker镜像

docker pull scrapinghub/splash

运行这个镜像

docker run -p 8050:8050 scrapinghub/splash

配置Scrapy

# settings.py

SPLASH_URL = 'http://localhost:8050'  

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

爬虫文件

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        ...


其他两篇链接

Scrapy爬取Ajax页面(一)- 抓Json包
Scrapy爬取Ajax页面(三)- Selenium插件