工作需要,接触到python的scrapy爬虫框架,据说是python最好用的爬虫框架,没有之一。文章内容为学习过程的笔记,参考资源会贴在文章最后。
安装
win安装
安装pywin32
进入https://sourceforge.net/projects/pywin32/files/pywin32/或者https://github.com/mhammond/pywin32/releases下载python对应的pywin32包
傻瓜式安装
检查安装结果,python命令行输入
import win32api,如果没有报错,安装成功
安装Twisted
进入http://www.lfd.uci.edu/~gohlke/pythonlibs/,下载对应twisted和lxml
pip install ******.whl(没有安装pip,点击查看安装方法)输入命令
pip --version检查是否安装成功
安装scrapy
命令行输入:pip install scrapy
scrapy 框架学习
框架架构图

目录结构
items.py 存放爬取的数据模型
middwares.py 中间件
pipelines.py 把爬取的数据保存
settings.py 爬虫的配置信息
scrapy.cfg 项目的配置文件
spiders目录 爬虫脚本
基本使用
1.新建一个项目
scrapy startproject 项目名字2.scrapy.cfg 打包部署文件
3.创建爬虫
scrapy genspider 爬虫名字 网站域名4.注意:
爬虫名字不能和项目名字一样
网站域名
5.爬虫文件所在位置
项目名/项目名/spiders/爬虫名字.py
6.查看scrapy有几个类
scrapy genspider -l7.selectors选择器
正则
Xpath表达式
css
8.Xpath表达式规则
demo
# html代码
<title>我是标题</title>
# 获取title内容
# title是html标签
# text获取内容
# extract
reponse.xpath('//title/text()').get();9.运行爬虫
scrapy crawl 爬虫名10.运行爬虫并保存结果
scrapy crawl 爬虫名 -o xxx.csv(或者xxx.json)11.拼接url
reponse.urljoin(uri)12.分页
yield scrapy.Request(url, callback=self.parse)采取管道方式存储数据
在settings.py中开启
# 300 表示优先级 越小优先级越高
ITEM_PIPELINES = {
'cxianshengSpider.pipelines.CxianshengspiderPipeline': 300,
}管道中采取json方式纯存储数据,以导入的方式存储
# 批量存入 适合数据量小的场景
from scrapy.exporters import JsonItemExporter
# 一行一行存入 数据量大的时候使用
from scrapy.exporters import JsonLinesItemExporter用法:
# 定义
self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
# 使用
self.exporter.export_item(item)CrawlSpider
crawlSpider可以创建更灵活的爬虫,可以自定义爬取规则等
创建crawl爬虫
scrapy genspider -t crawl 爬虫名 域名最新请求头地址
http://useragentstring.com/pages/useragentstring.php?typ=browser
设置下载器中间件
1.在middlewares.py增加如下代码:
import random
class HttpbinUserAgentMiddleware(object):
user_agent = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
def process_request(self,request,spider):
# 随机获取一个请求头
user_agent = random.choice(self.user_agent)
# 设置请求头user-agent
request.headers['User-Agent'] = user_agent2.settings.py开启:
# HttpbinUserAgentMiddleware为自己设置的下载器类的名称
DOWNLOADER_MIDDLEWARES = {
'httpBin.middlewares.HttpbinUserAgentMiddleware': 543,
}
并开启下载请求间隔时间
# 3表示3秒
DOWNLOAD_DELAY = 33.爬虫代码
# -*- coding: utf-8 -*-
import scrapy
import json
class UseragentdemoSpider(scrapy.Spider):
name = 'userAgentDemo'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/user-agent']
def parse(self, response):
data = json.loads(response.text)['user-agent']
print('='*30)
print(data)
print('='*30)
# 重复发起请求
yield scrapy.Request(self.start_urls[0], dont_filter=True)
pass
Xpath规则学习
html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Document</title>
</head>
<body>
<a class="a-test" href="/next/2" >email me</a>
</body>
</html>Xpath表达式
基本使用
text获取内容
@定位符
表达式
1.获取指定标签的内容
//title/text()2.根据html属性定位获取内容
@定位符
获取a标签href内容
//a[@class="a-text"]/@hrefscrapy配合Xpath
代码示例:(PS:这里的例子为本人的博客,亲测有效:
# -*- coding: utf-8 -*-
import scrapy
class CxianshengSpider(scrapy.Spider):
name = 'cxiansheng' # 爬虫名称
allowed_domains = ['cxiansheng.cn'] # 允许爬取的域名
start_urls = ['https://cxiansheng.cn/'] # 开始爬取的url
def return_default_str(self, str):
return str.strip() if str else ''
def parse(self, response):
selectors = response.xpath('//section/article')
for selector in selectors:
article_title = selector.xpath('./header/h1/a/text()').get()
article_url = selector.xpath('./div/p[@class="more"]/a/@href').get()
article_title = self.return_default_str(article_title)
article_url = self.return_default_str(article_url)
yield {'文章标题': article_title, '文章地址': article_url}
next_url = response.xpath('//nav[@class="pagination"]/a[@class="extend next"]/@href').get()
if next_url:
# 重新发起请求
yield scrapy.Request(next_url, callback=self.parse)
爬完让我知道了一个事实,我的博客写的并不多,哭/(ㄒoㄒ)/~~
源码地址:
Python scrapy demo