Scrapy中使用USER_AGENT模拟浏览器请求


前言

在使用爬虫的过程中,经常遇到网站的一些反爬措施。通常Scrapy爬虫的请求头里面的用户代理user agent是固定的一种,很容易被识别出,因此许多网站将过滤指定user agent作为最基础的反爬手段,那么我们要绕开的方法就是每次请求时随机用不同的user agent,下面记录常用的浏览器user agent和Scrapy启用随机用户代理的方法。

常用的浏览器user agent

废话不多说,直接贴上:

'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/45.0.2454.101 Safari/537.36',                                                                                                           
'Apache-HttpClient/UNAVAILABLE (java 1.4)',                                                                                          
'Lite 1.0 ( http://litesuits.com )',                                                                                                                     
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)',                                                                                                                  
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0',                                                                                                 
#WIN7 Chrome  
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',                                                                                                               
#Win7 Firefox 
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',                                                         
#win7 Safari 
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',  

#win7 Opera  
'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50',                                                                                                                                                       
#Win7 ie9:
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)',      

#傲游3.1.7在Win7+ie9,高速模式: 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12',                                                                                                              
#搜狗3.0在Win7+ie9,高速模式: 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0',                                                                                   
#360浏览器3.0在Win7+ie9: 
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)', 

#QQ浏览器6.9(11079)在Win7+ie9,极速模式:
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201',                                                                                               
#以下是win10系统 
#Chrome                                                                               'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',                                                                                                            
#360极速 兼容模式
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; rv:11.0) like Gecko',                                                                                                                                
#360极速 极速模式    
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',                                                                                                           
#火狐开发者
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0',

#火狐
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',

#搜狗高速
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',

#IE11
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; rv:11.0) like Gecko',                                                                                                                                
#Edge
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',                                                                                        
#QQ极速
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5680.400 QQBrowser/10.2.1852.400',         

Scrapy里面使用随机User Agent

首先在settings.py把上面的User Agent放到一个名叫USER_AGENT的列表里面:

USER_AGENT_LIST = [                                                                   'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/45.0.2454.101 Safari/537.36',                                                                                                           
'Apache-HttpClient/UNAVAILABLE (java 1.4)',                                                                                          
'Lite 1.0 ( http://litesuits.com )',                                                                                                                     
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)',                                                                                                                  
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0',                                                                                                 
#WIN7 Chrome  
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',                                                                                                               
#Win7 Firefox 
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',                                                         
#win7 Safari 
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',  

#win7 Opera  
'Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50',                                                                                                                                                       
#Win7 ie9:
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; Tablet PC 2.0; .NET4.0E)',      

#傲游3.1.7在Win7+ie9,高速模式: 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12',                                                                                                              
#搜狗3.0在Win7+ie9,高速模式: 
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0',                                                                                   
#360浏览器3.0在Win7+ie9: 
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)', 

#QQ浏览器6.9(11079)在Win7+ie9,极速模式:
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201',                                                                                               
#以下是win10系统 
#Chrome                                                                               'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',                                                                                                            
#360极速 兼容模式
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; rv:11.0) like Gecko',                                                                                                                                
#360极速 极速模式    
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',                                                                                                           
#火狐开发者
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0',

#火狐
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0',

#搜狗高速
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',

#IE11
'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; rv:11.0) like Gecko',                                                                                                                                
#Edge
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',                                                                                        
#QQ极速
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5680.400 QQBrowser/10.2.1852.400',                                                   
] 

然后在中间键文件middlewares.py中写一个调用USERANGEN的类:

#新版本scrapy使用下面这两行引入settings里面的值
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
import random

class UAMiddleware(object):
     #每次发起请求前,从list里面随机获取一个用户代理放到请求头里面
    def process_request(self, request, spider):
        ua = random.choice(settings['USER_AGENT_LIST'])
        request.headers['User-Agent'] = ua

最后在settings.py里面启用该中间件,配置DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'get_ais.middlewares.UAMiddleware': 543,
}

至此配置完成,网站服务器可以检测到是从一个IP发起了大量请求,但是却是不同浏览器发起的。


文章作者: 无咎
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 无咎 !
评论
 上一篇
Scrapy的pipelines使用sqlalchemy Scrapy的pipelines使用sqlalchemy
前言Scrapy爬取到的数据可以使用MySql持久化存储,通常Scrapy通过在pipiline.py里面连接数据库获取游标,然后写原生SQL语句来插入数据到Mysql里面,所以一旦数据字段较多时,pipiline里面会出现大段SQL语句,
2020-05-30
下一篇 
Django Web开发入门——Admin后台 Django Web开发入门——Admin后台
上文中,我们说Django自带了一个强大的后台管理系统,那么强大在何处呢?本文来学习Django后台的强大之处。 一、使用simpleui优化django后台页面前文中见到的Django原生后台巨丑无比,好在现在网上有很多好看的后台管理系统
2020-05-18
  目录