bot短期密集访问形成的流量高峰有哪些?如何解决?|天天微资讯
周末大清早收到封警报邮件,估计网站被攻击了,要么就是缓存日志memory的问题。打开access.log 看了一眼,原来该时间段内大波的bot(bot: 网上机器人;自动程序 a computer programthat performs a particular task again and again many times)访问了我的网站。
website.com (AWS) - Monitor is Down
Down since Mar 25, 2017 1:38:58 AM CET
(资料图片仅供参考)
Site Monitored
http://www.website.com
Resolved IP
54.171.32.xx
Reason
Service Unavailable.
Monitor Group
XX Applications
Outage Details
LocationResolved IPReasonLondon - UK (5.77.35.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxxSeattle - US (104.140.20.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxx
上网搜了一下,发现许多webmaster都遇到了由于bot短期密集访问形成的流量高峰而无法其它终端提供服务的问题。从这篇文章的分析中,我们看到有这样几种方法来block这些web bot。
1. robots.txt
许多网络爬虫都是先去搜索robots.txt,如下所示:
"199.58.86.206" - - [25/Mar/2017:01:26:50 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "199.58.86.206" - - [25/Mar/2017:01:26:54 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "162.210.196.98" - - [25/Mar/2017:01:39:18 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
许多bot的发布者也谈到了如果不希望被爬取,应该如何来操作,就以MJ12bot为例:
How can I block MJ12bot?
MJ12bot adheres to the robots.txtstandard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:
User-agent: MJ12bot
Disallow: /
Please do not waste your time trying to block bot via IP in htaccess - we do not use any consecutive IP blocks so your efforts will be in vain. Also please make sure the bot can actually retrieve robots.txt itself - if it can"t then it will assume (this is the industry practice) that its okay to crawl your site.
If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: bot@majestic12.co.uk. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.
How can I slow down MJ12bot?
You can easily slow down bot by adding the following to your robots.txt file:
User-Agent: MJ12bot
Crawl-Delay: 5
Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.
If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so.
那么我们可以写如下的
User-agent: YisouSpider
Disallow: /
User-agent: EasouSpider
Disallow: /
User-agent: EtaoSpider
Disallow: /
User-agent: MJ12bot
Disallow: /
另外,鉴于很多bot都会去访问这些目录:
/wp-login.php /wp-admin/
/trackback/
/?replytocom=
…
许多WordPress网站也确实用到了这些文件夹,那么如何在不影响功能的情况下做一些调整呢?
robots.txt修改之前robots.txt修改之后
User-agent: *
Disallow: /wp-admin
Disallow: /wp-content/plugins
Disallow: /wp-content/themes
Disallow: /wp-includes
Disallow: /?s=User-agent: *
Disallow: /wp-admin
Disallow: /wp-*
Allow: /wp-content/uploads/
Disallow: /wp-content
Disallow: /wp-login.php
Disallow: /comments
Disallow: /wp-includes
Disallow: /*/trackback
Disallow: /*?replytocom*
Disallow: /?p=*&preview=true
Disallow: /?s=
不过,也可以看到许多爬虫并不理会robots.txt,以这个为例,就没有先去访问robots.txt
"10.70.8.30, 163.172.65.40" - - [25/Mar/2017:02:13:36 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:13:42 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/js/utils.js HTTP/1.1" 200 5345 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/css/home.css HTTP/1.1" 200 8511 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"
这个时候就要试一下其他几种方法。
2. .htaccess
原理就是利用URL rewrite,只要发现访问来自于这些agent,就禁止其访问。作者“~吉尔伽美什”的这篇文章介绍了关于.htaccess的很多用法。
5. Blocking users by IP 根据IP阻止用户访问order allow,deny deny from 123.45.6.7 deny from 12.34.5. (整个C类地址) allow from all 6. Blocking users/sites by referrer 根据referrer阻止用户/站点访问需要mod_rewrite模块 例1. 阻止单一referrer: badsite.comRewriteEngine on # Options +FollowSymlinks RewriteCond %{HTTP_REFERER} badsite\.com [NC] RewriteRule .* - [F] 例2. 阻止多个referrer: badsite1.com, badsite2.comRewriteEngine on # Options +FollowSymlinks RewriteCond %{HTTP_REFERER} badsite1\.com [NC,OR] RewriteCond %{HTTP_REFERER} badsite2\.com RewriteRule .* - [F] [NC] - 大小写不敏感(Case-insensite) [F] - 403 Forbidden 注意以上代码注释掉了”Options +FollowSymlinks”这个语句。如果服务器未在 httpd.conf 的 段落设置 FollowSymLinks, 则需要加上这句,否则会得到”500 Internal Server error”错误。 7. Blocking bad bots and site rippers (aka offline browsers) 阻止坏爬虫和离线浏览器需要mod_rewrite模块 坏爬虫? 比如一些抓垃圾email地址的爬虫和不遵守robots.txt的爬虫(如baidu?) 可以根据 HTTP_USER_AGENT 来判断它们 (但是还有更无耻的如”中搜 zhongsou.com”之流把自己的agent设置为 “Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)” 太流氓了,就无能为力了) RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.* - [F,L] [F] - 403 Forbidden [L] - 连接(Link) 8. Change your default directory page 改变缺省目录页面 DirectoryIndex index.html index.php index.cgi index.pl 9. Redirects 转向单个文件Redirect /old_dir/old_file.html http://yoursite.com/new_dir/new_file.html 整个目录Redirect /old_dir http://yoursite.com/new_dir 效果: 如同将目录移动位置一样 http://yoursite.com/old_dir -> http://yoursite.com/new_dir http://yoursite.com/old_dir/dir1/test.html -> http://yoursite.com/new_dir/dir1/test.html Tip: 使用用户目录时Redirect不能转向的解决方法当你使用Apache默认的用户目录,如 http://mysite.com/~windix,当你想转向 http://mysite.com/~windix/jump时,你会发现下面这个Redirect不工作: Redirect /jump http://www.google.com 正确的方法是改成 Redirect /~windix/jump http://www.google.com (source: .htaccess Redirect in “Sites” not redirecting: why? ) 10. Prevent viewing of .htaccess file 防止.htaccess文件被查看 order allow,deny deny from all
3. 拒绝IP的访问
可以在Apache配置文件httpd.conf指明拒绝来自某些IP的访问。
Order allow,deny
Allow from all
Deny from5.9.26.210
Deny from162.243.213.131
但是由于很多时候,这些访问的IP并不固定,所以这种方法不太方便,而且修改了httpd.conf还要重启apache才能生效,所以建议采用修改.htaccess。
标签:
相关推荐:
最新新闻:
- 世界热门:如何打开pdg文件?手把手教你打开PDG文件
- bot短期密集访问形成的流量高峰有哪些?如何解决?|天天微资讯
- 茅台推出酒瓶装冰淇淋酷似飞天茅台:66元一瓶:热资讯
- 【剑灵力士】新版本力士职业天赋加点推荐 备战不删档-焦点
- 《寓言之地》将于4月13日在Steam开启抢先体验测试
- 环球关注:外媒称网易暴雪分手 是因为网易“威胁”了科蒂克
- 躲过15次GC之后 进入老年代系统有哪些规则?|天天最资讯
- 环球快报:十二届省委第二轮巡视完成反馈
- 理想汽车雷达在无人陵园显示全是人影 官方回应_每日视点
- 央企专业化整合精准提速-世界热资讯
- 全球快看点丨《最后的生还者》PC版更新仍未解决卡顿等问题
- 环球观点:《GT赛车7》今日更新免费追加丰田豪华MPV
- MINISFORUM NAB6迷你主机上架:i7-12650H准系统低至2599元:天天热议
- 松下极简电吹风159元:天天快资讯
- 迪斯尼取缔元宇宙部门
- 资讯:Curved Lines展览征集稿件
- 固态PCIe转接卡5.9元
- 《霍格沃茨之遗》成为2023年2月北美最畅销电子游戏_世界快看点
- 天天热点评!PS+欧美服4月会免游戏公布 《麻布仔大冒险》等
- 新一代高贵“亮机卡”!RTX 4050被曝6月发布:弱得不像话 动态焦点
- 怎样去除甲醛最好_怎样去除甲醛最有效
- PC《最后生还者:第一部》4080 4K帧率测试 DLSS效果出色
- 接手机宾得望远镜889元_天天报资讯
- Steam明年不再支持Win7、8!
- 透视上市券商年报:两融业务不乏亮点 仍需多维度补短板:每日热文
- 广宇发展:2022年营收34亿元,净利6.33亿元 | 年报快讯
- 环球今日报丨TGA红灵入侵少年再出现:送给甲亢哥梅西签名球衣
- 斯坦福大学医院外科医生培训 白男占比0%引争议
- 玲娜贝儿设计师遭裁员:已经累计为迪士尼公司工作22年-焦点快报
- 《地平线:西之绝境》DLC“燃烧海岸”新截图公开-今日最新
- 世界视讯!超过90%会员表示:如果没有XGP就不会尝试在玩的游戏
- 每日热点:有学历如何去澳大利亚工作_办理澳大利亚工作签证需要什么条件
- 《塞尔达传说:王国之泪》武器融合系统《蟹蟹寻宝奇遇》早有了:别说我抄袭|世界即时
- 世界热资讯!《TLOUP1》PC版首日峰值36496人 位列PS游戏第四
- 小米13 Ultra定档:4月中旬发布 世界热议
- 股票行情快报:中色股份(000758)3月29日主力资金净卖出2901.96万元
- 引领个人PC新风尚 “定番”款笔记本电脑VAIO F14·F16在日本上市 全球快报
- 【新视野】2499 元,影驰 PCIe 5.0 高速固态硬盘发布,速度可达 10G/s
- 42岁法国女星伊娃·格林新写真 贴身长裙大秀好身材 速讯
- 环球短讯!360版ChatGPT要来了!周鸿祎:大家给起个名字
- 白客坦言:"王大锤"后 导演看中了我的朴实和普通_当前速看
- 传《巫师:天狼星》是日本背景 目标是在东方推广巫师
- 《龙之陨落》上架steam年内发售 爽快3D动作RPG 焦点短讯
- 前《寂静岭》编剧:《寂静岭2:重制版》将让玩家失望
- 《迷雾魔域:迷雾与活地下城》的众筹活动开始!
- 外媒盘点《生化危机4重制版》十大最可怕敌人:电锯狂人仅第四
- 【快播报】安桥cmx1听感(安桥cmx1)
- 快看:顽皮狗发布公告 阐述PC《最后生还者:第一部》已知问题
- 全球观天下!零售商上架新PS5主机 简介或暗示PS5Slim即将推出
- 热点在线丨追觅科技推动《电吹风》标准制定,引领行业进入规范化发展新时代
- 装机神条!16GB DDR4 3200MHz内存低至185元:每日时讯
- 天天简讯:集成灶十大品牌帅丰电器持续深入终端,助力全国帅丰集成灶门店
- 全球热议:跌跌不休,内存还将继续降价,至少再打 9 折
- 2023育碧前瞻会再度归来!6.13洛杉矶现场直播
- 《王国之泪》限定NS OLED开箱:金绿符文神秘感拉满
- 活到老打到老 《师父》steam首发优惠立减10%
- 云南思茅:夯实自治“小支点” 撬动社区“大治理”:世界观热点
- 一加11木星岩版发布:纹理独一无二 聚看点
- 不会录屏?Siri可以帮你实现了:热推荐
- 世界微资讯!海信电视亮相上海张园,与《家居廊》共探家居艺术新边界
- E Ink元太科技与Sharp合作推出零功耗*电子纸数字海报|世界热点
- 年轻人为什么不喜欢买iPhone了?_今头条
- [路演]青矩技术:工程管理科技服务系提升建筑业信息化和数字化水平主要手段 拥有较大发展空间
- 热头条丨顽皮狗承诺:未来将继续支持PC玩家
- 热资讯!曝《巫师天狼星》是日本背景 为扩大IP在东方影响力
- 当前最新:DDR5价格崩了!16GB内存低至339元
- 焦点短讯!推动可信数据自由流动 浪潮数据云业务战略正式发布
- 云霄之上,是卡萨帝电热水器的创新之“胆”!-当前观点
- 环球快播:真福利还是促销手段?小米给27万小米空气净化器初代用户每人送899元优惠券
- Redmi Note 12 Turbo采用天马屏幕:素质很强
- 快看:家庭托育来了!浙江的情况怎么样?
- 专访《白荆回廊》团队:没玩过古剑不影响游戏体验_世界观点
- 网购南孚电池撕开外皮是金鹤电池:买完商品就下架了:世界即时看
- 硬气!甄子丹抨击昆汀《好莱坞往事》对李小龙的刻画-世界实时
- 微信推出新支付方式:刷掌支付、抬手就能付_每日视讯