- A+
所属分类:Python
1. 目的
使用爬虫脚本 爬去 百度搜索关键字后获得链接地址以及域名信息
可结合GHDB语法
e.g. inrul:php?id=
2. 知识结构
2.1 使用 threading & queue 模块,多线程处理,自定义线程数
2.2 使用BeautifulSoup & re模块,处理href 匹配
2.3 使用requests 模块,处理web请求&获得请求后的真实地址(r.url)
2.4 百度最大搜索页面76页,pn max 760
2.5 将结果存入文本,域名已去重
3. 爬虫脚本
#coding=utf-8 import requests import re import Queue import threading from bs4 import BeautifulSoup as bs import os,sys,time headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'} class BaiduSpider(threading.Thread): def __init__(self,queue): threading.Thread.__init__(self) self._queue = queue def run(self): while not self._queue.empty(): url = self._queue.get_nowait() try: #print url self.spider(url) except Exception,e: print e pass def spider(self,url): #if not add self , error:takes exactly 1 argument (2 given) r = requests.get(url=url,headers=headers) soup = bs(r.content,'lxml') urls = soup.find_all(name='a',attrs={'data-click':re.compile(('.')),'class':None}) for url in urls: #print url['href'] new_r = requests.get(url=url['href'],headers=headers,timeout=3) if new_r.status_code == 200 : url_para = new_r.url url_index_tmp = url_para.split('/') url_index = url_index_tmp[0]+'//'+url_index_tmp[2] print url_para+'\n'+url_index with open('url_para.txt','a+') as f1: f1.write(url_para+'\n') with open('url_index.txt','a+') as f2: with open('url_index.txt', 'r') as f3: if url_index not in f3.read(): f2.write(url_index+'\n') else: print 'no access',url['href'] def main(keyword): queue = Queue.Queue() de_keyword = keyword.decode(sys.stdin.encoding).encode('utf-8') print keyword # baidu max pages 76 , so pn=750 max for i in range(0,760,10): #queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(keyword,i)) queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(de_keyword,i)) threads = [] thread_count = 4 for i in range(thread_count): threads.append(BaiduSpider(queue)) for t in threads: t.start() for t in threads: t.join() if __name__ == '__main__': if len(sys.argv) != 2: print 'Enter:%s keyword'%sys.argv[0] sys.exit(-1) else: main(sys.argv[1])
效果图
4. 待优化点
4.1 多个搜索引擎的处理
4.2 多参数的处理
4.2 payload 结合
5. 参考信息
5.1. ADO ichunqiu Python安全工具开发应用
5.2. https://github.com/sharpdeep/CrawlerBaidu/blob/master/CrawlerBaidu.py
- 我的微信
- 这是我的微信扫一扫
- 我的微信公众号
- 我的微信公众号扫一扫