没有所谓的捷径
一切都是时间最平凡的累积

python多线程抓取多个网址标题

python多线程抓取多个网址标题

@class product_url: 从文件中读取url地址加入队列,用迭代方式避免文件过大导致内存溢出

@download_url: 线程类,从队列取url,采集标题打印或写文件

@recodetime: 记录执行时间的装饰器函数

@main():主函数

#!/usr/local/bin/python3
# coding:utf-8
import requests
import threading
import re
import chardet
import time
from queue import Queue
from lxml import html
from functools import wraps

lock = threading.Lock()
number = 1  # 计数器


class product_url():
    # 生成url加入队列类
    # @param file:文件名
    def __init__(self, file):
        self.file = file

    def product(self, queue):
        # 打开域名列表文件,将url逐条加入队列
        # @param queue:url队列
        try:
            with open(self.file) as f:
                for shortUrl in f:
                    url = shortUrl.strip()
                    if self.is_valid_domain(url):
                        queue.put(self.makeUrl(url))
        except Exception as e:
            raise e

    @staticmethod
    def makeUrl(shortUrl):
        # 将域名拼接成完整的url
        # @shorturl:域名
        # @return:返回完整url地址http://xxxx.com
        base = "http://"
        return base + shortUrl + '/'

    @staticmethod
    def is_valid_domain(value):
        # 判断是否是域名
        # @param value:域名
        # @return:成功返回True,失败返回False
        pattern = re.compile(
            r'^(([a-zA-Z]{1})|([a-zA-Z]{1}[a-zA-Z]{1})|'
            r'([a-zA-Z]{1}[0-9]{1})|([0-9]{1}[a-zA-Z]{1})|'
            r'([a-zA-Z0-9][-_.a-zA-Z0-9]{0,61}[a-zA-Z0-9]))\.'
            r'([a-zA-Z]{2,13}|[a-zA-Z0-9-]{2,30}.[a-zA-Z]{2,3})$'
        )
        return True if pattern.match(value) else False


class download_url(threading.Thread):
    # 线程类
    # @param queue:urls队列
    # @param headers:user-agent
    headers = {'User-Agent': 'Mozilla/5.0 (compatible;Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)'}

    def __init__(self, queue, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.queue = queue

    def run(self):
        global number
        while True:
            if self.queue.empty():
                break
            url = self.queue.get()
            try:
                response = self.fetch(url)
            except Exception:
                response = None
            else:
                title = self.parse(response) if response else ""
                lock.acquire()
                # self.savefile(file, url, title)
                print(number, url, title)
                number += 1
                lock.release()
        return

    def fetch(self, url):
        # requests抓取url内容
        # @param url:完整的url地址
        # @return:成功返回response,失败返回None
        try:
            return requests.get(url, headers=self.headers, timeout=10)
        except Exception:
            return

    def wdecode(self, response):
        # 自适应网页解码
        # @param response: response对象
        # @return: 解码后的网页内容字符串
        try:
            encode = chardet.detect(response.content).get('encoding')
            if encode.lower() == 'gb2312':
                # response = response.text.encode("latin1").decode("gbk")
                encode = 'gbk'
            response.encoding = encode
            response = response.text
        except Exception:
            pass
        finally:
            return response

    def parse(self, response):
        # 解析网页内容
        # @param response: requests.get获取的response对象
        # @return: 成功返回标题内容,失败返回None
        try:
            result = html.fromstring(self.wdecode(response))
            return ''.join(result.xpath('//head/title/text()'))
        except Exception:
            return

    def savefile(self, file, url, title):
        # @param file:文件名
        # @param url:url地址
        # @param title:标题内容
        # @return:
        pass


def recodetime(output):
    # 计算函数执行时间
    # @param output:装饰器参数
    # @param func: 函数名
    # @wraps(func): 确保原函数在使用装饰器时不改变自身的函数名及应有属性
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            res = func(*args, **kwargs)
            end = time.time()
            output("用时: %s" % (end - start))
            return res
        return wrapper
    return decorator


@recodetime(print)
def main():
    max_works = 20  # 线程数
    urls_queue = Queue()  # url队列
    pro = product_url('1.txt')
    pro.product(urls_queue)
    task = [download_url(urls_queue) for _ in range(max_works)]
    for x in task:
        x.start()
    for x in task:
        x.join()
    print("all done!")
    return


if __name__ == '__main__':
    main()

 

赞(0) 打赏
声明:本站发布的内容(图片、视频和文字)以原创、转载和分享网络内容为主,若涉及侵权请及时告知,将会在第一时间删除,联系邮箱:lwarm@qq.com。文章观点不代表本站立场。本站原创内容未经允许不得转载,或转载时需注明出处:红岩子 » python多线程抓取多个网址标题
分享到: 更多 (0)

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

今天所做的努力都是在为明天积蓄力量

联系我们赞助我们