菜鸟国际电子游戏首页 > 文章中心 > 正文

scrapy框架 爬取慕课网信息完整教程-电子游戏app下载



1.新建项目

scrapy startproject mukewang

cd mukewang

scrapy fenspider mukes imooc.com

#运行代码
在此目录下命令行运行

scrapy crawl 爬虫名


2.mukes.py 爬虫代码

# -*- coding: utf-8 -*-
import scrapy

from mukewang.items import mukewangitem

class mukesspider(scrapy.spider):
name = 'mukes'
allowed_domains = ['imooc.com']
start_urls = ['http://www.imooc.com/course/list']

def parse(self, response):
item = mukewangitem()
for box in response.xpath('//div[@class="course-card-container"]'):
item['title'] = box.xpath('.//h3[@class="course-card-name"]/text()').extract()
item['url'] = box.xpath('.//@href').extract()
item['image_url'] = box.xpath('.//@data-original').extract()
item['introduction'] = box.xpath('.//p[@class="course-card-desc"]/text()').extract()
item['student'] = box.xpath('.//div[@class="course-card-info"]/span[2]/text()').extract()

yield item


3.items.py 文件代码

# -*- coding: utf-8 -*-

import scrapy

class mukewangitem(scrapy.item):
title = scrapy.field()
url = scrapy.field()
image_url = scrapy.field()
introduction = scrapy.field()
student = scrapy.field()


4.pipelines.py 代码

# -*- coding: utf-8 -*-

import json

class mukewangpipeline:
def __init__(self):
self.file = open('muke.json','w',encoding='utf-8')

def process_item(self, item, spider):
line = json.dumps(dict(item),ensure_ascii=false) '\n'
self.file.write(line)


5.setting.py 代码

#解开注释

item_pipelines = {
'mukewang.pipelines.mukewangpipeline': 300,
}


运行代码会在 mukewang 目录下生成一个muke.json文件





转载请注明出处:

0

相关文章