Case (II) crawl prep
Project two: crawling stock data using two different methods
Method two: scrapy crawler framework
This case is a crawl of relevant content using the scrapy framework.
Installing the scrapy framework
Open cmd and type the following code to install it.
pip install scrapy
Verify that the installation was successful.
scrapy -h
Create a new Scrapy crawler project
Once scrapy has been successfully installed, continue to create the project by typing the code into the cmd. Switch the directory to the path where you want to create the crawler project and execute.
scrapy startproject baidustocks
Once executed, a series of folders and files such as .py will be created in the directory.
Generating a Scrapy crawler in a project
Just type a single line of command in cmd, we need to specify the name of the crawler and the website to crawl.
cd baidustocks
scrapy genspider stocks hq.gucheng.com/gpdmylb.html
stocks is the name of the crawler hq.gucheng.com/gpdmylb.html is the site to crawl
A file called stocks.py will be generated when it is done.
Configure the resulting spider crawler
Modify the crawler file to suit your needs. As an example, I will crawl stock data.
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.selector import Selector
class StocksSpider(scrapy.Spider):
name = 'stocks'
start_urls = ['https://hq.gucheng.com/gpdmylb.html']
def parse(self, response):
for href in response.css('a::attr(href)').extract():
try:
stock = re.search(r'S[HZ]\d{6}/', href)
url = 'https://hq.gucheng.com/' + stock.group()
yield scrapy.Request(url, callback=self.parse_stock)
except:
continue
def parse_stock(self, response):
infoDict = dict()
stockInfo = response.css('.stock_top').extract()[0]
stockprice = response.css('.s_price').extract()[0]
stockname = response.css('.stock_title').extract()[0]
stockname = Selector(text=stockname)
stockprice = Selector(text=stockprice)
stockInfo = Selector(text=stockInfo)
infoDict['名字'] = re.search(r'>(.*?)</h1>', stockname.css('h1').extract()[0]).group(1)
infoDict['编号'] = re.search(r'>(.*?)</h2>', stockname.css('h2').extract()[0]).group(1)
infoDict['状态'] = re.search(r'>(.*?)</em>', stockname.css('em').extract()[0]).group(1)
infoDict['时间'] = re.search(r'>(.*?)</time>', stockname.css('time').extract()[0]).group(1)
price = stockprice.css('em').extract()
infoDict['股价'] = re.search(r'>(.*?)</em>', price[0]).group(1)
infoDict['涨跌额'] = re.search(r'>(.*?)</em>', price[1]).group(1)
infoDict['涨跌幅'] = re.search(r'>(.*?)</em>', price[2]).group(1)
keylist = stockInfo.css('dt').extract()
valuelist = stockInfo.css('dd').extract()
for i in range(len(keylist)):
key = re.search(r'>(.*?)<', keylist[i], flags=re.S).group(1)
key = str(key)
key = key.replace('\n', '')
try:
val = re.search(r'>(.*?)<', valuelist[i], flags=re.S).group(1)
val = str(val)
val = val.replace('\n', '')
except:
val = '--'
infoDict[key] = val
yield infoDict
Run the crawler and get the data
cmd to execute the following command
scrapy crawl stocks
Wait for the summary information you can see after execution.
Write Pipelines to process the fetched data
Write pipelines.py file
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class BaidustocksPipeline:
def process_item(self, item, spider):
return item
class BaidustocksInfoPipeline(object):
def open_spider(self, spider):
self.f = open('BaiduStockInfo.txt', 'w')
def close_spider(self, spider):
self.f.close()
def process_item(self, item, spider):
try:
line = str(dict(item)) + '\n'
self.f.write(line)
except:
pass
return item
Configure the ITEM_PIPELINES option
Write the settings.py file Look for the parameter ITEM_PIPELINES in it and change the following parameters.
Execute the entire framework
In the cmd.
scrapy crawl stocks
Then we'll wait and be done with it =. =...
Finish the job!