Skip to main content

Financial Data Analysis V Crawling Stock Data - Method 2 -- scrapy crawler framework

· 3 Minutes to read
Allen Ma

Case (II) crawl prep

Project two: crawling stock data using two different methods

Method two: scrapy crawler framework

This case is a crawl of relevant content using the scrapy framework.

Installing the scrapy framework

Open cmd and type the following code to install it.

pip install scrapy

Verify that the installation was successful.

scrapy -h

Create a new Scrapy crawler project

Once scrapy has been successfully installed, continue to create the project by typing the code into the cmd. Switch the directory to the path where you want to create the crawler project and execute.

scrapy startproject baidustocks

Once executed, a series of folders and files such as .py will be created in the directory. 在这里插入图片描述

Generating a Scrapy crawler in a project

Just type a single line of command in cmd, we need to specify the name of the crawler and the website to crawl.

cd baidustocks
scrapy genspider stocks hq.gucheng.com/gpdmylb.html

stocks is the name of the crawler hq.gucheng.com/gpdmylb.html is the site to crawl

A file called stocks.py will be generated when it is done.

Configure the resulting spider crawler

Modify the crawler file to suit your needs. As an example, I will crawl stock data.

# -*- coding: utf-8 -*-

import scrapy
import re
from scrapy.selector import Selector


class StocksSpider(scrapy.Spider):
name = 'stocks'
start_urls = ['https://hq.gucheng.com/gpdmylb.html']

def parse(self, response):
for href in response.css('a::attr(href)').extract():
try:
stock = re.search(r'S[HZ]\d{6}/', href)
url = 'https://hq.gucheng.com/' + stock.group()
yield scrapy.Request(url, callback=self.parse_stock)
except:
continue

def parse_stock(self, response):
infoDict = dict()
stockInfo = response.css('.stock_top').extract()[0]
stockprice = response.css('.s_price').extract()[0]
stockname = response.css('.stock_title').extract()[0]
stockname = Selector(text=stockname)
stockprice = Selector(text=stockprice)
stockInfo = Selector(text=stockInfo)
infoDict['名字'] = re.search(r'>(.*?)</h1>', stockname.css('h1').extract()[0]).group(1)
infoDict['编号'] = re.search(r'>(.*?)</h2>', stockname.css('h2').extract()[0]).group(1)
infoDict['状态'] = re.search(r'>(.*?)</em>', stockname.css('em').extract()[0]).group(1)
infoDict['时间'] = re.search(r'>(.*?)</time>', stockname.css('time').extract()[0]).group(1)
price = stockprice.css('em').extract()
infoDict['股价'] = re.search(r'>(.*?)</em>', price[0]).group(1)
infoDict['涨跌额'] = re.search(r'>(.*?)</em>', price[1]).group(1)
infoDict['涨跌幅'] = re.search(r'>(.*?)</em>', price[2]).group(1)
keylist = stockInfo.css('dt').extract()
valuelist = stockInfo.css('dd').extract()
for i in range(len(keylist)):
key = re.search(r'>(.*?)<', keylist[i], flags=re.S).group(1)
key = str(key)
key = key.replace('\n', '')
try:
val = re.search(r'>(.*?)<', valuelist[i], flags=re.S).group(1)
val = str(val)
val = val.replace('\n', '')
except:
val = '--'
infoDict[key] = val
yield infoDict

Run the crawler and get the data

cmd to execute the following command

scrapy crawl stocks

Wait for the summary information you can see after execution. 在这里插入图片描述

Write Pipelines to process the fetched data

Write pipelines.py file

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BaidustocksPipeline:
def process_item(self, item, spider):
return item

class BaidustocksInfoPipeline(object):
def open_spider(self, spider):
self.f = open('BaiduStockInfo.txt', 'w')

def close_spider(self, spider):
self.f.close()

def process_item(self, item, spider):
try:
line = str(dict(item)) + '\n'
self.f.write(line)
except:
pass
return item

Configure the ITEM_PIPELINES option

Write the settings.py file Look for the parameter ITEM_PIPELINES in it and change the following parameters.

在这里插入图片描述

Execute the entire framework

In the cmd.

scrapy crawl stocks

Then we'll wait and be done with it =. =... 在这里插入图片描述 在这里插入图片描述Finish the job!