盒子
盒子
文章目录
  1. 愿总有阳光照进回忆里:学生购电用电查询系统

Python Spider: 北邮用电查询系统数据

愿总有阳光照进回忆里:学生购电用电查询系统

感冒了,难得会被人挂念。谢谢,总有阳光照进回忆里,温暖的气息让人难以忘记。


某天,一位大神师兄 @Zhaoking 神秘兮兮地给我看一个网站 http://ydcx.bupt.edu.cn/

师兄说,你可以试着把它的数据爬下来。

当时开着八个线程跟着机器学习、回归分析、统计推断、探索性图形分析一堆数据分析的课程,觉得似乎把用电数据爬下来可以玩玩。
于是谨听师兄命令……

首先,就是研究研究网站逻辑。打开firebug,输入1-101,回车。

显然,首先是一个post操作,但返回了一个302重定向,所以,第二个请求应该是正确的请求

试试看看

In [1]: import requests

In [2]: dorm_num = '1-101'

In [3]: r = requests.get('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num)

检查下r.content,确认在里头看到了电量和加电信息。

那么,还有这么多页这么办?我们点点看看。页码1已经不能点了,点2。

竟然变成一个post了,而且post的数据这么多,你可以抱着试试看的心理不加上这些post的数据试试。

In [5]: __EVENTTARGET = 'GridView1'

In [6]: __EVENTARGUMENT = 'Page$2'

In [7]: data = {'__EVENTTARGET': __EVENTTARGET,  '__EVENTARGUMENT': __EVENTARGUMENT}

In [9]: r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=data)

In [10]: r
Out[10]: <Response [500]>

500——internal error……这个错误一般表示服务器内部处理出现错误。怎么回事呢,直觉告诉我们,问题就在于那些Post的参数。

加上试试,

 In [15]: data = {'__EVENTTARGET': __EVENTTARGET,  '__EVENTARGUMENT': __EVENTARGUMENT, '__EVENTVALIDATION': '/wEWCwLMxp63CAKtsp64BAKtsuK4BAKtsva4BAKtsvq4BAKtsu64BAKtsvK4BAKtssa4BAKtssq4BALDuK6IBQKokYx1/Loh35D537/CRr+++EgM74nLP5E=', '__VIEWSTATE': '/wEPDwULLTIwNzMwNzAxOTAPZBYCAgMPZBYQZg8PFgIeBFRleHQFATFkZAIBDw8WAh8ABQEgZGQCAg8PFgIfAAUyMS0xMDEgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBkZAIDDw8WAh8ABQbnlLXku7dkZAIEDw8WAh8ABQwwMDAwMDAwMzg5NTVkZAIFDw8WAh8ABQ0xMC4yMTAuOTYuMjEwZGQCBg88KwANAQAPFgQeC18hRGF0YUJvdW5kZx4LXyFJdGVtQ291bnQCjQNkFgJmD2QWFgIBD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTI0IDA6MDA6MDBkZAICDw8WAh8ABQMyMTVkZAICD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIzIDA6MDA6MDBkZAICDw8WAh8ABQMyMThkZAIDD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIyIDA6MDA6MDBkZAICDw8WAh8ABQMyMjBkZAIED2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIxIDA6MDA6MDBkZAICDw8WAh8ABQMyMjNkZAIFD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIwIDA6MDA6MDBkZAICDw8WAh8ABQMyMjVkZAIGD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE5IDA6MDA6MDBkZAICDw8WAh8ABQMyMjhkZAIHD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE4IDA6MDA6MDBkZAICDw8WAh8ABQMyMzBkZAIID2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE3IDA6MDA6MDBkZAICDw8WAh8ABQMyMzJkZAIJD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE2IDA6MDA6MDBkZAICDw8WAh8ABQMyMzRkZAIKD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE1IDA6MDA6MDBkZAICDw8WAh8ABQMyMzZkZAILDw8WAh4HVmlzaWJsZWhkZAIIDzwrAA0BAA8WBB8BZx8CAgVkFgJmD2QWDgIBD2QWDmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTcgMTQ6MTM6NDBkZAICDw8WAh8ABQMyNTBkZAIDDw8WAh8ABQMxMjBkZAIEDw8WAh8ABQfotK0g55S1ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCAg9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFETIwMTQtMi0xOCA4OjI3OjQ2ZGQCAg8PFgIfAAUDMjgwZGQCAw8PFgIfAAUBMGRkAgQPDxYCHwAFB+WFjSDotLlkZAIFDw8WAh8ABQzliqDnlLXlrozmiJBkZAIGDw8WAh8ABQnlvKDogIHluIhkZAIDD2QWDmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAUQMjAxMy05LTMgOTo1NTowOGRkAgIPDxYCHwAFAzI4MGRkAgMPDxYCHwAFATBkZAIEDw8WAh8ABQflhY0g6LS5ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCBA9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFEjIwMTMtMy0yMCAxNjozMzoyNWRkAgIPDxYCHwAFAzE4MGRkAgMPDxYCHwAFATBkZAIEDw8WAh8ABQflhY0g6LS5ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCBQ9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFETIwMTMtMy0xIDE1OjM5OjIxZGQCAg8PFgIfAAUDMTAwZGQCAw8PFgIfAAUBMGRkAgQPDxYCHwAFB+WFjSDotLlkZAIFDw8WAh8ABQzliqDnlLXlrozmiJBkZAIGDw8WAh8ABQnlvKDogIHluIhkZAIGDw8WAh8DaGRkAgcPDxYCHwNoZGQYAgUJR3JpZFZpZXcyDzwrAAoBCAIBZAUJR3JpZFZpZXcxDzwrAAoBCAIoZG6dKHZt7NdjJRdl8NOMCRx8QVCP'} 

In [16]: r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=data)

In [17]: r
Out[17]: <Response [200]>

成功的返回了需要的用电数据,你可以检查下r.content看看

接下来的问题在于,这些参数从哪里来的?

简单谷歌下和检查下网页源代码,可以看到

那么逻辑就清晰了,先get请求某个寝室的页面,获取对应的post参数,然后一页一页往后翻就是。

下面讲讲如何从html源码中提取信息,用re当然可以,但是xpath这种东西也许更好用。比如

1
2
3
4
5
6
## parse the content
root = fromstring(r.content)
## extract __VIEWSTATE
__VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
## extract __EVENTVALIDATION
__EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]

相比写这么个正则简单优雅很多,不过所谓quick and dirty,不管黑猫白猫……:

1
re.search('id="__VIEWSTATE" value="([^"]+)", r.content).group(1)

接下来的问题在于,页码总数我不知道,最后的一个…符号可以进入下一个十页。

那么,反正最后经过trial and error以一种很quick and dirty的方式work around页码这个问题……

我的解决方案,一个循环,页码不断加一,并且读取当前页右下角所有显示页码。

如果当前页面的那些页码cpn中最后一个,比我下一个想要抓取的np页码小一,或者当前页码个数不是十个,就肯定是最后几页了,就可以抓取然后break不继续往后抓取了。

if (int(cpn[-1]) == np - 1) or (int(cpn[-1]) % 10 != 0):

反正……最后可以运行……

多么不堪入目的实现,随便看看玩

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
用电信息
http://ydcx.bupt.edu.cn/Default.aspx
"""


__author__ = "Reverland"

import socket
from gevent import monkey
monkey.patch_all()
import requests
from lxml.html import fromstring
import re
import os

timeout = 30
socket.setdefaulttimeout(timeout)

try:
os.makedirs('data')
except:
pass


def buy_info(dorm_num, root):
# skip for downloaded
if os.path.exists('data/' + dorm_num + '.csv'):
print dorm_num + 'buy info downloaded...'
return
data = root.xpath('//table[@id="GridView2"]/tr/td/font/text()')
# header
with open('data/' + dorm_num + '_buy.csv', 'wb') as f:
f.write('timestamp, electric_increase, money, charge_bool, state, operator\n')
for i in range(len(data)):
# timestamp, electric_increase, money, charge_bool, state, operator
if i % 7 < 6:
with open('data/' + dorm_num + '_buy.csv', 'a') as f:
f.write(data[i].encode('iso-8859-1').strip() + ',')
if i % 7 == 6:
with open('data/' + dorm_num + '_buy.csv', 'a') as f:
f.write(data[i].encode('iso-8859-1').strip() + '\n')



def parse_dorm_elec(dorm_num):
# skip for downloaded
if os.path.exists('data/' + dorm_num + '.csv'):
print dorm_num + ' downloaded...'
return
print "parsing " + dorm_num + ' now...'
data = []
r = requests.get('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num)
## parse the content
root = fromstring(r.content)
## extract __VIEWSTATE
__VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
## extract __EVENTVALIDATION
__EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
## __EVENTTARGET is fixed
__EVENTTARGET = 'GridView1'
buy_info(dorm_num, root)
## define headers
header = {
'Content-Type': 'application/x-www-form-urlencoded'
}
np = 2

while (1):
# First extract some value
# print "Retrieving data from page", np
__EVENTARGUMENT = 'Page$' + str(np)
payload = {
'__EVENTARGUMENT': __EVENTARGUMENT,
'__EVENTTARGET': __EVENTTARGET,
'__EVENTVALIDATION': __EVENTVALIDATION,
'__VIEWSTATE': __VIEWSTATE}
r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
## get electric data
data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
## get current page numbers
cpn = re.findall('\)"><font color="#4A3C8C">([\d]+)', r.content)
if (int(cpn[-1]) == np - 1) or (int(cpn[-1]) % 10 != 0):
cpn = range(np+1, int(cpn[-1]) + 1)
## next page
for p in cpn:
## parse the content
root = fromstring(r.content)
## extract __VIEWSTATE
__VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
## extract __EVENTVALIDATION
__EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
__EVENTARGUMENT = 'Page$'+ str(p)
payload = {
'__EVENTARGUMENT': __EVENTARGUMENT,
'__EVENTTARGET': __EVENTTARGET,
'__EVENTVALIDATION': __EVENTVALIDATION,
'__VIEWSTATE': __VIEWSTATE}
r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
# print "Retrieving data from page", p
break
## next page
for p in cpn:
## parse the content
root = fromstring(r.content)
## extract __VIEWSTATE
__VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
## extract __EVENTVALIDATION
__EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
__EVENTARGUMENT = 'Page$' + p
payload = {
'__EVENTARGUMENT': __EVENTARGUMENT,
'__EVENTTARGET': __EVENTTARGET,
'__EVENTVALIDATION': __EVENTVALIDATION,
'__VIEWSTATE': __VIEWSTATE}
r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
# print "Retrieving data from page", p
# check if reach end
np = int(cpn[-1]) + 1
# write header
with open('data/' + dorm_num + '.csv', 'wb') as f:
f.write('timestamp' + ',' + 'electric_remain\n')
for i in range(len(data)):
if i % 3 == 1:
with open('data/' + dorm_num + '.csv', 'a') as f:
f.write(data[i] + ',')
if i % 3 == 2:
with open('data/' + dorm_num + '.csv', 'a') as f:
f.write(data[i] + '\n')
return data
## for example
# __EVENTVALIDATION = '/wEWCwKhncT7DwKtsp64BAKtsuK4BAKtsva4BAKtsvq4BAKtsu64BAKtsvK4BAKtssa4BAKtssq4BALDuK6IBQKokYx1vYI/vd4i5UCVuVYMSo/l30x1o/g='
# __EVENTARGUMENT = 'Page$2'

#data = parse_dorm_elec(dorm_num)
def parse_dorm_elec_wrap(dorm_num):
try:
parse_dorm_elec(dorm_num)
except:
print "parsing " + dorm_num + " error..."
pass
# ----------------------------------------------------
# # dorm 10
# dorms = []
# # layer 1
# dorms += map(lambda x: '10-1' + str(x).zfill(2), range(1, 17))
# # layer 2
# dorms += map(lambda x: '10-2' + str(x).zfill(2), range(1, 21))
# # layer 3 to 7
# for l in range(3, 8):
# dorms += map(lambda x: '10-' + str(l) + str(x).zfill(2), range(1, 72))
# # layer 8 to 13
# for l in range(8, 14):
# dorms += map(lambda x: '10-' + str(l) + str(x).zfill(2) , range(1, 53))
# ------------------------------------------------------

#--------------------
# dorm 1
dorms = []
# layer 1
dorms += map(lambda x: '1-' + str(1) + str(x).zfill(2), range(1, 24))
# layer 2,5
for l in range(2, 6):
dorms += map(lambda x: '1-' + str(l) + str(x).zfill(2), range(1, 27))
#--------------------

# for d in dorms:
# try:
# parse_dorm_elec_wrap(d)
# except:
# try:
# parse_dorm_elec_wrap(d)
# except:
# print "Error with ", d
# pass

from gevent.pool import Pool
pool = Pool(5)
pool.join(timeout=timeout)
pool.map(parse_dorm_elec_wrap, dorms)
#
# print dorms
# parse_dorm_elec('10-1220')