Home
Categories
WIKI
Topic
User
LANGUAGE:
中文
English
1750款商业游戏的小爬虫
Theme area
1091
views ·
1
replies ·
To
floor
Go
shouhuanxiaoji
deepin
2018-05-16 01:56
Author
本帖最后由 shouhuanxiaoji 于 2018-5-16 13:11 编辑
学了几天python3,手痒写了一段爬虫代码,爬的rutracker上的linux游戏及下载链接
一共1750款,也算满足几年前的心愿。
会不断完善,加上其他网站上的游戏,预计能爬上万款吧,回头有空慢慢测试打包appimage
对版权有洁癖的请绕行。
代码比较烂,欢迎提建议。
请先sudo pip3 install pyquery
#!/usr/bin/env python3
from pyquery import PyQuery as pq
import urllib.parse
import urllib.request
import os,time
URLFIRST = "https://rutracker.org/forum/viewforum.php?f=1992"
currentpath = os.path.abspath('.')
listpagepath = os.path.join(currentpath, 'list')
outputpath = os.path.join(currentpath, 'rutracker-output')
outputfile = os.path.join(outputpath, 'output.txt')
gamelist = {}
if not os.path.exists(listpagepath):
os.mkdir(listpagepath)
def CachePage(IndexPageNum=1):
urldata = {}
# 首页无此参数,第二页为50,第三页为100,依次类推
urldata["start"] = IndexPageNum * 50
urlvalue = urllib.parse.urlencode(urldata)
if (IndexPageNum == 0):
url = URLFIRST
else:
url = URLFIRST + "&" + urlvalue
response = urllib.request.urlopen(url)
htmlcontent = response.read().decode("Windows-1251")
controledfile = open(os.path.join(listpagepath, str(IndexPageNum + 1) + ".html"), "w", encoding="utf-8")
controledfile.write(htmlcontent)
controledfile.close()
response.close()
print("第" + str(IndexPageNum + 1) + "页已获取完毕!")
def CacheMagnet():
htmldom = pq(URLFIRST)
indexnum = htmldom('a.pg').eq(-2).text()
indexnum = int(indexnum)
if not os.path.exists(outputpath):
os.mkdir(outputpath)
if not os.path.exists(outputfile):
op = open(outputfile, 'w', encoding='utf-8')
op.close()
if (len([i for i in os.listdir(listpagepath) if os.path.isfile(os.path.join(listpagepath, i))]) != indexnum):
for i in range(0, indexnum):
CachePage(i)
for i in os.listdir(listpagepath):
if os.path.isfile(os.path.join(listpagepath, i)):
htmlfile = open(os.path.join(listpagepath, i), 'r', encoding='utf-8', errors='ignore')
htmldata = pq(htmlfile.read())
gamelistnum = len(htmldata('a.tt-text'))
for m in range(0, gamelistnum):
innerhref = htmldata('a.tt-text').eq(m).attr('href')
innerurl = 'https://rutracker.org/forum/' + innerhref
innerdom = pq(url = innerurl)
op = open(outputfile, 'a', encoding='utf-8')
gamelist['size'] = htmldata('a.dl-stub').eq(m).text()
gamelist['name'] = innerdom('div.post_body').children('span').eq(0).text()
gamelist['magnet'] = innerdom('a.magnet-link').attr('href')
op.write(str(gamelist))
op.close
time.sleep(2)
htmlfile.close()
CacheMagnet()
Copy the Code
Reply
Like 0
Favorite
View the author
All Replies
shouhuanxiaoji
deepin
2018-05-16 01:57
#1
应该还有个ssl的报错,百度一下,缺个库,pip装一下。
Reply
Like 0
View the author
Please
sign
in first
Featured Collection
Change
[Tutorial] deepin25 WSL Offline Installation Guide
UOS AI 2.8 Released! Three New Intelligent Agents & Major Evolution
Solid Q&A | deepin 25 Common Questions – The Immutable System Edition
New Thread
Popular Ranking
Change
【Enhanced Repo】Better Deepin Repo is released
How to fix grub boot menu that has disappeared?
Popular Events
More
学了几天python3,手痒写了一段爬虫代码,爬的rutracker上的linux游戏及下载链接
一共1750款,也算满足几年前的心愿。
会不断完善,加上其他网站上的游戏,预计能爬上万款吧,回头有空慢慢测试打包appimage
对版权有洁癖的请绕行。
代码比较烂,欢迎提建议。
请先sudo pip3 install pyquery