1750款商业游戏的小爬虫
Tofloor
poster avatar
shouhuanxiaoji
deepin
2018-05-16 01:56
Author
本帖最后由 shouhuanxiaoji 于 2018-5-16 13:11 编辑

学了几天python3,手痒写了一段爬虫代码,爬的rutracker上的linux游戏及下载链接
一共1750款,也算满足几年前的心愿。
会不断完善,加上其他网站上的游戏,预计能爬上万款吧,回头有空慢慢测试打包appimage
对版权有洁癖的请绕行。
代码比较烂,欢迎提建议。
请先sudo pip3 install pyquery
  1. #!/usr/bin/env python3
  2. from pyquery import PyQuery as pq
  3. import urllib.parse
  4. import urllib.request
  5. import os,time

  6. URLFIRST = "https://rutracker.org/forum/viewforum.php?f=1992"
  7. currentpath = os.path.abspath('.')
  8. listpagepath = os.path.join(currentpath, 'list')
  9. outputpath = os.path.join(currentpath, 'rutracker-output')
  10. outputfile = os.path.join(outputpath, 'output.txt')
  11. gamelist = {}
  12. if not os.path.exists(listpagepath):
  13.     os.mkdir(listpagepath)


  14. def CachePage(IndexPageNum=1):
  15.     urldata = {}
  16.     # 首页无此参数,第二页为50,第三页为100,依次类推
  17.     urldata["start"] = IndexPageNum * 50
  18.     urlvalue = urllib.parse.urlencode(urldata)
  19.     if (IndexPageNum == 0):
  20.         url = URLFIRST
  21.     else:
  22.         url = URLFIRST + "&" + urlvalue
  23.     response = urllib.request.urlopen(url)
  24.     htmlcontent = response.read().decode("Windows-1251")
  25.     controledfile = open(os.path.join(listpagepath, str(IndexPageNum + 1) + ".html"), "w", encoding="utf-8")
  26.     controledfile.write(htmlcontent)
  27.     controledfile.close()
  28.     response.close()
  29.     print("第" + str(IndexPageNum + 1) + "页已获取完毕!")


  30. def CacheMagnet():
  31.     htmldom = pq(URLFIRST)
  32.     indexnum = htmldom('a.pg').eq(-2).text()
  33.     indexnum = int(indexnum)
  34.     if not os.path.exists(outputpath):
  35.         os.mkdir(outputpath)
  36.     if not os.path.exists(outputfile):
  37.         op = open(outputfile, 'w', encoding='utf-8')
  38.         op.close()
  39.     if (len([i for i in os.listdir(listpagepath) if os.path.isfile(os.path.join(listpagepath, i))]) != indexnum):
  40.         for i in range(0, indexnum):
  41.             CachePage(i)
  42.     for i in os.listdir(listpagepath):
  43.         if os.path.isfile(os.path.join(listpagepath, i)):
  44.             htmlfile = open(os.path.join(listpagepath, i), 'r', encoding='utf-8', errors='ignore')
  45.             htmldata = pq(htmlfile.read())
  46.             gamelistnum = len(htmldata('a.tt-text'))
  47.             for m in range(0, gamelistnum):
  48.                 innerhref = htmldata('a.tt-text').eq(m).attr('href')
  49.                 innerurl = 'https://rutracker.org/forum/' + innerhref
  50.                 innerdom = pq(url = innerurl)
  51.                 op = open(outputfile, 'a', encoding='utf-8')
  52.                 gamelist['size'] = htmldata('a.dl-stub').eq(m).text()
  53.                 gamelist['name'] = innerdom('div.post_body').children('span').eq(0).text()
  54.                 gamelist['magnet'] = innerdom('a.magnet-link').attr('href')
  55.                 op.write(str(gamelist))
  56.                 op.close
  57.                 time.sleep(2)
  58.             htmlfile.close()


  59. CacheMagnet()
Copy the Code




Reply Favorite View the author
All Replies
avatar
shouhuanxiaoji
deepin
2018-05-16 01:57
#1
应该还有个ssl的报错,百度一下,缺个库,pip装一下。
Reply View the author