Python爬虫课初学经验分享


刚开始接触Python爬虫,林老师的课讲的非常清楚,尝试写了下第三节课的作业,中间遇到一些坑,拿出来跟大家分享一下。Ps.最开始的时候始终找不到问题出在哪里,感谢林老师单独指导!

QQ图片20170329232925.png


问题1. 商品信息获取报错
goods = browser.find_element_by_css_selector('body > div > div.sks-clear-container.big-box > div.left-container > div.florid-market-goods-container > div > div.florid-design-goods-list > div:nth-child(1)')
原因:
1.Copy selector的时候,最后一项内容表示为定位到具体商品,应删除':nht-child(1)'内容,注意保留div;
2. 选择商品信息时是获取全部商品信息,因此应该find_element_by_css_selector为find_elements_by_css_selector,特别需要注意「elements」有's'
问题2.使用get_attribute('href')获取href信息。
如获取店铺信息:
shopURL = good.find_element_by_css_selector('div.design-goods-shops-contianer > div.float-contianer > a').get_attribute('href')
问题3. 'ascii' codec can't encode characters in position 4-12: ordinal not in range(128)
原因:程序中的字符串变量为unicode编码,需要转为‘utf8’编码,仅在程序开头写‘# _*_ encoding:utf-8 _*_’是不够的,我的解决方法是在程序开头加入:
import sys
reload(sys)
sys.setdefaultencoding('utf8')

最后贴上我的Python爬虫第三节作业完整代码:

coding:utf-8

import sys
reload(sys)
sys.setdefaultencoding('utf8')

from selenium import webdriver
import time
import re

browser = webdriver.Chrome()
browser.set_page_load_timeout(30)

获取网页信息

browser.get('http://hz.17zwd.com/sks.aspx?so=%E5%A5%B3%E5%A3%AB%E5%A4%A7%E8%A1%A3')

获取商品页数

page_info = browser.find_element_by_css_selector('#pages__pager_2 > div > span:nth-child(8)')

共10 页

m = re.findall(r'(\w*[0-9]+)\w*', page_info.text)
pages = int(m[0])
print '商品有%d页' % pages

for i in range(pages):
if i > 1:
break
else:
url = 'http://hz.17zwd.com/sks.aspx?so=%E5%A5%B3%E5%A3%AB%E5%A4%A7%E8%A1%A3&page=' + str(i + 1)
browser.get(url)
browser.execute_script('window.scrollTo(0,document.body.scrollHeight);')
time.sleep(2)

goods = browser.find_elements_by_css_selector(
'body > div > div.sks-clear-container.big-box > div.left-container > div.florid-market-goods-container > div > div.florid-design-goods-list > div')

print ('第%d页有%d件商品' % (i + 1, len(goods)))
for index, good in enumerate(goods):
try:
title = good.find_element_by_css_selector(' div.design-goods-name-container > a').text
price = good.find_element_by_css_selector(
'div.design-goods-price-and-collect-container > div.design-goods-price').text
shop = good.find_element_by_css_selector('div.design-goods-shops-contianer > div.float-contianer> a').text
shopURL = good.find_element_by_css_selector(
'div.design-goods-shops-contianer > div.float-contianer > a').get_attribute('href')
goodURL = good.find_element_by_css_selector('div.design-goods-image-container > a').get_attribute('href')
print ("No.%d:商品名称:%s\t价格:%s\t 商品链接:%s" % (index + 1, str(title), str(price), str(goodURL)))
print ("店铺名称:%s\t店铺链接 :%s" % (str(shop), str(shopURL)))
except Exception as e:
print 'Exception:', e

browser.quit()
已邀请:

Big Snail

赞同来自: tinydragon


很棒,赞一个

要回复问题请先登录注册