Issue With Html Tags While Scraping Data Using Beautiful Soup
Common piece of code: # -*- coding: cp1252 -*- import csv import urllib2 import sys import time from bs4 import BeautifulSoup from itertools import islice page = urllib2.urlopen('
Solution 1:
The page uses a large JavaScript structure to load the prices. You can load just that structure:
scripts = soup.find_all('script')
script = next(s.text for s in scripts if s.string and 'window.rates' in s.string)
datastring = script.split('phones=')[1].split(';window.')[0]
This results in a large JavaScript structure, starting with:
{sku844082:{name:"Samsung Galaxy SII",image:"/images/m677391_300468.jpg",deliveryTime:"Vorauss. verfügbar ab Anfang Januar",sku1444291:{p:"prod954312",e:"19.90"},sku1444286:{p:"prod954312",e:"19.90"},sku1444283:{p:"prod954312",e:"39.90"},sku1444275:{p:"prod954312",e:"59.90"},sku1104261:{p:"prod954312",e:"99.90"}},sku894279:{name:"BlackBerry Torch 9810",image:"/images/m727477_300464.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod1004495",e:"179.90"},sku1104261:{p:"prod1004495",e:"259.90"},sku1444291:{p:"prod1004495",e:"29.90"},sku1444286:{p:"prod1004495",e:"29.90"},sku1444283:{p:"prod1004495",e:"49.90"}},sku864221:{name:"BlackBerry Bold 9900",image:"/images/m707491_300465.jpg",deliveryTime:"Lieferbar innerhalb 48 Stunden",sku1444275:{p:"prod974431",e:"129.90"},sku1104261:{p:"prod974431",e:"169.90"},sku1444291:{p:"prod974431",e:"49.90"},sku1444286:{p:"prod974431",e:"49.90"},sku1444283:{p:"prod974431",e:"89.90"}}
Unfortunately, that's not directly loadable with the json
module; although valid JavaScript, without quoting around the keys it is not valid JSON. You'd need to use regular expressions to clean that up further, or grab the p:"someprice"
information directly from that string.
Luckily the structure can be fixed with a small amount of regular expression magic:
import re
import json
datastring = re.sub(ur'([{,])([a-z]\w*):', ur'\1"\2":', datastring)
data = json.loads(datastring)
This gives you a large dictionary, with SKU keys and dictionaries with nested dicts as data, including nested SKUs with p
product codes and e
prices:
>>> from pprint import pprint
>>> pprint(data['sku864221'])
{u'deliveryTime': u'Lieferbar innerhalb 48 Stunden',
u'image': u'/images/m707491_300465.jpg',
u'name': u'BlackBerry Bold 9900',
u'sku1104261': {u'e': u'169.90', u'p': u'prod974431'},
u'sku1444275': {u'e': u'129.90', u'p': u'prod974431'},
u'sku1444283': {u'e': u'89.90', u'p': u'prod974431'},
u'sku1444286': {u'e': u'49.90', u'p': u'prod974431'},
u'sku1444291': {u'e': u'49.90', u'p': u'prod974431'}}
Post a Comment for "Issue With Html Tags While Scraping Data Using Beautiful Soup"