Fetching Data Using Python & Lxml
I have a my HTML which looks like below. I would like to get the text which is in the . According to the e.g given below I would get 3 and
Solution 1:
The following code works with your input:
import lxml.html
root = lxml.html.parse('text.html').getroot()
for span in root.xpath('//span[@class="zzAggregateRatingStat"]'):
print span.text
it prints:
3
5
I prefer using lxml
's xpath over CSSSelectors though they can both do the job.
ChrisP's example prints 3
but if you run it on your actual input we get errors:
$ python chrisp.py
Traceback (most recent call last):
File "chrisp.py", line 6, in <module>
doc = fromstring(text)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 3, column 210
ChrisP's code can be changed to use lxml.html.fromstring
- which is a more lenient parser - instead of lxml.etree.fromstring
.
If this change is made it prints 3
.
Solution 2:
This is clearly documented at the lxml website
from lxml.etree import fromstring
from lxml.cssselect import CSSSelector
sel = CSSSelector('.zzAggregateRatingStat')
text = '<span><span class="zzAggregateRatingStat">3</span></span>'
doc = fromstring(text)
el = sel(doc)[0]
print el.text
Post a Comment for "Fetching Data Using Python & Lxml"