Why Getparent() Don't Work As Expected?
I need to make some manipulations with text inside one of tags and want to get parent tag for every found text node for it Code: import lxml.etree import pprint s = '''
Solution 1:
What you are seeing has to do with the tail
property (text immediately following an end tag), which is a peculiarity of the ElementTree and lxml way of representing XML.
By adding a is_tail
test (returns True
if the text is "tail text") to your code, you can see what's happening:
import lxml.etree
import pprint
s = '''
<data>
data text
<foo>foo - <bar>bar</bar> text</foo>
data text
<bar>
bar text
<baz>baz text</baz>
<baz>baz text</baz>
bar text
</bar>
data text
</data>
'''
etree = lxml.etree.fromstring(s)
text = etree.xpath("//text()[normalize-space()]")
pprint.pprint([(s.getparent().tag, s.is_tail, s.strip()) for s in text])
Output:
[('data', False, 'data text'),
('foo', False, 'foo -'),
('bar', False, 'bar'),§
('bar', True, 'text'),
('foo', True, 'data text'),
('bar', False, 'bar text'),
('baz', False, 'baz text'),
('baz', False, 'baz text'),
('baz', True, 'bar text'),
('bar', True, 'data text')]
Solution 2:
This, as far as I can see, is due to the "tail" concept in lxml
(See : 2. How ElementTree represents XML). When content of an element contains mixture of element nodes and text nodes, the text node represented as 'tail' of the preceding element or represented normally as child of the parent element only if it comes first.
You can call getparent()
twice to get the actual parent in case of a 'tail' text node (is_tail=True
), for example :
pprint.pprint(
[(s.getparent().getparent().tag if s.is_tail else s.getparent().tag,
s.strip())
for s in text]
)
output :
[('data', 'data text'),
('foo', 'foo -'),
('bar', 'bar'),
('foo', 'text'),
('data', 'data text'),
('bar', 'bar text'),
('baz', 'baz text'),
('baz', 'baz text'),
('bar', 'bar text'),
('data', 'data text')]
Post a Comment for "Why Getparent() Don't Work As Expected?"