Skip to content Skip to sidebar Skip to footer

Python Remove Duplicate Elements From Xml Tree

I have a xml structure with some elements which are not unique. So I managed to sort the subtrees and I can filter propper the elements which I have more than one time. But the rem

Solution 1:

I don't know how you've defined elements_equal, but (shamelessly adapted from Testing Equivalence of xml.etree.ElementTree) this works for me:

EDIT: store a list of each element to be removed whilst iterating over page and then remove them rather than doing the removal within one loop.

EDIT: Noticed a small typo in the code in the comparison of the element tags and correct it.

import xml.etree.ElementTree as ET

path = 'in.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = Nonedefelements_equal(e1, e2):
    iftype(e1) != type(e2):
        returnFalseif e1.tag != e2.tag: returnFalseif e1.text != e2.text: returnFalseif e1.tail != e2.tail: returnFalseif e1.attrib != e2.attrib: returnFalseiflen(e1) != len(e2): returnFalsereturnall([elements_equal(c1, c2) for c1, c2 inzip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
# [...]
tree.write("out.xml")

Gives:

$ python undupe.py
found duplicate: blabla blub not unique
found duplicate: 2nd blabla blub not unique
$ cat out.xml
<root><page><text>blabla blub unique</text><text>blabla blub not unique</text><text>blabla blub again unique</text></page><page><text>2nd blabla blub unique</text><text>2nd blabla blub not unique</text><text>2nd blabla blub again unique</text></page>

Post a Comment for "Python Remove Duplicate Elements From Xml Tree"