Skip to content Skip to sidebar Skip to footer

Python: How Do I Parse A Stream Of Json Arrays With Ijson Library

The incoming data resembles the following: [{ 'foo': 'bar' }] [{ 'bar': 'baz' }] [{ 'baz': 'foo' }] as you see, arrays of objects strung together. JSON-ish ijson is ab

Solution 1:

Here's a first cut at the problem that at least has a working regex substitution to turn a full string into valid json. It only works if you're ok with reading the full input stream before parsing as json.

import re

input = ''for line in inputStream:
  input = input + line    
# input == '[{"foo": "bar"}][{"bar": "baz"}][{"baz": "foo"}]'# wrap in [] and put commas between each ][
sanitizedInput = re.sub(r"\]\[", "],[", "[%s]" % input)
# sanitizedInput == '[[{"foo": "bar"}],[{"bar": "baz"}],[{"baz": "foo"}]]'# then parse sanitizedInput
parsed = json.loads(sanitizedInput)
print parsed #=> [[{u'foo': u'bar'}], [{u'bar': u'baz'}], [{u'baz': u'foo'}]]

Note: since you're read the whole thing as a string, you can use json instead of ijson

Solution 2:

You can use json.JSONDecoder.raw_decode to walk through the string. Its documentation indeed says:

This can be used to decode a JSON document from a string that may have extraneous data at the end.

The following code sample assumes all the JSON values are in one big string:

def json_elements(string):
    while True:
        try:
            (element, position) = json.JSONDecoder.raw_decode(string)
            yield element
            string = string[position:]
        except ValueError:
            break

To avoid dealing with raw_decode yourself and to be able to parse a stream chunk by chunk, I would recommend a library I made for this exact purpose: streamcat.

defjson_elements(stream)
    decoder = json.JSONDecoder()
    yieldfrom streamcat.stream_to_iterator(stream, decoder)

This works for any concatenation of JSON values regardless of how many white-space characters are used within them or between them.

If you have control over how your input stream is encoded, you may want to consider using line-delimited JSON, which makes parsing easier.

Post a Comment for "Python: How Do I Parse A Stream Of Json Arrays With Ijson Library"