Python: How Do I Parse A Stream Of Json Arrays With Ijson Library
Solution 1:
Here's a first cut at the problem that at least has a working regex substitution to turn a full string into valid json. It only works if you're ok with reading the full input stream before parsing as json.
import re
input = ''for line in inputStream:
input = input + line
# input == '[{"foo": "bar"}][{"bar": "baz"}][{"baz": "foo"}]'# wrap in [] and put commas between each ][
sanitizedInput = re.sub(r"\]\[", "],[", "[%s]" % input)
# sanitizedInput == '[[{"foo": "bar"}],[{"bar": "baz"}],[{"baz": "foo"}]]'# then parse sanitizedInput
parsed = json.loads(sanitizedInput)
print parsed #=> [[{u'foo': u'bar'}], [{u'bar': u'baz'}], [{u'baz': u'foo'}]]
Note: since you're read the whole thing as a string, you can use json
instead of ijson
Solution 2:
You can use json.JSONDecoder.raw_decode to walk through the string. Its documentation indeed says:
This can be used to decode a JSON document from a string that may have extraneous data at the end.
The following code sample assumes all the JSON values are in one big string:
def json_elements(string):
while True:
try:
(element, position) = json.JSONDecoder.raw_decode(string)
yield element
string = string[position:]
except ValueError:
break
To avoid dealing with raw_decode
yourself and to be able to parse a stream chunk by chunk, I would recommend a library I made for this exact purpose: streamcat.
defjson_elements(stream)
decoder = json.JSONDecoder()
yieldfrom streamcat.stream_to_iterator(stream, decoder)
This works for any concatenation of JSON values regardless of how many white-space characters are used within them or between them.
If you have control over how your input stream is encoded, you may want to consider using line-delimited JSON, which makes parsing easier.
Post a Comment for "Python: How Do I Parse A Stream Of Json Arrays With Ijson Library"