Fast And Efficient Way Of Serializing And Retrieving A Large Number Of Numpy Arrays From Hdf5 File
Solution 1:
According to Out[7]
, "img_feats" is a large 3d array. (113287, 36, 2048) shape.
Define ds
as the dataset (doesn't load anything):
ds = hf[group_key]
x = ds[0] # should be a (36, 2048) arrayarr = ds[:] # should load the whole dataset into memory.arr = ds[:n] # load a subset, slice
According to h5py-reading-writing-data :
HDF5 datasets re-use the NumPy slicing syntax to read and write to the file. Slice specifications are translated directly to HDF5 “hyperslab” selections, and are a fast and efficient way to access data in the file.
I don't see any point in wrapping that in list()
; that is, in splitting the 3d array in a list of 113287 2d arrays. There's a clean mapping between 3d datasets on the HDF5 file and numpy
arrays.
h5py-fancy-indexing warns that fancy indexing of a dataset is slower. That is, seeking to load, say [1, 1000, 3000, 6000] subarrays of that large dataset.
You might want to experiment with writing and reading some smaller datasets if working with this large one is too confusing.
Solution 2:
One way would be to put each sample into its own group and index directly into those. I am thinking the conversion takes long because it tries to load the entire data set into a list (which it has to read from disk). Re-organizing the h5 file such that
- group
- sample
- 36 x 2048 may help in indexing speed.
- sample
Post a Comment for "Fast And Efficient Way Of Serializing And Retrieving A Large Number Of Numpy Arrays From Hdf5 File"