Converting A 1.2gb List Of Edges Into A Sparse Matrix
Solution 1:
Here's my solution:
import numpy as np
import pandas as pd
import scipy.sparse as ss
defread_data_file_as_coo_matrix(filename='edges.txt'):
"Read data file and return sparse matrix in coordinate format."
data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32)
rows = data[0] # Not a copy, just a reference.
cols = data[1]
ones = np.ones(len(rows), np.uint32)
matrix = ss.coo_matrix((ones, (rows, cols)))
return matrix
Pandas does the heavy lifting of parsing using read_csv
. And Pandas is already storing the data in columnar format. The data[0]
and data[1]
just get references, no copies. Then I feed those to coo_matrix
. Benchmarked locally:
In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix()
1 loop, best of 5: 14.2 s per loop
Then to save a csr-matrix to a file:
defsave_csr_matrix(filename, matrix):
"""Save compressed sparse row (csr) matrix to file.
Based on http://stackoverflow.com/a/8980156/232571
"""assert filename.endswith('.npz')
attributes = {
'data': matrix.data,
'indices': matrix.indices,
'indptr': matrix.indptr,
'shape': matrix.shape,
}
np.savez(filename, **attributes)
Benchmarked locally:
In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr())
1 loop, best of 5: 13.4 s per loop
And later load it back from a file:
defload_csr_matrix(filename):
"""Load compressed sparse row (csr) matrix from file.
Based on http://stackoverflow.com/a/8980156/232571
"""assert filename.endswith('.npz')
loader = np.load(filename)
args = (loader['data'], loader['indices'], loader['indptr'])
matrix = ss.csr_matrix(args, shape=loader['shape'])
return matrix
Benchmarked locally:
In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz')
1 loop, best of 5: 881 ms per loop
And finally test it all:
def test():
"Test data file parsing and matrix serialization."
coo_matrix = read_data_file_as_coo_matrix()
csr_matrix = coo_matrix.tocsr()
save_csr_matrix('edges.npz', csr_matrix)
loaded_csr_matrix = load_csr_matrix('edges.npz')
# Comparison based on http://stackoverflow.com/a/30685839/232571
assert (csr_matrix != loaded_csr_matrix).nnz == 0
if __name__ == '__main__':
test()
When running test()
, it takes about 30 seconds:
$ time python so_38688062.py
real0m30.401s
user0m27.257s
sys 0m2.759s
And the memory high-water mark was ~1.79 GB.
Note that once you've converted the "edges.txt" to "edges.npz" in CSR-matrix format, loading it will take less than a second.
Solution 2:
In my answer I consider the case where the ids of the nodes are given by 9 characters long strings each character from [0-9A-Za-z]
. n
of these node ids should be mapped on the values [0,n-1]
(which might be not necessary for your application, but still of general interest).
The next considerations, I'm sure you are aware of, are here for the sake of completeness:
- Memory is the bottle neck.
- There are around
10^8
strings in the file. - a 9 character long
string + int32
value pair costs around120
bytes in a dictionary, resulting in 12GB memory usage for the file. - a string id from file can be mapped onto an
int64
: there are 62 different characters -> can be encoded with 6 bits, 9 characters in the string -> 6*9=54<64 bit. See alsotoInt64()
method further below. - there are int64+int32=12 byte "real" information => ca. 1.2 GB could be enough, however the cost for such a pair in a dictionary is around 60 bytes (around 6 GB RAM needed).
- Creating small objects (on the heap) results in a lot of memory overhead, thus bundling these objects in arrays is advantageous. Interesting information about memory used by python objects can be found in his tutorial stile article. Interesting experiences with reducing the memory usage are made public in this blog entry.
- python-list is out of question as data structure as well as dictionary.
array.array
could be alternative, but we usenp.array
(because there are sorting algorithms fornp.array
but notarray.array
).
1. step: reading file and mapping strings to int64
. It is a pain to let a np.array
grow dynamically, so we assume we now the number of edges in the file (it would be nice to have it in the header, but it can also be deduced from the file size):
import numpy as np
defread_nodes(filename, EDGE_CNT):
nodes=np.zeros(EDGE_CNT*2, dtype=np.int64)
cnt=0for line inopen(filename,"r"):
nodes[cnt:cnt+2]=map(toInt64, line.split()) # use map(int, line.split()) for cases without lettersreturn nodes
2. step: converting the int64-values into values [0,n-1]
:
Possibility A, needs 3*0.8GB:
defmaps_to_ids(filename, EDGE_CNT):
""" return number of different node ids, and the mapped nodes"""
nodes=read_nodes(filename, EDGE_CNT)
unique_ids, nodes = np.unique(nodes, return_index=True)
return (len(unique_ids), nodes)
Possibility B, needs 2*0.8GB, but is somewhat slower:
defmaps_to_ids(filename, EDGE_CNT):
""" return number of different node ids, and the mapped nodes"""
nodes=read_nodes(filename, EDGE_CNT)
unique_map = np.unique(nodes)
for i in xrange(len(nodes)):
node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect
nodes[i]=node_id
return (len(unique_map), nodes)
3. step: put it all into coo_matrix:
from scipy import sparse
defdata_as_coo_matrix(filename, EDGE_CNT)
node_cnt, nodes = maps_to_ids(filename, EDGE_CNT)
rows=nodes[::2]#it is only a view, not a copy
cols=nodes[1::2]#it is only a view, not a copyreturn sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt))
For calling data_as_coo_matrix("data.txt", 62500000)
, memory need peaks at 2.5GB (but with int32
instead of int64
only 1.5GB are needed). It took around 5 minutes on my machine, but my machine is pretty slow...
So what is different from your solution?
- I get only unique values from
np.unique
(and not all the indices and the inverse), so there is some memory saved - I can replace the old ids with the new in-place. - I have no experience with
pandas
so maybe there is some copying involved betweenpandas
<->numpy
data structures?
What is the difference from sascha's solution?
- There is no need for the list sorted all the time - it is enough to sort after all items are in the list, that is what
np.unique()
does. sascha's solution keep the list sorted the whole time - you have to pay for this with at least a constant factor, even if the running time staysO(n log(n))
. I assumed, an add operation would beO(n)
, but as pointed out it isO(log(n)
.
What is the difference to GrantJ's solution?
- The size of the resulting sparse matrix is
NxN
- withN
- number of different nodes and not2^54x2^54
(with very many empty rows and column).
PS:
Here is my idea, how the 9 char string id can be mapped on an int64
value, but I guess this function could become a bottle neck the way it is written and should get optimized.
deftoInt64(string):
res=0Lfor ch in string:
res*=62if ch <='9':
res+=ord(ch)-ord('0')
elif ch <='Z':
res+=ord(ch)-ord('A')+10else:
res+=ord(ch)-ord('a')+36return res
Solution 3:
Updated version
As indicated in the comments, the approach did not fit your use-case. Let's make some changes:
- use pandas for reading in the data (instead of numpy: i'm quite surprised np.loadtxt is performing that bad!)
- use external library sortedcontainers for a more memory-efficient approach (instead of a dictionary)
- the basic approach is the same
This approach will take ~45 minutes (which is slow; but you could pickle/save the result so you need to do it only once) and ~5 GB of memory to prepare the sparse-matrix for your data, generated with:
import random
N = 62500000for i in xrange(N):
printrandom.randint(10**8,10**9-1), random.randint(10**8,10**9-1)
Code
import numpy as np
from scipy.sparse import coo_matrix
import pandas as pd
from sortedcontainers import SortedList
import time
# Read data# global memory usage after: one big array
df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32)
data = df.as_matrix()
df = None
n_edges = data.shape[0]
# Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown# global memory usage after: one big array + one big searchtreeprint('fit mapping')
start = time.time()
observed_vertices = SortedList()
mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on verticesfor column inrange(data.shape[1]):
for row inrange(data.shape[0]):
# double-loop: slow, but easy to understand space-complexity
val = data[row, column]
if val notin observed_vertices:
observed_vertices.add(val)
mappings = mappings[:len(observed_vertices)]
n_vertices = len(observed_vertices)
end = time.time()
print(' secs: ', end-start)
print('transform mapping')
# Map original data (in-place !)# global memory usage after: one big array + one big searchtree(can be deleted!)
start = time.time()
for column inrange(data.shape[1]):
for row inrange(data.shape[0]):
# double-loop: slow, but easy to understand space-complexity
val = data[row, column]
mapper_pos = observed_vertices.index(val)
data[row, column] = mappings[mapper_pos]
end = time.time()
print(' secs: ', end-start)
observed_vertices = None# if not needed anymore
mappings = None# if not needed anymore# Create sparse matrix (only caring about a single triangular part for now)# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))
First version
Here is a very simple and very inefficient (in regards to time and space) code to build this sparse matrix. I post this code, because i believe it is important to understand the core parts if one is using these in something bigger.
Let's see, if this code is efficient enough for your use-case or if it needs work. From distance it's hard to tell, because we don't have your data.
The dictionary-part, used for the mapping, is a candidate to blow up your memory. But it's pointless to optimize this without knowing if it's needed at all. Especially because this part of the code is dependent on the number of vertices in your graph (and i don't have any knowledge of this cardinality).
""" itertools.count usage here would need changes for py2 """import numpy as np
from itertools import count
from scipy.sparse import coo_matrix
# Read data# global memory usage after: one big array
data = np.loadtxt('edges.txt', np.uint32)
n_edges = data.shape[0]
#print(data)#print(data.shape)# Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown# global memory usage after: one big array + one big dict
index_gen = count()
mapper = {}
for column inrange(data.shape[1]):
for row inrange(data.shape[0]):
# double-loop: slow, but easy to understand space-complexity
val = data[row, column]
if val notin mapper:
mapper[val] = next(index_gen)
n_vertices = len(mapper)
# Map original data (in-place !)# global memory usage after: one big array + one big dict (can be deleted!)for column inrange(data.shape[1]):
for row inrange(data.shape[0]):
# double-loop: slow, but easy to understand space-complexity
data[row, column] = mapper[data[row, column]]
#print(data)# Create sparse matrix (only caring about a single triangular part for now)# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))
#print(sp_mat)
Output for edges-10.txt:
[[287111206 357850135][512616930 441657273][530905858 562056765][524113870 320749289][149911066 964526673][169873523 631128793][646151040 986572427][105290138 382302570][194873438 968653053][912211115 195436728]]
(10, 2)
[[ 0 10][ 1 11][ 2 12][ 3 13][ 4 14][ 5 15][ 6 16][ 7 17][ 8 18][ 9 19]]
(0, 10) True
(1, 11) True
(2, 12) True
(3, 13) True
(4, 14) True
(5, 15) True
(6, 16) True
(7, 17) True
(8, 18) True
(9, 19) True
Solution 4:
I was trying the different methods available apart from the ones already used. I found the following doing good.
Method 1 - Reading the file into a string, parsing the string into a 1-D array using numpy's fromstring.
import numpy as np
import scipy.sparse as sparse
defreadEdges():
withopen('edges.txt') as f:
data = f.read()
edges = np.fromstring(data, dtype=np.int32, sep=' ')
edges = np.reshape(edges, (edges.shape[0]/2, 2))
ones = np.ones(len(edges), np.uint32)
cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdges()
Output:
5 loops, best of 3: 13.6 s per loop
Method 2 - Same as method 1 except instead of loading the file into a string, using the memory mapped interface.
def readEdgesMmap():
with open('edges.txt') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m:
edges = np.fromstring(m, dtype=np.int32, sep=' ')
edges = np.reshape(edges, (edges.shape[0]/2, 2))
ones = np.ones(len(edges), np.uint32)
cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdgesMmap()
Output:
5 loops, best of 3: 12.7 s per loop
Monitored using /usr/bin/time
, both methods use a maximum of around ~2GB of memory.
Few notes:
It seems to do slightly better than pandas
read_csv
. Using pandas read_csv, the output on the same machine is5 loops, best of 3: 16.2 s per loop
Conversion from COO to CSR/CSC consumes significant time too. In @GrantJ's answer, it takes less time because the COO matrix initialization is incorrect. The argument needs to be given as a tuple. I wanted to leave a comment there but I don't have commenting rights yet.
My guess on why this is slightly better than pandas
read_csv
is the prior assumption of 1D data.
Solution 5:
You might want to take a look at the igraph project, this is a GPL library of C code which is designed for this kind of thing, and has a nice Python API. I think in your case you the Python code would be something like
from igraph importGraphg= Graph.Read_Edgelist('edges.txt')
g.write_adjacency('adjacency_matrix.txt')
Post a Comment for "Converting A 1.2gb List Of Edges Into A Sparse Matrix"