Converting A 1.2gb List Of Edges Into A Sparse Matrix

February 28, 2024 Post a Comment

I have a 1.2GB list of edges from a graph in a text file. My ubuntu PC has 8GB of RAM. Each line in the input looks like 287111206 357850135 I would like to convert it into a spar

Solution 1:

Here's my solution:

import numpy as np
import pandas as pd
import scipy.sparse as ss

defread_data_file_as_coo_matrix(filename='edges.txt'):
    "Read data file and return sparse matrix in coordinate format."
    data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32)
    rows = data[0]  # Not a copy, just a reference.
    cols = data[1]
    ones = np.ones(len(rows), np.uint32)
    matrix = ss.coo_matrix((ones, (rows, cols)))
    return matrix

Pandas does the heavy lifting of parsing using read_csv. And Pandas is already storing the data in columnar format. The data[0] and data[1] just get references, no copies. Then I feed those to coo_matrix. Benchmarked locally:

In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix()
1 loop, best of 5: 14.2 s per loop

Then to save a csr-matrix to a file:

defsave_csr_matrix(filename, matrix):
    """Save compressed sparse row (csr) matrix to file.

    Based on http://stackoverflow.com/a/8980156/232571

    """assert filename.endswith('.npz')
    attributes = {
        'data': matrix.data,
        'indices': matrix.indices,
        'indptr': matrix.indptr,
        'shape': matrix.shape,
    }
    np.savez(filename, **attributes)

Benchmarked locally:

In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr())
1 loop, best of 5: 13.4 s per loop

And later load it back from a file:

defload_csr_matrix(filename):
    """Load compressed sparse row (csr) matrix from file.

    Based on http://stackoverflow.com/a/8980156/232571

    """assert filename.endswith('.npz')
    loader = np.load(filename)
    args = (loader['data'], loader['indices'], loader['indptr'])
    matrix = ss.csr_matrix(args, shape=loader['shape'])
    return matrix

Benchmarked locally:

In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz')
1 loop, best of 5: 881 ms per loop

And finally test it all:

def test():
    "Test data file parsing and matrix serialization."
    coo_matrix = read_data_file_as_coo_matrix()
    csr_matrix = coo_matrix.tocsr()
    save_csr_matrix('edges.npz', csr_matrix)
    loaded_csr_matrix = load_csr_matrix('edges.npz')
    # Comparison based on http://stackoverflow.com/a/30685839/232571
    assert (csr_matrix != loaded_csr_matrix).nnz == 0

if __name__ == '__main__':
    test()

When running test(), it takes about 30 seconds:

$ time python so_38688062.py 
real0m30.401s
user0m27.257s
sys     0m2.759s

And the memory high-water mark was ~1.79 GB.

Note that once you've converted the "edges.txt" to "edges.npz" in CSR-matrix format, loading it will take less than a second.

Solution 2:

In my answer I consider the case where the ids of the nodes are given by 9 characters long strings each character from [0-9A-Za-z]. n of these node ids should be mapped on the values [0,n-1] (which might be not necessary for your application, but still of general interest).

The next considerations, I'm sure you are aware of, are here for the sake of completeness:

Memory is the bottle neck.
There are around 10^8 strings in the file.
a 9 character long string + int32 value pair costs around 120 bytes in a dictionary, resulting in 12GB memory usage for the file.
a string id from file can be mapped onto an int64: there are 62 different characters -> can be encoded with 6 bits, 9 characters in the string -> 6*9=54<64 bit. See also toInt64() method further below.
there are int64+int32=12 byte "real" information => ca. 1.2 GB could be enough, however the cost for such a pair in a dictionary is around 60 bytes (around 6 GB RAM needed).
Creating small objects (on the heap) results in a lot of memory overhead, thus bundling these objects in arrays is advantageous. Interesting information about memory used by python objects can be found in his tutorial stile article. Interesting experiences with reducing the memory usage are made public in this blog entry.
python-list is out of question as data structure as well as dictionary. array.array could be alternative, but we use np.array (because there are sorting algorithms for np.array but not array.array).

1. step: reading file and mapping strings to int64. It is a pain to let a np.array grow dynamically, so we assume we now the number of edges in the file (it would be nice to have it in the header, but it can also be deduced from the file size):

import numpy as np

defread_nodes(filename, EDGE_CNT):   
    nodes=np.zeros(EDGE_CNT*2, dtype=np.int64)
    cnt=0for line inopen(filename,"r"):
        nodes[cnt:cnt+2]=map(toInt64, line.split())  # use map(int, line.split()) for cases without lettersreturn nodes

2. step: converting the int64-values into values [0,n-1]:

Possibility A, needs 3*0.8GB:

defmaps_to_ids(filename, EDGE_CNT):
""" return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_ids, nodes = np.unique(nodes, return_index=True)  
    return (len(unique_ids), nodes)

Possibility B, needs 2*0.8GB, but is somewhat slower:

defmaps_to_ids(filename, EDGE_CNT):
    """ return number of different node ids, and the mapped nodes"""
    nodes=read_nodes(filename, EDGE_CNT)
    unique_map = np.unique(nodes)
    for i in xrange(len(nodes)):
        node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect
        nodes[i]=node_id  
    return (len(unique_map), nodes)

3. step: put it all into coo_matrix:

from scipy import sparse
defdata_as_coo_matrix(filename, EDGE_CNT)
    node_cnt, nodes = maps_to_ids(filename, EDGE_CNT)    
    rows=nodes[::2]#it is only a view, not a copy
    cols=nodes[1::2]#it is only a view, not a copyreturn sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt))

For calling data_as_coo_matrix("data.txt", 62500000), memory need peaks at 2.5GB (but with int32 instead of int64 only 1.5GB are needed). It took around 5 minutes on my machine, but my machine is pretty slow...

So what is different from your solution?

I get only unique values from np.unique (and not all the indices and the inverse), so there is some memory saved - I can replace the old ids with the new in-place.
I have no experience with pandas so maybe there is some copying involved between pandas<->numpy data structures?

What is the difference from sascha's solution?

There is no need for the list sorted all the time - it is enough to sort after all items are in the list, that is what np.unique() does. sascha's solution keep the list sorted the whole time - you have to pay for this with at least a constant factor, even if the running time stays O(n log(n)). I assumed, an add operation would be O(n), but as pointed out it is O(log(n).

What is the difference to GrantJ's solution?

The size of the resulting sparse matrix is NxN - with N - number of different nodes and not 2^54x2^54 (with very many empty rows and column).

PS: Here is my idea, how the 9 char string id can be mapped on an int64 value, but I guess this function could become a bottle neck the way it is written and should get optimized.

deftoInt64(string):
    res=0Lfor ch in string:
        res*=62if ch <='9':
          res+=ord(ch)-ord('0')
        elif ch <='Z':
          res+=ord(ch)-ord('A')+10else:
          res+=ord(ch)-ord('a')+36return res

Solution 3:

Updated version

As indicated in the comments, the approach did not fit your use-case. Let's make some changes:

use pandas for reading in the data (instead of numpy: i'm quite surprised np.loadtxt is performing that bad!)
use external library sortedcontainers for a more memory-efficient approach (instead of a dictionary)
the basic approach is the same

This approach will take ~45 minutes (which is slow; but you could pickle/save the result so you need to do it only once) and ~5 GB of memory to prepare the sparse-matrix for your data, generated with:

import random
N = 62500000for i in xrange(N):
    printrandom.randint(10**8,10**9-1), random.randint(10**8,10**9-1)

Code

import numpy as np
from scipy.sparse import coo_matrix
import pandas as pd
from sortedcontainers import SortedList
import time

# Read data# global memory usage after: one big array
df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32)
data = df.as_matrix()
df = None
n_edges = data.shape[0]

# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown# global memory usage after: one big array + one big searchtreeprint('fit mapping')
start = time.time()
observed_vertices = SortedList()
mappings = np.arange(n_edges*2, dtype=np.uint32)  # upper bound on verticesfor column inrange(data.shape[1]):
    for row inrange(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val notin observed_vertices:
            observed_vertices.add(val)
mappings = mappings[:len(observed_vertices)]
n_vertices = len(observed_vertices)
end = time.time()
print(' secs: ', end-start)

print('transform mapping')
# Map original data (in-place !)# global memory usage after: one big array + one big searchtree(can be deleted!)
start = time.time()
for column inrange(data.shape[1]):
    for row inrange(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        mapper_pos = observed_vertices.index(val)
        data[row, column] = mappings[mapper_pos]
end = time.time()
print(' secs: ', end-start)
observed_vertices = None# if not needed anymore
mappings = None# if not needed anymore# Create sparse matrix (only caring about a single triangular part for now)# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))

First version

Here is a very simple and very inefficient (in regards to time and space) code to build this sparse matrix. I post this code, because i believe it is important to understand the core parts if one is using these in something bigger.

Let's see, if this code is efficient enough for your use-case or if it needs work. From distance it's hard to tell, because we don't have your data.

The dictionary-part, used for the mapping, is a candidate to blow up your memory. But it's pointless to optimize this without knowing if it's needed at all. Especially because this part of the code is dependent on the number of vertices in your graph (and i don't have any knowledge of this cardinality).

""" itertools.count usage here would need changes for py2 """import numpy as np
from itertools import count
from scipy.sparse import coo_matrix


# Read data# global memory usage after: one big array
data = np.loadtxt('edges.txt', np.uint32)
n_edges = data.shape[0]
#print(data)#print(data.shape)# Learn mapping to range(0, N_VERTICES)  # N_VERTICES unknown# global memory usage after: one big array + one big dict 
index_gen = count()
mapper = {}
for column inrange(data.shape[1]):
    for row inrange(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        val = data[row, column]
        if val notin mapper:
            mapper[val] = next(index_gen)
n_vertices = len(mapper)

# Map original data (in-place !)# global memory usage after: one big array + one big dict (can be deleted!)for column inrange(data.shape[1]):
    for row inrange(data.shape[0]):
        # double-loop: slow, but easy to understand space-complexity
        data[row, column] = mapper[data[row, column]]
#print(data)# Create sparse matrix (only caring about a single triangular part for now)# if needed: delete dictionary before as it's not needed anymore!
sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices))
#print(sp_mat)

Output for edges-10.txt:

[[287111206 357850135][512616930 441657273][530905858 562056765][524113870 320749289][149911066 964526673][169873523 631128793][646151040 986572427][105290138 382302570][194873438 968653053][912211115 195436728]]
(10, 2)
[[ 0 10][ 1 11][ 2 12][ 3 13][ 4 14][ 5 15][ 6 16][ 7 17][ 8 18][ 9 19]]
  (0, 10)   True
  (1, 11)   True
  (2, 12)   True
  (3, 13)   True
  (4, 14)   True
  (5, 15)   True
  (6, 16)   True
  (7, 17)   True
  (8, 18)   True
  (9, 19)   True

Solution 4:

I was trying the different methods available apart from the ones already used. I found the following doing good.

Method 1 - Reading the file into a string, parsing the string into a 1-D array using numpy's fromstring.

import numpy as np
import scipy.sparse as sparse

defreadEdges():
    withopen('edges.txt') as f:
        data = f.read()  
    edges = np.fromstring(data, dtype=np.int32, sep=' ')
    edges = np.reshape(edges, (edges.shape[0]/2, 2))
    ones = np.ones(len(edges), np.uint32)
    cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdges()

Output:

5 loops, best of 3: 13.6 s per loop

Method 2 - Same as method 1 except instead of loading the file into a string, using the memory mapped interface.

def readEdgesMmap():
    with open('edges.txt') as f:
        with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m: 
            edges = np.fromstring(m, dtype=np.int32, sep=' ')
            edges = np.reshape(edges, (edges.shape[0]/2, 2))
            ones = np.ones(len(edges), np.uint32)
            cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1])))
%timeit -n5 readEdgesMmap()

Output:

5 loops, best of 3: 12.7 s per loop

Monitored using /usr/bin/time, both methods use a maximum of around ~2GB of memory.

Few notes:

It seems to do slightly better than pandas read_csv. Using pandas read_csv, the output on the same machine is
5 loops, best of 3: 16.2 s per loop
Conversion from COO to CSR/CSC consumes significant time too. In @GrantJ's answer, it takes less time because the COO matrix initialization is incorrect. The argument needs to be given as a tuple. I wanted to leave a comment there but I don't have commenting rights yet.
My guess on why this is slightly better than pandas read_csv is the prior assumption of 1D data.

Solution 5:

You might want to take a look at the igraph project, this is a GPL library of C code which is designed for this kind of thing, and has a nice Python API. I think in your case you the Python code would be something like

from igraph importGraphg= Graph.Read_Edgelist('edges.txt')
g.write_adjacency('adjacency_matrix.txt')

Python Developer