Efficient Loop Over Numpy Array

August 29, 2023 Post a Comment

Versions of this question have already been asked but I have not found a satisfactory answer. Problem: given a large numpy vector, find indices of the vector elements which are dup

Solution 1:

Since the answers have stopped coming and none was totally satisfactory, for the record I post my own solution.

It is my understanding that it's the assignment which makes Python slow in this case, not the nested loops as I thought initially. Using a library or compiled code eliminates the need for assignments and performance improves dramatically.

from __future__ import print_function
import numpy as np
from numba import jit

N = 10000
vect = np.arange(N, dtype=np.float32)

vect[N/2] = 1
vect[N/4] = 1
dupl = np.zeros(N, dtype=np.int32)

print("init done")
# uncomment to enable compiled function#@jitdefduplicates(i, counter, dupl, vect):
    eps = 0.01
    ns = len(vect)
    for j inrange(i+1, ns):
        # replace if to use approx comparison#if abs(vect[i] - vect[j]) < eps:if vect[i] == vect[j]:
            dupl[counter] = j
            counter += 1return counter

counter = 0for i in xrange(N):
    counter = duplicates(i, counter, dupl, vect)

print("counter =", counter)
print(dupl[0:counter])

Tests

# no jit$timepythonarray-test-numba.pyinitdonecounter=3
[2500 5000 5000]

elapsed10.135s# with jit$timepythonarray-test-numba.pyinitdonecounter=3
[2500 5000 5000]

elapsed0.480s

The performance of compiled version (with @jit uncommented) is close to C code performance ~0.1 - 0.2 sec. Perhaps eliminating the last loop could improve the performance even further. The difference in performance is even stronger when using approximate comparison using eps while there is very little difference for the compiled version.

# no jit$timepythonarray-test-numba.pyinitdonecounter=3
[2500 5000 5000]

elapsed109.218s# with jit$timepythonarray-test-numba.pyinitdonecounter=3
[2500 5000 5000]

elapsed0.506s

This is ~ 200x difference. In the real code, I had to put both loops in the function as well as use a function template with variable types so it was a bit more complex but not very much.

Solution 2:

Python itself is a highly-dynamic, slow, language. The idea in numpy is to use vectorization, and avoid explicit loops. In this case, you can use np.equal.outer. You can start with

a = np.equal.outer(vect, vect)

Now, for example, to find the sum:

 >>> np.sum(a)
 10006

To find the indices of i that are equal, you can do

np.fill_diagonal(a, 0)

>>> np.nonzero(np.any(a, axis=0))[0]array([   1, 2500, 5000])

Timing

def find_vec():
    a = np.equal.outer(vect, vect)
    s = np.sum(a)
    np.fill_diagonal(a, 0)
    return np.sum(a), np.nonzero(np.any(a, axis=0))[0]

>>> %timeit find_vec()
1 loops, best of 3: 214 ms per loop

def find_loop():
    dupl = []
    counter = 0
    for i in range(N):
        for j in range(i+1, N):
             if vect[i] == vect[j]:
                 dupl.append(j)
                 counter += 1
    return dupl

>>> % timeit find_loop()
1 loops, best of 3: 8.51 s per loop

Solution 3:

This solution using the numpy_indexed package has complexity n Log n, and is fully vectorized; so not terribly different from C performance, in all likelihood.

import numpy_indexed as npidpl= np.flatnonzero(npi.multiplicity(vect) > 1)

Solution 4:

The obvious question is why you want to do this in this way. NumPy arrays are intended to be opaque data structures – by this I mean NumPy arrays are intended to be created inside the NumPy system and then operations sent in to the NumPy subsystem to deliver a result. i.e. NumPy should be a black box into which you throw requests and out come results.

So given the code above I am not at all suprised that NumPy performance is worse than dreadful.

The following should be effectively what you want, I believe, but done the NumPy way:

import numpy as np

N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1print([np.where(a == vect)[0] for a in vect][1])

# Delivers [1, 2500, 5000]

Solution 5:

Approach #1

You can simulate that iterator dependency criteria for a vectorized solution using a triangular matrix. This is based on this post that dealt with multiplication involving iterator dependency. For performing the elementwise equality of each element in vect against its all elements, we can use NumPy broadcasting. Finally, we can use np.count_nonzero to get the count, as it's supposed to be very efficient in summing purposes on boolean arrays.

So, we would have a solution like so -

mask = np.triu(vect[:,None] == vect,1)
counter = np.count_nonzero(mask)
dupl = np.where(mask)[1]

If you only care about the count counter, we could have two more approaches as listed next.

Approach #2

We can avoid the use of the triangular matrix and simply get the entire count and just subtract the contribution from diagonal elements and consider just one of either lower of upper triangular regions by just halving the remaining count as the contributions from either ones would be identical.

So, we would have a modified solution like so -

counter = (np.count_nonzero(vect[:,None] == vect) - vect.size)//2

Approach #3

Here's an entirely different approach that uses the fact the count of each unique element plays a cumsumed contribution to the final total.

So, with that idea in mind, we would have a third approach like so -

count = np.bincount(vect) # OR np.unique(vect,return_counts=True)[1]
idx = count[count>1]
id_arr = np.ones(idx.sum(),dtype=int)
id_arr[0] = 0
id_arr[idx[:-1].cumsum()] = -idx[:-1]+1
counter = np.sum(id_arr.cumsum())

Python Developer