Finding A Set Of Indices That Maps The Rows Of One Numpy Ndarray To Another
Solution 1:
Based on your example, it seems that you have shuffled all of the columns simultaneously, such that there is a vector of row indices that maps A→B. Here's a toy example:
A = np.random.permutation(12).reshape(4, 3)
idx = np.random.permutation(4)
B = A[idx]
print(repr(A))
# array([[ 7, 11, 6],
# [ 4, 10, 8],
# [ 9, 2, 0],
# [ 1, 3, 5]])
print(repr(B))
# array([[ 1, 3, 5],
# [ 4, 10, 8],
# [ 7, 11, 6],
# [ 9, 2, 0]])
We want to recover a set of indices, idx
, such that A[idx] == B
. This will be a unique mapping if and only if A and B contain no repeated rows.
One efficient* approach would be to find the indices that would lexically sort the rows in A, then find where each row in B would fall within the sorted version of A. A useful trick is to view A
and B
as 1D arrays using an np.void
dtype that treats each row as a single element:
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
# A and B must be C-contiguous, might need to force a copy herea = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a) # indices that sort the rows of A in lexical order
Now we can use np.searchsorted
to perform a binary search for where each row in B would fall within the sorted version of A:
# using the `sorter=` argument rather than `a[a_to_as]` avoids making a copy of `a`as_to_b = a.searchsorted(b, sorter=a_to_as)
The mapping from A→B can be expressed as a composite of A→As→B
a_to_b = a_to_as.take(as_to_b)
print(np.all(A[a_to_b] == B))
# True
If A and B contain no repeated rows, the inverse mapping from B→A can also be obtained using
b_to_a = np.argsort(a_to_b)
print(np.all(B[b_to_a] == A))
# True
As a single function:
deffind_row_mapping(A, B):
"""
Given A and B, where B is a copy of A permuted over the first dimension, find
a set of indices idx such that A[idx] == B.
This is a unique mapping if and only if there are no repeated rows in A and B.
Arguments:
A, B: n-dimensional arrays with same shape and dtype
Returns:
idx: vector of indices into the rows of A
"""ifnot (A.shape == B.shape):
raise ValueError('A and B must have the same shape')
ifnot (A.dtype == B.dtype):
raise TypeError('A and B must have the same dtype')
rowtype = np.dtype((np.void, A.dtype.itemsize * A.size / A.shape[0]))
a = np.ascontiguousarray(A).view(rowtype).ravel()
b = np.ascontiguousarray(B).view(rowtype).ravel()
a_to_as = np.argsort(a)
as_to_b = a.searchsorted(b, sorter=a_to_as)
return a_to_as.take(as_to_b)
Benchmark:
In [1]: gen = np.random.RandomState(0)
In [2]: %%timeit A = gen.rand(1000000, 100); B = A.copy(); gen.shuffle(B)
....: find_row_mapping(A, B)
1loop, best of 3: 2.76 s per loop
*The most costly step would be the quicksort over rows which is O(n log n) on average. I'm not sure it's possible to do any better than this.
Solution 2:
Since either one of the arrays could be shuffled to match the other, no one has stopped us from re-arranging both. Using Jaime's Answer, we can vstack
both the arrays and find the unique rows. Then the inverse indices returned by unique is essentially the desired mapping (as the arrays don't contain duplicate rows).
Let's first define a unique2d
function for convenience:
defunique2d(arr,consider_sort=False,return_index=False,return_inverse=False):
"""Get unique values along an axis for 2D arrays.
input:
arr:
2D array
consider_sort:
Does permutation of the values within the axis matter?
Two rows can contain the same values but with
different arrangements. If consider_sort
is True then those rows would be considered equal
return_index:
Similar to numpy unique
return_inverse:
Similar to numpy unique
returns:
2D array of unique rows
If return_index is True also returns indices
If return_inverse is True also returns the inverse array
"""if consider_sort isTrue:
a = np.sort(arr,axis=1)
else:
a = arr
b = np.ascontiguousarray(a).view(np.dtype((np.void,
a.dtype.itemsize * a.shape[1])))
if return_inverse isFalse:
_, idx = np.unique(b, return_index=True)
else:
_, idx, inv = np.unique(b, return_index=True, return_inverse=True)
if return_index == Falseand return_inverse == False:
return arr[idx]
elif return_index == Trueand return_inverse == False:
return arr[idx], idx
elif return_index == Falseand return_inverse == True:
return arr[idx], inv
else:
return arr[idx], idx, inv
We can now define our mapping as follows
defrow_mapper(a,b,consider_sort=False):
"""Given two 2D numpy arrays returns mappers idx_a and idx_b
such that a[idx_a] = b[idx_b] """assert a.dtype == b.dtype
assert a.shape == b.shape
c = np.concatenate((a,b))
_, inv = unique2d(c, consider_sort=consider_sort, return_inverse=True)
mapper_a = inv[:b.shape[0]]
mapper_b = inv[b.shape[0]:]
return np.argsort(mapper_a), np.argsort(mapper_b)
Verify:
n = 100000
A = np.arange(n).reshape(n//4,4)
B = A[::-1,:]
idx_a, idx_b = row_mapper(A,B)
print np.all(A[idx_a]==B[idx_b])
# True
Benchmark: benchmark against @ali_m's solution
%timeitfind_row_mapping(A,B)# ali_m's solution%timeitrow_mapper(A,B)# current solution# n = 100100000loops,best of 3:12.2µsperloop10000loops,best of 3:47.3µsperloop# n = 100010000loops,best of 3:49.1µsperloop10000loops,best of 3:148µsperloop# n = 100001000 loops,best of 3:548µsperloop1000 loops,best of 3:1.6msperloop# n = 100000100loops,best of 3:6.96msperloop100loops,best of 3:19.3msperloop# n = 100000010loops,best of 3:160msperloop1loops,best of 3:372msperloop# n = 100000001loops,best of 3:2.54sperloop1loops,best of 3:5.92sperloop
Although maybe there is room for improvement, the current solution is 2-3 times slower than ali_m's solution and perhaps a little messier as well as both arrays need to be mapped. Just thought this could be an alternate solution.
Post a Comment for "Finding A Set Of Indices That Maps The Rows Of One Numpy Ndarray To Another"