How To Hash A Class Or Function Definition?
Solution 1:
All you’re looking for is a hash procedure that includes all the salient details of the class’s definition. (Base classes can be included by including their definitions recursively.) To minimize false matches, the basic idea is to apply a wide (cryptographic) hash to a serialization of your class. So start with pickle
: it supports more types than hash
and, when it uses identity, it uses a reproducible identity based on name. This makes it a good candidate for the base case of a recursive strategy: deal with the functions and classes whose contents are important and let it handle any ancillary objects referenced.
So define a serialization by cases. Call an object special if it falls under any case below but the last.
- For a
tuple
deemed to contain special objects:- The character
t
- The serialization of its
len
- The serialization of each element, in order
- The character
- For a
dict
deemed to contain special objects:- The character
d
- The serialization of its
len
- The serialization of each name and value, in sorted order
- The character
- For a class whose definition is salient:
- The character
C
- The serialization of its
__bases__
- The serialization of its
vars
- The character
- For a function whose definition is salient:
- The character
f
- The serialization of its
__defaults__
- The serialization of its
__kwdefaults__
(in Python 3) - The serialization of its
__closure__
(but with cell values instead of the cells themselves) - The serialization of its
vars
- The serialization of its
__code__
- The character
- For a code object (since
pickle
doesn’t support them at all):- The character
c
- The serializations of its
co_argcount
,co_nlocals
,co_flags
,co_code
,co_consts
,co_names
,co_freevars
, andco_cellvars
, in that order; none of these are ever special
- The character
- For a static or class method object:
- The character
s
orm
- The serialization of its
__func__
- The character
- For a property:
- The character
p
- The serializations of its
fget
,fset
, andfdel
, in that order
- The character
- For any other object:
pickle.dumps(x,-1)
(You never actually store all this: just create a hashlib
object of your choice in the top-level function, and in the recursive part update
it with each piece of the serialization in turn.)
The type tags are to avoid collisions and in particular to be prefix-free. Binary pickles are already prefix-free. You can base the decision about a container on a deterministic analysis of its contents (even if heuristic) or on context, so long as you’re consistent.
As always, there is something of an art to balancing false positives against false negatives: for a function, you could include __globals__
(with pruning of objects already serialized to avoid large if not infinite serializations) or just any __name__
found therein. Omitting co_varnames
ignores renaming local variables, which is good unless introspection is important; similarly for co_filename
and co_name
.
You may need to support more types: look for static attributes and default arguments that don’t pickle
correctly (because they contain references to special types) or at all. Note of course that some types (like file objects) are unpicklable because it’s difficult or impossible to serialize them (although unlike pickle
you can handle lambdas just like any other function once you’ve done code
objects). At some risk of false matches, you can choose to serialize just the type of such objects (as always, prefixed with a character ?
to distinguish from actually having the type in that position).
Post a Comment for "How To Hash A Class Or Function Definition?"