Skip to content

KNN-Shapley

Using the generic class ShapleyValuation together with a PermutationSampler to compute Shapley values for a KNN model has a time complexity of O(n2log2n).1

However, it is possible to exploit the local structure of K-Nearest Neighbours to drastically reduce the amount of utility evaluations and bring complexity down to O(nlogn) per test point. The key insight is that because no sample besides the K closest affects the score, most can be ignored, allowing for a recursive formula of the complexity mentioned. This method was introduced by (Jia et al., 2019)3 in two forms, an exact (available with KNNShapleyValuation) and an approximate one.

The exact computation

Define the utility by the likelihood of the right label:

U(S,xtest)=1Kk=1minK,|S|I[yαk(S)=ytest]

where αk(S) is the k-th closest point xiS to xtest. The SV of each xiDtrain is computed exactly with the following recursion:

sαN=I[yαN=ytest]N,
sαi=sαi+1+I[yαi=ytest]I[yαi+1=ytest]Kmin{K,i}i.

The utilities are then averaged over all xtestDtest to compute the total utility U(S).

pyDVL implements the exact method in KNNShapleyValuation. Internally it iterates through each training point with a LOOSampler and computes the recursive formula.2

Computing exact KNN-Shapley values
from joblib import parallel_config
from pydvl.valuation import Dataset, KNNShapleyValuation
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
train, test = Dataset.from_arrays(...)
valuation = KNNShapleyValuation(model, test, progress=True)

with parallel_config(n_jobs=16):
    valuation.fit(train)

Approximate computation

By using approximate KNN-based methods, it is possible to shave off a factor n in the complexity of the exact method. As of v0.10.0 there is no implementation in pyDVL for this technique.


  1. This is based on a Hoeffding bound and the time complexity of sorting n points, which is the asymptotic complexity of one utility evaluation of KNN (Jia et al., 2019)3

  2. As already noted, it is possible to use the standard infrastructure instead although this is only useful for testing purposes To this end, we provide an implementation of the formula above in KNNClassifierUtility

  3. Jia, R., Dao, D., Wang, B., Hubis, F.A., Gurel, N.M., Li, B., Zhang, C., Spanos, C., Song, D., 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proc. VLDB Endow. 12, 1610--1623. https://doi.org/10.14778/3342263.3342637