Computes exact Shapley values for a KNN classifier.
This implements the method described in (Jia, R. et al., 2019)1.
It exploits the local structure of K-Nearest Neighbours to reduce the number
of calls to the utility function to a constant number per index, thus
reducing computation time to \(O(n)\).
PARAMETER |
DESCRIPTION |
u |
Utility with a KNN model to extract parameters from. The object
will not be modified nor used other than to call get_params()
TYPE:
Utility
|
progress |
Whether to display a progress bar.
TYPE:
bool
DEFAULT:
True
|
Source code in src/pydvl/value/shapley/knn.py
| def knn_shapley(u: Utility, *, progress: bool = True) -> ValuationResult:
"""Computes exact Shapley values for a KNN classifier.
This implements the method described in (Jia, R. et al., 2019)<sup><a href="#jia_efficient_2019a">1</a></sup>.
It exploits the local structure of K-Nearest Neighbours to reduce the number
of calls to the utility function to a constant number per index, thus
reducing computation time to $O(n)$.
Args:
u: Utility with a KNN model to extract parameters from. The object
will not be modified nor used other than to call [get_params()](
<https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator.get_params>)
progress: Whether to display a progress bar.
Returns:
Object with the data values.
Raises:
TypeError: If the model in the utility is not a
[sklearn.neighbors.KNeighborsClassifier][].
!!! tip "New in version 0.1.0"
"""
if not isinstance(u.model, KNeighborsClassifier):
raise TypeError("KNN Shapley requires a K-Nearest Neighbours model")
defaults: Dict[str, Union[int, str]] = {
"algorithm": "ball_tree" if u.data.dim >= 20 else "kd_tree",
"metric": "minkowski",
"p": 2,
}
defaults.update(u.model.get_params())
# HACK: NearestNeighbors doesn't support this. There will be more...
del defaults["weights"]
n_neighbors: int = int(defaults["n_neighbors"])
defaults["n_neighbors"] = len(u.data) # We want all training points sorted
assert n_neighbors < len(u.data)
# assert data.target_dim == 1
nns = NearestNeighbors(**defaults).fit(u.data.x_train)
# closest to farthest
_, indices = nns.kneighbors(u.data.x_test)
values: NDArray[np.float_] = np.zeros_like(u.data.indices, dtype=np.float_)
n = len(u.data)
yt = u.data.y_train
iterator = enumerate(zip(u.data.y_test, indices), start=1)
for j, (y, ii) in tqdm(iterator, disable=not progress):
value_at_x = int(yt[ii[-1]] == y) / n
values[ii[-1]] += (value_at_x - values[ii[-1]]) / j
for i in range(n - 2, n_neighbors, -1): # farthest to closest
value_at_x = (
values[ii[i + 1]] + (int(yt[ii[i]] == y) - int(yt[ii[i + 1]] == y)) / i
)
values[ii[i]] += (value_at_x - values[ii[i]]) / j
for i in range(n_neighbors, -1, -1): # farthest to closest
value_at_x = (
values[ii[i + 1]]
+ (int(yt[ii[i]] == y) - int(yt[ii[i + 1]] == y)) / n_neighbors
)
values[ii[i]] += (value_at_x - values[ii[i]]) / j
return ValuationResult(
algorithm="knn_shapley",
status=Status.Converged,
values=values,
data_names=u.data.data_names,
)
|