Skip to content

pydvl.value.least_core

New in version 0.4.0

This package holds all routines for the computation of Least Core data values.

Please refer to Data valuation for an overview.

In addition to the standard interface via compute_least_core_values(), because computing the Least Core values requires the solution of a linear and a quadratic problem after computing all the utility values, there is the possibility of performing each step separately. This is useful when running multiple experiments: use lc_prepare_problem() or mclc_prepare_problem() to prepare a list of problems to solve, then solve them in parallel with lc_solve_problems().

Note that mclc_prepare_problem() is parallelized itself, so preparing the problems should be done in sequence in this case. The solution of the linear systems can then be done in parallel.

LeastCoreMode

Bases: Enum

Available Least Core algorithms.

montecarlo_least_core

montecarlo_least_core(
    u: Utility,
    n_iterations: int,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    non_negative_subsidy: bool = False,
    solver_options: Optional[dict] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult

Computes approximate Least Core values using a Monte Carlo approach.

\[ \begin{array}{lll} \text{minimize} & \displaystyle{e} & \\ \text{subject to} & \displaystyle\sum_{i\in N} x_{i} = v(N) & \\ & \displaystyle\sum_{i\in S} x_{i} + e \geq v(S) & , \forall S \in \{S_1, S_2, \dots, S_m \overset{\mathrm{iid}}{\sim} U(2^N) \} \end{array} \]

Where:

  • \(U(2^N)\) is the uniform distribution over the powerset of \(N\).
  • \(m\) is the number of subsets that will be sampled and whose utility will be computed and used to compute the data values.
PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function

TYPE: Utility

n_iterations

total number of iterations to use

TYPE: int

n_jobs

number of jobs across which to distribute the computation

TYPE: int DEFAULT: 1

parallel_backend

Parallel backend instance to use for parallelizing computations. If None, use JoblibParallelBackend backend. See the Parallel Backends package for available options.

TYPE: Optional[ParallelBackend] DEFAULT: None

config

(DEPRECATED) Object configuring parallel computation, with cluster address, number of cpus, etc.

TYPE: Optional[ParallelConfig] DEFAULT: None

non_negative_subsidy

If True, the least core subsidy \(e\) is constrained to be non-negative.

TYPE: bool DEFAULT: False

solver_options

Dictionary of options that will be used to select a solver and to configure it. Refer to cvxpy's documentation for all possible options.

TYPE: Optional[dict] DEFAULT: None

progress

If True, shows a tqdm progress bar

TYPE: bool DEFAULT: False

seed

Either an instance of a numpy random number generator or a seed for it.

TYPE: Optional[Seed] DEFAULT: None

RETURNS DESCRIPTION
ValuationResult

Object with the data values and the least core value.

Changed in version 0.9.0

Deprecated config argument and added a parallel_backend argument to allow users to pass the Parallel Backend instance directly.

Source code in src/pydvl/value/least_core/montecarlo.py
@deprecated(
    target=True,
    args_mapping={"config": "config"},
    deprecated_in="0.9.0",
    remove_in="0.10.0",
)
def montecarlo_least_core(
    u: Utility,
    n_iterations: int,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    non_negative_subsidy: bool = False,
    solver_options: Optional[dict] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> ValuationResult:
    r"""Computes approximate Least Core values using a Monte Carlo approach.

    $$
    \begin{array}{lll}
    \text{minimize} & \displaystyle{e} & \\
    \text{subject to} & \displaystyle\sum_{i\in N} x_{i} = v(N) & \\
    & \displaystyle\sum_{i\in S} x_{i} + e \geq v(S) & ,
    \forall S \in \{S_1, S_2, \dots, S_m \overset{\mathrm{iid}}{\sim} U(2^N) \}
    \end{array}
    $$

    Where:

    * $U(2^N)$ is the uniform distribution over the powerset of $N$.
    * $m$ is the number of subsets that will be sampled and whose utility will
      be computed and used to compute the data values.

    Args:
        u: Utility object with model, data, and scoring function
        n_iterations: total number of iterations to use
        n_jobs: number of jobs across which to distribute the computation
        parallel_backend: Parallel backend instance to use
            for parallelizing computations. If `None`,
            use [JoblibParallelBackend][pydvl.parallel.backends.JoblibParallelBackend] backend.
            See the [Parallel Backends][pydvl.parallel.backends] package
            for available options.
        config: (**DEPRECATED**) Object configuring parallel computation,
            with cluster address, number of cpus, etc.
        non_negative_subsidy: If True, the least core subsidy $e$ is constrained
            to be non-negative.
        solver_options: Dictionary of options that will be used to select a solver
            and to configure it. Refer to [cvxpy's
            documentation](https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options)
            for all possible options.
        progress: If True, shows a tqdm progress bar
        seed: Either an instance of a numpy random number generator or a seed for it.

    Returns:
        Object with the data values and the least core value.

    !!! tip "Changed in version 0.9.0"
        Deprecated `config` argument and added a `parallel_backend`
        argument to allow users to pass the Parallel Backend instance
        directly.
    """
    problem = mclc_prepare_problem(
        u,
        n_iterations,
        n_jobs=n_jobs,
        parallel_backend=parallel_backend,
        config=config,
        progress=progress,
        seed=seed,
    )
    return lc_solve_problem(
        problem,
        u=u,
        algorithm="montecarlo_least_core",
        non_negative_subsidy=non_negative_subsidy,
        solver_options=solver_options,
    )

mclc_prepare_problem

mclc_prepare_problem(
    u: Utility,
    n_iterations: int,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> LeastCoreProblem

Prepares a linear problem by sampling subsets of the data. Use this to separate the problem preparation from the solving with lc_solve_problem(). Useful for parallel execution of multiple experiments.

See montecarlo_least_core for argument descriptions.

Changed in version 0.9.0

Deprecated config argument and added a parallel_backend argument to allow users to pass the Parallel Backend instance directly.

Source code in src/pydvl/value/least_core/montecarlo.py
@deprecated(
    target=True,
    args_mapping={"config": "config"},
    deprecated_in="0.9.0",
    remove_in="0.10.0",
)
def mclc_prepare_problem(
    u: Utility,
    n_iterations: int,
    *,
    n_jobs: int = 1,
    parallel_backend: Optional[ParallelBackend] = None,
    config: Optional[ParallelConfig] = None,
    progress: bool = False,
    seed: Optional[Seed] = None,
) -> LeastCoreProblem:
    """Prepares a linear problem by sampling subsets of the data. Use this to
    separate the problem preparation from the solving with
    [lc_solve_problem()][pydvl.value.least_core.common.lc_solve_problem]. Useful
    for parallel execution of multiple experiments.

    See
    [montecarlo_least_core][pydvl.value.least_core.montecarlo.montecarlo_least_core]
    for argument descriptions.

    !!! note "Changed in version 0.9.0"
        Deprecated `config` argument and added a `parallel_backend`
        argument to allow users to pass the Parallel Backend instance
        directly.
    """
    n = len(u.data)

    if n_iterations < n:
        warnings.warn(
            f"Number of iterations '{n_iterations}' is smaller the size of the dataset '{n}'. "
            f"This is not optimal because in the worst case we need at least '{n}' constraints "
            "to satisfy the individual rationality condition."
        )

    if n_iterations > 2**n:
        warnings.warn(
            f"Passed n_iterations is greater than the number subsets! "
            f"Setting it to 2^{n}",
            RuntimeWarning,
        )
        n_iterations = 2**n

    parallel_backend = _maybe_init_parallel_backend(parallel_backend, config)

    iterations_per_job = max(
        1, n_iterations // parallel_backend.effective_n_jobs(n_jobs)
    )

    map_reduce_job: MapReduceJob["Utility", "LeastCoreProblem"] = MapReduceJob(
        inputs=u,
        map_func=_montecarlo_least_core,
        reduce_func=_reduce_func,
        map_kwargs=dict(n_iterations=iterations_per_job, progress=progress),
        n_jobs=n_jobs,
        parallel_backend=parallel_backend,
    )

    return map_reduce_job(seed=seed)

exact_least_core

exact_least_core(
    u: Utility,
    *,
    non_negative_subsidy: bool = False,
    solver_options: Optional[dict] = None,
    progress: bool = True,
) -> ValuationResult

Computes the exact Least Core values.

Note

If the training set contains more than 20 instances a warning is printed because the computation is very expensive. This method is mostly used for internal testing and simple use cases. Please refer to the Monte Carlo method for practical applications.

The least core is the solution to the following Linear Programming problem:

\[ \begin{array}{lll} \text{minimize} & \displaystyle{e} & \\ \text{subject to} & \displaystyle\sum_{i\in N} x_{i} = v(N) & \\ & \displaystyle\sum_{i\in S} x_{i} + e \geq v(S) &, \forall S \subseteq N \\ \end{array} \]

Where \(N = \{1, 2, \dots, n\}\) are the training set's indices.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function

TYPE: Utility

non_negative_subsidy

If True, the least core subsidy \(e\) is constrained to be non-negative.

TYPE: bool DEFAULT: False

solver_options

Dictionary of options that will be used to select a solver and to configure it. Refer to the cvxpy's documentation for all possible options.

TYPE: Optional[dict] DEFAULT: None

progress

If True, shows a tqdm progress bar

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
ValuationResult

Object with the data values and the least core value.

Source code in src/pydvl/value/least_core/naive.py
def exact_least_core(
    u: Utility,
    *,
    non_negative_subsidy: bool = False,
    solver_options: Optional[dict] = None,
    progress: bool = True,
) -> ValuationResult:
    r"""Computes the exact Least Core values.

    !!! Note
        If the training set contains more than 20 instances a warning is printed
        because the computation is very expensive. This method is mostly used for
        internal testing and simple use cases. Please refer to the
        [Monte Carlo method][pydvl.value.least_core.montecarlo.montecarlo_least_core]
        for practical applications.

    The least core is the solution to the following Linear Programming problem:

    $$
    \begin{array}{lll}
    \text{minimize} & \displaystyle{e} & \\
    \text{subject to} & \displaystyle\sum_{i\in N} x_{i} = v(N) & \\
    & \displaystyle\sum_{i\in S} x_{i} + e \geq v(S) &, \forall S \subseteq N \\
    \end{array}
    $$

    Where $N = \{1, 2, \dots, n\}$ are the training set's indices.

    Args:
        u: Utility object with model, data, and scoring function
        non_negative_subsidy: If True, the least core subsidy $e$ is constrained
            to be non-negative.
        solver_options: Dictionary of options that will be used to select a solver
            and to configure it. Refer to the [cvxpy's
            documentation](https://www.cvxpy.org/tutorial/advanced/index.html#setting-solver-options)
            for all possible options.
        progress: If True, shows a tqdm progress bar

    Returns:
        Object with the data values and the least core value.
    """
    n = len(u.data)
    if n > 20:  # Arbitrary choice, will depend on time required, caching, etc.
        warnings.warn(f"Large dataset! Computation requires 2^{n} calls to model.fit()")

    problem = lc_prepare_problem(u, progress=progress)
    return lc_solve_problem(
        problem=problem,
        u=u,
        algorithm="exact_least_core",
        non_negative_subsidy=non_negative_subsidy,
        solver_options=solver_options,
    )

lc_prepare_problem

lc_prepare_problem(u: Utility, progress: bool = False) -> LeastCoreProblem

Prepares a linear problem with all subsets of the data Use this to separate the problem preparation from the solving with lc_solve_problem(). Useful for parallel execution of multiple experiments.

See exact_least_core() for argument descriptions.

Source code in src/pydvl/value/least_core/naive.py
def lc_prepare_problem(u: Utility, progress: bool = False) -> LeastCoreProblem:
    """Prepares a linear problem with all subsets of the data
    Use this to separate the problem preparation from the solving with
    [lc_solve_problem()][pydvl.value.least_core.common.lc_solve_problem]. Useful for
    parallel execution of multiple experiments.

    See [exact_least_core()][pydvl.value.least_core.naive.exact_least_core] for argument
    descriptions.
    """
    n = len(u.data)

    logger.debug("Building vectors and matrices for linear programming problem")
    powerset_size = 2**n
    A_lb = np.zeros((powerset_size, n))

    logger.debug("Iterating over all subsets")
    utility_values = np.zeros(powerset_size)
    for i, subset in enumerate(  # type: ignore
        tqdm(
            powerset(u.data.indices),
            disable=not progress,
            total=powerset_size - 1,
            position=0,
        )
    ):
        indices: NDArray[np.bool_] = np.zeros(n, dtype=bool)
        indices[list(subset)] = True
        A_lb[i, indices] = 1
        utility_values[i] = u(subset)  # type: ignore

    return LeastCoreProblem(utility_values, A_lb)

compute_least_core_values

compute_least_core_values(
    u: Utility,
    *,
    n_jobs: int = 1,
    n_iterations: Optional[int] = None,
    mode: LeastCoreMode = MonteCarlo,
    non_negative_subsidy: bool = False,
    solver_options: Optional[dict] = None,
    progress: bool = False,
    **kwargs,
) -> ValuationResult

Umbrella method to compute Least Core values with any of the available algorithms.

See Data valuation for an overview.

The following algorithms are available. Note that the exact method can only work with very small datasets and is thus intended only for testing.

PARAMETER DESCRIPTION
u

Utility object with model, data, and scoring function

TYPE: Utility

n_jobs

Number of jobs to run in parallel. Only used for Monte Carlo Least Core.

TYPE: int DEFAULT: 1

n_iterations

Number of subsets to sample and evaluate the utility on. Only used for Monte Carlo Least Core.

TYPE: Optional[int] DEFAULT: None

mode

Algorithm to use. See LeastCoreMode for available options.

TYPE: LeastCoreMode DEFAULT: MonteCarlo

non_negative_subsidy

If True, the least core subsidy \(e\) is constrained to be non-negative.

TYPE: bool DEFAULT: False

solver_options

Optional dictionary of options passed to the solvers.

TYPE: Optional[dict] DEFAULT: None

RETURNS DESCRIPTION
ValuationResult

Object with the computed values.

New in version 0.5.0

Source code in src/pydvl/value/least_core/__init__.py
def compute_least_core_values(
    u: Utility,
    *,
    n_jobs: int = 1,
    n_iterations: Optional[int] = None,
    mode: LeastCoreMode = LeastCoreMode.MonteCarlo,
    non_negative_subsidy: bool = False,
    solver_options: Optional[dict] = None,
    progress: bool = False,
    **kwargs,
) -> ValuationResult:
    """Umbrella method to compute Least Core values with any of the available
    algorithms.

    See [Data valuation][data-valuation] for an overview.

    The following algorithms are available. Note that the exact method can only
    work with very small datasets and is thus intended only for testing.

    - `exact`: uses the complete powerset of the training set for the constraints
      [combinatorial_exact_shapley()][pydvl.value.shapley.naive.combinatorial_exact_shapley].
    - `montecarlo`:  uses the approximate Monte Carlo Least Core algorithm.
      Implemented in [montecarlo_least_core()][pydvl.value.least_core.montecarlo.montecarlo_least_core].

    Args:
        u: Utility object with model, data, and scoring function
        n_jobs: Number of jobs to run in parallel. Only used for Monte Carlo
            Least Core.
        n_iterations: Number of subsets to sample and evaluate the utility on.
            Only used for Monte Carlo Least Core.
        mode: Algorithm to use. See
            [LeastCoreMode][pydvl.value.least_core.LeastCoreMode] for available
            options.
        non_negative_subsidy: If True, the least core subsidy $e$ is constrained
            to be non-negative.
        solver_options: Optional dictionary of options passed to the solvers.

    Returns:
        Object with the computed values.

    !!! tip "New in version 0.5.0"
    """

    if mode == LeastCoreMode.MonteCarlo:
        # TODO fix progress showing in remote case
        progress = False
        if n_iterations is None:
            raise ValueError("n_iterations cannot be None for Monte Carlo Least Core")
        return montecarlo_least_core(  # type: ignore
            u=u,
            n_iterations=n_iterations,
            n_jobs=n_jobs,
            progress=progress,
            non_negative_subsidy=non_negative_subsidy,
            solver_options=solver_options,
            **kwargs,
        )
    elif mode == LeastCoreMode.Exact:
        return exact_least_core(
            u=u,
            progress=progress,
            non_negative_subsidy=non_negative_subsidy,
            solver_options=solver_options,
        )

    raise ValueError(f"Invalid value encountered in {mode=}")