theme | title | info | class | highlighter | drawings | transition | mdc | themeConfig | hideInToc | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
default |
Data valuation for machine learning |
## A primer on data valuation and attribution
Some examples of how to attribute data sources and how to value data in your projects
using pyDVL.
Learn more at [pydvl.org](https://pydvl.org)
|
text-center |
shiki |
|
slide-left |
true |
|
true |
Detecting mislabelled and out-of-distribution samples with pyDVL
Miguel de Benito Delgado - Kristof Schröder
title: What is data valuation? level: 1 layout: two-cols-header class: self-center text-center p-6 transition: fade-out
the contribution of a training point to...
or
::left::
the overall model performance
("global" methods: Data Shapley & co.)
::right::
a single prediction
("local" methods: influences)
utility(some_data) := model.fit(some_data).score(validation)
Take one training point
::left::
score_with = u(train)
score_without = u(train.drop(x))
value = score_with - score_without
low signal
::right::
for subset in sampler.from_data(train.drop(x)):
scores_with.append[u(subset.union({x}))]
scores_without.append[u(subset)]
value = weighted_mean(scores_with - scores_without, coefficients)
- Top Hits Spotify from 2000-20191
- Predict song popularity
GradientBoostingRegressor
- Compute values for all training points
- Drop low-valued ones
Data dropped | MAE improvement |
---|---|
10% | 8% (+- 2%) |
15% | 10% (+- 3%) |
::right::
Three steps
// First example
```python {none|1-2|3-4|5-7|all}
train, val, test = load_spotify_dataset(...)
model = GradientBoostingRegressor(...)
scorer = SupervisedScorer("accuracy", val)
utility = Utility(model, scorer)
valuation = DataShapleyValuation(utility, ...)
with joblib.parallel_backend("loky", n_jobs=16):
valuation.fit(train)
```
```python {2,3}
train, val, test = load_data()
model = AnyModel()
scorer = CustomScorer(val)
utility = Utility(model, scorer)
valuation = DataShapleyValuation(utility, ...)
with joblib.parallel_backend("loky", n_jobs=16):
valuation.fit(train)
```
```python {5}
train, val, test = load_data()
model = AnyModel()
scorer = CustomScorer(val)
utility = Utility(model, scorer)
valuation = AnyValuationMethod(utility, ...)
with joblib.parallel_backend("loky", n_jobs=16):
valuation.fit(train)
```
```python {6,7}
train, val, test = load_data()
model = AnyModel()
scorer = CustomScorer(val)
utility = Utility(model, scorer)
valuation = AnyValuationMethod(utility, ...)
with joblib.parallel_backend("ray", n_jobs=480):
valuation.fit(train)
```
and
values = valuation.values(sort=True)
clean_data = data.drop_indices(values[:100].indices)
model.fit(clean_data)
assert model.score(test) > 1.02 * previous_score
::left::
- We increased accuracy by removing bogus points
- Better: select data for inspection
- Data debugging
what's wrong with this data? - Model debugging
why are these data detrimental?
::right::
- Data acquisition: prioritize data sources
- Attribution: find the most important data points
- Continual learning: compress your dataset
- Data markets: price your data
- Improve fairness metrics
- ...
- Any scikit-learn model
- Or a wrapper with a
fit()
method - A scoring function
- An imperfect dataset
::right::
numpy
andsklearn
joblib
for parallelizationmemcached
for caching- Influence Functions use
pytorch
- Planned: allow
jax
andtorch
everywhere dask
for large datasets
- Computational cost
- Has my approximation converged?
- Consistency across runs
- Model and metric dependence
::right::
- Monte-Carlo approximations
- Efficient subset sampling strategies
- Proxy models (value transfer)
- Model-specific methods (KNN-Shap, Data-OOB, ...)
- Utility learning (YMMV)
::left::
Data | Test loss | |
---|---|---|
(... train ...) |
||
(... train ...) |
The "influence" of
::right::
- One value per training / test point pair
$(z_i, z)$ - A full retraining per training point!
- However: $$I(z_i, z) = \nabla_\theta L^\top \cdot H^{-1}{\theta} \cdot \nabla\theta L$$
- Implicit computation and approximations
- Are they good?
- Does it matter?
::left::
- NIH dataset with ~28K images for malaria screening1
- Goal: detect these data points with pyDVL
::right::
```python {hide|1-4|5|7-8|10|all|4,7}
torch_model = ... # Trained model
train, test = ... # Dataloaders
if_model = DirectInfluence(torch_model, loss, ...)
if_model.fit(train)
if_calc = SequentialInfluenceCalculator(if_model)
lazy_values = if_calc.influences(test, train)
values = lazy_values.to_zarr(path, ...) # memmapped
```
```python {4,7}
torch_model = ... # Trained model
train, test = ... # Dataloaders
if_model = ArnoldiInfluence(torch_model, loss, ...)
if_model.fit(train)
if_calc = SequentialInfluenceCalculator(if_model)
lazy_values = if_calc.influences(test, train)
values = lazy_values.to_zarr(path, ...) # memmapped
```
```python {4,7}
torch_model = ... # Trained model
train, test = ... # Dataloaders
if_model = NystroemSketchInfluence(torch_model, loss, ...)
if_model.fit(train)
if_calc = SequentialInfluenceCalculator(if_model)
lazy_values = if_calc.influences(test, train)
values = lazy_values.to_zarr(path, ...) # memmapped
```
```python {4,7}
torch_model = ... # Trained model
train, test = ... # Dataloaders
if_model = NystroemSketchInfluence(torch_model, loss, ...)
if_model.fit(train)
if_calc = DaskInfluenceCalculator(if_model)
lazy_values = if_calc.influences(test, train)
values = lazy_values.to_zarr(path, ...) # memmapped
```
(Plus CG, LiSSa, E-KFAC, ...)
::left::
- Compute all pairs of influences
- For each training point: 25th percentile of influences (same labels)
::right::
Cells labelled as parasitized::left::
- Computational complexity: $H^{-1}{\theta} \nabla\theta L$
- Memory complexity: how many gradients fit on the device?
::right::
- Approximation of the inverse Hessian vector product
- Parallelization
- Out-of-core computation
::bottom::
```python {1,3}
if_model = DirectInfluence(torch_model, loss, ...)
(...)
if_calc = SequentialInfluenceCalculator(if_model)
```
```python {1,3-5}
if_model = NystroemSketchInfluence(torch_model, loss, rank=10, ...)
(...)
client = Client(LocalCUDACluster())
if_calc = DaskInfluenceCalculator(if_model, client)
```
::left::
- Large models with costly retrainings
torch
interface- Point to point valuation
::right::
- Smaller models
sklearn
interface- Value over a test set
::bottom::
layout: two-cols title: Thank you! hideInToc: true class: text-center table-center table-invisible p-6
Thank you for your attention!
slides and code:
::right::
PyDVL contributors
You! |
It's a growing field 1
- Fit before, during, or after trainig
- With or without reference datasets
- Specific to classification / regression / unsupervised
- Different model assumptions (from none to strong)
- Local and global valuation
Three steps for all valuation methods
::left::
- Prepare
Dataset
andmodel
- Choose
Scorer
andUtility
- Compute values (contribution to performance)
::right::
train, test = Dataset.from_sklearn(load_iris(), train_size=0.6)
model = LogisticRegression()
scorer = SupervisedScorer("accuracy", test)
utility = Utility(model, scorer)
valuation = DataBanzhafValuation(
utility, MSRSampler(), RankCorrelation()
)
with joblib.parallel_backend("ray", n_jobs=48):
valuation.fit(train)