- systems consists of multiple captioned query engines
- splits query into multiple sub-queries
- sub-queries are routed to query engine with most similar caption
- responses to sub-queries are later fused to answer query
- multiple captioned query engines allow more fine grained distinction of information than jsut using metadata
- generated faulty versions of text with typical fault patterns
- deletion/transposition/insertion/substitution typos ("und" -> "nd", "udn", "undd", "umd")
- keyboard layout based ("n" is closer to "m" so it is more likely to be a substitute)
- phonetics based ("Maier" -> "Meier")
- ocr based ("OvvISB`" -> "0w!58'")
- trained sentence classifier (original text vs typo text), used word level gradcam to deduct word predictions
- highly related to k-means, but with supervised guidance
- given: few labeled data points, many unlabeled data points (from approx. same distribution)
- perform clustering, align clusters to respect class labels, and predict according to the majority class per cluster
- objective function = mean distance to cluster center +
$\alpha$ class impurity per cluster
- initiate cluster centers
- unsupervised: assign data points to closest cluster center
- supervised: move labeled data point to other cluster, if that minimizes objective function
- cluster_center = mean(cluster)
- repeat 2. - 4. until no improvement in objective fnction is achieved
- objective function = mean distance to cluster center +
- bandwagon effect: LLM favors following (fictional) majority vote presented in prompt
- improved tweet classification by adding guidance from XGBoost to exploit bandwagon effect
- XGBoost trained on embedded tweets
- also tested: setting a fixed fictional guidance to always favor positive/negative class doesn't lead to better recall/precision
- optuna hyperparameter optimization
- optimizes trial score (validation loss, validation acc, ...) over
$n$ trials, each trial is a run with a certain hyperparameter set - narrows hyperparameter search space based on past trial scores (focuses on regions that lead to better scores)
- aborts unpromising trials via early stopping
- optimizes trial score (validation loss, validation acc, ...) over
- implementation of learning rate range test (lrrt)
- stable algorithm for determining learning rates (and other hyperparameters) along a range of training batches
- naive comparison between initial and last batch loss can fail to detect best lr due to variance in the batch losses
- define a set of lr candidates
- train from the same checkpoint on few batches with each lr candidate
- fit a line through the batch losses for each lr candidate
- return the lr candidate with the steepest negative line slope
- implementation of a multihead resnet
- classification head classifies cotton plants (healthy, powdy mildew, aphids, army worm, bacterial blight, target spot)
- embedding head creates 2d latent space using the TripletMarginLoss on triplets of data points:
-
$a$ = anchor (embedding of a data point) -
$p$ = positive (embedding of a data point of same class as$a$ ) -
$n$ = negative (embedding of a data point of different class as$a$ ) -
$L(a, p, n) = max(d(a, p) - d(a, n) + \alpha, 0)$ (with$d$ being a distance, \alpha a desired margin) - learns to fulfill
$d(a, p) + \alpha < d(a, n)$
-
- hard triplet mining
- find
$p$ so that$p$ is the most different embedding to$a$ of same class - find
$n$ so that$n$ is the most similar embedding to$a$ of different class
- find
- implementation of a MIMO (Multi-Input Multi-Output) Ensemble
- implicit ensemble that learns independent subnetworks within one neural network
- exploits network capacity
- M ensemble predictions with a single forward pass
- few more time and space complexity (less than 1%), but can converge in independent subnetworks with decorrelated errors/high disagreement
- M ensemble predictions allow uncertainty measure
- MIMO paper: https://openreview.net/pdf?id=OGg9XnKxFAH
- cifar10 preprocessing for MIMO ensembles
- my presentation slides about the MIMO paper
- my seminar paper reviewing the MIMO paper
- implementation of a monte carlo dropout CNN on MNIST
- drops out certain activations not only during training but also in inference
- multiple forward passes create ensemble predictions that can be averaged to increase the generalization ability
- mc dropout paper: https://arxiv.org/pdf/1506.02142.pdf
- performed 10 runs to compare monte carlo ensembles of different size with a normal dropout baseline
- masked language modeling (mlm) with bert
- texts are split into tokens ((sub-) words)
- each token is masked with a certain probability (usually 15%)
- model "fills the gaps" with tokens (simple classification to check if predicted token is correct)
- rotation detection with rezero-cnn
- images are rotated by 0, 90, 180, 270 degrees, model predicts respective class (4 class classification)
- detecting that a truck is rotated by 90 degrees demands basic knowledge about the concept "truck"
- carlini wagner attack (targeted attack)
- target class
$t$ : flamingo - change
$x$ using gradient descent so that target probability is at least$\kappa$ bigger than second biggest probability - makes
$x$ and$x_0$ more similar to each other, if softmax output is of desired form - carlini wagner criterion:
$max(-\kappa, \underset{j\neq t}{max}(p_j)-p_t) + ||x-x_0||^2_2$
- target class
- fast gradient sign method (untargeted attack)
- goal: create
$x_{fgsm}$ that is close to$x$ and leads to misclassification $x_{fgsm}=x - sign(\frac{\partial f(x)_{y}}{\partial x})) \cdot \epsilon$ -
$sign(\frac{\partial f(x)_{y}}{\partial x})):$ direction in which score for class$y$ , increases - strong perturbations can make
$x_{fgsm}$ OOD and can lead to even higher class score, because gradient is only local approximation
- goal: create
- detecting litter objects on forest floor
- created data set
- made photos of forest floor
- most photos contain at least one litter object (plastic, metal, paper, glass)
- annotated litter objects with bounding boxes (corner coordinates)
- photos contain benign confounders, i.e. natural objects that are easily confused with litter (reflecting puddles, colorful blossoms and berries, ...)
- annotated data is available on https://www.kaggle.com/datasets/milankalkenings/litter-on-forest-floor-object-detection
- fine tuned Faster R-CNN (pretrained on COCO)
- semi supervised training with cross entropy and unsupervised support
- unsupervised support: loss functions that can be calculated on unlabeled data points
-
stability loss:
$\lambda d(f(x), f(x_{aug}))$ favors similar softmax outputs for$n$ augmented versions of same data point- risk: trivial solution is to always predict the same vector
- also called consistency regularization, because it biases model towards similar softmax outputs and thus bigger training error)
-
mutual exclusivity loss: favors low-entropy softmax outputs
- leads to decision boundary through low-density regions in feature space
- prevents trivial solution for stability loss
- implementation of a (small) unet architecture
- U-Nets have two main components:
-
down: spatial resolution
$\downarrow$ , channel resolution$\uparrow$ . Creates dense input representation -
up: spatial resolution
$\uparrow$ , channel resolution$\downarrow$ . Output is often of same (spatial) resolution as down-input.
-
down: spatial resolution
- skip-connections (concatenation) between up and down blocks of same resolution improve gradient flow to early layers
- U-Nets have two main components:
- pretraining of the down part with image classification using a classification head
- fine tuning on image segmentation data in two stages:
- adjusting upwards part with frozen pretrained downwards part
- end-to-end fine tuning of the downwards part and the upwards part
- given: few labeled data points, many unlabeled data points (from approx. same distribution)
- iteratively add semi-supervised labels to the unlabeled data points
- train model on labeled training set
- predict labels of unlabeled data points
- add data point(s) with most confident prediction to labeled training set
- repeat 1. to 3. until no improvement is achieved on validation data
- possible in transductive setting (treat test datapoints as unlabeled training data points)
- training an autoencoder on MNIST and CIFAR100
- if autoencoder is trained to reconstruct instances of data set X, it is likely to achieve good results on reconstructing instances of data set Y, if X and Y are similar enough.
- pretraining an encoder within an autoencoder, and later using it for as a feature learner in a classifier can speed up the training process, because the encoder already learned how to extract general features in the given data
- a well trained autoencoder can be used to generate new data points that still contain the data signal, but add further noise (similar to data augmentation)
- training a variational autoencoder on MNIST
- learns reparameterization
$encoding=\mu + (\sigma \epsilon), \epsilon \textasciitilde \mathcal{N}(0,1)$ $\mu=linear(encoding_{raw})$ $\sigma=exp(linear(encoding_{raw}))$
- visualization of PCA-reduced latent space
- learns reparameterization
- fake news detection (feature engineering, random forest feature importance & selection
- feature engineering
- number of words (title and body)
- number of exclamation marks (title and body)
- number of question marks (title and body)
- lexical diversity (title and body)
$\frac{\text{number of title words}}{\text{number of title words} + \text{number of body words}}$
- random forest feature importance and respective feature selection
- good resuls can already be achieved by using only 1 feature
- further sklearn mechanics used (stacking ensemble, gridsearch)
- feature engineering
- co2 emission time series forecasting for rwanda based on chem-sensor data at varying locations
- discretization and one-hot encoding of numerical data
- catboost feature importance and respective feature selection
- optuna hyperparameter tuning
- reptile meta learning:
- sample small batch of tasks from task set
- for each task train copy of meta model on sampled task for few iterations
- update meta model with average of parameter updates of copy models
- repeat 1. - 3. until meta model performs well over all tasks
- reptile pretraining improves few shot results on data and task similar enough to reptile tasks
- visualization of positional encodings as used in Transformers
- implementation of regular class activation maps (cam)
- cams show the importance of the individual input elements for the activation for the respective class
- cams are based on the gradient of a class activation w.r.t. the input
$\frac{\partial f_\theta(x)_{class}}{\partial x}$ - cams can e.g. be used for
- inferring weakly supervised labels (here: bounding boxes from classification labels)
- model debugging
- deducting model design decisions
- implementation of smoothgrad
- better cams by averaging over gradients for
$n$ noisy versions of the input
- better cams by averaging over gradients for
- implementation of guided cam
- better cams by only propagating positive gradients back
- gradually masking out one object decreases its class score
- semi-supervised versatile training data
- generated using text block augmented LLM prompts
- substrings are used to annotate text
- allows detecting names in german natural language (probably not useful without grammatical structures)
- fine tuned bert
- evaluation set is formed as sampled collection of edge cases
- pistachio type prediction, 4 initially labeled instances, model queries next label per
- baseline: label a random unlabeled instance
- min_confidence: label unlabeled instance that has lowest predicted max. class confidence (approximation of prediction entropy)
- informative initially labeled instances are chosen as training data clusters centers (approx. represent training data best; exploration-paradigm)
- min_confidence policy leads to higher val accuracy mean and smaller val accuracy std when informative initial labeled instances are used (otherwise even worse than random)
- weakly supervised method:
- data instances come in bags
- only the labels of few instances are known
- predictions are made on bag level
- binary classification: bag is of positive class, if it contains a "2", else bag is of negative class
- attention pooling can be used to deduct instance level predictions
- model pays high attention to instances of positive class (even to those that are not labeled)
- mil-attention-pooling paper: http://proceedings.mlr.press/v80/ilse18a/ilse18a.pdf