Machine Studying and Deep Studying now drive a variety of merchandise and purposes that we use every day, from picture modifying software program to self-driving automobiles. These purposes usually course of numerous varieties of info, together with audio, photos, textual content, and sensor knowledge. To construct Deep Studying fashions that carry out effectively, it’s essential to combine all some of these info throughout coaching. We refer to those varied types of knowledge as “knowledge modalities,” and the deep studying fashions that make the most of them are often called Multimodal Deep Studying fashions.
Just like standard deep studying fashions, Multimodal Deep Studying fashions additionally endure from overconfidence. Overconfidence happens when a mannequin assigns excessively excessive chances to its predictions, even when they’re incorrect. This may usually result in catastrophic outcomes. For instance, a confidently fallacious prediction in self-driving automobiles, can result in harm or loss of life of the passengers, as happened in 2016. To exclude such eventualities, we have to perceive how assured actually deep studying fashions are of their predictions. Uncertainty Quantification (UQ) serves this goal and tries to quantify uncertainties within the knowledge and within the skilled mannequin.
Bayesian statistics largely distinguishes between two varieties of uncertainties: aleatoric and epistemic. The aleatoric uncertainty refers back to the uncertainty inherent in knowledge and can’t be lowered by observing extra knowledge. For instance, if we take a look at the picture under, we are able to see that the 2 courses are combined, and it’s laborious to deduce what the label of a brand new level on shall be within the combined areas. Including extra knowledge, is not going to make the classification simpler.
Epistemic uncertainty, then again, is the uncertainty of the mannequin because of lack of expertise. For instance, within the picture above, we see that we don’t have sufficient knowledge factors to confidently say which choice boundary is the most effective one. In distinction to the aleatoric uncertainty, on this case if we add extra knowledge factors it could actually assist to accumulate further info and therefore, scale back the epistemic uncertainty.
In Multimodal Deep Studying we are able to have extra advanced interactions between uncertainties in modalities. It’s potential to have a complementary info, which shall scale back the uncertainties, or to have conflicting info, which may enhance the uncertainties.
On this weblog publish we are going to attempt to discover totally different uncertainty eventualities and measure the corresponding uncertainties on LUMA multimodal dataset¹.
We’re going to use the LUMA dataset, which permits us to inject various kinds of noises into every of the modalities and observe the adjustments in uncertainties. LUMA dataset is comprised of three modalities: audio, picture and textual content. Picture modality incorporates small 32×32 photos of various objects. The audio modality incorporates the pronunciation of the labels of this object, and the textual content modality incorporates textual content passages in regards to the objects. In whole there are 50 courses, 42 of that are designed for mannequin coaching and testing, and one other 8 are offered as out-of-distribution knowledge.
First, we have to obtain and compile the dataset. For that, we have to go to our command line interface (bash in my case), and run the next command, which is able to clone the LUMA dataset compiler and noise injector:
git clone https://github.com/bezirganyan/LUMA.git
cd LUMA
Then, we have to set up the dependences by creating and activating a conda atmosphere (ensure you have anaconda or miniconda put in):
conda env create -f atmosphere.yml
conda activate luma_env
Having all of the dependencies, we are able to obtain the dataset to knowledge
listing with:
git lfs set up
git clone https://huggingface.co/datasets/bezirganyan/LUMA knowledge
Lastly, we are able to compile totally different dataset variations with differing kinds and quantities of noises in every modality. For compiling the default dataset (i.e. with out further noises), we have to run:
python compile_dataset.py
Now, the LUMA software permits us to inject various kinds of noises.
- Pattern Noise — One of these noise provides real looking noise to every of the modalities. For instance, for textual content modality, it could actually change phrases with antonyms, add typo noise, spelling errors, and so on. For audio modality, it could actually add background conversations, typing noises, and so on. And for the picture modality, noises like blur, defocus, frost, and so on., may be added.
- Label Noise — One of these noise, randomly witches the labels of the info samples to their closest courses, which shall enhance the combination between courses.
- Range — This controls how divers the info factors are. If we need to scale back the variety, then the info factors will likely be extra concentrated within the latent area, which suggests the fashions can have much less info to work with.
- Out-of-distribution (OOD) pattern — The LUMA dataset additionally supplies us with OOD samples, which implies that they’re samples which can be exterior the coaching distribution. Ideally, the ML mannequin shall have excessive uncertainty on these sorts of samples, in order that it doesn’t make a confidently fallacious choice on a distribution it hadn’t seen earlier than.
Let’s individually inject these noises. To manage the quantity of noises, we are able to modify (or create) the configuration file in cfg
folder. Nonetheless, there are already some preconfigured choices accessible, that we are going to use. For pattern noise, we are able to make use of pre-defined configuration file cfg/noise_sample.yml
. Particularly, we are able to take note of theses strains in configuration for every modality:
sample_noise:
add_noise_train: True
add_noise_test: True
They activate or off the pattern noise per modality. The strains instantly under, management noise parameters, and are totally different for every modality. For audio they appear like this:
sample_noise:
add_noise_train: True
add_noise_test: True
noisy_data_ratio: 1
min_snr: 3
max_snr: 5
output_path: knowledge/noisy_audio
the place we are able to management the noisy knowledge ratio (0.0–1.0), minimal and most signal-to-noise ratio, and the place to avoid wasting the noisy audio recordsdata.
For textual content, they appear like this:
sample_noise:
add_noise_train: True
add_noise_test: True
noisy_data_ratio: 1
noise_config:
KeyboardNoise:
aug_char_min: 1
aug_char_max: 5
aug_word_min: 3
aug_word_max: 8
BackTranslationNoise:
machine: cuda # cuda or cpu
...
Right here, you possibly can specify noises from: KeyboardNoise
, BackTranslationNoise
, SpellingNoise
, OCRNoise
, RandomCharNoise
, RandomWordNoise
, AntonymNoise
. The parameters for every noise may be discovered here.
Lastly, for picture modality, the configuration appears like this:
sample_noise:
add_noise_train: True
add_noise_test: True
noisy_data_ratio: 1
output_path: knowledge/noisy_images.pth
noise_config:
gaussian_noise:
severity: 4
shot_noise:
severity: 4
impulse_noise:
severity: 4
You possibly can select noises from: gaussian_noise
, shot_noise
, impulse_noise
, defocus_blur
, frosted_glass_blur
, motion_blur
, zoom_blur
, snow
, frost
, fog
, brightness
, distinction
, elastic
, pixelate
, jpeg_compression
. For every of noises, you possibly can specify a severity
parameter, which obtains values from 1–5. Beneath you possibly can see the examples of various noise sorts for picture:
Then, we are able to compile the datset with pattern noise with:
python compile_dataset.py -c cfg/noise_sample.yml
You possibly can in fact use every other configuration recordsdata.
So as to add label noise, one solely wants to vary the label_switch_prob
for every modality. For instance, one can take a look at cfg/noise_label.tml
. Lastly, for range, one wants to vary the compactness
parameter. The upper the compactness worth, the much less numerous the info will likely be. An instance of this may be seen in cfg/noise_diversity.yml
.
The OOD knowledge for every era is saved in a separate file specified within the configuration file.
We will use the category from dataset.py
to load the dataset in PyTorch.
from dataset import LUMADatasettrain_audio_path = 'knowledge/audio/datalist_train.csv'
train_text_path = 'knowledge/text_data_train.tsv'
train_image_path = 'knowledge/image_data_train.pickle'
train_audio_data_path = 'knowledge/audio'
train_dataset = LUMADataset(train_image_path,
train_audio_path,
train_audio_data_path,
train_text_path)
Nonetheless, this can return a uncooked texts, audios and pictures, which will not be very snug to make use of in our fashions. Therefore, we want to course of this samples earlier than utilizing them in our fashions and convert them to extra handy codecs. For audio we want to convert the uncooked audio knowledge to mel-spectrograms. For that we are going to outline a remodel as:
from torchvision.transforms import Compose
from torchaudio.transforms import MelSpectrogram
import torchclass PadCutToSizeAudioTransform():
def __init__(self, dimension):
self.dimension = dimension
def __call__(self, audio):
if audio.form[-1] < self.dimension:
audio = torch.nn.practical.pad(audio, (0, self.dimension - audio.form[-1]))
elif audio.form[-1] > self.dimension:
audio = audio[:, :self.size]
return audio
audio_transform = Compose([MelSpectrogram(), PadCutToSizeAudioTransform(128)])
Right here we use the MelSpectrogram
remodel, after which use a customized remodel to pad/reduce the spectrogram into the identical dimension for all samples.
For textual content knowledge, we select to make use of the typical Bert embeddings for coaching. To try this we are able to extract the textual content options right into a file, after which outline a customized remodel for loading the embeddings as an alternative of uncooked textual content:
from data_generation.text_processing import extract_deep_text_featuresextract_deep_text_features(train_text_path, output_path='text_features_train.npy')
class Text2FeatureTransform():
def __init__(self, features_path):
with open(features_path, 'rb') as f:
self.options = np.load(f)
def __call__(self, textual content, idx):
return self.options[idx]
text_transform=Text2FeatureTransform('text_features_train.npy')
For the picture modality, we are going to normalize the photographs and convert them to tensors:
from torchvision.transforms import ToTensor, Normalizeimage_transform = Compose([
ToTensor(),
Normalize(mean=(0.51, 0.49, 0.44),
std=(0.27, 0.26, 0.28))
])
Lastly, we are going to apply these transforms by passing them to the dataset class:
train_dataset = LUMADataset(train_image_path, train_audio_path, train_audio_data_path, train_text_path,
text_transform=text_transform,
audio_transform=audio_transform,
image_transform=image_transform)
We will load check and OOD knowledge in a similar way. The ultimate knowledge loading process will likely be:
import torch
from torchaudio.transforms import MelSpectrogram
from torchvision.transforms import Compose, Normalize, ToTensorfrom data_generation.text_processing import extract_deep_text_features
from dataset import LUMADataset
train_audio_path = 'knowledge/audio/datalist_train.csv'
train_text_path = 'knowledge/text_data_train.tsv'
train_image_path = 'knowledge/image_data_train.pickle'
audio_data_path = 'knowledge/audio'
test_audio_path = 'knowledge/audio/datalist_test.csv'
test_text_path = 'knowledge/text_data_test.tsv'
test_image_path = 'knowledge/image_data_test.pickle'
ood_audio_path = 'knowledge/audio/datalist_ood.csv'
ood_text_path = 'knowledge/text_data_ood.tsv'
ood_image_path = 'knowledge/image_data_ood.pickle'
class PadCutToSizeAudioTransform():
def __init__(self, dimension):
self.dimension = dimension
def __call__(self, audio):
if audio.form[-1] < self.dimension:
audio = torch.nn.practical.pad(audio, (0, self.dimension - audio.form[-1]))
elif audio.form[-1] > self.dimension:
audio = audio[:, :self.size]
return audio
class Text2FeatureTransform():
def __init__(self, features_path):
with open(features_path, 'rb') as f:
self.options = np.load(f)
def __call__(self, textual content, idx):
return self.options[idx]
extract_deep_text_features(train_text_path, output_path='text_features_train.npy')
extract_deep_text_features(test_text_path, output_path='text_features_test.npy')
extract_deep_text_features(ood_text_path, output_path='text_features_ood.npy')
image_transform = Compose([
ToTensor(),
Normalize(mean=(0.51, 0.49, 0.44),
std=(0.27, 0.26, 0.28))
])
text_transform_train = Text2FeatureTransform('text_features_train.npy')
text_transform_test = Text2FeatureTransform('text_features_test.npy')
text_transform_ood = Text2FeatureTransform('text_features_ood.npy')
audio_transform = Compose([MelSpectrogram(), PadCutToSizeAudioTransform(128)])
train_dataset = LUMADataset(train_image_path, train_audio_path, audio_data_path, train_text_path,
text_transform=text_transform_train,
audio_transform=audio_transform,
image_transform=image_transform)
test_dataset = LUMADataset(test_image_path, test_audio_path, audio_data_path, test_text_path,
text_transform=text_transform_test,
audio_transform=audio_transform,
image_transform=image_transform)
ood_dataset = LUMADataset(ood_image_path, ood_audio_path, audio_data_path, ood_text_path,
text_transform=text_transform_ood,
audio_transform=audio_transform,
image_transform=image_transform)
For constructing the multimodal UQ mannequin, we’re going to use a current multimodal strategy primarily based on evidential studying. Evidential deep learning³ is a technique that enhances conventional deep studying fashions by not solely making predictions but in addition offering a measure of uncertainty about these predictions. It leverages rules from Dempster-Shafer concept, a mathematical framework for evidence-based reasoning. This concept permits the mannequin to mix totally different items of proof to calculate levels of perception, moderately than a single deterministic output. As an alternative of simply giving a single reply, evidential studying outputs a spread of potential solutions together with the arrogance degree in every.
Following the concepts offered by Xu et al., (2024), we’re going to construct evidential networks for every modality and mix them utilizing their proposed conflictive opinion aggregation technique (RCML⁴). The picture classifier, therefore, will appear like this:
class ImageClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.3):
tremendous(ImageClassifier, self).__init__()
self.image_model = torch.nn.Sequential(
torch.nn.Conv2d(3, 32, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Conv2d(32, 64, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Flatten(),
)
self.classifier = torch.nn.Linear(64 * 6 * 6, num_classes)def ahead(self, x):
picture, audio, textual content = x
picture = self.image_model(picture.float())
return self.classifier(picture)
Equally, the audio and textual content classifiers will likely be:
class AudioClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.5):
tremendous(AudioClassifier, self).__init__()
self.audio_model = torch.nn.Sequential( # from batch_size x 1 x 128 x 128 spectrogram
torch.nn.Conv2d(1, 32, 5),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Conv2d(32, 64, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Conv2d(64, 64, 3),
torch.nn.ReLU(),
torch.nn.MaxPool2d(2),
torch.nn.Dropout(dropout),
torch.nn.Flatten()
)
self.classifier = torch.nn.Linear(64 * 14 * 14, num_classes)def ahead(self, x):
picture, audio, textual content = x
audio = self.audio_model(audio)
return self.classifier(audio)
class TextClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.5):
tremendous(TextClassifier, self).__init__()
self.text_model = torch.nn.Sequential(
torch.nn.Linear(768, 512),
torch.nn.ReLU(),
torch.nn.Dropout(dropout),
torch.nn.Linear(512, 256),
torch.nn.ReLU(),
torch.nn.Dropout(dropout),
)
self.classifier = torch.nn.Linear(256, num_classes)
def ahead(self, x):
picture, audio, textual content = x
textual content = self.text_model(textual content)
return self.classifier(textual content)
Having these uni-modal classifiers, we are going to mix them right into a multimodal community:
class MultimodalClassifier(torch.nn.Module):
def __init__(self, num_classes, dropout=0.5):
tremendous(MultimodalClassifier, self).__init__()
self.image_model = ImageClassifier(num_classes, dropout)
self.audio_model = AudioClassifier(num_classes, dropout)
self.text_model = TextClassifier(num_classes, dropout)def ahead(self, x):
image_outputs = self.image_model(x)
audio_outputs = self.audio_model(x)
text_outputs = self.text_model(x)
image_logits = torch.nn.practical.softplus(image_outputs)
audio_logits = torch.nn.practical.softplus(audio_outputs)
text_logits = torch.nn.practical.softplus(text_outputs)
logits = [image_logits, audio_logits, text_logits]
agg_logits = image_logits
for i in vary(1, 3):
agg_logits = (agg_logits + logits[i])/2
return agg_logits, (image_logits, audio_logits, text_logits)
Right here we use the softplus perform, since within the evidential networks, evidences shall be non-negative numbers. The diagram of the structure may be seen within the picture under:
To make our coaching simpler, we’re going to use the PyTorch Lightning framework. For that, we have to outline one other lightning class:
import numpy as np
import pytorch_lightning as pl
import torch
from torchmetrics import Accuracyfrom baselines.utils import AvgTrustedLoss
class DirichletModel(pl.LightningModule):
def __init__(self, mannequin, num_classes=42, dropout=0.):
tremendous(DirichletModel, self).__init__()
self.num_classes = num_classes
self.mannequin = mannequin(num_classes=num_classes, monte_carlo=False, dropout=dropout, dirichlet=True)
self.train_acc = Accuracy(job='multiclass', num_classes=num_classes)
self.val_acc = Accuracy(job='multiclass', num_classes=num_classes)
self.test_acc = Accuracy(job='multiclass', num_classes=num_classes)
self.criterion = AvgTrustedLoss(num_views=3)
self.aleatoric_uncertainties = None
self.epistemic_uncertainties = None
def ahead(self, inputs):
return self.mannequin(inputs)
def training_step(self, batch, batch_idx):
loss, output, goal = self.shared_step(batch)
self.log('train_loss', loss)
acc = self.train_acc(output, goal)
self.log('train_acc_step', acc, prog_bar=True)
return loss
def shared_step(self, batch):
picture, audio, textual content, goal = batch
output_a, output = self((picture, audio, textual content))
output = torch.stack(output)
loss = self.criterion(output, goal, output_a)
return loss, output_a, goal
def validation_step(self, batch, batch_idx):
loss, output, goal = self.shared_step(batch)
self.val_acc(output, goal)
alphas = output + 1
probs = alphas / alphas.sum(dim=-1, keepdim=True)
entropy = self.num_classes / alphas.sum(dim=-1)
alpha_0 = alphas.sum(dim=-1, keepdim=True)
aleatoric_uncertainty = -torch.sum(probs * (torch.digamma(alphas + 1) - torch.digamma(alpha_0 + 1)), dim=-1)
return loss, output, goal, entropy, aleatoric_uncertainty
def test_step(self, batch, batch_idx):
loss, output, goal = self.shared_step(batch)
self.test_acc(output, goal)
alphas = output + 1
probs = alphas / alphas.sum(dim=-1, keepdim=True)
entropy = self.num_classes / alphas.sum(dim=-1)
alpha_0 = alphas.sum(dim=-1, keepdim=True)
aleatoric_uncertainty = -torch.sum(probs * (torch.digamma(alphas + 1) - torch.digamma(alpha_0 + 1)), dim=-1)
return loss, output, goal, entropy, aleatoric_uncertainty
def training_epoch_end(self, outputs):
self.log('train_acc', self.train_acc.compute(), prog_bar=True)
self.criterion.annealing_step += 1
def validation_epoch_end(self, outputs):
self.log('val_acc', self.val_acc.compute(), prog_bar=True)
self.log('val_loss', np.imply([x[0].detach().cpu().numpy() for x in outputs]), prog_bar=True)
self.log('val_entropy', torch.cat([x[3] for x in outputs]).imply(), prog_bar=True)
self.log('val_sigma', torch.cat([x[4] for x in outputs]).imply(), prog_bar=True)
def test_epoch_end(self, outputs):
self.log('test_acc', self.test_acc.compute(), prog_bar=True)
self.log('test_entropy_epi', torch.cat([x[3] for x in outputs]).imply())
self.log('test_ale', torch.cat([x[4] for x in outputs]).imply())
self.aleatoric_uncertainties = torch.cat([x[4] for x in outputs]).detach().cpu().numpy()
self.epistemic_uncertainties = torch.cat([x[3] for x in outputs]).detach().cpu().numpy()
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-2)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', issue=0.33, endurance=5,
verbose=True)
return {
'optimizer': optimizer,
'lr_scheduler': scheduler,
'monitor': 'val_loss'
}
Right here we predict the proper class of the community, and in addition compute the aleatoric and epistemic uncertainties.
For coaching we simply have to outline dataloaders, and use PyTorch Lightning Coach class for coaching.
batch_size = 128
courses = 42
dropout_p = 0.3
train_dataset, val_dataset = torch.utils.knowledge.random_split(train_dataset, [int(0.8 * len(train_dataset)),
len(train_dataset) - int(
0.8 * len(train_dataset))])train_loader = torch.utils.knowledge.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=8)
val_loader = torch.utils.knowledge.DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=8)
test_loader = torch.utils.knowledge.DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=8)
ood_loader = torch.utils.knowledge.DataLoader(ood_dataset, batch_size=batch_size, shuffle=False, num_workers=8)
# Now we are able to use the loaders to coach a mannequin
mannequin = DirichletModel(MultimodalClassifier, courses, dropout=dropout_p)
coach = pl.Coach(max_epochs=300,
gpus=1 if torch.cuda.is_available() else 0,
callbacks=[pl.callbacks.EarlyStopping(monitor='val_loss', patience=10, mode='min'),
pl.callbacks.ModelCheckpoint(monitor='val_loss', mode='min', save_last=True)])
coach.match(mannequin, train_loader, val_loader)
print('Testing mannequin')
coach.check(mannequin, test_loader)
print('Take a look at outcomes:')
print(coach.callback_metrics)
aleatoric_uncertainties = mannequin.aleatoric_uncertainties
epistemic_uncertainties = mannequin.epistemic_uncertainties
print('Testing OOD')
coach.check(mannequin, ood_loader)
aleatoric_uncertainties_ood = mannequin.aleatoric_uncertainties
epistemic_uncertainties_ood = mannequin.epistemic_uncertainties
auc_score = roc_auc_score(
np.concatenate([np.zeros(len(epistemic_uncertainties)), np.ones(len(epistemic_uncertainties_ood))]),
np.concatenate([epistemic_uncertainties, epistemic_uncertainties_ood]))
print(f'AUC rating: {auc_score}')
Right here we’re logging the classification accuracy, the typical uncertainty values and the AUC rating for OOD detection.
For coaching on the noisy variations of the datasets, we simply want to vary the info paths to noisy knowledge paths.
On the clear knowledge (with out injecting further noise), we get the next outcomes:
As we are able to see, including noise successfully raises the uncertainty metrics. An fascinating analysis route, therefore, is to regulate the noise ranges and see how the uncertainties change. It’s important not solely to construct DL fashions sturdy to those noises however discover UQ strategies that reliably can point out when the fashions are not sure about their predictions.
This weblog publish is written primarily based on the code and dataset of LUMA, revealed throughout the scope of my PhD thesis at Aix-Marseille College (AMU), CNRS, LIS. I want to point out and thank my PhD Supervisors and paper co-authors Sana Sellami (AMU, CNRS, LIS), Laure Berti-Équille (IRD, ESPACE-DEV), and Sébastien Fournier (AMU, CNRS, LIS).
For those who preferred this port, please star LUMA at GitHub. We will likely be pleased to listen to you ideas, questions or solutions within the dialogue under.