Voici une sélection de sujets que nous avons évalués :
1. Truthfulness in Latent Knowledge
2. Justice dataset with GPT-3
3. Study of the biases of a generative model through the example of a face inpainting model
4. Transférabilité via des métriques perceptuelles
5. Adversarial Attacks and Defends on MNIST
6. Racial bias when inpainting faces with CelebA denoising diffusion probabilistic models
7. Comparison of the robustness to adversarial SOTA DL audio compression
8. Foundation models and the AI act
9. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Nous voulons mettre en valeur l’un des rapports, qui présente un travail très élégant en interprétabilité des GPTs.
Truthfulness in Latent Knowledge
de Stanislas Dozias & Sébastien Meyer
Abstract:
In their 2022 paper, Burns et al. introduce the Contrast-Consistent Search (CCS) model [1]. This model works in an unsupervised fashion by learning a consistent and confident mapping from the hidden states of a language model to a tuple of probabilities, as shown in Figure 1. This mapping can be used to perform binary classification for classes such as: True/False, positive/negative, duplicates/not duplicates, etc. It achieves better results than other unsupervised techniques such as zero-shot classification [5], however it still underperforms logistic regression in the supervised setting [1]. To go beyond this paper, we propose two ideas. The first consists in finding and analyzing the direction of truthfulness within the hidden states and making a projection on it. We introduce a method called Mean Direction Projection (MDP).
Recall from [1] that the questions qi are fed to the language model in two different options, one being x+
for a positive answer and the other being x- for a negative answer. Our method consists in taking the mean of the differences between the hidden states corresponding to the positive answers φ(x+) and negative answers φ(x-), multiplying each difference by yi being +1 if the true answer is positive and -1 if the true answer is negative. We applied our method to three different datasets and we compared it with the direction found by CCS. Finally, we applied both our model and CCS to the Moral Uncertainty Research Competition dataset.