Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing
By Agatha Duzan, Matthieu David, Jonathan Claybrough
Abstract: Based on the paper "Discovering Latent Knowledge in Language Models without Supervision" this project discusses how well the proposed method applies to the concept of ambiguity.
To do that, we tested the Contrast Consistent Search method on a dataset which contained both clear cut (0-1) and ambiguous (0,5) examples: We chose the ETHICS-commonsense dataset.
The global conclusion is that the CCS approach seems to generalize well in ambiguous situations, and could potentially be used to determine a model’s latent knowledge about other concepts.
These figures show how the CCS results for last layer activations splits into two groups for the non-ambiguous training samples while the ambiguous test samples on the ETHICS dataset reveals the same ambiguity of latent knowledge by the flattened Gaussian inference probability distribution.
Haydn & Esben’s judging comment: This project is very good in investigating the generality of unsupervised latent knowledge learning. It also seems quite useful as a direct test of how easy it is to extract latent knowledge and provides an avenue towards a benchmark using the ETHICS unambiguous/ambiguous examples dataset. Excited to see this work continue!
Trojan detection and implementation on transformers
By Clément Dumas, Charbel-Raphaël Segerie, Liam Imadache
Abstract: Neural Trojans are one of the most common adversarial attacks out there. Even though they have been extensively studied in computer vision, they can also easily target LLMs and transformer based architecture. Researchers have designed multiple ways of poisoning datasets in order to create a backdoor in a network. Trojan detection methods seem to have a hard time keeping up with those creative attacks. Most of them are based on the analysis and cleaning of the datasets used to train the network.
There doesn't seem to be some accessible and easy to use benchmark to test Trojan attacks and detection algorithms, and most of these algorithms need the knowledge of the training dataset.
We therefore decided to create a small benchmark of trojan networks that we implemented ourselves based on the literature, and use it to test some existing and new detection techniques.
[from the authors]: The colab contains the code to create the trojans described below, but you will also find some mysterious networks containing trojans that you can try to detect and explain. We will provide 50 euros for the first one who will be able to propose a method to find our trigger!
Haydn & Esben’s judging comment: Great to see so many replications of papers in one project and a nice investigation into Trojan triggers in training data. The proposed use of Task Vectors is quite interesting and your conclusion about Trojan attacks >> defenses is a good observation.