Recent Natural Language Processing (NLP) research has shown a strong interest in methods for parameter efficient learning such as Adapters and prompt-based learning. These methods do not modify the learned parameters of the pre-trained model. Instead, they freeze all of its parameters and only tune tiny modules implanted in the main architecture. This way, to transition from one task to another, all it takes is to change or remove the adapters in a plug-and-play manner. Thus robustness in extrapolation to new tasks can be improved without sacrificing information related to earlier tasks/domains.
Despite the large potential that these methods have shown in several NLP benchmarks, their effect is not yet fully explored in the context of Computer Vision (CV). While there are studies that have successfully adapted prompt-based learning to multi-modal tasks such as visual-linguistic modeling (e.g. CLIP), the adaptation potential of these concepts to pure image-processing models (e.g. CNNs) has yet been thoroughly investigated.