Triplet loss
Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples. It was conceived by Google researchers for their prominent FaceNet algorithm for face detection.[1]
Triplet loss is designed to support metric learning. Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions. In the context of face detection, data points correspond to images.
Definition
[edit]The loss function is defined using triplets of training points of the form . In each triplet, (called an "anchor point") denotes a reference point of a particular identity, (called a "positive point") denotes another point of the same identity in point , and (called a "negative point") denotes an point of an identity different from the identity in point and .
Let be some point and let be the embedding of in the finite-dimensional Euclidean space. It shall be assumed that the L2-norm of is unity (the L2 norm of a vector in a finite dimensional Euclidean space is denoted by .) We assemble triplets of points from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets in the training data set:
The variable is a hyperparameter called the margin, and its value must be set manually. In the FaceNet system, its value was set as 0.2.
Thus, the full form of the function to be minimized is the following:
Selection of triplets
[edit]In general, the number of triplets of the form is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image they chose that positive image for which is maximum (such a positive image was called a "hard positive image") and that negative image for which is minimum (such a negative image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.
- Generate triplets offline computing the minimum and maximum on a subset of the data.
- Generate triplets online by selecting the hard positive/negative examples from within a mini-batch.
Comparison and Extensions
[edit]In computer vision tasks such as re-identification, a prevailing belief has been that the triplet loss is inferior to using surrogate losses (i.e., typical classification losses) followed by separate metric learning steps. Recent work showed that for models trained from scratch, as well as pretrained models, a special version of triplet loss doing end-to-end deep metric learning outperforms most other published methods as of 2017.[2]
Additionally, triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities. This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks.[3]
In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture.[4]
Other extensions involve specifying multiple negatives (multiple negatives ranking loss).
See also
[edit]- Siamese neural network
- t-distributed stochastic neighbor embedding
- Learning to rank
- Similarity learning
References
[edit]- ^ Schroff, Florian; Kalenichenko, Dmitry; Philbin, James (2015). "FaceNet: A Unified Embedding for Face Recognition and Clustering": 815–823.
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Hermans, Alexander; Beyer, Lucas; Leibe, Bastian (2017-03-22). "In Defense of the Triplet Loss for Person Re-Identification". arXiv:1703.07737 [cs.CV].
- ^ Zhou, Mo; Niu, Zhenxing; Wang, Le; Gao, Zhanning; Zhang, Qilin; Hua, Gang (2020-04-03). "Ladder Loss for Coherent Visual-Semantic Embedding" (PDF). Proceedings of the AAAI Conference on Artificial Intelligence. 34 (7): 13050–13057. doi:10.1609/aaai.v34i07.7006. ISSN 2374-3468. S2CID 208139521.
- ^ Reimers, Nils; Gurevych, Iryna (2019-08-27). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084 [cs.CL].