Supervised learning models have achieved robust performance across a wide range of tasks and domains. Yet, it requires a considerable amount of data preparation, particularly labeling, as it implies that each piece of data is associated with a label. However, these models are based on large volumes of data. In the field of medical image analysis, this requirement influences the robustness of the models produced. In the absence of a qualified training set of sufficient size to reach the sensitivity and specificity thresholds expected for routine clinical use, their emergence can also be slowed down.
To overcome this constraint, three types of weak supervision exist:
- incomplete supervision,
- inexact supervision and
- inaccurate supervision [1].
Incomplete supervised learning
This supervision can be divided into two approaches:
- active learning, which depends on human supervision, and
- semi-supervised learning, which does not depend on human supervision.
Active learning
During active learning, a representative and informative subset of the unlabeled dataset is presented to a human expert for labeling.
This method minimizes the effort required by an expert to label the dataset, while guaranteeing the model's ability to generalize to the unlabeled dataset.
Subset representativeness provides information about the structure of the data, while its informative nature reduces the model's statistical uncertainty.
There are two ways of defining this subset using the informativeness criterion [1] [2]. The first is uncertainty sampling. A model establishes an initial estimate of the class to which individuals belong, then those with the lowest confidence score are added to the subset.
The second uses the concept of query by consensus. Several estimation models propose a membership class for each individual, then those for which no consensus has been found are added to the subset.
Both methods exploit the cluster structure of unlabeled data. When assessing the representativeness of one or more individuals, interpreting the results of clustering requires vigilance. This is particularly true when few labeled examples are available for use. The same applies to the informativeness of individuals, where the quality of the labeled data used to estimate the first model is crucial. [1] [2]
Once the subset has been built up, the model can be improved in two ways [2]:
- either by re-training the base model again, using the training set supplemented with newly-labeled data;
- or by fine-tuning the basic model on the basis of the newly-tagged data.
Re-training the model requires more computational resources than fine-tuning, but has the advantage of enabling a consistent comparison of model performance as a function of the addition of new labeled data.
Semi-supervised learning
Semi-supervised learning attempts to exploit unlabeled data without the intervention of a human expert to improve the performance of an estimation model. To achieve this, two assumptions are made about the distribution of the data:
- the cluster hypothesis, which assumes that the data distribution is structured in clusters, and that each cluster corresponds to a class of data, and
- the manifold hypothesis, which assumes that the data is located in a manifold, so that the predictions of neighboring individuals are similar.
Four categories of semi-supervised learning methods can be outlined:
- generative methods imply that labeled and unlabeled data is intrinsically linked as it’s generated by the same model; unlabeled data, considered as missing values in the data distribution, can be estimated by an expectation-maximization algorithm [3];
- graph-based methods (minimum spanning tree, minimum cut, random walk) assume that the data set represents a graph where unlabeled data is classed the same as adjacent nodes; minimum cut methods are the most common in image segmentation, this type of approach is, however, not suitable for high-dimensional problems [4];
- low-density separation methods force the classification model to go through the least dense regions of the input data space (e.g.: the S3VM algorithm - semi-supervised support vector machine), this type of approach is restricted in high dimension;
- disagreement-based methods use a set of models that cooperate to exploit unlabeled data, i.e. each model estimates the probability of a membership class from different views of the data (e.g. in medical image analysis, axial, coronal and sagittal planes may be used), at each iteration the models submit to the others their examples of pseudolabeled data with the highest confidence score [5].
Inexact supervised learning
Inexact supervision relates to learning from imprecisely labeled data, such as a coarse grain segmentation of a region of interest.
This type of supervision is related to multi-instance learning, where labels are assigned to a set of individuals, called a bag of labels, rather than to an individual [6]. The aim of multi-instance learning is to predict unknown labels at both bag and individual (instance) level.
Multi-instance learning relies on two categories of algorithms that use:
- either the label bag distribution space;
- or the distribution space of individuals/instances. [7]
One of the difficulties of this type of learning lies in the inexactness of the labels, since, in the case of binary classification, a bag belongs to the positive class, if at least one of its individuals is itself attached to that class; and to the negative class, if all its individuals are of that same class [6].
For good performance, the use of homogeneous and independent data is preferable. Some multi-instance learning algorithms use attention mechanisms to explain the contribution of individuals in the bag to the assignment of its label. [7]
Inaccurate supervised learning
Inaccurate supervised learning corresponds to learning from noisy data, because it contains errors. A typical example is the crowdsourcing of data labels, which reduces the cost of data preparation compared with labeling by one or more experts. Since the noise added to the data by the various contributors during the labeling phase greatly increases the variance in data quality.
Inaccurate supervision estimates a generalizing model despite the noise present in the dataset.
Different approaches reduce the noise embedded in the data. A first approach consists in identifying and modifying misclassified labels, a second in using the ensemble method to regularize the noise:
- data editing;
- majority voting (similar to the Bagging - bootstrap aggregating or Adaboost - adaptive boosting).
Data-editing involves identifying the nodes of a relative neighborhood graph associated with many cut edges. A cut edge connects two nodes of different classes. These nodes are considered misclassified and are either removed from the dataset or relabeled according to the class of the neighboring nodes. In high dimensions, this method is restricted, as it uses the Euclidean distance. [8]
The majority vote strategy is achieved by estimating the quality of labels using a models’ ensemble, then retaining the most popular estimate by majority vote. A confidence score weights the labels to assess their quality. It quantifies a label's ability to represent the observation, and is obtained from either an expectation-maximization or minimax algorithm. This method improves the stability and accuracy of classifiers.
Conclusion
To compensate for the lack of accurately labeled data available and the resulting loss of performance for supervised learning models, semi-supervised learning offers three types of supervision, each based on the quality of the models' input data. Incomplete supervision proposes the use of precisely labeled data, but in very limited numbers. Inaccurate supervision relies on the use of coarsely labeled data (e.g., using bounding boxes). Finally, imprecise supervision deals with the noise present in labeled data, including errors.
The strategies implemented rely either on human intervention, data distribution, data structure or ensemble learning methods.
The former employs active learning. The second relies on multi-instance learning and generative methods. The third uses graph-based methods. Finally, the fourth uses majority voting or disagreement-based strategies.
[1] Zhou, Z. H. (2018). A brief introduction to weakly supervised learning. National science review, 5(1), 44-53.
[2] Budd, S., Robinson, E. C., & Kainz, B. (2021). A survey on active learning and human-in-the-loop deep learning for medical image analysis. Medical Image Analysis, 71, 102062.
[3] Lin, Z., Yang, T., Li, M., Wang, Z., Yuan, C., Jiang, W., & Liu, W. (2022). Swem: Towards real-time video object segmentation with sequential weighted expectation-maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1362-1372).
[4] Fabijanska, A. (2022). Graph Convolutional Networks for Semi-Supervised Image Segmentation. IEEE Access, 10, 104144-104155.
[5] Xia, Y., Yang, D., Yu, Z., Liu, F., Cai, J., Yu, L., ... & Roth, H. (2020). Uncertainty-aware multi-view co-training for semi-supervised medical image segmentation and domain adaptation. Medical image analysis, 65, 101766.
[6] Zoghlami, M., Aridhi, S., Maddouri, M., & Nguifo, E. M. (2018, July). ABClass: Une approche d'apprentissage multi-instances pour les séquences. In RJCIA 2018-16èmes Rencontres des Jeunes Chercheurs en Intelligence Artificielle (pp. 1-9).
[7] Zhang, W., Zhang, X., & Zhang, M. L. (2022). Multi-instance causal representation learning for instance label prediction and out-of-distribution generalization. Advances in Neural Information Processing Systems, 35, 34940-34953.
[8] Zhang, M. L., & Zhou, Z. H. (2011). CoTrade: Confident co-training with data editing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(6), 1612-1626.