Semi-supervised learning
I want to go through the Wikipedia series on Machine Learning and Data mining. Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
References
Notes
Weak supervision (also known as semi-supervised learning) is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data (exclusively used in more expensive and time-consuming supervised learning paradigm), followed by a large amount of unlabeled data (used exclusively in unsupervised learning paradigm). In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems.
The cost of getting labeled data is expensive, while getting unlabeled data is relatively cheap. In such situations, semi-supervised learning can be of great practical value.
Semi-supervised learning assumes a set of independently identically distributed examples with corresponding labels and unlabeled examples are processed. Semi-supervised learning combines this information to surpass the classification performance that can be obtained either by discarding the unlabeled data and doing supervised learning or by discarding labels and doing unsupervised learning.
- In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent,
Semi-supervised learning may refer to either transductive learning or inductive learning.
- The goal of transductive learning is to infer the correct labels for the given unlabeled data only.
- In logic, statistical inference, and supervised learning, transduction or transductive inference is reasoning from observed, specific (training) cases to specific (test) cases.
- The goal of inductive learning is to infer the correct mapping from to .
- Inductive reasoning is any of various methods of reasoning in which broad generalizations or principles are derived from a body of observations.
In order to make use of unlabeled data, some relationship to the underlying distribution of data must exist:
- Continuity / smoothness assumption
- Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries.
- Cluster assumption
- The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature leaning with cluster algorithms.
- Manifold assumption
- The data lie approximately on a manifold of much lower dimension than the input space. In this case, learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality.
- In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point.
- The manifold assumption is practical when high dimensional data are generated by some process that may be hard to model directly, but which only has a few degrees of freedom.
Comments
You have to be logged in to add a comment
User Comments
There are currently no comments for this article.