๐ณ Decision Tree Learning: Concepts, Algorithms, and Information Theory
Decision Tree Learning is one of the most intuitive and widely used techniques in Machine Learning. It is used for both classification and regression, and its power comes from recursively splitting data based on informative features.
At the heart of decision trees lies Information Theory, which provides the mathematical foundation for choosing the best splits.
๐ 1. What is a Decision Tree?
A decision tree is a tree-structured model where:
Each internal node represents a test on a feature
Each branch represents an outcome
Each leaf node represents a final prediction
Example
A tree might ask:
“Is CGPA > 3.5?”
“Does the student have research experience?”
And then classify:
๐ Admitted or Not Admitted
๐ง 2. Why Information Theory?
To build a good tree, we must decide:
“Which feature should we split on at each step?”
This is where Information Theory comes in. It helps us measure:
Uncertainty
Impurity
Information gained after a split
๐ 3. Key Concepts from Information Theory
๐น 3.1 Entropy (Measure of Uncertainty)
Entropy quantifies how “mixed” a dataset is.
[
H(S) = - \sum p_i \log_2 p_i
]
Where:
( p_i ) = probability of class ( i )
Intuition:
Entropy = 0 → Pure (all same class)
Entropy = 1 (max) → Completely mixed
Example:
| Class | Probability |
|---|---|
| Yes | 0.5 |
| No | 0.5 |
Entropy = 1 → maximum uncertainty
๐น 3.2 Information Gain (IG)
Information Gain tells us how much uncertainty is reduced after a split.
[
IG(S, A) = H(S) - \sum \frac{|S_v|}{|S|} H(S_v)
]
Where:
( S ) = dataset
( A ) = attribute
( S_v ) = subset after split
Intuition:
๐ Choose the feature that gives maximum information gain
๐น 3.3 Gini Impurity
Used in some algorithms instead of entropy.
[
Gini(S) = 1 - \sum p_i^2
]
Intuition:
Lower Gini = better split
Faster to compute than entropy
๐น 3.4 Gain Ratio
Fixes a problem in Information Gain (bias toward many-valued attributes)
[
Gain\ Ratio = \frac{Information\ Gain}{Split\ Information}
]
๐ณ 4. How Decision Trees Work
General Process:
Start with full dataset
Compute entropy
Try all features
Choose best split (max IG / min Gini)
Split dataset
Repeat recursively
⚙️ 5. Popular Decision Tree Algorithms
๐ฟ 5.1 ID3 (Iterative Dichotomiser 3)
Uses:
Entropy
Information Gain
Works with:
Categorical features
Steps:
Compute entropy of dataset
Calculate IG for each feature
Choose feature with highest IG
Split and repeat
Limitation:
Overfitting
Cannot handle continuous features well
๐ฟ 5.2 C4.5
Improved version of ID3.
Improvements:
Uses Gain Ratio
Handles continuous data
Handles missing values
Includes pruning
๐ Very practical and widely used
๐ฟ 5.3 CART
Uses:
Gini Impurity (classification)
Variance reduction (regression)
Produces:
Binary trees
Advantages:
Works for both classification & regression
Handles numerical data well
✂️ 6. Overfitting and Pruning
Decision trees tend to overfit.
Types of Pruning:
๐น Pre-pruning
Stop tree early
Example:
Max depth
Minimum samples per node
๐น Post-pruning
Build full tree
Then remove weak branches
๐ 7. Advantages of Decision Trees
Easy to understand
No need for normalization
Handles categorical + numerical data
Interpretable
⚠️ 8. Disadvantages
Overfitting
Unstable (small data change → different tree)
Greedy algorithm (not globally optimal)
๐งช 9. Applications
Medical diagnosis
Credit scoring
Fraud detection
Bioinformatics
๐ฏ 10. Key Insight (Important)
Decision trees are essentially:
A greedy search guided by Information Theory
They repeatedly ask:
๐ “Which feature reduces uncertainty the most?”
๐งพ Conclusion
Decision Tree Learning combines:
Simple structure (tree)
Powerful math (entropy, information gain)
Efficient algorithms (ID3, C4.5, CART)
Understanding information theory is crucial because it explains why a tree splits the way it does—not just how.
No comments:
Post a Comment