Sunday, March 29, 2026

Decision Tree Learning: Concepts, Algorithms, and Information Theory

๐ŸŒณ Decision Tree Learning: Concepts, Algorithms, and Information Theory

Decision Tree Learning is one of the most intuitive and widely used techniques in Machine Learning. It is used for both classification and regression, and its power comes from recursively splitting data based on informative features.

At the heart of decision trees lies Information Theory, which provides the mathematical foundation for choosing the best splits.


๐Ÿ“Œ 1. What is a Decision Tree?

A decision tree is a tree-structured model where:

  • Each internal node represents a test on a feature

  • Each branch represents an outcome

  • Each leaf node represents a final prediction

Example

A tree might ask:

  • “Is CGPA > 3.5?”

  • “Does the student have research experience?”

And then classify:
๐Ÿ‘‰ Admitted or Not Admitted


๐Ÿง  2. Why Information Theory?

To build a good tree, we must decide:

“Which feature should we split on at each step?”

This is where Information Theory comes in. It helps us measure:

  • Uncertainty

  • Impurity

  • Information gained after a split


๐Ÿ“Š 3. Key Concepts from Information Theory


๐Ÿ”น 3.1 Entropy (Measure of Uncertainty)

Entropy quantifies how “mixed” a dataset is.

[
H(S) = - \sum p_i \log_2 p_i
]

Where:

  • ( p_i ) = probability of class ( i )

Intuition:

  • Entropy = 0 → Pure (all same class)

  • Entropy = 1 (max) → Completely mixed

Example:

ClassProbability
Yes0.5
No0.5

Entropy = 1 → maximum uncertainty


๐Ÿ”น 3.2 Information Gain (IG)

Information Gain tells us how much uncertainty is reduced after a split.

[
IG(S, A) = H(S) - \sum \frac{|S_v|}{|S|} H(S_v)
]

Where:

  • ( S ) = dataset

  • ( A ) = attribute

  • ( S_v ) = subset after split

Intuition:

๐Ÿ‘‰ Choose the feature that gives maximum information gain


๐Ÿ”น 3.3 Gini Impurity

Used in some algorithms instead of entropy.

[
Gini(S) = 1 - \sum p_i^2
]

Intuition:

  • Lower Gini = better split

  • Faster to compute than entropy


๐Ÿ”น 3.4 Gain Ratio

Fixes a problem in Information Gain (bias toward many-valued attributes)

[
Gain\ Ratio = \frac{Information\ Gain}{Split\ Information}
]


๐ŸŒณ 4. How Decision Trees Work

General Process:

  1. Start with full dataset

  2. Compute entropy

  3. Try all features

  4. Choose best split (max IG / min Gini)

  5. Split dataset

  6. Repeat recursively


⚙️ 5. Popular Decision Tree Algorithms


๐ŸŒฟ 5.1 ID3 (Iterative Dichotomiser 3)

  • Uses:

    • Entropy

    • Information Gain

  • Works with:

    • Categorical features

Steps:

  1. Compute entropy of dataset

  2. Calculate IG for each feature

  3. Choose feature with highest IG

  4. Split and repeat

Limitation:

  • Overfitting

  • Cannot handle continuous features well


๐ŸŒฟ 5.2 C4.5

Improved version of ID3.

Improvements:

  • Uses Gain Ratio

  • Handles continuous data

  • Handles missing values

  • Includes pruning

๐Ÿ‘‰ Very practical and widely used


๐ŸŒฟ 5.3 CART

  • Uses:

    • Gini Impurity (classification)

    • Variance reduction (regression)

  • Produces:

    • Binary trees

Advantages:

  • Works for both classification & regression

  • Handles numerical data well


✂️ 6. Overfitting and Pruning

Decision trees tend to overfit.

Types of Pruning:

๐Ÿ”น Pre-pruning

  • Stop tree early

  • Example:

    • Max depth

    • Minimum samples per node

๐Ÿ”น Post-pruning

  • Build full tree

  • Then remove weak branches


๐Ÿ“ˆ 7. Advantages of Decision Trees

  • Easy to understand

  • No need for normalization

  • Handles categorical + numerical data

  • Interpretable


⚠️ 8. Disadvantages

  • Overfitting

  • Unstable (small data change → different tree)

  • Greedy algorithm (not globally optimal)


๐Ÿงช 9. Applications

  • Medical diagnosis

  • Credit scoring

  • Fraud detection

  • Bioinformatics


๐ŸŽฏ 10. Key Insight (Important)

Decision trees are essentially:

A greedy search guided by Information Theory

They repeatedly ask:

๐Ÿ‘‰ “Which feature reduces uncertainty the most?”


๐Ÿงพ Conclusion

Decision Tree Learning combines:

  • Simple structure (tree)

  • Powerful math (entropy, information gain)

  • Efficient algorithms (ID3, C4.5, CART)

Understanding information theory is crucial because it explains why a tree splits the way it does—not just how.

No comments:

Post a Comment

Decision Tree Learning: Concepts, Algorithms, and Information Theory

๐ŸŒณ Decision Tree Learning: Concepts, Algorithms, and Information Theory Decision Tree Learning is one of the most intuitive and widely used ...