🌳 Decision Tree Learning: Concepts, Algorithms, and Information Theory

Decision Tree Learning is one of the most intuitive and widely used techniques in Machine Learning. It is used for both classification and regression, and its power comes from recursively splitting data based on informative features.

At the heart of decision trees lies Information Theory, which provides the mathematical foundation for choosing the best splits.

📌 1. What is a Decision Tree?

A decision tree is a tree-structured model where:

Each internal node represents a test on a feature
Each branch represents an outcome
Each leaf node represents a final prediction

Example

A tree might ask:

“Is CGPA > 3.5?”
“Does the student have research experience?”

And then classify:
👉 Admitted or Not Admitted

🧠 2. Why Information Theory?

To build a good tree, we must decide:

“Which feature should we split on at each step?”

This is where Information Theory comes in. It helps us measure:

Uncertainty
Impurity
Information gained after a split

📊 3. Key Concepts from Information Theory

🔹 3.1 Entropy (Measure of Uncertainty)

Entropy quantifies how “mixed” a dataset is.

[
H(S) = - \sum p_i \log_2 p_i
]

Where:

( p_i ) = probability of class ( i )

Intuition:

Entropy = 0 → Pure (all same class)
Entropy = 1 (max) → Completely mixed

Example:

Class	Probability
Yes	0.5
No	0.5

Entropy = 1 → maximum uncertainty

🔹 3.2 Information Gain (IG)

Information Gain tells us how much uncertainty is reduced after a split.

[
IG(S, A) = H(S) - \sum \frac{|S_v|}{|S|} H(S_v)
]

Where:

( S ) = dataset
( A ) = attribute
( S_v ) = subset after split

Intuition:

👉 Choose the feature that gives maximum information gain

🔹 3.3 Gini Impurity

Used in some algorithms instead of entropy.

[
Gini(S) = 1 - \sum p_i^2
]

Intuition:

Lower Gini = better split
Faster to compute than entropy

🔹 3.4 Gain Ratio

Fixes a problem in Information Gain (bias toward many-valued attributes)

[
Gain\ Ratio = \frac{Information\ Gain}{Split\ Information}
]

🌳 4. How Decision Trees Work

General Process:

Start with full dataset
Compute entropy
Try all features
Choose best split (max IG / min Gini)
Split dataset
Repeat recursively

⚙️ 5. Popular Decision Tree Algorithms

🌿 5.1 ID3 (Iterative Dichotomiser 3)

Uses:
- Entropy
- Information Gain
Works with:
- Categorical features

Steps:

Compute entropy of dataset
Calculate IG for each feature
Choose feature with highest IG
Split and repeat

Limitation:

Overfitting
Cannot handle continuous features well

🌿 5.2 C4.5

Improved version of ID3.

Improvements:

Uses Gain Ratio
Handles continuous data
Handles missing values
Includes pruning

👉 Very practical and widely used

🌿 5.3 CART

Uses:
- Gini Impurity (classification)
- Variance reduction (regression)
Produces:
- Binary trees

Advantages:

Works for both classification & regression
Handles numerical data well

✂️ 6. Overfitting and Pruning

Decision trees tend to overfit.

Types of Pruning:

🔹 Pre-pruning

Stop tree early
Example:
- Max depth
- Minimum samples per node

🔹 Post-pruning

Build full tree
Then remove weak branches

📈 7. Advantages of Decision Trees

Easy to understand
No need for normalization
Handles categorical + numerical data
Interpretable

⚠️ 8. Disadvantages

Overfitting
Unstable (small data change → different tree)
Greedy algorithm (not globally optimal)

🧪 9. Applications

Medical diagnosis
Credit scoring
Fraud detection
Bioinformatics

🎯 10. Key Insight (Important)

Decision trees are essentially:

A greedy search guided by Information Theory

They repeatedly ask:

👉 “Which feature reduces uncertainty the most?”

🧾 Conclusion

Decision Tree Learning combines:

Simple structure (tree)
Powerful math (entropy, information gain)
Efficient algorithms (ID3, C4.5, CART)

Understanding information theory is crucial because it explains why a tree splits the way it does—not just how.

Sunday, March 29, 2026