Data Mining — Complete Revision

01

Chapter 1

Introduction to Data Mining

What is Data Mining?

Non-trivial extraction of implicit, previously unknown, and potentially useful information from data. It involves exploration and analysis — by automatic or semi-automatic means — of large quantities of data to discover meaningful patterns.

Why Data Mining?

Explosive growth in data from e-commerce, social media, banking, satellites
Cheaper, more powerful computing resources
Competitive pressure to provide better, customized services
Massive scientific datasets (NASA, genomics, simulations)

Origins & Disciplines

Machine Learning / AI
Pattern Recognition
Statistics
Database Systems

A key component of the emerging field of Data Science & Data-Driven Discovery.

Scale Challenges

Traditional techniques may be unsuitable for data that is:

Large-scale (petabytes)
High-dimensional
Heterogeneous & Complex
Distributed across sources

    Core distinction: Data Mining is about discovering patterns from data, not programming rules explicitly. The insight is extracted, not written.
  

Data Mining Tasks

Category	Task	Description	Example
Predictive	Classification	Predict discrete class labels	Fraud detection, spam filter
Predictive	Regression	Predict continuous values	House price prediction
Descriptive	Clustering	Group similar records together	Customer segmentation
Descriptive	Association Rules	Find co-occurring items	Market basket analysis
Descriptive	Anomaly Detection	Find unusual data points	Intrusion detection, fraud

Classification — How it Works

Training Set

Learn Model

Test Set

Predict

Model learns from labeled training data
Applied to unseen test data for prediction
Examples: fraud detection, churn prediction, galaxy cataloging

Societal Applications

Predicting impact of climate change
Improving healthcare & reducing costs
Reducing hunger by boosting agriculture
Finding alternative / green energy sources
Cyber security intrusion detection
Sky survey cataloging (stars, quasars)

02

Chapter 2

Data — Types, Quality & Preprocessing

Core Concepts

Attribute — a property/characteristic of an object (a.k.a. variable, feature, field, dimension)
Object — a collection of attributes (a.k.a. record, point, case, instance)
Attribute Values — numbers or symbols assigned to an attribute for a particular object

Attribute Types

Type	Properties	Example
Nominal	Distinctness only	Eye color, zip code, ID
Ordinal	Distinct + Order	Grades, rankings, {short, medium, tall}
Interval	+Meaningful diff	Temperature (°C/°F), dates
Ratio	+Meaningful ratio	Kelvin, length, count, elapsed time

    Key Rule: Nominal → only = ≠ | Ordinal → adds < > | Interval → adds + − | Ratio → adds × ÷. Each level includes all operations of levels above it.
  

Discrete vs Continuous

Discrete — finite or countably infinite values (zip codes, counts, words). Binary attributes are special case. Usually integers.
Continuous — real-number values (temperature, height, weight). Represented as floating-point variables.
Asymmetric — only presence (non-zero value) is important (e.g., items in transactions, words in documents)

Types of Datasets

Record Data — fixed set of attributes per object
Data Matrix — m×n numeric matrix (rows=objects, cols=attributes)
Document Data — term-frequency vectors
Transaction Data — sets of items per transaction
Graph Data — nodes + edges (web, molecules)
Ordered — spatial, temporal, sequential, genomic

Data Quality Problems

Problem	Description	Handling
Noise	Modification of original values / extraneous objects. Distorts signal.	Smoothing, robust algorithms
Outliers	Objects considerably different from the rest. Can be noise OR the target.	Detection algorithms; keep if goal
Missing Values	Not collected, or not applicable to all cases.	Eliminate / estimate / ignore
Duplicate Data	Same or almost-same records. Common when merging sources.	Data cleaning (deduplication)
Wrong / Fake Data	Incorrect values entered or fabricated.	Validation, cross-referencing

Data Preprocessing Pipeline

Raw Data

Aggregate

Sample

Transform

Reduce

Ready

Aggregation

Combining attributes or objects into single one
Reduces number of attributes or objects
Allows change of scale (cities → regions → countries)
Aggregated data has less variability — more stable

Discretization

Converts continuous attribute to ordinal
Unsupervised: Equal-width, Equal-frequency (percentiles), K-means clustering
Supervised: Uses class labels to guide splits
Static (once at start) or Dynamic (at each node)

Dimensionality Reduction

Curse of Dimensionality — data becomes increasingly sparse as dimensions grow; distance/density metrics lose meaning
PCA — finds projection capturing maximum variance
SVD — Singular Value Decomposition
Helps reduce noise, time, memory; aids visualization

Feature Selection & Creation

Redundant features — duplicate info from other attributes (e.g., price + tax)
Irrelevant features — no useful info (e.g., student ID for GPA prediction)
Feature Extraction — from raw data (edges from images)
Feature Construction — combine existing (mass/volume = density)
Mapping — Fourier, wavelet transforms

Attribute Transformation

Maps values to new replacement values
Simple functions: x^k, log(x), e^x, |x|
Normalization — adjusts for differences in frequency, mean, variance, range
Standardization — subtract mean, divide by std dev (z-score)
Removes unwanted signals like seasonality

03

Chapter 3 — Part A

Classification: Basic Concepts & Techniques

Classification Defined

Given a training set of records each characterized by (x, y) where x is the attribute set and y is the class label, learn a model that maps each x into one of the predefined class labels y.

x = attribute / predictor / independent variable / input
y = class / response / dependent variable / output

Classification Techniques

Base Classifiers:

Decision Tree based Methods
Rule-based Methods
Nearest-Neighbor (k-NN)
Naïve Bayes & Bayesian Belief Networks
Support Vector Machines (SVM)
Neural Networks / Deep Neural Nets

Ensemble: Boosting, Bagging, Random Forests

Building a Classification Model — General Flow

Raw Data

Training Set

Learn Classifier

Model

Test Set

Predictions

Decision Trees — Core Concepts

Concept	Explanation
Root Node	Top-most node; represents the entire dataset
Internal Node	Splitting attribute test
Leaf Node	Class label assignment
Branch	Outcome of a test condition
Splitting Attribute	The attribute used to divide data at each node

Hunt's Algorithm

One of the earliest DT induction algorithms
If Dₜ has records of same class y → leaf node labeled y
If Dₜ has multiple classes → split using attribute test, recurse on each subset
Basis for CART, ID3, C4.5, SLIQ, SPRINT

Splitting Conditions by Attribute Type

Binary — already two-way split
Nominal — multi-way (one per value) or binary (group into two subsets)
Ordinal — same as nominal but order property must be preserved
Continuous — binary (A < v or A ≥ v), find best cut; or discretize first

Decision Tree — Advantages

Relatively inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small trees
Robust to noise (with overfitting prevention)
Handles redundant & irrelevant attributes easily

Decision Tree — Disadvantages

Greedy splitting may miss interacting attributes
Each decision boundary involves only a single attribute
Multiple trees may fit same data (not unique)
Can overfit without pruning mechanisms

Chapter 3 — Part B

Model Overfitting

Types of Errors

Error Type	Definition
Training Error	Errors on the training set (data model was built from)
Test Error	Errors on the held-out test set
Generalization Error	Expected error on random selection from the same distribution

Underfitting vs Overfitting

Underfitting — model too simple; both training AND test error are large
Overfitting — model too complex; training error is small but test error is large
As complexity increases, test error first decreases then increases
Optimal complexity sits at the minimum of test error

    Core insight: Training error is NOT a reliable estimate of how well the model will perform on unseen data. A model that memorizes the training set will have near-zero training error but high generalization error.
  

Causes of Overfitting

Not enough training data — model clings to noise
High model complexity — too many parameters relative to data
Multiple Comparison Procedure — selecting the "best" from many randomly generated models inflates apparent accuracy

Multiple Comparison Procedure

Example: 50 analysts each make 10 random stock predictions. Probability that at least one analyst gets ≥8 correct is surprisingly high — yet the prediction is pure chance. Selecting the "best" model this way leads to false confidence.

Impact of Training Data Size

Increasing training data reduces the gap between training and test errors at any given model size
More data forces the model to learn genuine patterns rather than noise
Even a 50-node tree performs better with twice the training data

Model Selection Strategies

Validation Set — hold out a portion of training data to estimate generalization error
Incorporating Model Complexity — penalize complex models (MDL, Structural Risk Minimization)
Goal: find the simplest model that explains the data well

Pruning Strategies for Decision Trees

Strategy	When	How
Pre-Pruning (Early Stopping)	During tree construction	Stop growing if: all same class; all same attribute values; too few instances; class dist. independent of features; no impurity improvement; gen. error below threshold
Post-Pruning	After full tree grown	Bottom-up subtree replacement: trim if generalization error improves. Leaf label = majority class of trimmed sub-tree.

Decision Tree Complexity vs Error — Key Takeaway

4-node tree: higher training error but generalizes better

50-node tree: near-zero training error, but overfits badly on test data

More data at any fixed tree size reduces test error significantly

Overfitting produces trees more complex than necessary — pruning is essential

04

Chapter 4

Artificial Neural Networks

Core Idea

A complex non-linear function learned as a composition of simple processing units.

Collection of nodes (neurons) connected by directed edges
Each node: receives signals, computes, transmits
Edge weight = strength of connection
Analogous to biological neurons + electrical impulses

Perceptron (Single Neuron)

Simplest ANN — one layer of computation
Computes a weighted sum of inputs + applies threshold
Can only solve linearly separable problems
Fails on XOR — no linear hyperplane separates XOR data

Multi-layer Neural Networks

Input Layer

Hidden Layer 1

Hidden Layer 2

···

Output Layer

Also called feedforward neural networks
Each hidden layer operates on activations from preceding layer
Multi-layer ANNs with ≥1 hidden layer can solve ANY classification task involving nonlinear decision surfaces
Hidden layers represent levels of abstraction — complex features as compositions of simpler features

Why Multiple Hidden Layers?

Each layer extracts features at increasing abstraction
Example: Image recognition: edges → shapes → facial parts → face
Depth = number of layers; deeper networks express complex hierarchies
Deep networks require fewer total parameters than shallow equivalents for same function

Design Issues

Parameter	Rule of Thumb
Input nodes	One per binary/continuous attr; k or log₂k per categorical
Output nodes	One for binary; k or log₂k for k-class
Hidden layers/nodes	Tune empirically (no fixed rule)
Hyperparameters	Learning rate, epochs, mini-batch size, initial weights & biases

ANN Characteristics

Characteristic	Detail
Universal Approximators	Multi-layer ANNs can approximate any continuous function — but can overfit if too large
Feature Hierarchy	Naturally represents features at multiple levels of abstraction
Gradient Descent	Optimization may converge to local minimum, not global
Training vs Testing	Model building is compute-intensive; testing is very fast
Redundant Attributes	Handled automatically — weights are learnt for all attributes
Noise Sensitivity	Sensitive to noise in training data
Missing Data	Difficult to handle missing attribute values

Deep Learning — Why Now?

Hardware

Faster GPUs (parallel computation)
Distributed training infrastructure

Data

Larger labeled training datasets
ImageNet, web-scraped corpora

Algorithmic Improvements

ReLU activation (responsive, solves vanishing gradient)
Dropout regularization
Batch normalization
Supervised & unsupervised pre-training

Specialized Architectures

CNN — Convolutional Neural Networks (images)
RNN — Recurrent Neural Networks (sequences)
ResNet — Residual Networks (skip connections)
GAN — Generative Adversarial Networks

DataMining

Introduction to Data Mining

Data — Types, Quality & Preprocessing

Classification: Basic Concepts & Techniques

Model Overfitting

Artificial Neural Networks

⚡ Comprehensive Quiz

Data
Mining