Complete Study Guide · Prof. Christina Albert

Data
Mining

A comprehensive, structured revision covering all 4 chapters — from data fundamentals to neural networks. Includes a 40-question quiz.

Chapter 1 · Intro Chapter 2 · Data Chapter 3 · Classification Chapter 3 · Overfitting Chapter 4 · ANN ⚡ Take the Quiz
01

Introduction to Data Mining

What is Data Mining?

Non-trivial extraction of implicit, previously unknown, and potentially useful information from data. It involves exploration and analysis — by automatic or semi-automatic means — of large quantities of data to discover meaningful patterns.

Why Data Mining?
  • Explosive growth in data from e-commerce, social media, banking, satellites
  • Cheaper, more powerful computing resources
  • Competitive pressure to provide better, customized services
  • Massive scientific datasets (NASA, genomics, simulations)
Origins & Disciplines
  • Machine Learning / AI
  • Pattern Recognition
  • Statistics
  • Database Systems

A key component of the emerging field of Data Science & Data-Driven Discovery.

Scale Challenges

Traditional techniques may be unsuitable for data that is:

  • Large-scale (petabytes)
  • High-dimensional
  • Heterogeneous & Complex
  • Distributed across sources
Core distinction: Data Mining is about discovering patterns from data, not programming rules explicitly. The insight is extracted, not written.
Data Mining Tasks
CategoryTaskDescriptionExample
PredictiveClassificationPredict discrete class labelsFraud detection, spam filter
PredictiveRegressionPredict continuous valuesHouse price prediction
DescriptiveClusteringGroup similar records togetherCustomer segmentation
DescriptiveAssociation RulesFind co-occurring itemsMarket basket analysis
DescriptiveAnomaly DetectionFind unusual data pointsIntrusion detection, fraud
Classification — How it Works
Training Set
Learn Model
Test Set
Predict
  • Model learns from labeled training data
  • Applied to unseen test data for prediction
  • Examples: fraud detection, churn prediction, galaxy cataloging
Societal Applications
  • Predicting impact of climate change
  • Improving healthcare & reducing costs
  • Reducing hunger by boosting agriculture
  • Finding alternative / green energy sources
  • Cyber security intrusion detection
  • Sky survey cataloging (stars, quasars)
02

Data — Types, Quality & Preprocessing

Core Concepts
  • Attribute — a property/characteristic of an object (a.k.a. variable, feature, field, dimension)
  • Object — a collection of attributes (a.k.a. record, point, case, instance)
  • Attribute Values — numbers or symbols assigned to an attribute for a particular object
Attribute Types
TypePropertiesExample
NominalDistinctness onlyEye color, zip code, ID
OrdinalDistinct + OrderGrades, rankings, {short, medium, tall}
Interval+Meaningful diffTemperature (°C/°F), dates
Ratio+Meaningful ratioKelvin, length, count, elapsed time
Key Rule: Nominal → only = ≠ | Ordinal → adds < > | Interval → adds + − | Ratio → adds × ÷. Each level includes all operations of levels above it.
Discrete vs Continuous
  • Discrete — finite or countably infinite values (zip codes, counts, words). Binary attributes are special case. Usually integers.
  • Continuous — real-number values (temperature, height, weight). Represented as floating-point variables.
  • Asymmetric — only presence (non-zero value) is important (e.g., items in transactions, words in documents)
Types of Datasets
  • Record Data — fixed set of attributes per object
  • Data Matrix — m×n numeric matrix (rows=objects, cols=attributes)
  • Document Data — term-frequency vectors
  • Transaction Data — sets of items per transaction
  • Graph Data — nodes + edges (web, molecules)
  • Ordered — spatial, temporal, sequential, genomic
Data Quality Problems
ProblemDescriptionHandling
NoiseModification of original values / extraneous objects. Distorts signal.Smoothing, robust algorithms
OutliersObjects considerably different from the rest. Can be noise OR the target.Detection algorithms; keep if goal
Missing ValuesNot collected, or not applicable to all cases.Eliminate / estimate / ignore
Duplicate DataSame or almost-same records. Common when merging sources.Data cleaning (deduplication)
Wrong / Fake DataIncorrect values entered or fabricated.Validation, cross-referencing
Data Preprocessing Pipeline
Raw Data
Aggregate
Sample
Transform
Reduce
Ready
Aggregation
  • Combining attributes or objects into single one
  • Reduces number of attributes or objects
  • Allows change of scale (cities → regions → countries)
  • Aggregated data has less variability — more stable
Discretization
  • Converts continuous attribute to ordinal
  • Unsupervised: Equal-width, Equal-frequency (percentiles), K-means clustering
  • Supervised: Uses class labels to guide splits
  • Static (once at start) or Dynamic (at each node)
Dimensionality Reduction
  • Curse of Dimensionality — data becomes increasingly sparse as dimensions grow; distance/density metrics lose meaning
  • PCA — finds projection capturing maximum variance
  • SVD — Singular Value Decomposition
  • Helps reduce noise, time, memory; aids visualization
Feature Selection & Creation
  • Redundant features — duplicate info from other attributes (e.g., price + tax)
  • Irrelevant features — no useful info (e.g., student ID for GPA prediction)
  • Feature Extraction — from raw data (edges from images)
  • Feature Construction — combine existing (mass/volume = density)
  • Mapping — Fourier, wavelet transforms
Attribute Transformation
  • Maps values to new replacement values
  • Simple functions: xk, log(x), ex, |x|
  • Normalization — adjusts for differences in frequency, mean, variance, range
  • Standardization — subtract mean, divide by std dev (z-score)
  • Removes unwanted signals like seasonality
03

Classification: Basic Concepts & Techniques

Classification Defined

Given a training set of records each characterized by (x, y) where x is the attribute set and y is the class label, learn a model that maps each x into one of the predefined class labels y.


  • x = attribute / predictor / independent variable / input
  • y = class / response / dependent variable / output
Classification Techniques

Base Classifiers:

  • Decision Tree based Methods
  • Rule-based Methods
  • Nearest-Neighbor (k-NN)
  • Naïve Bayes & Bayesian Belief Networks
  • Support Vector Machines (SVM)
  • Neural Networks / Deep Neural Nets

Ensemble: Boosting, Bagging, Random Forests

Building a Classification Model — General Flow
Raw Data
Training Set
Learn Classifier
Model
Test Set
Predictions
Decision Trees — Core Concepts
ConceptExplanation
Root NodeTop-most node; represents the entire dataset
Internal NodeSplitting attribute test
Leaf NodeClass label assignment
BranchOutcome of a test condition
Splitting AttributeThe attribute used to divide data at each node
Hunt's Algorithm
  • One of the earliest DT induction algorithms
  • If Dₜ has records of same class y → leaf node labeled y
  • If Dₜ has multiple classes → split using attribute test, recurse on each subset
  • Basis for CART, ID3, C4.5, SLIQ, SPRINT
Splitting Conditions by Attribute Type
  • Binary — already two-way split
  • Nominal — multi-way (one per value) or binary (group into two subsets)
  • Ordinal — same as nominal but order property must be preserved
  • Continuous — binary (A < v or A ≥ v), find best cut; or discretize first
Decision Tree — Advantages
  • Relatively inexpensive to construct
  • Extremely fast at classifying unknown records
  • Easy to interpret for small trees
  • Robust to noise (with overfitting prevention)
  • Handles redundant & irrelevant attributes easily
Decision Tree — Disadvantages
  • Greedy splitting may miss interacting attributes
  • Each decision boundary involves only a single attribute
  • Multiple trees may fit same data (not unique)
  • Can overfit without pruning mechanisms

Model Overfitting

Types of Errors
Error TypeDefinition
Training ErrorErrors on the training set (data model was built from)
Test ErrorErrors on the held-out test set
Generalization ErrorExpected error on random selection from the same distribution
Underfitting vs Overfitting
  • Underfitting — model too simple; both training AND test error are large
  • Overfitting — model too complex; training error is small but test error is large
  • As complexity increases, test error first decreases then increases
  • Optimal complexity sits at the minimum of test error
Core insight: Training error is NOT a reliable estimate of how well the model will perform on unseen data. A model that memorizes the training set will have near-zero training error but high generalization error.
Causes of Overfitting
  • Not enough training data — model clings to noise
  • High model complexity — too many parameters relative to data
  • Multiple Comparison Procedure — selecting the "best" from many randomly generated models inflates apparent accuracy
Multiple Comparison Procedure

Example: 50 analysts each make 10 random stock predictions. Probability that at least one analyst gets ≥8 correct is surprisingly high — yet the prediction is pure chance. Selecting the "best" model this way leads to false confidence.

Impact of Training Data Size
  • Increasing training data reduces the gap between training and test errors at any given model size
  • More data forces the model to learn genuine patterns rather than noise
  • Even a 50-node tree performs better with twice the training data
Model Selection Strategies
  • Validation Set — hold out a portion of training data to estimate generalization error
  • Incorporating Model Complexity — penalize complex models (MDL, Structural Risk Minimization)
  • Goal: find the simplest model that explains the data well
Pruning Strategies for Decision Trees
StrategyWhenHow
Pre-Pruning (Early Stopping)During tree constructionStop growing if: all same class; all same attribute values; too few instances; class dist. independent of features; no impurity improvement; gen. error below threshold
Post-PruningAfter full tree grownBottom-up subtree replacement: trim if generalization error improves. Leaf label = majority class of trimmed sub-tree.
Decision Tree Complexity vs Error — Key Takeaway
4-node tree: higher training error but generalizes better
50-node tree: near-zero training error, but overfits badly on test data
More data at any fixed tree size reduces test error significantly
Overfitting produces trees more complex than necessary — pruning is essential
04

Artificial Neural Networks

Core Idea

A complex non-linear function learned as a composition of simple processing units.

  • Collection of nodes (neurons) connected by directed edges
  • Each node: receives signals, computes, transmits
  • Edge weight = strength of connection
  • Analogous to biological neurons + electrical impulses
Perceptron (Single Neuron)
  • Simplest ANN — one layer of computation
  • Computes a weighted sum of inputs + applies threshold
  • Can only solve linearly separable problems
  • Fails on XOR — no linear hyperplane separates XOR data
Multi-layer Neural Networks
Input Layer
Hidden Layer 1
Hidden Layer 2
···
Output Layer
  • Also called feedforward neural networks
  • Each hidden layer operates on activations from preceding layer
  • Multi-layer ANNs with ≥1 hidden layer can solve ANY classification task involving nonlinear decision surfaces
  • Hidden layers represent levels of abstraction — complex features as compositions of simpler features
Why Multiple Hidden Layers?
  • Each layer extracts features at increasing abstraction
  • Example: Image recognition: edges → shapes → facial parts → face
  • Depth = number of layers; deeper networks express complex hierarchies
  • Deep networks require fewer total parameters than shallow equivalents for same function
Design Issues
ParameterRule of Thumb
Input nodesOne per binary/continuous attr; k or log₂k per categorical
Output nodesOne for binary; k or log₂k for k-class
Hidden layers/nodesTune empirically (no fixed rule)
HyperparametersLearning rate, epochs, mini-batch size, initial weights & biases
ANN Characteristics
CharacteristicDetail
Universal ApproximatorsMulti-layer ANNs can approximate any continuous function — but can overfit if too large
Feature HierarchyNaturally represents features at multiple levels of abstraction
Gradient DescentOptimization may converge to local minimum, not global
Training vs TestingModel building is compute-intensive; testing is very fast
Redundant AttributesHandled automatically — weights are learnt for all attributes
Noise SensitivitySensitive to noise in training data
Missing DataDifficult to handle missing attribute values
Deep Learning — Why Now?
Hardware
  • Faster GPUs (parallel computation)
  • Distributed training infrastructure
Data
  • Larger labeled training datasets
  • ImageNet, web-scraped corpora
Algorithmic Improvements
  • ReLU activation (responsive, solves vanishing gradient)
  • Dropout regularization
  • Batch normalization
  • Supervised & unsupervised pre-training
Specialized Architectures
  • CNN — Convolutional Neural Networks (images)
  • RNN — Recurrent Neural Networks (sequences)
  • ResNet — Residual Networks (skip connections)
  • GAN — Generative Adversarial Networks

⚡ Comprehensive Quiz

40 Questions
0 Score
0 Current
0
/ 40
0
Correct
0
Wrong
0%
Score