A comprehensive, structured revision covering all 4 chapters — from data fundamentals to neural networks. Includes a 40-question quiz.
Non-trivial extraction of implicit, previously unknown, and potentially useful information from data. It involves exploration and analysis — by automatic or semi-automatic means — of large quantities of data to discover meaningful patterns.
A key component of the emerging field of Data Science & Data-Driven Discovery.
Traditional techniques may be unsuitable for data that is:
| Category | Task | Description | Example |
|---|---|---|---|
| Predictive | Classification | Predict discrete class labels | Fraud detection, spam filter |
| Predictive | Regression | Predict continuous values | House price prediction |
| Descriptive | Clustering | Group similar records together | Customer segmentation |
| Descriptive | Association Rules | Find co-occurring items | Market basket analysis |
| Descriptive | Anomaly Detection | Find unusual data points | Intrusion detection, fraud |
| Type | Properties | Example |
|---|---|---|
| Nominal | Distinctness only | Eye color, zip code, ID |
| Ordinal | Distinct + Order | Grades, rankings, {short, medium, tall} |
| Interval | +Meaningful diff | Temperature (°C/°F), dates |
| Ratio | +Meaningful ratio | Kelvin, length, count, elapsed time |
| Problem | Description | Handling |
|---|---|---|
| Noise | Modification of original values / extraneous objects. Distorts signal. | Smoothing, robust algorithms |
| Outliers | Objects considerably different from the rest. Can be noise OR the target. | Detection algorithms; keep if goal |
| Missing Values | Not collected, or not applicable to all cases. | Eliminate / estimate / ignore |
| Duplicate Data | Same or almost-same records. Common when merging sources. | Data cleaning (deduplication) |
| Wrong / Fake Data | Incorrect values entered or fabricated. | Validation, cross-referencing |
Given a training set of records each characterized by (x, y) where x is the attribute set and y is the class label, learn a model that maps each x into one of the predefined class labels y.
Base Classifiers:
Ensemble: Boosting, Bagging, Random Forests
| Concept | Explanation |
|---|---|
| Root Node | Top-most node; represents the entire dataset |
| Internal Node | Splitting attribute test |
| Leaf Node | Class label assignment |
| Branch | Outcome of a test condition |
| Splitting Attribute | The attribute used to divide data at each node |
| Error Type | Definition |
|---|---|
| Training Error | Errors on the training set (data model was built from) |
| Test Error | Errors on the held-out test set |
| Generalization Error | Expected error on random selection from the same distribution |
Example: 50 analysts each make 10 random stock predictions. Probability that at least one analyst gets ≥8 correct is surprisingly high — yet the prediction is pure chance. Selecting the "best" model this way leads to false confidence.
| Strategy | When | How |
|---|---|---|
| Pre-Pruning (Early Stopping) | During tree construction | Stop growing if: all same class; all same attribute values; too few instances; class dist. independent of features; no impurity improvement; gen. error below threshold |
| Post-Pruning | After full tree grown | Bottom-up subtree replacement: trim if generalization error improves. Leaf label = majority class of trimmed sub-tree. |
A complex non-linear function learned as a composition of simple processing units.
| Parameter | Rule of Thumb |
|---|---|
| Input nodes | One per binary/continuous attr; k or log₂k per categorical |
| Output nodes | One for binary; k or log₂k for k-class |
| Hidden layers/nodes | Tune empirically (no fixed rule) |
| Hyperparameters | Learning rate, epochs, mini-batch size, initial weights & biases |
| Characteristic | Detail |
|---|---|
| Universal Approximators | Multi-layer ANNs can approximate any continuous function — but can overfit if too large |
| Feature Hierarchy | Naturally represents features at multiple levels of abstraction |
| Gradient Descent | Optimization may converge to local minimum, not global |
| Training vs Testing | Model building is compute-intensive; testing is very fast |
| Redundant Attributes | Handled automatically — weights are learnt for all attributes |
| Noise Sensitivity | Sensitive to noise in training data |
| Missing Data | Difficult to handle missing attribute values |