Machine Learning A2: Project Journal
Author: Carlos Emiliano Mendoza Hernandez
Task: Chinese MNIST classification with Logistic Regression and MLP baselines.
Repo/Docs:
- Overview: https://emilianodesu.github.io/MLA2/
- Presentation: https://emilianodesu.github.io/MLA2/presentation.html
- Notebooks:
- Logistic Regression: https://emilianodesu.github.io/MLA2/notebooks/log_reg.html
- MLP: https://emilianodesu.github.io/MLA2/notebooks/mlp.html
- Preprocessing Utils: https://emilianodesu.github.io/MLA2/notebooks/preprocessing_utils.html
- Models Utils: https://emilianodesu.github.io/MLA2/notebooks/models_utils.html
- Logistic Regression: https://emilianodesu.github.io/MLA2/notebooks/log_reg.html
- V0 baselines (first approach):
- V0/log_reg.py: https://github.com/emilianodesu/MLA2/blob/main/V0/log_reg.py
- V0/mlp.py: https://github.com/emilianodesu/MLA2/blob/main/V0/mlp.py
- V0/models_utils.py: https://github.com/emilianodesu/MLA2/blob/main/V0/models_utils.py
- V0/log_reg.py: https://github.com/emilianodesu/MLA2/blob/main/V0/log_reg.py
This journal synthesizes the implementation and results documented across the pages above.
A. Task Definition
Handwritten digit recognition (e.g., MNIST) is a canonical benchmark for testing machine learning models on visual pattern classification. The Chinese MNIST dataset extends this challenge from 10 Arabic numerals (0–9) to 15 Chinese numerals and related characters, introducing significant visual and structural complexity. Chinese numerals often include multiple intersecting strokes, nested radicals, or shared components that create high intra-class variability (due to handwriting differences) and high inter-class similarity (characters that differ by subtle radicals or stroke orientation). These factors make the task substantially more challenging than Latin digit classification and more representative of real-world OCR tasks involving non-Latin scripts.
The motivation for selecting handwritten Chinese character classification is both intellectual and practical. From a research perspective, it provides a more demanding testbed for classical and neural models, allowing for a richer analysis of model capacity, feature extraction, and decision boundaries. From an application standpoint, automated recognition of Chinese handwritten characters underpins important tasks such as digitizing historical archives, postal address recognition, educational handwriting feedback tools, and OCR systems for Chinese administrative and cultural documents. Building robust models for this task aligns with broader goals in multilingual OCR and low-resource script digitization.
This project aims to build two baseline models for classifying images of handwritten Chinese numerals from the Chinese MNIST dataset. The models will be a logistic regression and a multi-layer perceptron (MLP). The goal is to establish foundational performance benchmarks that can be improved upon with more advanced architectures in future work.
Objective
The objective of this project is to implement and evaluate two baseline models—Logistic Regression and a configurable Multi-Layer Perceptron (MLP)—for classifying handwritten Chinese numerals into their corresponding categories. By systematically preprocessing the data, defining a clean model interface, and running controlled training and evaluation pipelines, the project establishes baseline performance and analyzes the strengths and limitations of linear vs. nonlinear decision functions in this context. This work sets the foundation for future extensions, such as convolutional architectures or transformers, which could further exploit the spatial structure of the images.
Interface
Inputs
The system operates on grayscale images of handwritten Chinese characters drawn from the Chinese MNIST dataset. After preparation, each sample is stored as data/{train,val,test}/{class}/image.png
. Metadata containing the numeric value, Unicode character, and file code is loaded from data/chinese_mnist.csv
. The preprocessing pipeline (implemented in preprocessing_utils.py
) applies a deterministic sequence of transformations:
- Image to tensor: Convert to single-channel grayscale, resize to 64×64, convert to PyTorch tensor with values in ([0, 1]).
- Optional normalization: Subtract per-channel mean and divide by standard deviation.
- Data augmentation (train only): Random rotations (±10°) and affine translations (≤5%) simulate handwriting variations.
- Flattening: The final image tensor is reshaped in row-major order into a fixed-length 4096-dimensional vector, ensuring compatibility with linear and MLP models.
Batches are loaded via get_dataloaders(...)
, which returns PyTorch DataLoaders for train, validation, and test splits. The input tensor to the models has shape (batch_size, 4096)
and contains real values in [0, 1] (normalized if enabled), with no NaNs. Target labels are integer-encoded in ({0, , 14}), corresponding to the 15 character classes:
['0', '1', '10', '100', '1000', '10000', '100000000',
'2', '3', '4', '5', '6', '7', '8', '9']
Outputs
Given an input batch \(X \in \mathbb{R}^{B \times 4096}\), the model returns logits \(Z \in \mathbb{R}^{B \times 15}\). These logits are transformed using a softmax function to produce class probabilities \(p(y|x)\) for each sample, with \(\sum_i p_i = 1\). Two prediction modes are provided:
predict
: \(\arg\max_i Z_i\) returns the most likely class index.predict_proba
: returns the full probability vector for downstream analysis or thresholding.
All training and evaluation metrics (e.g., loss, accuracy) are logged per epoch to structured JSON and CSV files under metrics/
, while model checkpoints are stored under checkpoints/
using standardized filenames encoding architecture, optimizer, batch size, and learning rate. Final test metrics and confusion matrices are saved in plots/
for inspection (see presentation Sections 7–10).
Rationale Linking Inputs, Outputs, and Task
The flattened 4096-dimensional vector serves as a stable, fixed-length numerical representation of each image, enabling both linear and MLP models to operate without convolutional layers. Light data augmentation introduces invariance to the types of perturbations expected in real handwriting, improving generalization. The cross-entropy loss on the softmax logits directly optimizes the negative log-likelihood of the correct class, aligning exactly with the task objective of assigning each image to one of 15 categories. The logits also provide a probabilistic interpretation useful for confidence estimation, misclassification analysis, and visualization. This tight alignment between input representation, model output, and loss function ensures that the computational pipeline reflects the structure and demands of the classification task rather than relying on ad hoc heuristics.
B. Explanation of Model, Algorithm, Key Components, and System Structure
System Overview
The project follows a structured end-to-end pipeline that mirrors standard machine learning workflows, from data acquisition to final model evaluation. Each component was designed to be modular, reproducible, and analytically transparent.
1. Data Preparation
(preprocessing_utils.py
, presentation Sec. 2–3) The raw dataset is first downloaded and extracted if necessary, after which it is partitioned into stratified training, validation, and test splits with an 80 / 10 / 10 ratio. This stratification ensures that class proportions are preserved across splits, maintaining balanced evaluation. ImageFolder datasets are then constructed, and corresponding DataLoaders are instantiated with configured transforms and batch sizes. These transforms handle basic preprocessing (e.g., tensor conversion and normalization) and light augmentations to improve generalization without distorting the underlying character structures.
2. Model Definition
(mlp.py
, log_reg.py
, presentation Sec. 4–6) Two primary model families were explored:
Logistic Regression: A single linear layer mapping the 4096-dimensional flattened input to 15 logits. This provides a simple baseline to assess the linear separability of the data.
MLP Baselines: Deeper fully connected architectures built from repeating blocks of
Linear → BatchNorm1d → ReLU → Dropout
, followed by a final linear layer to 15 logits. Three key architectures were examined:mlp_1x512
mlp_2x256
mlp_3x512-256-128
(the main configuration showcased in the presentation).
All models employ dropout (0.5) and weight decay (5 × 10⁻⁴) for regularization. Batch normalization and ReLU activations stabilize training and enable effective optimization across learning rates.
3. Training Loop
(models_utils.py
, presentation Sec. 7) The training pipeline supports both SGD with momentum (0.9) and Adam optimizers, with learning rates in {0.01, 0.001, 0.0005}. Models are trained for up to 50 epochs for MLPs and 30 for Logistic Regression, with early stopping based on validation accuracy (patience = 5). Each epoch performs forward propagation, computes cross-entropy loss, runs backpropagation, updates parameters, and evaluates on the validation set. Checkpoints are saved whenever validation accuracy improves, and training metrics are logged to JSON and CSV for reproducibility.
We adopt mini‑batch gradient descent. Contrast of optimizers: - SGD+Momentum approximates a first-order low-pass filter over gradients; momentum term reduces variance & accelerates along shallow curvature. - Adam adapts per-parameter step size using first & second moment estimates (bias-corrected). Faster initial descent but may converge to sharper minima; weight decay mitigates overfitting.
4. Model Selection and Testing
(Evaluation & Selection site, presentation Sec. 9–10) All checkpoints within /checkpoints/log_reg/
and related directories are automatically enumerated. Filenames encode hyperparameter settings (architecture, batch size, optimizer, learning rate), which are parsed via regular expressions to reconstruct model configurations. Each model is reloaded and evaluated on the test set using the original batch size. The best model is selected by test accuracy. Additional analyses include confusion matrix visualization and qualitative inspection on selected test samples.
Key components
Data–Model Alignment.
Images are grayscale 64×64, flattened into 4096-dimensional vectors. This representation deliberately avoids convolutional inductive biases, isolating the effects of network depth/width, optimization strategy, and regularization on a pure dense architecture. Light augmentations preserve the geometric integrity of Chinese characters, avoiding transformations that could corrupt stroke patterns.
Model Architecture.
MLPs with BatchNorm and ReLU form a robust hypothesis class: they accelerate convergence, stabilize gradient flow, and benefit from dropout to mitigate co-adaptation. The tapered architecture (512 → 256 → 128) encourages progressive feature compression, reducing parameter count in later layers while retaining representational power.
Optimization Strategies.
Adam provides adaptive learning rates and generally converges rapidly with minimal tuning. SGD with momentum introduces a useful inductive bias toward flatter minima and often yields competitive generalization. Both are systematically compared across learning rates to identify stable regimes.
Early Stopping and Persistence.
Validation accuracy monitoring prevents overfitting and reduces unnecessary computation. A strict, human- and machine-readable checkpoint naming convention allows efficient post-hoc analyses: JSON and CSV logs enable quick aggregation and plotting without retraining.
Theory-to-code Mapping
For input \(x \in \mathbb{R}^{4096}\), the network computes \[
z = f_\theta(x),
\] where \(f_\theta\) is the composition of affine, batch normalization, ReLU, and dropout layers. The final linear layer produces 15 logits; softmax probabilities are used implicitly through CrossEntropyLoss
: \[
p_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}.
\] Gradients of the negative log-likelihood are backpropagated via PyTorch’s autograd engine. Parameter updates follow either:
SGD with Momentum: \(v_t = \mu v_{t-1} + \nabla_\theta L_t\), \(\theta_{t+1} = \theta_t - \eta v_t - \eta \lambda \theta_t\).
Adam: Maintains first and second moment estimates \(m, v\), with bias correction to update each parameter adaptively.
Early stopping monitors validation accuracy, saving the best checkpoint and halting when no improvement is observed within the patience window.
Implementation Details and Challenges
Batch Size.
Smaller batches (e.g., 32) introduced gradient noise that occasionally improved generalization, whereas larger batches (256) yielded smoother loss curves but sometimes required lower learning rates to remain stable.
Learning Rate Sensitivity.
Empirically, Adam with 0.001 and SGD with 0.01 were consistently strong choices. A learning rate of 0.0005 was often useful for fine-tuning later in training.
Metric Persistence.
Each training run produces atomic JSON and CSV logs, making it straightforward to rank, visualize, and compare results. Duplicate runs are identifiable by run_id
hashes, enabling robust experiment tracking.
Evolution from V0.
The initial V0 scripts (in V0/log_reg.py
, V0/mlp.py
, and V0/models_utils.py
) provided a minimal, working pipeline: a single training loop without sophisticated checkpointing, metric persistence, or architectural exploration. The final version refactors and extends this baseline by:
- Incorporating BatchNorm and Dropout to improve stability and generalization.
- Implementing early stopping for more efficient training.
- Establishing a consistent checkpoint/logging framework for systematic experimentation.
- Automating checkpoint enumeration and test-time model selection for evaluation.
C. Model Evaluation and Refinement
Expected Model Behaviour
The primary objective of the project is to maximize classification accuracy on the held-out test set of the Chinese MNIST dataset. Secondary objectives include monitoring macro-F1 scores to assess per-class balance and inspecting confusion matrices for systematic misclassifications, particularly among visually similar numerals and characters. The dataset is split stratified 80/10/10 into training, validation, and test sets. Early stopping is applied based on validation accuracy to prevent overfitting, and the final model is selected based on test-set performance among saved checkpoints.
Loss Function vs. Task Objective
During training, the model optimizes Cross-Entropy Loss, which serves as a differentiable surrogate for 0–1 accuracy. While accuracy is the ultimate evaluation metric, cross-entropy offers several advantages: it provides smooth gradients suitable for backpropagation, allows probabilistic calibration through softmax outputs, and generally correlates strongly with classification accuracy. The minor mismatch between the training objective (loss minimization) and evaluation goal (accuracy maximization) is acceptable because it enables stable optimization and effective model convergence.
Empirical Results
The dataset consists of 12,000 training images (800 per class) and 1,500 images each for validation and testing (100 per class), ensuring balanced class distributions. Representative experiments with the primary MLP architecture (mlp_3x512-256-128
, batch size 32) reveal nuanced optimization dynamics:
Adam, lr=0.01: Early stopping occurred quickly at val_acc ≈ 0.593, indicating that this learning rate is too aggressive for the given architecture.
Adam, lr=0.001: Val_acc peaked at ≈ 0.887 around epoch 23 before early stopping, demonstrating stable convergence.
SGD with momentum 0.9, lr=0.01: Val_acc steadily increased to ≈ 0.899 by epoch 29, indicating strong generalization performance.
SGD, lr=0.001: Converged more slowly but eventually achieved mid/high 0.88 val_acc, requiring more epochs to reach peak performance.
Patterns across experiments show that the best validation accuracies were obtained with ADAM at lr=0.001 for the 3-layer tapered MLP. Larger batch runs (bs=256) were logged and analyzed similarly; test-time evaluation used the same systematic checkpoint parsing and selection procedure. As documented on the site “Evaluation & Selection,” checkpoints are parsed and re-evaluated on the test split; best-by-test-accuracy is selected. Confusion matrices and sample inference visualizations are produced for the winner.
Discrepancy Analysis
Cross-entropy may favor confidence calibration over sheer accuracy in some regimes. Where classes are visually similar (e.g., characters with similar strokes), macro-F1 helps detect minority-class degradations even in a balanced dataset. Confusion matrices highlight consistent misclass pairs: future work can target these with specialized augmentation or architectural priors (e.g., CNN).
Reflections and Future Work
Appendix: Reproducibility and Artifacts
What worked:
- Early stopping + BatchNorm + Dropout significantly improved MLP training stability and generalization vs. the V0 approach.
- Systematic logging (JSON/CSV) and strict checkpoint naming made it easy to compare runs and prevented “lost best model” issues.
- Adam at 0.001 provided strong, steady results.
What could be improved:
- Convolutional inductive bias: Dense MLPs over flattened pixels ignore locality; CNN baselines should improve both sample efficiency and final accuracy.
- Calibration analysis: Add reliability diagrams/ECE to assess confidence; helpful if deploying probabilistic outputs.
- Hyperparameter search: Move from manual grids to Optuna/Ray Tune for depth/width/LR/momentum/weight decay.
- Data normalization: Compute dataset mean/std to enable consistent Normalize(mean,std) and potentially improve optimization smoothness.
- Regularization sweeps: Dropout rates and weight decay could be tuned per depth.
- Ensembling: Averaging top-N checkpoints may yield a small but robust boost.
- Mixed precision / GPU acceleration: For larger sweeps, AMP and device utilization would help.
Next steps:
Implement CNN baselines (e.g., small ConvNet or ResNet-18 with grayscale input) using non-flattened transforms.
Automate experiment orchestration (config files and seeds), and report mean±std across multiple seeds. Include confidence intervals and statistical tests (e.g., paired t-test on per-sample correctness between top models).
Extend selection criteria to incorporate macro-F1 in tie-breaks and include per-class recall thresholds.
Scripts:
- log_reg.py: minimal linear baseline with 30-epoch ceiling.
- mlp.py: MLPBaseline class with configurable hidden_layers, hidden_units, dropout.
- models_utils.py: train_model, test_model, plotting utilities (training curves, confusion matrix), predict_image.
- preprocessing_utils.py: prepare_images, summarize_split, get_dataloaders, transforms, previews.
Checkpoints and metrics:
- checkpoints/: per-run best validation models saved with pattern:
- MLP: checkpoints/mlp/{arch}_{batch}_{optimizer}_{lr}.pth
- LR: checkpoints/log_reg/model_{batch}_{optimizer}_{lr}.pth
- metrics/mlp/: per-run JSON + runs_detailed.csv; enables offline plotting and ranking without re-training.
- plots/: confusion matrices and training curves.
- checkpoints/: per-run best validation models saved with pattern:
V0 lineage:
- Early versions (V0/*.py) established a working baseline but lacked BatchNorm, robust persistence, and standardized naming. The current code incorporates these improvements, enabling more reliable experimentation and clearer reporting.