We are doing alignment backwards.
The dominant approach has been: build ever-more-capable generative systems, then layer on rules, filters, RLHF, constitutional principles, or classifiers. Hope the model learns to respect the guardrails or that the detectors catch failures.
It doesn't scale. Models learn to route around constraints, produce plausible compliance, or generate outputs that look safe while pursuing stronger internal gradients. We call this "alignment." It's closer to security theater.