As an SRE, I've always been fascinated by how systems learn and adapt. This curiosity naturally led me to explore Machine Learning Engineering โ a field where I could combine everything I know about infrastructure with the actual science of making machines smarter.
The honest truth? The transition is harder than it looks. Not because the concepts are impenetrable โ but because the mental model is different. In SRE, you're thinking about uptime and latency. In MLE, you're thinking about loss curves and distribution shift. It takes time to rewire.
What Transfers
More than you'd expect:
- โ Infrastructure knowledge is invaluable. ML systems need to run somewhere. GPUs need to be provisioned, models need to be served, data pipelines need to be reliable. SREs know this territory.
- โ Monitoring instincts carry over. Monitoring a model in production is just observability with a new vocabulary. Latency becomes inference time. Error rate becomes model accuracy drift.
- โ Systems thinking. Understanding that everything is a distributed system โ data flows, model registries, feature stores โ feels natural after years of on-call.
- โ Python comfort. Python is the language of ML. If you've been automating infrastructure with it, you're already halfway there.
What I Had to Learn from Scratch
The ML-specific concepts took real effort:
- Statistics and probability โ You can't evaluate a model without understanding what the numbers mean. I went back to basics.
- Feature engineering โ Raw data is garbage. Turning it into something a model can use is half the job.
- Model evaluation โ Accuracy is almost never the right metric. F1, AUC-ROC, precision/recall trade-offs โ each task has its own evaluation story.
- The ML lifecycle โ Experiment tracking, model versioning, A/B testing in production, rollback strategies for bad models. Different problems from SRE, same underlying discipline.
"The best way to learn is by doing. Start small, but start now."
Where I Am Now
I'm currently working through the fundamentals โ taking courses, building projects, and trying to connect each ML concept back to something I already understand from infrastructure. I'm building with MCP servers to explore AI tooling, running CrewAI experiments to understand multi-agent systems, and reading whatever papers I can get through without falling asleep.
The SRE-to-MLE path is real. The combination of skills โ being able to both train a model and run it reliably in production โ is genuinely rare. I think it's worth the climb.
More updates coming as I figure things out.