hpc.social


High Performance Computing
Practitioners
and friends /#hpc
Share: 
This is a crosspost from   Computing – thinking out loud works in progress and scattered thoughts, often about computers. See the original post here.

SRE to Solutions Architect

It’s been about two years since I joined NVIDIA as a Solutions Architect, which was a pretty big job change for me! Most of my previous work was in jobs that could fall under the heading of “site reliability engineering”, where I was actively responsible for the operations of computing systems, but my new job mostly has me helping customers design and build their own systems.

I’m finally starting to feel like I know what I’m doing at least 25% of the time ? so I thought this would be a good time to reflect on the differences between these roles and what my past experience brings to the table for my (sort of) new job.

(Just a note: I feel like job titles for ops folks are a fraught topic. My job titles have included things like “Production Engineer”, “HPC Cluster Administrator”, and “HPC/Cloud Systems Engineer”. I tend to self-identify more with the term “sysadmin”, but I’m using “SRE” as the most current term that captures the work I’ve spent a lot of my career doing, where I generally approached ops from a software engineering perspective. Feel free to substitute your job title of choice!)

I spent most of the past 10 years building and running large computing systems. With the exception of ~18 months working on backend storage for a fairly large website, I’ve mostly worked on large high-performance-computing (HPC) clusters. These systems are generally used by researchers and engineers to run simulations and data analysis. The teams I joined were generally responsible for building these clusters, keeping them running, helping the researchers who used them, and making sure they performed well.

In my day-to-day work in SRE (or whatever you call it), I mostly thought about problems like:


My role as a solutions architect is rather different, as I don’t actually have any services I’m responsible for keeping online. Instead, I’m generally working with external customers who are working with our products and using them in their own production environments. Because I’m focused on HPC and supercomputing, my customers have generally purchased NVIDIA’s hardware products, and are operating them in their own datacenters. I’m frequently talking to the SRE teams, but I’m not part of them myself.

In my daily work as a solutions architect, I’m thinking more about questions like:


On the one hand, I work on a lot of the same problems as a solutions architect as I did in SRE. I still spend a lot of time thinking about the scalability, performance, and reliability of HPC systems. I still care a lot about making sure the systems I help build are useful and usable for researchers.

On the other hand, I’m not so much on the pointy end of these problems anymore. My work is mostly focused on enabling others to run reliable systems, rather than being directly on the hook for them. And while I do help directly manage some internal lab clusters, those systems have very loose SLOs. So in practice I haven’t been on call in about two years.

I do think my experience in SRE has been really important in doing a good job in solutions architecture. I like to think I have a pretty good instinct for systems design at this point, and I can often help identify problems and bottlenecks in early stages. My troubleshooting skills from SRE work are incredibly helpful, as a lot of my work is helping customers understand what the heck is broken on their clusters. And I also find that it really helps to have someone who “speaks the same language” as the SRE teams for our customers, especially because I feel like so many vendor relationships neglect reliability concerns in favor of features.

The transition has been really interesting, and I’m still conflicted about which kind of job I prefer. I don’t exactly miss being on call… but I do miss somewhat the more visceral feeling of understanding a running system really well through sheer continuous contact with it. However, I really love helping my customers build cool systems, and I like the satisfaction of helping many different teams do well, versus focusing tightly on a single service.

I’m really enjoying the solutions architect gig right now, but I also wouldn’t be surprised if I ended up doing SRE work directly again at some point.