Hiring DevOps engineers - a definitive guide

Introduction

A picture speaks a thousand words, the above image describes exactly the current state of DevOps. A cloud of confusion surrounds DevOps. Our aim throughout this blog post will be to uncover the veil surrounding DevOps and explain why it is of paramount importance to understand it as a recruiter or sourcer in today's world. DevOps is made of two words - development and operations.

To put it simply, DevOps is a combination of processes in which both the software engineers and operation engineers work as a unified front throughout the entire software development cycle.

A key word used throughout DevOps is “siloed”. Its traditional meaning is isolated. When referred to in software development, siloed means isolated teams, development models, codebases, operations and support working individually. DevOps implementation amalgamates all these parts of an organization into one efficient operation.

The DevOps movement began around 2007 when the software development and IT operations communities raised concerns about the traditional software development model, where developers who wrote code worked apart from operations who deployed and supported the code.

This created numerous siloed challenges such as increased development time, low throughput, etc. Hence, DevOps came into being and integrated these disciplines into one, continuous process.

Basics of DevOps

There are a few core principles at work in DevOps:

Systems thinking: Systems thinking means thinking about the performance of an entire system, instead of the performance of specific teams. This mindset ensures all teams and employees feel responsible for producing good quality and discourages teams from passing defects downstream.
Culture: A successful DevOps culture is often tied to a spirit of improved collaboration, experimentation, and continuous learning. This might mean teams make sure time is allocated to improve work, teams are rewarded for taking risks, and members are able to learn from others within and without their teams.
Automation: DevOps places a heavy emphasis on automating as much as possible. This can reduce time spent on repetitive and time-consuming tasks, and increase deployment speed. A DevOps team might, for example, automate testing processes so that developers can receive feedback early and frequently.

A couple of key practices make DevOps what it is. These include:

Continuous integration (CI): Continuous integration means feedback from stakeholders and fixes are integrated into a product continually. This can mean both automating processes in which fixes are integrated, and creating a culture in which continuous integration happens.

Continuous delivery (CD): Continuous delivery is when changes to a product (likely your code) are integrated automatically so that the product is always in a deployable state. This means that code can be deployed in short time frames (daily, weekly, and so on).

Together, continuous integration and continuous delivery are often referred to as CI/CD. Taking these practices one step further, continuous deployment adds a routine of real-time monitoring, testing, and updating products after they launch. Within a DevOps environment, it's common for organizations to release smaller, more frequent product updates that are more reactive to customer feedback, rather than the large-scale, labor-intensive updates siloed teams may deploy.

What does a DevOps engineer do?

DevOps Engineer is somebody who understands the Software Development Lifecycle and has the outright understanding of various automation tools for developing digital pipelines (Continuous Integration (CI) / Continuous Deployment (CD) pipelines).

A DevOps Engineer works with developers and the IT staff to oversee the code releases. They are either developers who get interested in deployment and network operations or sysadmins who have a passion for scripting and coding and move into the development side where they can improve the planning of test and deployment.

Roles and Responsibilities

A DevOps engineer’s roles and responsibilities are a combination of technical and management roles. It is essential to have excellent communication and coordination skills to successfully integrate various functions in a coordinated manner and deliver the responsibilities to the customer’s satisfaction. Some of the core responsibilities of DevOps Engineer include:

Understanding customer and project requirements
Implementing various development, testing, automation tools, and IT infrastructure
Planning the team structure, activities, and involvement in project management activities.
Setting up tools and required infrastructure
Defining and setting development, test, release, update, and support processes for DevOps operation
Have the technical skills to review, verify, and validate the software code developed in the project.
Troubleshooting techniques and fixing the code bugs
Monitoring the processes during the entire lifecycle for its adherence and updating or creating new processes for improvement and minimizing the wastage
Encouraging and building automated processes wherever possible
Identifying and deploying cybersecurity measures by continuously performing vulnerability assessment and risk management
Incidence management and root cause analysis
Coordination and communication within the team and with customers
Selecting and deploying appropriate CI/CD tools
Monitoring and measuring customer experience
Managing periodic reporting on the progress to the management and the customer

What is an SRE? How is that different from DevOps?

Site reliability engineering (SRE) was originally developed by Google.

In the words of Ben Treynor, SRE is “what happens when you ask a software engineer to design an operations function.”

Like DevOps, SRE is about team culture and relationships. Both SRE and DevOps work to bridge the gap between development and operations teams to deliver services faster. However, SRE differs from DevOps because it relies on site reliability engineers within the development team who also have an operations background to remove communication and workflow problems.

Site Reliability Engineering (SRE) is responsible for implementing the product developed by the Core Development team. The key objective of the SREs is to implement and automate DevOps practices to reduce the level of incidents and improve reliability and scalability. SREs are also responsible for sending swift and constant feedback to the Development team based on performance metrics – availability, latency, efficiency, capacity, and incident.

Some of the best practices for SRE are:

Ensuring reliability - getting systems back to steady-state as quickly as possible
Eliminating toil - automating wherever possible
Blameless postmortems - driving better cross-team collaboration
Observing what matters - gaining full visibility into system health

Why are SRE experts important?

SRE teams determine the launch of new features by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO). An SLI measures specific aspects of provided service levels. Key SLIs include request latency, availability, error rate, and system throughput. An SLO is based on the target value or range for a specified service level based on the SLI. An SLO for the required system reliability is then based on the downtime determined to be acceptable. This downtime level is referred to as an error budget—the maximum allowable threshold for errors and outages.

Now let’s look at the role and responsibility of a Site Reliability Engineer. The main jo responsibilities of an SRE expert are:

Gathering project requirements from stakeholders along with BAs and PMs
Designing high-level schematics of the infrastructure, tools, and processes needed
Performing an in-depth analysis of the possible risks and countermeasures for them
Calculating the potential cost of outages and planning for contingency
Monitoring the systems in production, analyzing their performance
Preparing input for infrastructure/tooling/workflow updates across the organization.
Teaching the Dev and Ops (or DevOps) teams to follow the guidelines and procedures to minimize the number of errors and incidents.

Difference between DevOps and SRE

Here is a quick summary of the key differences:

What is a cloud engineer? How is a cloud engineer different from a DevOps engineer?

Cloud computing refers to services like storage, databases, software, and analytics that are made accessible via the internet. According to Gartner, the cloud tech services market is expected to grow from $175.8 billion in 2018 to $206 billion in 2019.

A cloud engineer is an IT professional that builds and maintains cloud infrastructure. Cloud engineers can have more specific roles that include cloud architecting (designing cloud solutions for organizations), development (coding for the cloud), and administration (working with cloud networks).

Cloud engineer is a general term encompassing a few different roles, including:

cloud architect
cloud software engineer
cloud security engineer
cloud systems engineer
cloud network engineer

Each position focuses on a specific type of cloud computing, rather than the technology as a whole. Companies that hire cloud engineers are often looking to deploy cloud services, scale up their use of cloud resources or further their cloud understanding and technology.

Here is a quick table that summarizes the differences between DevOps and Cloud engineering:

What is the difference between DevOps and Backend developers?

Why can’t a Backend engineer fill the shoes of a DevOps expert? While there are some similarities, there are also major differences between the two functions:

What are common DevOps/SRE titles?

Here are some of the most common DevOps job titles, what they do, and what you should consider when looking to hire them:

DevOps or Platform Engineer - This person typically oversees and supports the platform used for DevOps operations.
DevOps Software Developer - This person is at the heart of the DevOps organization.The developers are responsible not only for turning new requirements into code, but unit testing, deployment, and ongoing monitoring as well.
DevOps Evangelist - A person who instills the DevOps lifestyle in employees. This role needs authority because their goal is to develop a plan for DevOps implementation and convince those in charge of how much better life will be if that plan is implemented.
Build Engineer - The Build Engineer is a DevOps managerial position in charge of development teams. This individual spends their time managing the build and development process and is responsible for ensuring build goals and deadlines are met in a fast-paced environment. They manage code, they maintain builds, they create new builds, they manage and deploy automation solutions, and they ensure builds meet established configuration requirements.
Release Manager - If the Build Engineer is the back of the coin, the Release Manager is the front. The Release Manager is an oversight and managerial position that oversees the overall development pipeline, guiding both individual releases and overall release schedules. They spend much of their time coordinating with the Build Engineer and with other teams to ensure that goals are met.
Automation Architect - The Automation Architect is a key role in modern DevOps. Since a huge portion of DevOps relies on automated workflows and processes that streamline teams and minimize the need for intervention, the Automation Architect is a critical employee.
Site Reliability Manager - The Site Reliability Engineer, or Reliability Engineer as otherwise known, is the individual responsible for ensuring the quality of orchestration and integration of tools needed to support daily operations. This is the quintessential role that comes to mind when people think about DevOps for the first time, the “magician” who masterfully patches together existing infrastructure with cloud solutions and data storage infrastructures. This role is important in any DevOps organization, as a failure to ensure sound integration can lead to outages that are costly.
DevOps Data Analyst - The DevOps Data Analyst role is a position of extreme importance. While development takes theory into account when developing products and features, the Data Analyst takes real data from real users and distills it down into actionable intelligence. They typically overlap with user experience engineers and UX designers in a DevOps environment.
Security Manager - Modern software development is fraught with peril. Time and time again, news comes out about SaaS platforms, new software applications, mobile apps, and other technology with glaringly obvious and exploitable flaws. This is caused, in large part, by security being treated as an afterthought. Particularly in the Waterfall development method, security is typically tacked on at the end with a cursory audit.

What technologies and tools do DevOps engineers work with?

Tools used by DevOps/SRE teams can be categorized in three main areas:

Development Tools - DevOps tools during the development of a project
Deployment Tools - Efficient deployment tools when code is ready for customer usage
Automation & Feedback Tools - Post deployment of product to ensure automation and take feedback from customers

What are some of the success stories of companies integrating DevOps/SRE?

The trending software development approach of DevOps/SRE has many quantifiable technical and business benefits, including a move from centralized release management structures to adaptive release management, shorter development cycles, increased deployment frequency, and faster time to market. But because it relies so heavily on increased communication, collaboration, and innovation, it can also be a catalyst for cultural change within an organization.

Here are some examples of companies using DevOps strategically:

Amazon

Back when Amazon was still run on dedicated servers, it was a constant challenge to predict how much equipment to buy to meet traffic demands and pad estimates to accommodate for unforeseen traffic spikes. As a result, about 40 percent of Amazon's server capacity was wasted—and during the Christmas shopping season, when traffic could triple, more than three-quarters could be left unused—along with the money spent to purchase it.

Once the online retailer moved to the Amazon Web Services (AWS) cloud, it allowed engineers to scale capacity up or down incrementally. Not only did this reduce spending on server capacity, but it also spurred a transition to a continuous deployment process that allows any developer to deploy their own code to whichever servers they need, whenever they want.

Within a year of Amazon's move to AWS, engineers were deploying code every 11.7 seconds, on average. The agile approach also reduced both the number and duration of outages, resulting in increased revenue.

Etsy

For its first several years, Etsy struggled with slow, painful site updates that frequently caused the site to go down. In addition to frustrating visitors, any downtime impacted sales for Etsy's millions of users who sold goods through the online marketplace and risked driving them to a competitor.

With the help of a new technical management team, Etsy transitioned from its waterfall model, which produced four-hour full-site deployments twice weekly, to a more agile approach. Today, it has a fully automated deployment pipeline, and its continuous delivery (CD) practices have reportedly resulted in more than 50 deployments a day with fewer disruptions.

Netflix

When Netflix evolved its business model from shipping DVDs to streaming video over the web, it waded into uncharted waters. There weren't any commercial tools available to help keep the company's massive cloud infrastructure running smoothly, so it turned to open source solutions. Enlisting the volunteer help of hundreds of developers, it created the Simian Army, a suite of automated tools that stress test Netflix's infrastructure and allow the company to proactively identify and resolve vulnerabilities before they impact customers.

Since then, Netflix has continued its commitment to automation and open source, and today engineers deploy code thousands of times per day

Adobe

Adobe's DevOps transformation began five years ago when the company moved from packaged software to a cloud services model and was suddenly faced with making a continuous series of small software updates rather than big, semi-annual releases.

To maintain the required pace, Adobe uses CloudMunch's end-to-end DevOps platform to automate and manage its deployments. Because it integrates with a variety of software, developers can continue to use their preferred tools, and its multi-project view allows them to see how a change to any one Adobe product affects the others.

The move has enabled faster delivery and better product management, and according to the Wall Street Journal, Adobe has already been able to meet 60 percent more app development demand

What do interviewers look for in DevOps/SRE engineers?

It is helpful to know what technical interviewers look for in DevOps/SRE engineers and the most commonly asked questions. As a recruiter, you can help prepare a candidate for the screens and get a sense of how confident they are about the subject matter. Even if you don’t understand the answers, it may be a good idea to go through some of these questions with a potential candidate.

Hard Skills

Logic

Explain what is DevOps ?
What challenges exist when creating DevOps pipelines?
What Is CAP Theorem?
What type of applications - Stateless or Stateful are more suitable for Docker Container?
Explain Blue-Green deployment technique
Should I use Vagrant or Docker for creating an isolated environment?
What Is Sticky Session Load Balancing? What Do You Mean By "Session Affinity"?
Name the three variables that affect recursion and inheritance in Nagios

Design

What is the difference between Monolithic, SOA and Microservices Architecture?
What is the difference between Resource Allocation and Resource Provisioning?
What's the difference between a Blue/Green Deployment and a Rolling Deployment?
Explain the term "Infrastructure as Code" (IaC) as it relates to configuration management
Differentiate between DevOps, SRE and Cloud Computing?
Explain the master-slave architecture of Jenkins.

Best Practices

What are CI/CD best practices?
What are the differences between Continuous Integration, Continuous Delivery, and Continuous Deployment?
How Do you update a live heavy traffic site with minimum or Zero Down Time?
What Do You Mean By High Availability (HA)?
What Did The Law Stated By Melvin Conway Implied?

Tools

How do all DevOps tools work together?
Name a few essential DevOps/SRE tools and their usage?
How does Kubernetes orchestrate Containers?
What are the different Selenium components?
What are the benefits of using version control?
What are some of the Automation & Feedback tools used in DevOps/SRE?
What is a merge conflict in Git, and how can it be resolved?
What is a Test Kitchen in Chef tool?

Cloud

Classify Cloud Platforms by category
What do you know about the Serverless model?
How is a Container different from a Virtual Machine?
What is a Virtual Private Cloud or VNet?
How do you build a hybrid cloud?
[AWS] How do you set up a Virtual Private Cloud (VPC)?

Behavioral Skills / Soft Skills

Mention some of the core benefits of DevOps.
How will you approach a project that needs to implement DevOps?
Why Has DevOps Gained Prominence over the Last Few Years?
Describe the branching strategies you have used in your past projects
Describe an experience in which your SRE team discussed and made a decision about whether to triage or conduct a root cause analysis. How did the team discuss the issue or issues, what hands-on role did you play, and what decision was made and why?
Tell me about a time you played a key role in the development of a mission-critical system. What was the system, what was the scale, what was your position in the team, how did you go about assessing and planning your project, and what was the outcome?
Describe an experience during a key project in which you came to the conclusion that you needed to sharpen your skills in a specific area. What was this project, what was the area in which you felt you needed to improve, and what happened?
Describe an experience in which written reports or other correspondence caused confusion as it moved through related departments. How did you assess the issue, how did you rephrase or re-communicate the information, what did you learn, and what was the outcome?
Tell me about a time finger-pointing happened among team members during an incident post-mortem. How did you respond to this, who did you speak with, and what did you do to implement a blameless post-mortem culture? What was the outcome?

Note that when it comes to preparing a candidate for DevOps/SRE, behavioral interview is as important if not more than the technical interview since DevOps/SRE has its roots in developing a culture in an organization which promotes teamwork, efficiency, time management and leadership.

What’s the best way to find DevOps/SRE Engineers?

There are many options when it comes to finding potential DevOps/SRE engineer candidates.

LinkedIn remains the best place to find developers, at least in the United States and Canada. Their Recruiter search has several fields you can use to search candidates, including title, location, current company, years of experience etc., and by creating a boolean search, the search can be narrowed down very effectively.

A quick search on the LinkedIn Recruiter portal shows that there are 290k+ DevOps Engineers and 77k+ Site Reliability Engineers:

Here are a few tips to narrow this list by adding the following specific details:

Title: DevOps/SRE have a range of titles as we have seen earlier in this blog. Titles such as DevOps platform engineer, DevOps Evangelist, Build Engineer, Release Manager are a surefire way to know that someone is working as DevOps/SRE:
Technologies: determine which technologies are acceptable to the hiring manager and construct a boolean search based on that combination. For example, if the hiring manager wants a developer who knows Continuous Delivery (CD)

In addition to searching on LinkedIn, here are some other ways to find and engage with DevOps/SRE candidates:

Job boards - Careerbuilder, Upwork, LinkedIn, Dice, Monster, Workable, SimplyHired, DevOps Job Board, SRE Job Board, Glassdoor, Remoteok.io (Remote), Craigslist (freelance) and Ladders, to name a few.
Staffing companies - you can utilize staffing companies like Rocket, Robert Half, Randstad, and others to help you find a proficient DevOps/SRE developer.
AI sourcing - you can use tools like Hireflow or Fetcher to source DevOps/SRE developers for you to engage.

What are the growth projections for DevOps/SRE engineers? What is the expected compensation?

According to Zippia, the average compensation for a DevOps engineer is $104k in the US. The growth rate in terms of roles is almost 21% per year with organizations of all sizes having a need for DevOps professionals.

The data from Zippa is a bit lower than what we have seen at Rocket with typical salaries in the $130-150k range in the US.

Conclusion

Hope you found this guide informative and helpful in your journey of recruiting DevOps professionals. Please reach out if you have any questions or thoughts.

About Rocket

Rocket pairs talented recruiters with advanced AI to help companies hit their hiring goals and knows technology recruiting inside out. Rocket is headquartered in the heart of Silicon Valley but has recruiters all over the US & Canada serving the needs of our growing client base across engineering, product management, data science and more through a variety of offerings and solutions.