Up until a few months ago, the term DevOps
was simply another buzzword which filled my Twitter feed that evoked a particular idea but wasn’t really concrete to me. Similar to other buzzwords related to software development such as NoSQL and Agile, it is hard to pin down what the definitive definition of the term is just what it wasn’t. If you aren’t familiar with DevOps, a simple definition is that the goal of DevOps is to address this common problem when building online services
The Big Switch
A couple of months ago, my work group took what many would consider a rather extreme step in eliminating this wall between developers and operations. Specifically, Bing Ads transitioned away from the traditional Microsoft engineering model of having software design engineers (aka developers), software design engineers in test (testers) and service operations (ops) and merged all of these roles into a single engineering role. As it states in the Wikipedia entry for DevOps, the adoption of DevOps was driven by the following trends
- Use of agile and other development processes and methodologies
- Demand for an increased rate of production releases from application and business unit stakeholders
- Wide availability of virtualized and cloud infrastructure from internal and external providers
- Increased usage of data center automation and configuration management tools
All of these trends already applied to our organization before we made the big switch to merge the three engineering disciplines into a DevOps role. We’d already embraced the Agile development model complete with two to four week sprints, daily scrums, burn-down charts, and senior program managers playing the role of the product owner (although we use the term scenario owner). Given our market position as the underdog to Google in search and advertising, our business leaders always wants to ship more features, more quickly while maintaining high product quality. In addition, there’s a ton of peer pressure for all of us at Microsoft to leverage internal tools Windows Azure and Autopilot for as much of our cloud services needs as possible instead of rolling our own data centers and hardware configurations.
Technically our organization was already committed to DevOps practices before we made the transition that eliminated roles. However the what the organization realized is that a bigger change to the culture was needed for us to get the most value out of these practices. The challenge we faced is that the organizational structure of separate roles for developers, testers and operations tends to create these walls where one role feels their responsibility is for a certain part of the development cycle and then tosses the results of their efforts down stream to the next set of folks in the delivery pipeline. Developers tended to think their job was to write code and quality was the role of testers. Testers felt their role was to create test frameworks and find bugs then deployment was the role of the operations team. The operations team tended to think their role was keeping the live site running without the ability to significantly change how the product was built. No matter how open and collaborative the people are on your team, these strictly defined roles create these walls. My favorite analogy for this situation is like comparing two families who are on a diet trying to lose weight and one of them has fruit, veggies and healthy snacks in the pantry while the other has pop tarts, potato chips, chocolate and ice cream in theirs. No matter how much will power the latter family has, they are more likely to “cheat” on their diet than the first family because they have created an environment that makes it harder for them to do the right thing.
Benefits
The benefits of fully embracing DevOps are fairly self-evident so I won’t spend time on discussing the obvious benefits that have been beaten to death elsewhere. I will talk about the benefits I’ve seen in our specific case of merging the 3 previous engineering roles into a single one. The most significant change is the cultural change towards how we view automation of every step related to deployment and monitoring. It turns out that there is a big difference when approaching a problem from the perspective of taking away people’s jobs (i.e. automating what the operations team does) versus making your team more effective (i.e. reducing the amount of time the engineering team spends on operational tasks that can be automated thus giving us more time to work on features that move the business forward). This has probably the biggest surprise, although obvious in hindsight, as well as the biggest benefit.
We’ve also begun to see faster time to resolve issues from build breaks to features failing in production due to fact that the on-call person (we call them Directly Responsible Individuals or DRIs) is now a full member of the engineering team who is expected to be capable of debugging and fixing issues encountered as part of being on-call. This is an improvement from prior models where the operations team were the primary folks on-call and would tend to pull in the development team as a last resort outside of business hours.
As a program manager (or product manager if you’re a Silicon Valley company), I find it has made my job easier since I have fewer people to talk to because we’ve consolidated engineering managers. No longer having to talk to an development manager separately from the manager of systems engineers separately from a test manager has made communication far more efficient for me.
Challenges
There are a number of risks with any organization taking the steps that we have at Bing Ads. The biggest risk is definitely attrition especially at a company like Microsoft where these well-defined roles have been a part of the culture for decades and are still part & parcel of how the majority of the company does business. A number of people may feel that this is a bait and switch on their career plans with the new job definitions not aligning with how they saw their roles evolving over time. Others may not mind that as much but may simply feel that their skills may not be as valuable in the new world especially as they now need to learn a set of new skills. I’ve had one simple argument when I’ve met people with this mindset. The first is that DevOps is here to stay. The industry trends that have had more and more companies from Facebook and Amazon to Etsy and Netflix blurring the lines between developers, test engineers and operations staff will not go away. Companies aren’t going to want to start shipping less frequently nor will they want to bring back manual deployment processes instead of automating as much as possible. The skills you learn in a DevOps culture will make you more broadly valuable wherever they find their next role whether it is a traditional specialized engineering structure or in a DevOps based organization.
Other places where we’re still figuring things out are best practices around ownership of testing. We currently try to follow a “you build it, you test it, you deploy it” culture as much as possible although allowing any dev to deploy code has turned out to be bit more challenging than we expected since we had to ensure we do not run afoul of the structures we had in place to stay compliant with various regulations. Testing your own code is one of topics where many in the industry have come out against as being generally a bad idea. I remember arguments from my college classes from software engineering professors about the blind spots developers have about their software requiring the need for dedicated teams to do testing. We do have mitigations in place such as test plan reviews and code reviews to ensure there are alternate pairs of eyes looking at the problem space not just the developer who created the functionality. There is also the school of thought that since the person who wrote the code will likely be the person woken up in the middle of the night if it goes haywire at an inopportune moment, there is a sense of self preservation that will cause more diligence to be applied to the problem than was the case in the previous eras of boxed software which is when most of the anti-developer testing arguments were made.
Further Reading
Now Playing: Eminem – Rap God