Advantages of AWS Multi-Account Architecture

When we begin doing some things in AWS, we usually start with a single AWS account and create our AWS resources in it. And things can become a mess very fast. This article should give you an overview why you should switch to a using multi-account architecture very soon for workloads on AWS.

Hard limits per AWS Account

AWS has many “hard limits” per AWS Account, which means that - in contrast to soft limits - they cannot be increased. Having multiple AWS Accounts reduces the probability of hitting one of them. There are a few things annoying then having a failing deployment because you hit e.g. the maximum number of EC2 instances per account while rotating autoscaling-groups.

“Blast radius” reduction

One of the most important reasons for separating workloads into several distinct AWS accounts is to limit the so called blast radius. It means to contain issues, problems or leaks by design so that only one portion of the infrastructure is affected when things go wrong and to prevent them from leaking / cascading into other accounts.

AWS accounts are logically separated: No AWS account or resource in it can access resources of other AWS accounts by default. Cross-account access is possible, but it has always to be granted in an explicit way, e.g. by granting permissions through IAM or other mechanism specific to an AWS service.

AWS Per-Account Service and API limits

AWS throttles API access on an per-account basis: So for example imagine some script of Team A is e.g. hammering the EC2 API could result in another Team B’s production deployment to fail, if they are in the same AWS account. Finding the cause could be hard or even impossible for Team B. They might even see themselves forced to add retries/backoff to their deployment scripts which further increases load and even more throttling! And last but not least, it adds accidental complexity to their software.

Additionally there are also service and resource limits per AWS account. Some of them can be raised, some can’t. the probability of running into one of these limits declines if you distribute your AWS resources across several AWS Accounts.

Security

Maybe you remember the startup Code Spaces which had all their resources in one AWS account including backup: they got hacked and entirely vaporized within 12 hours.
I would argue that this scenario would have happened less likely if their backups resided in another AWS account.

Environment separation

Typed in DROP DATABASE into the wrong shell? Oops, production is gone! That’s actually common story, you might as well remember this GitHub outage (not directly related to AWS, but similar contributing factors).

Consider separating e.g. test, staging and production environments into own AWS accounts to reduce the blast radius.

IAM is complicated

IAM is not very easy to grasp and even today there seems to be no easy way to follow the Principle of Least Privilege in IAM, I’d say that Managed Policies are a good start, but too often I see myself falling back to assign AdministratorAccess. So we often tend to give away too many permissions to e.g. IAM roles or IAM users.

By separating workloads into their own AWS Accounts, we once again reduce the blast radius of too broad permissions to one AWS account - by design.

Map AWS Accounts to your organizational structure

Companies usually try to break down the organization into smaller autonomous subsystems. A subsystem could be an organizational unit/team or a product/project team. Thus, providing each subsystem their AWS account seems to be natural. It allows teams to make autonomous decisions within their AWS account and reduce communication overhead across subsystem borders as well as dependencies on other subsystems.

The folks from scout24 though issued a warning on mapping AWS accounts 1:1 to teams:

The actual sizing and assignment of accounts is not strictly regulated. We started creating accounts for teams but quickly found out that assigning accounts per product or group of related services makes more sense. Teams are changing members and missions but products are stable until being discontinued. Also products can be clearly related to business units. Therefore we introduced the rule of thumb: 1 business unit == 1 + n accounts. It allows to clearly identify each account with one business unit and gives users freedom to organize resources at will.

I can definitely fully sign that statement as I have seen it many times that teams are splitting and merging or are constantly getting reorganized. This is especially true in companies who think they are agile and try to fix deeper systemic problems by constantly reorganizing people and teams, ignorant of Conway’s Law or their technical constraints / heritage.

Exploring your company’s Bounded Contexts might be another method to find the right sizing and slicing.

Never slice AWS accounts by teams or org units - but rather by Bounded Context, product, purpose or capability.

Making implicit resource sharing harder by design

I guess almost everyone can tell a story of one big database in the middle, and tons of applications sharing it (Database Integration).

Sam Newman brings it to the point in “Building Microservices”:

Remember when we talked about the core principles behind good microservices? Strong cohesion and loose coupling — with database integration, we lose both things. Database integration makes it easy for services to share data, but does nothing about sharing behavior. Our internal representation is exposed over the wire to our consumers, and it can be very difficult to avoid making breaking changes, which inevitably leads to a fear of any change at all. Avoid at (nearly) all costs.

The probably best way to get out of this situation is to never get into it. So how did we get in this situation in the first place? I guess usually because humans go the path of least resistance. So the usual way goes like that: Change security group settings and connect directly to the database (in the same VPC and AWS Account). And BOOM: it became a shared resource and so it became a broken window.

I’d argue with separate AWS Accounts it’s harder to build an entangled mess. In the described case one would e.g. need to connect e.g. two VPCs from the different AWS accounts first. People might think twice if there is another way of accessing the data source in the other AWS account. E.g. by exposing it via an API. And even when they go for the VPC peering, they at least have to make that EXPLICIT on BOTH sides. It’s no drive-by change anymore.

Ownership and billing

Another advantage is the clarity of ownership when using multiple accounts. This can be enormously important in organizations which are in the transition from a classical dedicated ops team to a “You built it, you run it.“ model: If let’s say a dev team spawns a resource into their AWS account, it’s their resource. It’s their database, it’s their whatever. No throw-over-the-wall. They can move fast, they don’t have to mess around with or wait for other teams, but they are also more directly connected to the consequences of their actions. On the other hand, they also can do changes with less fear of breaking things in other contexts because of unknown side effects (remember the entangled database from above?).

It also makes billing really simple since costs are transparently mapped to the particular AWS accounts (Consolidated Billing), so you get a detailed bill per e.g. business function, environment or whatever you defined as dimensions for your AWS accounts. Again, a direct feedback loop. In contrast, think of a big messy AWS account with one huge bill. That might simply reinforce the still prevailing believe in many enterprises that IT is just a cost centre).

Side note: Yes you could also use Cost Allocation Tags for making ownership and costs transparent, but tagging has some limitations:

  1. Tagging is not consistent across AWS services and resources: Some support tags, some don’t.
  2. You need to force people to use tagging and/or build systems that check for correct tags etc. This process has to be maintained (e.g. initialized, communicated, trained, enforced, re-communicated, reinforced, and so on).

Right from the beginning

When I created my first corporate AWS account back in 2010, neither I nor my colleagues weren’t aware of all the multi-account advantages mentioned here. This was one contributing factor resulting in one big shared AWS account across teams. And believe me: “We’ll split up the account later, when we have more time / are earning money / are more people” is usually not going to happen! So please don’t make the same mistake! Create more AWS accounts!

My current favorite is to slice AWS accounts in two dimensions:

  • Dimension one: Business function/capability/product/project/Bounded Context (not teams/departments, see above!)
  • Dimension two: Environment (e.g.test, staging, prod)

This sounds like a lot of initial complexity, but I think it’s really worth it in the long term for the mentioned reasons.

Creating AWS Accounts is free and:

It’s getting easier with AWS Organizations

AWS Organizations does not only simplify the creation of new AWS accounts (it has been a pain in the ass before!), it also helps to govern who can do what: You can structure the AWS accounts you own into an organizational tree and apply policies to specific sub-trees. For example, you could deny the use of a particular service org-wide, for an organizational unit or a single account.

Outlook

In one of my next articles, I am going to bring some light into the drawbacks of having many AWS accounts, but also how to mitigate these drawbacks with automated account provisioning and governance, so stay tuned!

Thanks

I want to thank Deniz Adrian for reviewing this article and adding additional points about implicit resource sharing and fearless actions.

References

Paul O'Neill: The Irreducible Components of Leadership

This is an annotated transcription from Paul O’Neill’s talk on leadership - in my opinion the most powerful and inspiring talk I have ever seen on leadership. I decided to transcribe it (well, Youtube did the most work with its automatic subtitles feature), because there are so many great quotes in it and I wanted to have it as a source for myself, e.g. for futures articles, so I always have a written reference.

So here it is - if you find any errors, please do not hesitate to open a pull request:

I want to talk to you about leadership concepts this morning because I believe this: And I have now spent a lot of time working in a variety of ways in health and medical care and I choose to talk about the leadership component because I believe this: With leadership anything is possible and without it nothing is possible. So I’m going to define for you if I can what it is I mean by a leadership what is it that we should expect a leader to do. First of all I think it’s necessary for a real leader to articulate what I call unarguable goals and aspirations for the institution that they lead. That doesn’t mean that I think they should invent them themselves in a dark claws of middle of the night, but I believe it’s a really important critical role for a true leader to articulate non arguable goals.

So I want to tell you some non arguable goals. I want to start with my favorite thing: In a really great organization the people in it are never injured at work. Now when you head off in that direction one of the things you’ll find - I found when I first went to Alcoa - and I said on the first day I was there: People who work for Alcoa should never be hurt at work. There were a whole lot of people in Alcoa who didn’t say to my face but we’re saying in the hallways or behind me: He doesn’t know anything about making aluminum.”,, “He doesn’t know what he’s like to be in a smelter in Alcoa Tennessee in the summertime, where is a hundred and thirty degrees and the humidity is almost a hundred percent and people get heat prostration and there’s nothing you can do about it.”, “He doesn’t know, understand or appreciate any of that and so we’re pretty sure after he learned something about the business and we have our first downturn in metal prices he’ll shut up about safety because we’re already in the top one-third of all organizations in the United States in terms of our safety performance!”

So I’m here to tell you a leader who articulates non arguable goals is likely to get some arrows in the back. But it doesn’t mean you should stop. It really should renew your commitment to be out there in front articulating goals. Let me take it into health medical care and say to you I believe the same kind of goal is the right goal for a hospital-acquired infections. And let me be careful to say, it’s very difficult to actually get to zero injuries to a workforce or to zero nosocomial infections but I think it’s pretty hard for anyone to sustain an argument that says our goal should be some positive number because I - and I did this you know when I when I first came to Knoxville - and I said to people: “You know I’ve only been here three weeks but I hope the Alcoa tomtom network works as well as most informal communication systems do and you already know that I’ve said that Alcoa should be a place around the world, not just in the United States, around the world in 43 countries and 350 plants, that we should be a place where people are never hurt at work. And it can only happen if you will take personal accountability and responsibility for this, for yourself and for your associates, for us to get there and if some of you - as I’ve been told by the supervisors - believe that we should not set a zero goal because it’s unlikely we can achieve it, I’d like for you to raise your hand if you want to volunteer to be hurt so we can reach the goal.” There were no but there were no volunteers! And I guarantee you if you ask patience would it be okay if we gave you an infection because we’re not meeting our goal this month there would be no volunteers.

So a leader needs to articulate not arguable goals. And again it doesn’t mean that we know exactly how we’re going to get there but at least we’ve got every human factor in our organisation lined up and trying to achieve the targeted goal. This can’t be done you this cannot be a delegated function. You can’t have a person who’s the vice president for goals. The leader needs to articulate the goals. Other people do not have the power or the position to do that. Now after the goals have been articulated clearly, you notice I didn’t start by saying you know we’re going to make a hell of a lot of money, and let me just say parenthetically that’s because I believe in a truly great organization finance is not an objective it’s a consequence and it’s great if it’s the consequence of being more excellent at what you do then anyone else that does what you do. In my experience the finance follows excellence. So having a goal for economic for financial success is to be not not a good place to start. It doesn’t mean you don’t have to earn the cost of capital or cover your cost or any of that but it should not be an objective of the organization it should be a consequence of excellence.

So how do you move from this goals into action and organization? First of all I think it’s incredibly important to reach every person in the organization and again I’m talking about the theoretical limit in my experience they’re always, no matter how hard you work at it, there are three or four percent of the human factors in the organization that never get it and can’t get it, and you need to do something about that, but I found most people respond to a positive idea of leadership and organizational aspiration. Not many places really respond very well to negative motivation.

And so in an organization I believe that has the potential for greatness - doesn’t guarantee it - but had the potential for greatness the people in the organization can say every day without any reservation or hesitation ‘yes” to three questions. Here are the three questions:

I’m treated every day with dignity and respect by everyone I encounter without respect to my gender or my nationality or my race or my educational attainment or my rank or any other discriminating qualifier. Think about that for a minute. So it means when you go into the lobby of your enterprise every morning, the person behind the desk treats every person with the same happy face and welcoming greeting not related to whether you’re a surgeon that brings in 13% of the business or the person who cleans the room. In a truly great organization there is a seamless sense of “Everyone here is a court of dignity and respect every day”. Now I have a corollary for you which I practice at Alcoa: “If you’re not important you shouldn’t be here”. That raises a really difficult challenging proposition, if you think about it, because an awful lot of organizations, when times are tough, people are laid off. Not in a great organization, because at any particular time the people that are there are necessary or they wouldn’t be there. That creates a real challenge for leadership to figure out how to navigate ups and downs and economic cycles. And there is a way to do it by being clear in your own mind and in your own institution about the difference between a baseline of activity and fluctuating activities, so that you can negotiate with people who are going to be on the bubble if you will, that they understand that they’re on the bubble and they are there as casuals to take care of fluctuations. But for people are part of the baseline there needs to be an honored commitment, that you are really important or you wouldn’t be here, and we need you all the time. So the first proposition is I can say every day I’m treated with dignity and respect. Full-stop.

Second proposition is this: I’m given the things that I need - training, education, tools, encouragement, so that I can make a contribution - that’s important now - that gives meaning to my life. Think about, you know, if your work doesn’t give meaning to your life, it’s what you spend eight or ten or twelve hours a day doing then where are you going to get meaning in your life? On the golf course, or going out to dinner? You know, so I believe it’s an obligation of leadership to create a condition, so people can say I have all the things I need so I can make contribution that gives meaning to my life. Not a lot of places where people can say: this place gives meaning to my life

And third propositions pretty simple, it says: Everyday I can say, someone I care about and respect, noticed I did it. In a word its recognition regular - meaningful, sincere recognition. Now if you find a place, or you can create a place I would say this is a job of leaders - again this cannot be delegated to human resources - this is for a leader of an institution to establish the conditions on an ongoing basis so every person in the institution every day can say yes to these three propositions, then you have a potential for real greatness.

Now after after the leader has articulated the goals and created these cultural characteristics that are pro excellence, a leader needs to take away excuses. And in my experience the excuses are all the same across public, private and nonprofit. When you make these suggestions to people they say: “Well you don’t understand. We really can’t do this quality or continuous learning continuous improvement set of things because we’re already working two hours past what we get paid for. We’re too busy to take on something new”. And people say “We’re too busy!”, and they will say: “If we’re going to do this we need more people. We need to hire some people who are experts in continuous learning and continuous improvement and quality and we need to set up a new department”, and people will always say “We don’t have enough money, were already struggling, so we need more money!” I believe it’s the leader’s responsibility take away all of the excuses.

So I want to give you an example of taking away excuses. I was telling this to ? last night. She said, “you need to tell this story.” So when I first came to Tennessee to Knoxville to Alcoa Tennessee and I spent the morning walking through the plant because I like to feel the things that I am supposed to be responsible for so I wanted to see what it was like to be there for half a day and see what it smelled like and how the people were dressed and whether they had, you know, half of a finger - whatever as a consequence of being in this place. And so at noon they said, we’re we’re going to have lunch. And there were 75 people at lunch half of them were from the supervisory ranks and the other half were from the union organized workforce. So they said “Would you like to say something?”, I said, “Yes, I would.” So I got up and I said, “You know I want to talk to you about safety and I presume you’ve all heard this but here’s what I want to say to you: I want to say to the supervisors: I believe it’s the leader’s role to take away excuses so here’s what I’m saying to you: We will never ever in Alcoa again budget for safety. Never. We’re not going to have a budget line for safety. As soon as we identify anything, as soon as anyone in the institution identifies anything that could cause an individual to be hurt, we’re going to fix it right and we’re going to fix it as fast as it’s physically possible. And so I want to charge you and the supervisory ranks with acting on that idea. You need to actually do it. If something breaks down or you think something could hurt somebody, fix it right now. We’ll figure out how to pay for it later on. Just do it!“ And then I turn to the hourly workforce and I said to them, “You heard my instruction to them. Here’s what I want to do with you. I want to give you my home phone number so that if they don’t do what I said, you can call me!” Not many CEOs were giving their home phone number away, but I wanted I wanted the people to know that this was a real thing. In a few weeks I got a call late one afternoon from an operating guy from the floor in Alcoa Tennessee and he said, “You know, well I’m calling up because you told you told all of us, we should call you if the supervisors are not fixing things. Well we’ve had a roller conveyor system down here that’s been broken for three days or so, and as a consequence those of us who are the workforce have to pick up the 900-pound ingots, a bunch of us, and put them on a dolly and take them from one processing step to the next. And we’re going to get hurt doing this! Our backs are at risk at a minimum and if we dropped one of these things on our foot we’d be permanently disabled. So I want to know what are you going to do about it?” So I said, “You know, let me make a few phone calls.” So I called the supervisory people and explained to them that they were not doing what I told them was their obligation to the workforce. And you know, I had a couple of phone calls in the first six months I was at Alcoa, but fortunately the tomtom network at Alcoa really worked well and after I had to make a couple of interventions I didn’t have to make any more interventions.

You know, so part of what part of what I want to say to you is: Leadership is not about writings on the wall. It’s about acting in a noticeable way on the principles that you establish, so that people begin to believe that they are real that they’re not just writing on the wall. I would suggest that every organization that I know about that has an annual report says someplace early in the annual report, “Our human resources are our most important asset.” So in most places there’s no evidence that’s true, it’s just a sentiment. So we all say it yeah our human resource arm.. is your practice consistent with that? So it’s part of the reason that I elected the first day I went to Alcoa to articulate this goal of no one should ever be hurt at work because it’s measurable, right? You can tell whether or not somebody couldn’t come to work if they were hurt at work, because they aren’t there! right you can’t fudge the numbers. You can fudge numbers about recordable instance incidents and first-aid cases, but it’s pretty hard to lie about “Didn’t show up today”. That’s why I wanted a hard measure that we could look at every day and we could appreciate whether we were making progress or not.

So I want to tell you a little bit about how these numbers are done. In the 1987 the average number of cases of Americans in the workforce being hurt at work was five out of every 100 Americans. In 1987 had an incident at work that caused them to miss at least one day of work ,five out of 100. Alocas number was one 1.86. And if you want to know of what the number is today go on the internet, type Alcoa when you get the drop-down menu, go on environment health and safety and it’ll tell you 24 hours a day in 43 countries, in 350 locations what the injury, what the lost workday injury rate is on a running basis anytime you want to look at. Yesterday it was 0.116. Now why do I tell you that? Because the average lost workday rate in American hospitals is 3.0. And if you’re not good at math that’s 26 times worse than Alcoa. That’s unforgivable, because it’s within the capacity of leaders to articulate a zero goal and then to accomplish.

And I’m going to tell you a little bit about how do you accomplish it, because it’s not good enough, cheerleading won’t truly won’t do it, but this is really relevant to the quest for excellence in health and medical care. But I want to stay for a moment with injuries to the workforce, because the lessons about how to get close to zero in injury rates among the workforce are exactly, precisely the same things that are required to achieve startling excellence in the delivery of health medical care. So first of all you have to establish a process that says: Every incident that happens to one of our employees is going to be recorded within 24 hours, and it’s going to be put into cyberspace along with the surrounding circumstances. And where it’s possible to do it in 24 hours the root cause analysis and an indication of a corrective action that’s being taken, so that this set of circumstances will never again produce this result.

Now I will tell you a special piece about this. I believe that it’s really important in our world to keep things personal. And so when I started this Alcoa. and I said, “Not only do I want to identify this case, I want to do it by name”. My lawyers didn’t like that because they said, “You’re going to create a feast for the tork bar to come in here and sue the hell out of us because we’re now going to put in cyberspace for anybody to look at individuals by name and what happened to them.” You know, what lawyer could find a better way to produce cases. Okay and I said, “I don’t think you’re right and I’m going to take the personal responsibility if we do get sued, because it’s so important that we not let this be statistical.” It needs to be about “Every person is important and they’re important by name they’re not important as a statistic.” So what do you do when you create that, in a world that we live in now with this unbelievable connectivity, if you have an understanding and every one in the organization has signed on the wall, “I’m responsible for myself for not being hurt and for my mates not to be hurt”, when the message goes into cyberspace you can look at it with an expectation that - within the next 24 hours - 349 other locations around the world including Guinea and Russia and China India and Brazil and Argentina, 43 countries at all, that the people in those institutions will look at those cases and they will make whatever modifications are indicated, so we don’t have to learn this 350 times! That’s how you get close to zero, by continuous learning and continuous improvement from everything gone wrong.

It really works an unbelievable way, and I would suggest to you in health and medical care, it would be great if we could get leaders to sign up for the idea that there is a measurable way of knowing whether or not the people in the organization are truly the most important resource, by being able to tell, what kind of an injury rate exists among the people who deliver the care. So that I have to tell you I’m really skeptical of an organization that doesn’t know what its injury rate is. That they’re really good at hand hygiene, you know, because if you’re not really good at your worker’s own safety, at least for me there’s a doubt that you’re really good at the other things that we know are directly related to perfect patient care. And again I would suggest to you, the tools of learning and approach and engagement of the population are exactly the same in every kind of institution.

So that when I went to the Treasury, I tell you little story about a Larry Summers, who was my predecessor is a secretary of the Treasury under the Clinton administration. And so when we had our first briefing session where he was going to tell me about what he’s been doing toward the end of session I said to Larry, “Larry, what’s the lost workday rate at Treasury?”, and he said, “I don’t know what you’re talking about.”, which frankly was not a surprise to me.
And it took about three weeks to actually round up the data, and it turned out that the injury rate at the treasury - you may think you know how can anybody get hurt at the Treasury - well there are 125 thousand people there, and a significant fraction of them work in the mint. And if you went into the mint in 2001 and looked at the workers you’d find a lot of workers with a half little finger because the stamping machines at the end of their finger off is kind of a badge of experience. The injury rate of the Treasury was unbelievable. In 23 months we reduced the injury rate of the Treasury by 50%. In 23 months.

But I will tell you another story that’s related both to Alcoa and the Treasury to demonstrate another important principle to you. I believe that excellence at its best is habitual. And by habitual I mean it’s ingrained and inculcated in all the individuals in the institution so that it’s almost automatic. So it means it applies to everyone - again I can’t say enough about how important it is that if you’re really going to be on a quality quest - it needs to be about everyone in the institution. The people in the quality department cannot produce quality in an organization. It doesn’t mean they don’t have an important responsibility, but they cannot do it. In the same way that infection control committees cannot fix infections, right? They have an important role, but they cannot make it happen for the whole institution. In this quest to make sure that every one in the institution grasp these ideas, I called in the controller at Alcoa - this is about 1991 - and I said, “Ernie, I’d like to know, right now we’re closing our books in this worldwide enterprise in 11 days and reporting our results to Wall Street and I’d like to know, if we had a perfect process with no repair work, no transpositions of numbers, no foul-ups with computer programs that don’t integrate very well with each other for all these 350 locations, if we had no repair work and all of the time that we spent was high-value touch time, that means we’re actually producing value in every minute of every day, how long would it take?” About three weeks he came back to me and he said, “I’ve figured out the answer to your question and here it is: Right now we’re closing in books at 11 days. If we did it perfectly, we could do it in three days”. And I said, “You know, Ernie, that’s our new goal!”, and he said, “No that’s not what I meant. Oh my god, we can’t really do that! That’s just the answer to your question!”. I said, “Hey Ernie, we’re trying to be perfect at everything else we do, including workplace safety and manufacturing, and so the finance function needs to demonstrate to the rest of the organization what excellence really looks like.”
And it took us a year to get there. Now here the leadership functions is really important. I had to say to them, “I don’t care how much it cost to make this perfect. I don’t care because I’m so confident that the value is there, and so here’s your permission. You can examine all of the things that we’re sucking up from around the world and decide whether the stuff that has evolved is really critical to a financial characterization of our organization and meeting our responsibilities to the Securities and Exchange Commission, so you have freedom to redefine what it is we do. You have the resources to rewrite the computer programs so that they’re friendly to human beings instead of only people who are nerds, who delight in complexity, so you can make this, so that it works for the people who have to do the process of financial roll-up. And if you need some outside help go and get it!

So a leader needs to provide a running room for people to work toward the theoretical limit. And in a year we got the point where we could close our books in three days. Full stop. And today, if you look at the quarterly earning report process, you look at CNBC or any of the other financial channels, Alcoa is still and probably always will be the first major corporation to roll up as earnings and report them good and bad, because the process works now. Think about the implication of that, I want you to transfer it to health and medical care. In Alcoa at that time we had 1,300 people in the finance function, and by going from 11 days to 3 days we freed up 8 days a quarter for 13 of the most highly trained analytic people in the organization. Not so we could fire them, but so that they could use their brain power to help us better understand, how to improve everything else we were doing. This is not about firing people. It’s about creating the opportunity for applying resources in a way that produces ever greater value.

So when I went to the Treasury I said to them, “How long’s that take us to close our books after the end of the fiscal year of September 30th”, and they said, “Well, we usually get it done by March.” and I said, “I don’t know why you even bother! Who the hell wants to know what the numbers were five months after the fact?” So I said them, “You know, I know an organization that’s more complicated than the Treasury where they closed the books in three days, and that should be our goal. At the Treasury we should be at least as good as Alcoa. And so they said - you know, again the excuse for routine, “We don’t have the money. We’re already too busy”. And then they hit me with a new one: ”Government laws and regulations won’t permit it.” Smart guy, so I said, “I tell you what if you can show me a rule or regulation or a law, that prohibits us from doing this I will go and get it changed.” Again it was taking away the excuses. “Give it to me. If you tell me there are barriers that need to be rolled away, I will roll over the barriers. There weren’t any. It was just an excuse. Nobody had really examined, “How the hell can we do this? We will do it!” And so in 13 months at the Treasury we we figured out how to close the books in three days. If you want to see this story, it’s on the Treasury website. They’re so proud of it. My name isn’t there but it happened on my watch, because I got my controller from Alcoa to come pro bono. We didn’t pay them a dime to come pro bono to coach the people at the Treasury how to do this job.

And again the reason I tell you this is, because I know an awful lot of health care institutions that don’t close their books in three days, but they could if the leadership decided this is a value and a way to demonstrate the organization that every part of our institution is on the same wavelength, and we’re all about excellence and we don’t and we won’t live in silos and we won’t embrace excuses, and we will be excellent at everything that we do. And I would tell you more stories about health medical care, but I’ve used up my time and I hope I have challenged you a little bit, maybe inspired you a little bit about the potential for what you as leaders in health and medical care can do because I will tell you just one more thing: I believe there is no other sector of our society and our economy that has the same potential for simultaneously improving outcomes from medical intervention and reducing the cost by a trillion dollars a year.

Further reading

Dead man's switch with AWS CloudWatch: Freshness-Alerting for Backups and Co

A recent challenge for one of the teams I am currently involved was to find a way in AWS CloudWatch:

  1. To alert if the metric breaches a specified threshold.
  2. To alert if a particular metric has not been sent to CloudWatch within a specified interval.

While the first one is pretty much standard CloudWatch functionality, the latter is a bit more tricky. In the Nagios/Icinga world it’s called “freshness”. You could also call it special case of a “Dead man’s switch“ for periodic tasks / cronjobs.

So for example in our case we wanted to have monitored and alerted whether a backup job runs once per day.

So here is what we did (CloudFormation snippet below):

  • Set the check period to the interval during which the metric is supposed to be sent. E.g. 86400 if the metric should is supposed to be sent every day. This instructs CloudWatch to check once per day.
  • Set evaluation periods to 1: We want to get alerted immediately when there is no data written or the threshold has been breached.
  • And now the important one: We have to treat missing data as breaching, so that,if there has been no entry within the evaluation period then the alarm gets triggered.

Example in CloudFormation syntax:

1
2
3
4
5
6
7
HealthCheckAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
Period: 86400
EvaluationPeriods: 1
TreatMissingData: breaching
...

References

Lesegruppen / Buchclubs in Unternehmen

Buchbesprechungen als Werkzeug für Organisationales Lernen

Auf den devopsdays 2015 in Berlin hatten wir eine Open Space Gruppe, die sich mit dem Thema beschäftigte: “Ich lese viele Bücher, Blogs etc, und würde dieses Wissen gerne in das Unternehmen tragen, in dem ich gerade arbeite”. Dabei kam die Idee auf, Lesegruppen zu bilden. Diese Idee fand ich so gut, dass ich sie direkt einmal ausprobiert habe.

Seitdem habe ich mehrere Buchlesegruppen in Unternehmen (mit-)gegründet und diese “Buchclubs” als als sehr effektives Werkzeug des strukturierten Organisationales Lernens schätzen gelernt.

Was ist ein Buchclub / eine Lesegruppe?

Ein Buchclub ist eine Gruppe von Leuten, die ein Buch durcharbeiten und dann (z. B.) kapitelweise in regelmäßigen Treffen besprechen. Das eigentliche Lesen geschieht vor den Treffen durch jedes Mitglied in Eigenverantwortung. Als Bücher kommen Fachbücher in Frage, die einen Bezug zum Unternehmen haben.

Vorteile von Buchclubs - was bringt uns das?

Buchclubs bringen eine Reihe von positiven Eigenschaften:

  • Besprechung in der Gruppe: Das Themengebiet des Buches wird in der Gruppe besprochen und bearbeitet. Dabei entstehen für das Unternehmen sehr wertvolle (wenn nicht die wertvollsten) Diskussionen. (Zumeist) neue Konzepte werden zuerst von jeder teilnehmenden Person einzeln gelesen und danach in der Gruppe besprochen.
    Es sind also nicht die Ideen einer einzelnen Person, sondern jedes Gruppenmitglied hat das Thema selbst durch Lesen erarbeitet. Menschen neigen dazu, “selbst aufgenommene” Informationen besser aufzunehmen und in ihre mentalen Modelle einzuarbeiten (die ggf. stark zu dem Gelesenen divergieren), als wenn z.B. eine Einzelperson versucht, neue Ideen in eine Gruppe zu bringen.
  • Angleichen von unterschiedlichen mentalen Modellen: Wie wir die Wirklichkeit sehen, ist immer nur ein Ausschnitt. Es kann zum Beispiel sein, dass völlig unterschiedliche Arbeitsweisen oder Begriffsdefinitionen vorliegen innerhalb der Gruppe. Werden diese im Buch angesprochen, so entstehen oft Diskussionen wie “Ach, ihr macht das so?”, oder “Ach jetzt verstehe ich, was ihr mit X meint!”, aber auch “Dann lass uns doch auf X einigen, ich spreche das einmal in meinem Team an!”. Das sind genau die richtigen Diskussionen, weil Missverständnisse und Unverständnisse untereinander sachlich(er) geklärt werden, und dadurch wahrscheinlich auch die Gruppenkohäsion im Unternehmen gestärkt wird.
  • Reflektion der eigenen Arbeit: Fachbücher bieten eine gute Grundlage, um einen Realitätsabgleich von aktuell vorherrschenden Arbeitsweisen oder -mustern zu machen. Natürlich steht in Büchern auch immer nicht die ganze Wahrheit, oder die vorgestellte Welt ist zu perfekt, aber trotzdem bietet die Literatur meistens eine gute Indikation, auf welchem Level eine Person oder eine Gruppe sich befindet. Aktuell zum Beispiel lesen wir das Buch “Site Reliability Engineering”, welches beschreibt, wie Google intern arbeitet - und daher sind einige Konzepte aus dem Buch auch erst anzuwenden, wenn man eine Größe wie Google erreicht hat. Im Großen und Ganzen sind die Konzepte aber übertragbar bzw. sie regen zumindest wertvolle Diskussionen an.
  • Direkte Anwendung von Gelerntem: Ich erinnere mich an einen Buchclub, in dem wir “Implementing Domain-Driven-Design” durchgearbeitet haben, und Leute aus unterschiedlichen Teams dabei waren. Herausgekommen sind sehr konstruktive Diskussionen über die Gesamtarchitektur der Software, die das Unternehmen entwickelt, und zwar dieses mal geleitet von der Theorie aus dem Buch und nicht von unterschiedlichen mentalen Modellen oder Wissensständen (was für mich gefühlt in vorherigen Meetings immer der Fall war).
    Ein weiteres Beispiel war ein Buchclub, in dem wir “Toyota Kata” besprochen haben, und dann angefangen haben, eine Value-Stream-Map für das gesamte Unternehmen aufzustellen. Das war ein spannender Einblick in andere Unternehmensbereiche und ich habe gemerkt, wieviel Spaß es macht, mit der Gruppe erst die Theorie zu besprechen und dann darüber zu philosophieren, wo unser Unternehmen eigentlich wertschöpfend ist - so qualitativ hochwertige Diskussionen habe ich selten erlebt.
  • Gruppendruck: Wie so viele Dinge, die wichtig, aber nicht dringend sind, geht häufig auch das disziplinierte Zu-Ende-Lesen von Büchern im Alltag unter. Hier kann der Druck der Gruppe helfen: Wenn morgen der nächste Besprechungstermin ist, dann steht man mitunter schon einmal eine Stunde früher auf, um das Kapitel durchzulesen.
  • Günstig: Es gibt viele Wege der Mitarbeiter- und Teamentwicklung: Workshops, Schulungen, Konferenzen etc. - diese sind oft sehr teuer. Und der Erfolg ist ggf. auch noch fragwürdig: Meistens holt uns nach einer Schulung oder Konferenz schnell wieder das Tagesgeschäft ein - und die frischen Ideen sowie der Elan verpuffen. Die direkten Kosten von Buchclubs beschränken sich normalerweise auf Lese- und Besprechungszeit sowie die Anschaffung des Buchs.

Wie fange ich an?

In der Gruppe

Thema, Buch und Lesegruppe finden

Zuerst muss “jemand” ein Buch vorschlagen und dann dafür eine Lesegruppe finden. Meistens wird diese Person dann auch direkt Organisator der Gruppe. Das Finden von Mitgliedern kann z. B. durch Vorstellung des Buchclub-Konzepts in Meetings oder einfach an der Kaffeemaschine passieren. Wichtig ist, dass es immer ein freiwilliges Angebot ist.

Hier hilft es auch, dass man selbst “Schwäche zeigt”, z. B. “Ich würde mich gerne in Thema XYZ einarbeiten, da ich auf dem Gebiet Wissenslücken habe. Dazu habe ich Buch XYZ gefunden und würde dies gerne mit mehreren Leuten besprechen können, um sicherzustellen, dass ich die Konzepte wirklich verstanden habe”. Durch dieses “Verletzlich machen” (Zugeben, dass man Wissenslücken hat) erhöhen sich die Chancen, dass man mehr Menschen zum Mitmachen bewegen kann: Entweder merken sie, dass es nicht schlimm ist, mit Unwissen in diese Gruppe zu gehen, und dass es nicht darum geht, einzelne Mitarbeiter blosszustellen - oder sie sind schon versiert auf dem Thema und können dann in den Kapitel-Besprechungen ihr bestehendes Wissen gezielt einstreuen.

Lesegruppen-Setup

Hat man seine Lesegruppe gefunden, kann es losgehen! Zuerst sollte ein regelmäßiger Termin, z.B. jede Woche eine Stunde, festgelegt werden. Es ist auch hilfreich, direkt eine Mailingliste oder einen Chat (je nachdem, was im Unternehmen da ist) einzurichten und die Mitglieder einzuladen. Diese können dann für Ankündigungen genutzt werden.

Als nächstes einigt sich die Gruppe darauf, dass jedes einzelne Mitglied bis zum ersten Termin ein oder mehrere Kapitel durchgelesen hat. Als Empfehlung für den ersten Termin: Lieber mit dem ersten Kapitel oder der Einleitung starten, also nicht zuviel auf einmal am Anfang, denn beim ersten Treffen gibt es bestimmt viel auszutauschen.

Im Treffen geht die Gruppe dann das Kapitel durch, z. B. werden markierte Stellen besprochen. Weiterhin lohnt es sich oft, auch hinter die Referenzen und Fußnoten zu schauen. Es gibt auch die Möglichkeit, dass ein_e Moderator_in für das Treffen ausgemacht wird, diese_r dann das Kapitel vorstellt und hindurch leitet. Mit den entstehenden Diskussionen (siehe oben) ist die Zeit dann meistens auch schneller herum als erwartet.

Abschluss des Buches

Ist das Buch zuende gelesen, kann sich die Gruppe entweder auflösen oder direkt ein weiteres Buch finden und somit bestehen bleiben. Bleibt die Gruppe bestehen, ist es aber hilfreich, weitere potentielle Mitglieder in die Gruppe aufzunehmen, um die Diversität zu erhöhen. Denn auch Buchclubs sind nicht vor Gruppendynamiken wie Groupthink gefeit. Man muss auch aufpassen, dass sich keine In- und Out-Gruppen bilden (z. B. Mitglieder des Buchclubs, die sich dann “elitärer” fühlen als Nichtmitglieder).

Als Unternehmen

  • Arbeitszeit explizit freigeben: Als Zeichen, dass das Unternehmen die Weiterbildung seiner Mitarbeiter_innen unterstützt, sollte ein gewisser Anteil der Arbeitszeit “freigegeben” werden für explizite Weiterbildungsmaßnahmen wie z. B. Buchclubs. Ein Beispiel für ein Modell wären z. B. die Besprechungszeit übernimmt das Unternehmen, die Lesezeit die Mitarbeiter_innen.
  • Teamleiter_in mit einbinden: Häufig kommen in Buchbesprechungen dann “Man müsste mal” Themen auf. Hilfreich ist es hier immer, wenn Teamleiter_innen direkt mit dabei sind, so dass Änderungen an Arbeitsprozessen oder Experimente schneller umgesetzt werden können. Häufig werden dann auch Dinge angesprochen, die sonst im Alltag untergehen würden.

Zusammenfassung

Buchclubs bieten eine für Unternehmen sehr kosteneffektive Möglichkeit, die Weiterbildung seiner Mitarbeiter_innen zu fördern. Weiterhin sind sie ein Werkzeug für organisationales Lernen.

AWS Continuous Infrastructure Delivery with CodePipeline and CloudFormation: How to pass Stack Parameters

When deploying CloudFormation stacks in a “Continuous Delivery” manner with CodePipeline, one might encounter the challenge to pass many parameters from the CloudFormation stack describing the pipeline to another stack describing the infrastructure to be deployed (in this example a stack named application).

Consider a CloudFormation snippet describing CodePipeline which deploys another CloudFormation stack:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# pipeline.yaml
...
Resources:
Pipeline:
Type: AWS::CodePipeline::Pipeline
Properties:
...
Stages:
...
- Name: Application
Actions:
- Name: DeployApplication
ActionTypeId:
Category: Deploy
Owner: AWS
Provider: CloudFormation
Version: 1
Configuration:
ActionMode: CREATE_UPDATE
StackName: application
TemplatePath: Source::application.yaml

Now when you want to pass parameters from the pipeline stack to the application stack, you could use the ParameterOverrides option offered by the CodePipeline CloudFormation integration, which might look like this:

1
2
3
4
5
6
7
8
# pipeline.yaml
...
- Name: DeployApplication
...
Configuration:
StackName: application
TemplatePath: Source::application.yaml
ParameterOverrides: '{"ApplicationParameterA": "foo", "ApplicationParameterB": "bar"}'

This would pass the parameters ApplicationParameterA and ApplicationParameterB to the application CloudFormation stack. For reference this is how the application stack could look like:

1
2
3
4
5
6
7
8
9
10
11
12
# application.yaml
---
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
ApplicationParameterA:
Type: String
ApplicationParameterB:
Type: String
Resources:
...

Alternative way of parameter passing with Template Configurations

Injecting parameters from the pipeline stack to the application stack can become awkward with the ParametersOverrides method. Especially when there are many parameters and they are passed into the pipeline stack as parameters as well, the pipeline template could look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# pipeline.yaml
---
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
ApplicationParameterA:
Type: String
ApplicationParameterB:
Type: String
Resources:
Pipeline:
Type: AWS::CodePipeline::Pipeline
Properties:
Stages:
...
Actions:
- Name: DeployApplication
...
Configuration:
...
TemplatePath: Source::application.yaml
ParameterOverrides: !Sub '{"ApplicationParameterA": "${ApplicationParameterA}", "ApplicationParameterB": "${ApplicationParameterB}"}'

An alternative way is to place a so called template configuration into the same artifact which contains the application.yaml template, and reference it via the TemplateConfiguration:

1
2
3
4
5
6
7
8
9
# pipeline.yaml
...
- Name: DeployApplication
...
Configuration:
...
TemplatePath: Source::application.yaml
ParameterOverrides: '{"ApplicationParameterA": "foo", "ApplicationParameterB": "bar"}'
TemplateConfiguration: Source::template_configuration.json

In our case, the template_configuration.json file would look like this:

1
2
3
4
5
6
{
"Parameters" : {
"ApplicationParameterA" : "foo",
"ApplicationParameterB" : "bar"
}
}

This might be much nicer to handle and maintain depending on your setup.

Btw you can also use the TemplateConfiguration to protect your resources from being deleted or replaces with Stack policies.

"Service Discovery" with AWS Elastic Beanstalk and CloudFormation

How to dynamically pass environment variables to Elastic Beanstalk.

Elastic Beanstalk is a great AWS service for managed application hosting. For me personally, it’s the Heroku of AWS: Developers can concentrate on developing their application while AWS takes care of all the heavy lifting of scaling, deployment, runtime updates, monitoring, logging etcpp.

But running applications usually means not only using plain application servers the code runs on, but also databases, caches and so on. And AWS offers many services like ElastiCache or RDS for databases, which should usually preferred in order to have lower maintenance overhead.

So, how do you connect Elastic Beanstalk and other AWS services? For example, your application needs to know the database endpoint of an RDS database in order to use it.

“Well, create the RDS via the AWS console, copy the endpoint and pass it as an environment variable to Elastic Beanstalk”, some might say.

Others might say: Please don’t hardcode such data like endpoint host names, use a service discovery framework, or DNS and use that to look up the name.

Yes, manually clicking services in the AWS console and hardcoding configuration is usually a bad thing(tm), because it violates “Infrastructure as Code”: Manual processes are error-prone, and you’ll loose documentation through codification, traceability and reproducibility of the setup.

But using DNS or any other service discovery for a relatively simple setup? Looks like a oversized solution for me, especially if the main driver for Elastic Beanstalk was the reduction of maintenance burden and complexity.

The solution: CloudFormation

Luckily, there is a simple solution to that problem: CloudFormation. With CloudFormation, we can describe our Elastic Beanstalk application and the other AWS resources it consumes in one template. We can also inject e.g. endpoints of those AWS resources created to the Elastic Beanstalk environment.

Let’s look at a sample CloudFormation template - step by step (I assume you are familiar with CloudFormation and Elastic Beanstalk itself).

First, let’s describe an Elastic Beanstalk application with one environment:

1
2
3
4
5
6
7
8
9
10
11
...
Resources:
Application:
Type: AWS::ElasticBeanstalk::Application
Properties:
Description: !Ref ApplicationDescription
ApplicationEnv:
Type: AWS::ElasticBeanstalk::Environment
Properties:
ApplicationName: !Ref Application
SolutionStackName: 64bit Amazon Linux 2016.09 v2.5.2 running Docker 1.12.6

Ok, nothing special so far, let’s add a RDS database:

1
2
3
4
DB:
Type: AWS::RDS::DBInstance
Properties:
...

CloudFormation allows it to get the endpoint of the database with the GetAtt function. To get the endpoint of the DB database, the following code can be used:

1
!GetAtt DB.Endpoint.Address

And CloudFormation can also pass environment variables to Elastic Beanstalk environments, so let’s combine those two capabilities:

1
2
3
4
5
6
7
8
9
ApplicationEnv:
Type: AWS::ElasticBeanstalk::Environment
Properties:
ApplicationName: !Ref Application
...
OptionSettings:
- Namespace: aws:elasticbeanstalk:application:environment
OptionName: DATABASE_HOST
Value: !GetAtt DB.Endpoint.Address

Et voila, the database endpoint hostname is now passed as an environment variable (DATABASE_HOST) to the Elastic Beanstalk environment.
You can add as many environment variables as you like. They are even updated if you change their value (Cloudformation would trigger an Elastic Beanstalk enviroment update is this case).

CodePipeline and CloudFormation with a stack policy to prevent REPLACEMENTs of resources

Some operations in CloudFormation trigger a REPLACEMENT of resources which can have unintended and catastrophic consequences, e.g. an RDS instance being replaced (which means that the current database will be deleted by CloudFormation after a new one has been created).

While CloudFormation does support deletion policies natively which prevent the deletion of resources, there is no simple way to do this for REPLACEMENTs as of writing this.

When using CodePipeline in combination with CloudFormation to deploy infrastructure changes in an automated Continuous Delivery manner, one should have protection from accidental deletions even more mind. This blog post shows how to use CloudFormation Stack Policies to protect critical resources from being replaced.

Let’s start with the CodePipeline (expressed as CloudFormation) piece which deploys a database stack called db (I assume you are confident with CloudFormation and CodePipeline):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Pipeline:
Type: AWS::CodePipeline::Pipeline
Properties:
...
Stages:
- Name: Source
...
- Name: DB
Actions:
- Name: DeployDB
ActionTypeId:
Category: Deploy
Owner: AWS
Provider: CloudFormation
Version: 1
Configuration:
ActionMode: CREATE_UPDATE
RoleArn: !GetAtt CloudFormationRole.Arn
StackName: db
TemplatePath: Source::db.yaml
TemplateConfiguration: Source::db_stack_update_policy.json
InputArtifacts:
- Name: Source
RunOrder: 1

The important part is the TemplateConfiguration parameter which tells CloudFormation to look for a configuration at this particular path in the Source artifact. In this case db_stack_update_policy.json.

db_stack_update_policy.json looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"StackPolicy" : {
"Statement" : [
{
"Effect" : "Allow",
"Action" : "Update:*",
"Principal": "*",
"Resource" : "*"
},
{
"Effect" : "Deny",
"Action" : "Update:Replace",
"Principal": "*",
"Resource" : "LogicalResourceId/DB"
}
]
}
}

While the first statement allows all updates to all resources in the db stack, the second will deny operations which would result in a REPLACEMENT of the resource with the logical id DB in this stack.

A CloudFormation stack update of db would fail with an error message like Action denied by stack policy: Statement [#1] does not allow [Update:Replace] for resource [LogicalResourceId/DB].

Idempotent CloudFormation stack creation/update one-liner with Ansible

When developing CloudFormation templates, I regularly missed an idempotent one-liner command which does something like “create or update stack N with these parameters”, which provides a fast feedback loop.

So here it is with Ansible (and virtualenv for convenience):

1
2
3
4
virtualenv venv
source venv/bin/activate
pip install ansible boto3
ansible localhost -m cloudformation -a "stack_name=stack_name template=path/to/template region=eu-west-1 template_parameters='template_param1=bar,template_param2=baz'"

It will create a new or update an existing CloudFormation stack and wait until the operation has finished. It won’t complain if there are no updates to be performed.

PS: Michael Wittig has released a CloudFormation CLI wrapper (NPM module) for this problem, too!

Continuous Infrastructure Delivery Pipeline with AWS CodePipeline, CodeBuild and Terraform

Overview

This article explores how to build low-maintenance Continuous Delivery pipelines for Terraform, by using AWS building blocks CloudFormation, CodePipeline and CodeBuild.

CloudFormation

CloudFormation is the built-in solution for Infrastructure-as-Code (Iac) in AWS. It’s usually a good choice because it offers a low-maintenance and easy-to-start solution. On the other hand, it can have some drawbacks based on the use case or the usage level. Here are some points which pop up regularly:

  • AWS-only: CloudFormation has no native support for third-party services. It actually supports custom resources, but those are usually awkward to write and maintain. I would only use them as a last resort.
  • Not all AWS services/features supported: The usual AWS feature release process is that a component team (e.g. EC2) releases a new feature, but the CloudFormation part is missing (the CloudFormation team at AWS is apparently a separate team with its own roadmap). And since CloudFormation isn’t open source, we cannot add the missing functionality by ourselves.
  • No imports of existing resources: AWS resources created outside of CloudFormation cannot be “imported” into a stack. This would be helpful for example when resources had been set up manually earlier before (maybe because CloudFormation did not support them yet).

Terraform to the rescue!

Terraform is an IaC tool from HashiCorp, similar to CloudFormation, but with a broader usage range and greater flexibility than CloudFormation.

Terraform has several advantages over CloudFormation, here are some of them:

  • Open source: Terraform is open source so you can patch it and send changes upstream to make it better. This is great because anyone can, for example, add new services or features, or fix bugs. It’s not uncommon that Terraform is even faster than CloudFormation with implementing new AWS features.
  • Supports a broad range of services, not only AWS: This enables automating bigger ecosystems spanning e.g. multiple clouds or providers. In CloudFormation one would have to fall back to awkward custom resources. A particular use-case is provisioning databases and users of a MySQL database,
  • Data sources: While CloudFormation has only “imports“ and some intrinsic functions to lookup values (e.g. from existing resources) Terraform provides a wide range of data sources (just have a look at this impressive list.
  • Imports: Terraform can import existing resources (if supported by the resources type)! As mentioned, this becomes handy when working with a brownfield infrastructure, e.g. manually created resources.

(Some) Downsides of Terraform

TerraForm is no managed service, so the maintenance burden is on the user side. That means we as users have to install, upgrade, maintain, debug it and so on (instead of focusing on building our own products).

Another important point is that Terraform uses “state files” to maintain the state of the infrastructure it created. The files are the holy grail of Terraform and messing around with them can bring you into serious trouble, e.g. bringing your infrastructure into an undefined state. The user has to come up with a solution how to keep those state files in a synchronized and central location (Luckily Terraform provides remote state handling, I will get back to this in a second). CloudFormation actually also maintains the state of the resources it created, but AWS takes care of state storage!

Last but not least, Terraform currently does not take care of locking, so two concurrent Terraform runs could lead to unintended consequences. (which will change soon).

Putting it all together

So how can we leverage the described advantages of Terraform while still minimizing its operational overhead and costs?

Serverless delivery pipelines

First of all, we should use a Continuous Delivery Pipeline: Every change in the source code triggers a run of the pipeline consisting of several steps, e.g. running tests and finally applying/deploying the changes. AWS offers a service called CodePipeline to create and run these pipelines. It’s a fully managed service, no servers or containers to manage (a.k.a “serverless”).

Executing Terraform

Remember, we want to create a safe environment to execute Terraform, which is consistent and which can be audited (so NOT your workstation!!).

To execute Terraform, we are going to use AWS CodeBuild, which can be called as an action within a CodePipeline. The CodePipeline will inherently take care of the Terraform state file locking as it does not allow a single action to run multiple times concurrently. Like CodePipeline, CodeBuild itself is fully managed. And it follows a pay-by-use model (you pay for each minute of build resources consumed).

CodeBuild is instructed by a YAML configuration, similar to e.g. TravisCI (I explored some more details in an earlier post). Here is how a Terraform execution could look like:

1
2
3
4
5
6
7
8
9
10
11
version: 0.1
phases:
install:
commands:
- yum -y install jq
- curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI | jq 'to_entries | [ .[] | select(.key | (contains("Expiration") or contains("RoleArn")) | not) ] | map(if .key == "AccessKeyId" then . + {"key":"AWS_ACCESS_KEY_ID"} else . end) | map(if .key == "SecretAccessKey" then . + {"key":"AWS_SECRET_ACCESS_KEY"} else . end) | map(if .key == "Token" then . + {"key":"AWS_SESSION_TOKEN"} else . end) | map("export \(.key)=\(.value)") | .[]' -r > /tmp/aws_cred_export.txt # work around https://github.com/hashicorp/terraform/issues/8746
- cd /tmp && curl -o terraform.zip https://releases.hashicorp.com/terraform/${TerraformVersion}/terraform_${TerraformVersion}_linux_amd64.zip && echo "${TerraformSha256} terraform.zip" | sha256sum -c --quiet && unzip terraform.zip && mv terraform /usr/bin
build:
commands:
- source /tmp/aws_cred_export.txt && terraform remote config -backend=s3 -backend-config="bucket=${TerraformStateBucket}" -backend-config="key=terraform.tfstate"
- source /tmp/aws_cred_export.txt && terraform apply

First, in the install phase, the tool jq is installed to be used for a little workaround I had to write to get the AWS credentials from the metadata service, as Terraform does not yet support this yet. After retrieving the AWS credentials for later usage, Terraform is downloaded, checksum’d and installed (they have no Linux repositories).

In the build phase, first the Terraform state file location is set up. As mentioned earlier, it’s possible to use S3 buckets as a state file location, so we are going to tell Terraform to store it there.

You may have noticed the source /tmp/aws_cred_export.txt command. This simply takes care of setting the AWS credentials environment variables before executing Terraform. It’s necessary because CodeBuild does not retain environment variables set in previous commands.

Last, but not least, terraform apply is called which will take all .tf files and converge the infrastructure against this description.

Pipeline as Code

The delivery pipeline used as an example in this article is available as an AWS CloudFormation template, which means that it is codified and reproducible. Yes, that also means that CloudFormation is used to generate a delivery pipeline which will, in turn, call Terraform. And that we did not have to touch any servers, VMs or containers.

You can try out the CloudFormation one-button template here:

Launch Stack

You need a GitHub repository containing one or more .tf files, which will in turn get executed by the pipeline and Terraform.

Once the CloudFormation stack has been created, the CodePipeline will run initially:

CodePipeline screenshot

The InvokeTerraformAction will call CodeBuild, which looks like this:

CodeBuild log output screenshot

Stronger together

The real power of both TerraForm and CloudFormation comes to light when we combine them, as we can actually use best of both worlds. This will be a topic of a coming blog post.

Summary

This article showed how AWS CodePipeline and CodeBuild can be used to execute Terraform runs in a Continuous Delivery spirit, while still minimizing operational overhead and costs. A CloudFormation template is provided to ease the set up of such a pipeline. It can be used as a starting point for own TerraForm projects.

References

https://blog.gruntwork.io/how-to-manage-terraform-state-28f5697e68fa?gi=9769367dd11

AWS CodeBuild: The missing link for deployment pipelines in AWS

This is a follow-up of my AWSAdvent article Serverless everything: One-button serverless deployment pipeline for a serverless app , which extends the example deployment pipeline with AWS CodeBuild.

Deployment pipelines are very common today, as they are usually part of a continuous delivery/deployment workflow. While it’s possible to use e.g. projects like Jenkins or concourse for those pipelines, I prefer using managed services in order to minimize operations and maintenance so I can concentrate on generating business value. Luckily, AWS has a service called CodePipeline which makes it easy to create deployment pipelines with several stages and actions such as downloading the source code from GitHub, and executing build steps.

For the build steps, there are several options like invoking an external Jenkins Job, or SoranoCi etcpp. But when you want to stay in AWS land, your options were quite limited until recently. The only pure AWS option for CodePipeline build steps (without adding operational overhead, e.g. managing servers or containers) was invoking Lambda functions, which has several drawbacks that I all experienced:

Using Lambda as Build Steps

5 minutes maximum execution time

Lambda functions have a limit of 5 minutes which means that the process gets killed if it exceeds the timeout. Longer tests or builds might get aborted and thus result in a failing deployment pipeline. A possible workaround would be to split the steps into smaller units, but that is not always possible.

Build tool usage

The NodeJS 4.3 runtime in Lambda has the npm command pre-installed, but it needs several hacks to be working. For example, the Lambda runtime is a read-only file system except for tmp, so in order to use e.g. NPM, you need to fake the HOME to /tmp. Another example is that you need to find out where the preinstalled NPM version lives (checkout my older article on NPM in Lambda).

Artifact handling

CodePipeline works with so called artifacts: Build steps can have several input and output artifacts each. These are stored in S3 and thus have to be either downloaded (input artifact) or uploaded (output artifact) by a build step. In a Lambda build step, this has to be done manually, means you have to use the S3 SDK of the runtime for artifact handling.

NodeJS for synchronous code

When you want to use a preinstalled NPM in Lambda, you need to use the NodeJS 4.3 runtime. At least I did not manage to get the preinstalled NPM version running which is part of the Lambda Python runtime. So I was stuck with programming in NodeJS. And programming synchronous code in NodeJS did not feel like fun for me: I had to learn how promises work for code which would be a few lines of Python or Bash. When I look back, and there would be still no CodeBuild service, I would rather invoke a Bash or Python script from within the NodeJS runtime in order to avoid writing async code for synchronous program sequences.

Lambda function deployment

The code for Lambda functions is usually packed as ZIP file and stored in an S3 bucket. The location of the ZIP file is then referenced in the Lambda function. This is how it looks in CloudFormation, the Infrastructure-as-Code service from AWS:

1
2
3
4
5
6
LambdaFunction:
Type: AWS::Lambda::Function
Properties:
Code:
S3Bucket: !Ref DeploymentLambdaFunctionsBucket
S3Key: !Ref DeploymentLambdaFunctionsKey

That means there has to be another build and deployment procedure which packs and uploads the Lambda function code to S3 itself. Very much complexity for a build script which is usually a few lines of shell code, if you ask me.

By the way, actually there is a workaround: In CloudFormation, it’s possible to specify the code of the Lambda function inline in the template, like this:

1
2
3
4
5
6
7
8
LambdaFunctctionWithInlineCode:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: |
exports.handler = function(event, context) {
...
}

While this has the advantage that the pipeline and the build step code are now in one place (the CloudFormation template), this comes at the cost of losing e.g. IDE functions for the function code like syntax checking and highlighting. Another point: the inline code is limited to 4096 characters length, a limit which can be reached rather fast. Also the CloudFormation templates tend to become very long and confusing. In the end using inline code just felt awkward for me …

No AWS CLI installed in Lambda

Last but not least, there is no AWS CLI installed in the Lambda runtime, which makes things to be done in build steps, like uploading directories to S3, really hard, because this has to be done in the programming runtime. What would be a one-liner in AWS CLI, can be much more overhead and lines of code in e.g. NodeJS or Python.

At the recent re:invent conference, AWS announced CodeBuild which is a build service, very much like a managed version of Jenkins, but fully integrated into the AWS ecosystem. Here are a few highlights:

  • Fully integrated into AWS CodePipeline: CodePipeline is the “Deployment Pipeline” service from AWS and supports CodeBuild as an action in a deployment pipeline. It also means that CodePipeline can checkout code from e.g. a GitHub repository first, save it as output artifact and pass it to CodeBuild, so that the entire artifact handling is managed, no (un)zipping and S3 juggling necessary.
  • Managed build system based on Docker Containers: First you don’t need to take care of any Docker management. Second you can either use AWS provided images, which provide a range of operating systems / environments, e.g. Amazon Linux and Ubuntu for several pre-built environments, e.g. NodeJS or Python or Go (http://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref.html). Or you can bring your own container (I did not try that out yet).
  • Fully supported by CloudFormation, the Infrastructure-as-Code service from AWS: You can codify CodeBuild projects so that they are fully automated, and reproducible without any manual and error-prone installation steps. Together with CodePipeline they form a powerful unit to express entire code pipelines as code which further reduces total cost of ownership.
  • YAML-DSL, which describes the build steps (as a list of shell commands), as well as the output artifacts of the build.

Another great point is that the provided images are very similar to the Lambda runtimes (based on Amazon Linux) so that they are predestinated for tasks like packing and testing Lambda function code (ZIP files).

CodeBuild in action

So, what are the particular advantages of using CodeBuild vs. Lambda in CodePipeline? Have a look at this Pull Request. It replaces the former Lambda-based approach with CodeBuild in the project I set up for my AWS Advent article: Several hundred lines of JavaScript got replaced by some lines of CodeBuild YAML. Here is how a sample build file looks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
version: 0.1
phases:
install:
commands:
- npm install -g serverless
- cd backend && npm install
build:
commands:
- "cd backend && serverless deploy"
- "cd backend && aws cloudformation describe-stacks --stack-name $(serverless info | grep service: | cut -d' ' -f2)-$(serverless info | grep stage: | cut -d' ' -f2) --query 'Stacks[0].Outputs[?OutputKey==`ServiceEndpoint`].OutputValue' --output text > ../service_endpoint.txt"
artifacts:
files:
- frontend/**/*
- service_endpoint.txt

This example shows a buildspec.yml with two main sections: phases and artifacts:

  • phases apparently lists the phases of the build. These predefined names actually have no special meaning and you can put as many and arbitrary commands into it. The example shows several shell commands executed, in particular first - in the install stage - the installation of the serverless NPM package, followed by the build stage which contains the execution of the Serverless framework (serverless deploy). Lastly, it runs a more complex command to save the output of a CloudFormation stack into a file called service_endpoint.txt: That file is later picked up as an output artifact.
  • artifacts lists the directories and files which CodePipeline will generate as an output artifact. Used in combination with CodePipeline, it provides a seamless integration into the pipeline and you can use the artifact as input for another pipeline stage or action. In this example the frontend folder and the mentioned service_endpoint.txt file are nominated as output artifacts.

The artifacts section can also be omitted, if there are no artifacts at all.

Now that we learned the basics of the buildspec.yml file, lets see how this integrates with CloudFormation:

CodeBuild and CloudFormation

CloudFormation provides a type AWS::CodeBuild::Project to describe CodeBuild projects - an example follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
DeployBackendBuild:
Type: AWS::CodeBuild::Project
Properties:
Artifacts:
Type: CODEPIPELINE
Environment:
ComputeType: BUILD_GENERAL1_SMALL
Image: aws/codebuild/eb-nodejs-4.4.6-amazonlinux-64:2.1.3
Type: LINUX_CONTAINER
Name: !Sub ${AWS::StackName}DeployBackendBuild
ServiceRole: !Ref DeployBackendBuildRole
Source:
Type: CODEPIPELINE
BuildSpec: |
version: 0.1
...

This example creates a CodeBuild project which integrates into a CodePipeline (Type: CODEPIPELINE), and which uses a AWS provided image for nodejs runtimes. The advantage is that e.g. NPM is preinstalled. The Source section describes again that the source code for the build is coming from a CodePipeline. The BuildSpec specifies in inline build specification (e.g. the one shown above).

You could also specify that CodeBuild should search for a buildspec.yml in the provided source artifacts rather than providing one via the project specification.

CodeBuild and CodePipeline

Last but not least, let’s have a look at how CodePipeline and CodeBuild integrate by using an excerpt from the CloudFormation template which describes the pipeline as code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Pipeline:
Type: AWS::CodePipeline::Pipeline
Properties:
...
Stages:
- Name: Source
Actions:
- Name: Source
InputArtifacts: []
ActionTypeId:
Category: Source
Owner: ThirdParty
Version: 1
Provider: GitHub
OutputArtifacts:
- Name: SourceOutput
- Name: DeployApp
Actions:
- Name: DeployBackend
ActionTypeId:
Category: Build
Owner: AWS
Version: 1
Provider: CodeBuild
OutputArtifacts:
- Name: DeployBackendOutput
InputArtifacts:
- Name: SourceOutput
Configuration:
ProjectName: !Ref DeployBackendBuild
RunOrder: 1

This code describes a pipeline with two stages: While the first stage checks out the source code from a Git repository, the second stage is the interesting one here: It describes a stage with a CodeBuild action which takes the SourceOutput as input artifact, which ensures that the commands specified in the build spec of the referenced DeployBackendBuild CodeBuild project can operate on the source. DeployBackendBuild is the actual sample project we looked at in the previous section.

The Code

The full CloudFormation template describing the pipeline is on GitHub. You can actually test it out by yourself by following the instructions in the original article.

Summary

Deployment pipelines are as valuable as the software itself as they ensure reliable deployments, experimentation and fast time-to-market. So why shouldn’t we treat them like software, namely as code. With Codebuild, AWS completed the toolchain of building blocks which are necessary to codify and automate the setup of deployment pipelines for our software:

  • without the complexity of setting up / maintaining third party services
  • no error-prone manual steps
  • no management of own infrastructure like Docker clusters as “build farms”.
  • no bloated Lambda functions for build steps

This article showcases a CloudFormation template which should help the readers to get started with the own CloudFormation/CodePipeline/CodeBuild combo which provisions within minutes. There are no excuses anymore for manual and/or undocumented software deployments within AWS ;-)