Looking at this map of expansions we may have a tram line in front of our house in the near future.
Traditional operations teams and their counterparts in product development thus often end up in conflict, most visibly over how quickly software can be released to production. At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams’ goals are fundamentally in tension.
SRE is what happens when you ask a software engineer to design an operations team.
The use of an error budget resolves the structural conflict of incentives between development and SRE. SRE’s goal is no longer “zero outages”; rather, SREs and product developers aim to spend the error budget getting maximum feature velocity. This change makes all the difference. An outage is no longer a “bad” thing—it is an expected part of the process of innovation, and an occurrence that both development and SRE teams manage rather than fear.
When humans are necessary, we have found that thinking through and recording the best practices ahead of time in a “playbook” produces roughly a 3x improvement in MTTR as compared to the strategy of “winging it.” The hero jack-of-all-trades on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better.
However, some systems should be instrumented with client-side collection, because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics.
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.
A product’s feature velocity will slow if the SRE team is too busy with manual work and firefighting to roll out new features promptly.
This kind of tension is common within a team, and often reflects an underlying mistrust of the team’s self-discipline: while some team members want to implement a “hack” to allow time for a proper fix, others worry that a hack will be forgotten or that the proper fix will be deprioritized indefinitely. This concern is credible, as it’s easy to build layers of unmaintainable technical debt by patching over problems instead of making real fixes. Managers and technical leaders play a key role in implementing true, long-term fixes by supporting and prioritizing potentially time-consuming long-term fixes even when the initial “pain” of paging subsides.
It’s easy to overlook the fact that once you have encapsulated some task in automation, anyone can execute the task. Therefore, the time savings apply across anyone who would plausibly use the automation. Decoupling operator from operation is very powerful.
The main upshot of this new automation was that we had a lot more free time to spend on improving other parts of the infrastructure. Such improvements had a cascading effect: the more time we saved, the more time we were able to spend on optimizing and automating other tedious work.
“Why don’t we gate the code with a flag instead of deleting it?”
If we release 100 unrelated changes to a system at the same time and performance gets worse, understanding which changes impacted performance, and how they did so, will take considerable effort or additional instrumentation. If the release is performed in smaller batches, we can move faster with more confidence because each code change can be understood in isolation in the larger system.
There are many ways to simplify and speed troubleshooting. Perhaps the most fundamental are: Building observability—with both white-box metrics and structured logs—into each component from the ground up. Designing systems with well-understood and observable interfaces between components.
Some on-call engineers simultaneously experienced what they believed to be a failure of the corporate network and relocated to dedicated secure rooms (panic rooms) with backup access to the production environment.
Google relies upon our own tools. Much of the software stack that we use for troubleshooting and communicating lies behind jobs that were crash-looping. Had this outage lasted any longer, debugging would have been severely hindered.
De facto, the commander holds all positions that they have not delegated.
It is important to define postmortem criteria before an incident occurs so that everyone knows when a postmortem is necessary. In addition to these objective triggers, any stakeholder may request a postmortem for an event.
Writing a postmortem also involves formal review and publication. In practice, teams share the first postmortem draft internally and solicit a group of senior engineers to assess the draft for completeness. Review criteria might include: Was key incident data collected for posterity? Are the impact assessments complete? Was the root cause sufficiently deep? Is the action plan appropriate and are resulting bug fixes at appropriate priority? Did we share the outcome with relevant stakeholders?
Make sure that writing effective postmortems is a rewarded and celebrated practice, both publicly through the social methods mentioned earlier, and through individual and team performance management.
one of SRE’s guiding principles is that “team size should not scale directly with service growth.”
Performance Data describes how a service scales: for every unit of demand X in cluster Y, how many units of dependency Z are used? This scaling data may be derived in a number of ways depending on the maturity of the service in question. Some services are load tested, while others infer their scaling based upon past performance.
When deploying approximation to help speed development, it’s important to undertake the work in a way that allows the team to make future enhancements and revisit approximation.
By working one-on-one with early users, you can address those fears personally, and demonstrate that rather than owning the toil of performing a tedious task manually, the team instead owns the configurations, processes, and ultimate results of their technical work.
Load test components until they break. As load increases, a component typically handles requests successfully until it reaches a point at which it can’t handle more requests.
If you believe your system has proper protections against being overloaded, consider performing failure tests in a small slice of production to find the point at which the components in your system fail under real traffic
Its authors point out [Bur06] that providing consensus primitives as a service rather than as libraries that engineers build into their applications frees application maintainers of having to deploy their systems in a way compatible with a highly available consensus service (running the right number of replicas, dealing with group membership, dealing with performance, etc.).
Regardless of the source of the “thundering herd” problem, nothing is harder on cluster infrastructure and the SREs responsible for a cluster’s various services than a buggy 10,000 worker pipeline job.
We don’t make teams “practice” their backups, instead: Teams define service level objectives (SLOs) for data availability in a variety of failure modes. A team practices and demonstrates their ability to meet those SLOs.
Google has also found that the most devastating acute data deletion cases are caused by application developers unfamiliar with existing code but working on deletion-related code, especially batch processing pipelines
The most important principle in this layer is that backups don’t matter; what matters is recovery.
Was the ability to formulate such an estimate luck? No—our success was the fruit of planning, adherence to best practices, hard work, and cooperation, and we were glad to see our investment in each of these elements pay off as well as it did.
In short, we always knew that adherence to best practices is important, and it was good to see that maxim proven true.
At first, this race condition may occur for a tiny fraction of data. But as the volume of data increases, a larger and larger fraction of the data is at risk for triggering a race condition. Such a scenario is probabilistic—the pipeline works correctly for the vast majority of data and for most of the time. When such race conditions occur in a data deletion pipeline, the wrong data can be deleted nondeterministically.
The Google Search SRE team structures this learning through a document called the “on-call learning checklist.”
When standard operating procedures break down, they’ll need to be able to improvise fully.
Because of the rapid change of production systems, it is important that your team welcome any chance to refamiliarize themselves with a system, including by learning from the newest, rather than oldest, members of the team.
At some point, if you can’t get the attention you need to fix the root cause of the problems causing interrupts, perhaps the component you’re supporting isn’t that important.
Once embedded in a team, the SRE focuses on improving the team’s practices instead of simply helping the team empty the ticket queue. The SRE observes the team’s daily routine and makes recommendations to improve their practices.
A default to ops mode usually happens in response to an overwhelming pressure, real or imagined.
Any serving-critical component for which the existing SREs respond to questions by saying, “We don’t know anything about that; the devs own it” To give acceptable on-call support for a component, you should at least know the consequences when it breaks and the urgency needed to fix problems.
Usually, the SRE team establishes and maintains a PRR checklist explicitly for the Analysis phase.
For example, SRE might help implement a “dark launch” setup, in which part of the traffic from existing users is sent to the new service in addition to being sent to the live production service. The responses from the new service are “dark” since they are thrown away and not actually shown to users.
What happened The effectiveness of the response What we would do differently next time What actions will be taken to make sure a particular incident doesn’t happen again
Private property would become public to a significant extent and the possessions of those around you would, in a sense, become partly yours.
Although at first blush you might assume that the auction would allow the rich to buy up everything of value, reflect for a moment. What do you mean by “the rich”? People who own lots of businesses, land, and so forth. But, if everything were up for auction all the time, no person would own such assets.
George was more concerned about inequality than were the conservative followers of Smith, and he recognized that private property could stand in the way of truly free markets.
That paper was published in 1961. Its title, “Counterspeculation, Auctions, and Competitive Sealed Tenders,”
We were promised economic dynamism in exchange for inequality. We got the inequality, but dynamism is actually declining.
Because of these limitations, moral economies can feel constraining and antiquated when confronted with large-scale market societies. Unable to account for the needs of those far away, they may become hostile to outsiders and intolerant of internal diversity, fearing it will erode group values.
The economic wisdom of left and right did not cut to the core of the tensions in the basic structure of capitalism and democracy. Private property inherently conferred market power, a problem that ballooned along with inequality and that constantly mutated in ways that frustrated efforts by governments to solve it. One-person-one-vote gave majorities the power to tyrannize minorities. Checks, balances, and judicial intervention limited such tyranny, but did so by handing power to elites and special interest groups. In international relations, efforts to enhance cooperation and cross-border economic activity empowered an international capitalist elite that disproportionately benefited from international cooperation and faced nationalist backlash from the working class.
the common ownership self-assessed tax
That is why governments often take the lead, using the power of eminent domain to create new commercial or residential districts. But eminent domain is often unfair and always politically controversial.
The wealthy were rewarded for doing nothing. Poor people who needed land had to pay vast prices to obtain it or else starve. Critics attacked these circumstances as perverse, and portrayed the rich, in fiction and nonfiction alike, as parasites (sometimes literally, as in Bram Stoker’s Dracula).
Walras believed that land should be owned by the state and the rents it generated should be returned to the public as a “social dividend,” either directly or through the provision of public goods.
Socialists agreed on only one point: that traditional private property and the inequality of its ownership posed significant challenges to prosperity, well-being, and political order.
In 1942, the prominent conservative economist Joseph Schumpeter predicted that socialism would ultimately replace capitalism.21 His view was that most economic activity in capitalist economies took place in corporations and that a corporation is just a bureaucracy in which “management” at the center issues orders to various workers. From this vantage point, it was a small step to an economy in which each industry was dominated by one or two gigantic corporations, with government regulation to ensure that they do not abuse their monopoly power, an outcome not much different from the central planning of socialism.
Most mainstream economists even today continue to assume that bargaining eliminates the monopoly problem.
Most of us think of the liturgy as the words chanted by members of a religious community. But the term originated in ancient Athens where it meant roughly “public works” and referred to the responsibility of the roughly 1,000 wealthiest citizens to fund the operations of the state, particularly the army and navy. How did the Athenians determine which citizens were the wealthiest? According to Demosthenes, any member of the liturgical class could challenge any other citizen he believed was wealthier to antidosis or “exchange.”36 The person being challenged would have to either assume the liturgical responsibility or exchange all possessions with the challenger. The system gives everyone an incentive to be honest despite the burdens of the liturgy. If you falsely claimed to be poorer than the top 1,000 so as to avoid the liturgical burdens, then you could end up being forced to exchange your possessions with someone who is poorer than you are.
Furthermore, control of everything would be radically decentralized; a COST thus combines extreme decentralization of power with partial socialization of ownership, showing that they are, perhaps surprisingly, two sides of the same coin.
As previously noted, our proposal would redistribute roughly one-third of the return on capital and thus would reduce the income share of the top 1% by 4 percentage points, or roughly half the difference between recent levels and the low points in the 1970s.
One cannot develop an attachment to a car that one uses for a few hours, and no one seems the worse for this. Fetishistic attachment to a privately owned automobile—an extremely expensive durable asset, which even enthusiasts seldom drive for more than an hour or two per day—is, thankfully, becoming a thing of the past.
As the economy grows, the revenues generated by the COST would be redistributed back to citizens, just as employees who own stock in their employers benefit when the employer’s profits increase. From Friedrich Engels to George W. Bush, commentators and politicians have argued that owning a share in the national capital stock, usually through the stock market or a home, could help stabilize politics and enhance support for policies that raise the value of the capital stock, a position supported by some research.
Building on Samuelson’s ideas, economist and political scientist Mancur Olson argued that small groups of well-organized special interests can use expenditures, lobbying, and other forms of political action to persuade the government to act in their interest rather than for the
public good.29 Much of the public ignores complex issues, like bank regulation, while the banks who can profit from government fund lobbying organizations that control the agenda. Many economists are cynical about collective decision-making because it seems so easy to manipulate. But not all of them view it this way. Again, enter our hero
First, a passionate minority can outvote an indifferent majority, solving the problem of the tyranny of the majority. Second, the outcome of the vote should maximize the well-being of the entire group, not the well-being of one subset at the expense of that of another.
Despite centuries of progress, markets for public goods are hopelessly deficient. If we are right about QV, then it should bring markets for public goods in line with markets for private goods, with incalculable benefits for all citizens.
QV would offer citizens the chance to feel their voice had been more fully heard, both helping them win on the issue most important to them and reconciling them to the losses they suffer. These features are much like the social effects of market economies for private goods. Because citizens tend to resent and feel coerced by rationing in planned economies, they experience the abandonment of planning as a blossoming of freedom, as was so clear with the collapse of communism in the 1980s and 1990s. When people have the freedom to choose what to spend their money on, they are afforded a sense of dignity and responsibility for the things they have and choose to forgo. A political culture based on such a market mentality could give people a stronger sense of dignity and responsibility in politics.
Yet such large-scale services at present are either provided by monopolistic corporations or by dysfunctional public authorities. Fear of the failures of these providers often leads us to wastefully retreat from public life behind the walls of our homes, our gated communities, our private servers, and our individual cars.
Wealthy countries, by definition, have a greater relative abundance of capital as compared to labor than do poor countries. It is thus natural that trade and migration should both benefit capitalists in wealthy countries and laborers in poor countries at the expense of laborers in wealthy countries and capitalists in poor countries.
Often it is in the rural and economically depressed regions where few migrants reside that opposition to migration is strongest.28 Workers in such areas see migration adding to economic vibrancy in other communities, but not in their own. They gain none of the ancillary social and cultural benefits that dynamic city-dwellers gain from migration, of increased variety in food, color in urban life, or exposure to other cultures that can expand career opportunities. Instead, they see the rest of their country moving in directions that distance it from their experience in ways that increase their isolation and consignment to the cultural periphery.
While migration offers enormous advantages to the migrants themselves and their families back home, to employers and owners of capital, and to the high-skilled workers who they complement and live among, migration offers few benefits to and imposes some costs on most workers in wealthy countries, who are already left behind by the forces of trade, automation, and the rising power of concentrated finance.
A political backlash against massive migration is not inevitable. Even in closed societies, migration receives political support as long as its benefits are widely distributed in a visible way.
Many of the sophisticated cultural elites most likely to object to this sort of unequal relationship should contemplate their own relationships to migrants. In our experience, most people living in wealthy cities who consider themselves sympathetic to the plight of migrants know little or nothing of the language, cultures, aspirations, and values of those they claim to sympathize with. They benefit greatly from the cheap services these migrants offer and rarely concern themselves with the poverty in which they live. The solidarity of such cosmopolitan elites is thus skin deep. But it is better than the open hostility many ordinary citizens of wealthy countries feel toward migrants.
Yet economic research suggests that diversified institutional investors have harmed a wide range of industries, raising prices for consumers, reducing investment and innovation, and potentially lowering wages.
A law firm that sued institutional investors, on the other hand, would be bringing a case against capital as a class.
The primary difference between the scenario we describe above and present practice, other than some advances in chat capacities, is that in the world we imagine, Facebook is open and honest about how it uses data and pays for the value it receives with money. The user’s role as a vital cog in the information economy—as data producer and seller—is highlighted.
The inability to earn money in these environments undercuts the possibility of developing skills or careers around digital contributions, as technoserfs know any investment they make will be expropriated by the platforms.
However, they have attracted only a few users with an ideological attachment to the idea. Most users prefer a network that is used by most of their friends and that offers higher quality services.
Unlike traditional unions, they combine labor stoppages and consumer boycotts—because, as noted, data laborers are simultaneously consumers. During a strike, Facebook would lose not only access to data (on the labor side) but access to ad revenues (on the consumer side). It’s as if autoworkers could pressure GM or Ford not only by stopping production but also by refusing to purchase cars. Also unlike traditional unions, which must struggle to maintain solidarity during strikes, the data unions could enforce the “picket line” electronically.
She realized, too, that in many ways her new cause, fighting to get her old life back, had given her more meaning and not just greater wealth than the past she idealized. She started to wonder what else might supply that meaning and whether her whole movement was not ultimately some sort of self-serving charade.
A COST on human capital might turn out to be politically popular because it penalizes the highly resented educated class and lazy people of all types, while rewarding ordinary workers for their labor.
It would be a mistake, however, to think that the current system is not coercive. In our current system, there is a wide gulf between educated elites whose native or acquired talents are highly marketable and those who have been left behind by changes sweeping the economy. The talented enjoy a kind of freedom, as they can select from among a variety of appealing jobs. These jobs allow them to quickly accumulate capital that they can depend on as they age, if they do not like the jobs that are available, or pick and choose among different levels of labor (part-time, enjoyable or rewarding but low-paying jobs in the nonprofit sector, etc.). Those with fewer marketable skills are given a stark choice: undergo harsh labor conditions for low pay, starve, or submit to the many indignities of life on welfare. Yet the waste of social resources when a talented person fails to realize her potential are far greater, and arguably their failure to work should be punished more harshly.
By giving every citizen a share of national wealth, a COST could make voters attend to the consequences of policies for a nation’s wealth and create a more cooperative spirit across class lines.
Moreover, some scholars have argued that by encouraging selfishness, markets undermine the trust that is necessary for markets to function.
Shalizi considers an estimate by Soviet planners that, at the height of Soviet economic power in the 1950s, there were about 12 million commodities tracked in Soviet economic plans. To make matters worse, this figure does not even account for the fact that a ripe banana in Moscow is not the same as a ripe banana in Leningrad, and moving it from one place to the other must also be part of the plan. But even were there “merely” 12 million commodities, the most efficient known algorithms for optimization, running on the most efficient computers available today, would take roughly a thousand years to solve such a problem exactly once. It can even be proven that a modern computer could not achieve even a reasonably “approximate” solution
But if robots can drive cars, they can also make purchase orders, accept deliveries, gauge consumer sentiment, plan economic operations, and coordinate this activity at the level of the economy. At this macro level, the role of artificial intelligence in reshaping social organization has—bizarrely—received little attention.
One, they must build psychological safety to spur learning and avoid preventable failures; two, they must set high standards and inspire and enable people to reach them.
Leaders in a volatile, uncertain, complex, and ambiguous (VUCA) world, who understand that today’s work requires continuous learning to figure out when and how to change course, must consciously reframe how they think, from the default frames that we all bring to work unconsciously to a more productive reframe. Framing the work is not something that leaders do once, and then it’s done. Frequently calling attention to levels of uncertainty or interdependence helps people remember that they must be alert and candid to perform well.
Stripe runs on written long-form documents in a way that I haven’t seen before. So that means somebody can go deep, like all the way down, and then distill it back out to everybody else. So you don’t have to do all of that work yourself. It does require a lot of reading for sure, but the benefit is great clarity of thought on complex topics.
Quick-thinking, quick-acting people do really well here.
One of our operating principles is “really, really care.”
A culture of celebrating shipping, versus celebrating measurable progress and learnings
A nice story of how organizations slowly adopt data and all the struggles a data team has to deal with.
A good counter-point by Camille Fournier about how being a hands-off manager can turn into being an absent manager. Making sure these kind of meetings are held is the most important leverage you have as senior leadership.
‘Whomsoever is possessed of magisterial strength is courted; while whomsoever has inferior strength pays court to others. It is for this reason that the shrewd ruler strives for might.’
Enter prebuilds: pools of codespaces, fully cloned and bootstrapped, waiting to be connected with a developer who wants to get to work. The engineering investment we’ve made in prebuilds has returned its value many times over: we can now create reliable, preconfigured codespaces, primed and ready for GitHub.com development in 10 seconds.
I have done and seen a bunch of work in the space of local development environments looking at what Github has done: the problem is very hard and making the end user experience this good looks like it has taken an inordinate amount of effort.
We improve at this process by becoming more creative, having more slack, being more equanimous, and pruning more efficiently.