Transcript: Why Corporate America Still Runs on Ancient Software That Breaks

Southwest Airlines had a disastrous holiday season, thanks in part to a software bug that left crews out of place and grounded thousands of flights. But Southwest isn't alone in having software in the headlines lately. The New York Stock Exchange recently had a software error that caused weird pricing on stocks and the FAA had its own computer issue that grounded planes earlier this month. So what's the deal with corporate software? Why do these crashes happen? And why does the user experience typically leave something to be desired? On this episode of the podcast we speak with Patrick McKenzie, an expert on engineering and infrastructure, who writes the Bits About Money newsletter and recently left payments company Stripe after six years. We talked about the challenges of keeping any software system alive after years of upgrades and updates, the distribution of tech talent across industries, and whether non-tech companies can close the gap with Silicon Valley. This transcript has been lightly edited for clarity.

Key insights from the pod:
Blameless postmortems in tech — 6:33
Mainframes and shared ownership of software — 11:00
The lifecycle of engineers — 17:00
Why is upgrading computer systems so difficult? — 22:23
Corporate versus consumer-facing software — 28:43
Why is the IT work never done? — 33:06
Technology in the public sector — 34:57
What is bit rot? — 38:45

---

Joe Weisenthal: (00:10)
Hello and welcome to another episode of the Odd Lots podcast. I'm Joe Weisenthal.

Tracy Alloway: (00:15)
And I'm Tracy Alloway.

Joe: (00:17)
Tracy, I forgot to ask you, and it's kind of embarrassing like this late in January, but how was your New Year's, how was your holidays? Did you have a good Christmas and stuff?

Tracy: (00:27)
Aw, thanks. Thanks Joe. I had an excellent Christmas. I stayed at home for a week with my husband and my dog and we did hardly anything and it was absolutely glorious. How about you?

Joe: (00:39)
It was all right. So the thing was I was down in Texas visiting family, which is nice. But there was that huge cold blast. And it's worse, when the weather gets really cold, it's worse in a place like Texas because none of the buildings are insulated particularly well. None of them have that heat. So when it's really cold, it's actually better to be in a cold place where people are used to it. So it was a little uncomfortable, but the good news is somehow I managed to travel back and forth without having any major like airline disruptions.

Tracy: (01:13)
Right. So this is the key thing that happened right before Christmas, which is we had that very big winter storm, the arctic bomb blast, and it disrupted a ton of flights, first off because of the weather. But then what happened is you had this sort of cascade effect because the weather event was so large, a number of airlines, but one airline in particular, experienced a lot of problems with its software.

And Southwest had to cancel, I think in the end it was something like 16,000 flights. You had millions of passengers affected and you had disruptions that, you know, a weather related disruption that lasted one or two days ended up lasting, I think more than a week because of the impact of the computer glitches, I guess.

Joe: (02:00)
Yeah, yeah, that's right. You know, airline travel, it always sort of cascades and ripples out, right? Because a canceled flight is going to affect other flights and so forth. But it seemed like Southwest experienced something unique, which is that this turned into this major software problem and it's sort of a reminder that, okay, when we use the internet, when we use sort of modern consumer software, it's all very zippy and quick and it has nice interfaces. And then when you use sort of backend corporate software, particularly at large legacy institutions, it's nothing like the consumer internet. It's clunky. We all know what it's like.

Tracy: (02:42)
Right. So this is something that came up quite a lot [when] I used to cover the big banks, and yes. One of the crazy things that I learned relatively early on while I was doing this, was just how much of their IT system was still these old creaky big Iron mainframes, some of them still running on COBOL, which is the programming language that, I think it dates back to the 1950s or 1960s. And I remember, you know, you hear like, ‘oh, I can't believe that these big banks, our entire financial system in some respects’ is still running on these legacy computer systems.’ But on the other hand, if you look at what a big bank is, it's structurally a series of mergers and acquisitions.

Joe: (03:27)
No, that's a good point. That's really well put.

Tracy: (03:29)
Yeah, there used to be this great flow chart that showed the formation of like a JPMorgan or a Bank of America, and you can just see it's a series of rollups of smaller banks. And you think, every time they acquire a new bank, they have to integrate another system into their own system. And in the end, you kind of end up with this just incredibly complex and kind of patchy IT structure that in some ways very much resembles the amalgamation of all these smaller banks into a larger bank.

One of the charts Tracy referenced, about how the big banks came to be (recursive M&A activity), which is relevant to some of the legacy systems questions, is really quite interesting.

The *short* version: pic.twitter.com/USbTgXR3Yw
— Patrick McKenzie (@patio11) January 26, 2023

Joe: (03:59)
Yeah, that's totally right. You think of banks as when they merge, it's just like, okay, you know, the capital merges together and the assets etc., but like, right. None of them are going to have IT systems that work perfectly together, so they're like glued together with like duct tape and just over time, I think ‘technical debt’ is a term that software engineers [use], so you accumulate all this technical debt. And I'm sure nobody likes it inside the bank at any level, but I've always been curious, what are the economics such that it is just so impossible for these legacy institutions – banking, airlines, the public sector -- to get with the times, you know?

Tracy: (04:41)
Absolutely. And when it comes to Southwest, so first of all, let me declare a small self-interest in Southwest, which is my dad was a pilot for Southwest for a long time, and I remember when the outages were happening this Christmas, I sent him a text going ‘Well? What do you think about this?’ And he just said, ‘well, all the crew are out of position and they need to get them back.’ So a very straightforward ex-pilot answer about what was going on. Not that helpful for this podcast, but I do think in the case of Southwest, this has been a long-running issue and you have had people talking at various times about the need to upgrade the infrastructure, the technological infrastructure, and yet it hasn't happened. And so the question is why? Is it cheaper just to keep running these old mainframes and assume that you are going to have these outages during major events? Is that cheaper than it would cost to actually upgrade it?

Joe: (05:32)
No, absolutely. Well, let's talk to somebody who knows about software and knows about economics and can walk us through and help us understand the problem. We are going to be speaking with Patrick Mackenzie, he's an expert on software and infrastructure. He's a writer of the Bits About Money Newsletter, knows a lot about finance, and he just left Stripe after six years. He's still an advisor there, been reading his stuff for a long time. One of these people I sort of trust on almost any topic. Patrick, thank you so much for coming on Odd Lots.

Patrick MAckenzie: (06:06)
Thanks very much for having me.

Joe: (06:08)
What’s the deal? Let's just start, straight up, why don't you like, I mean Tracy mentioned all these institutions, they just, they swoosh it all together. They sort of tie it together with duct tape and so forth. You get these unwieldy things. Just give us the high-level view of why is it so difficult, at an abstract level, to modernize legacy software.

Patrick: (06:33)
So I'll start with the disclaimer, which is sort of mandatory in engineering culture. We have this thing that we've come up with over the last course of two decades or so, called blameless postmortems, where when there is a failure within a company where, you know, planes cannot be up in the air for a week at a time, rather than trying to point the finger at someone and say, ‘it was your decision or your inaction that caused this event.’ We, as engineers, want to look at the objective reality of the system, figure out what went wrong for the benefit of both that organization and for the larger community.

And so, this isn't to grind their nose in it, but just, you know, as an engineering matter, what probably happened -- in an ideal world, and if this had happened at a Google, for example, the engineering teams would push because of the culture of these things to do a very public post-mortem of uh what the decisions were, what the background is, etc., etc. In more traditional industries, I don't think that culture is fully baked yet, as it were, although they very well might be a post-mortem by, you know, the FAA and federal regulators because, you know, liveness constraint is a real thing for extremely important economic systems like airlines.

Anyhow, what probably happened, it wouldn't surprise anyone in the bowels of an airline that if you put off maintenance on airplanes for decades at a time, that eventually bad things would happen and no one would countenance that. However, software systems are quite similar where they aren't a ‘build once and then run for the rest of eternity’ sort of things. There were some decisions made early in their lifetimes, which are no longer accurate for the world we live in. They do suffer from something engineers euphemistic called ‘bit rot,’ where software which worked back in the past will tend to succumb to atrophy over time and not work exactly perfectly for all time afterwards.

And so you need to be doing an ongoing program of maintenance for your software just like you would for your airplanes. That bluntly was not done, and it seems to be credibly reported that the sort of cultural factors at Southwest that caused that to not be done, might have been caused by a sort of like overly accounting/penny pinching-focused management culture, which thought, ‘well, it costs money in the short-term to do maintenance so we can like cram down our engineering costs by doing less of this and relying more on external vendors, etc., etc.’

And then when stuff hits the oscillating plate, the right people have not done the years of work that are required to, you know, get to a posture where you can quickly recover from failures. And so they were left in a part where, we euphemistically in the industry called ‘heroics’ was required by folks in operations trying to, you know, contact thousands and tens of thousands of employees by phone and then pass around their information on probably spreadsheets to figure out where the crews actually were, to be able to tick the boxes that are required due to regulation to allow people to get back and up in the air.

And so that's probably the high level, root cause of what happened, but a reason to conduct these postmortems is it's never one single decision made by one person. It's a result of cultural factors, business decisions made over a course of probably decades in this place. And we want to tease out the various nuances there and make both Southwest and other organizations aware of that, so that they don't suffer critical systemic failures in their own places.

Tracy: (09:58)
So first of all, ‘bit rot’ is a fantastic term that I am going to have to try to work into all my conversations going forward. But secondly, just a step back a bit, can you maybe explain, you know, when we talk about mainframe computer systems, what exactly are we talking about and what is the counterpoint to mainframe? I'm assuming it's more cloud-based applications and things like that, but could you maybe define that basic term very quickly?

And then secondly, my understanding of the Southwest debacle was that they had this in-house software, I think it was called SkySolver, but it was based on an application that GE had been selling, and then Southwest kind of customized it. And I guess my question is how endemic is that type of software where you get something off the shelf, but then you customize it in such a way that it becomes, I guess, special to you and therefore your problem when it goes awry?

Patrick: (11:00)
So let's talk about the mainframe first and then we'll talk about the problem of like who owns the problem, whether it's your problem or some vendor’s problem and the way that balls tend to end in the middle of people and things get dropped.

So mainframes, back in the day, many, many decades ago, computers were approximately the size of a room. This is before the personal computer revolution. And banks, which were some of the earliest adopters of computers for sort of scaled usage in industry. And it's funny, the earliest users of computers tended to be either the financial industry or the military -- either attempting to move numbers which represented money, or numbers which represented like literally artillery shells flying through the air. And both were very important to society. For better or worse, banks because they standardized on what was the best available technology at the time, ended up with a lot of mainframes and they have kept those mainframes running for a good portion of 70 years now, in some cases, as you mentioned earlier.

The alternative to mainframes -- cloud is a bit of a buzzword. So there's the personal computer form factor that you're familiar with, where something sits on your desk. There are servers which would typically sit on server racks somewhere. And the difference between servers and ‘the cloud’ is the traditional way to manage servers would be, you would have a data center that would be owned or leased by yourself and you would put hardware that you owned in that data center. And in the cloud case, the data center is owned by Amazon or Google or Microsoft. You have rented access to a machine there, which sits on their balance sheet. And it's probably not one machine. It’s probably an awful lot of machines potentially with some virtualization layer and your engineers can cause systems to scale up or scale down based on how many machines you need on a, you know, minute by minute or second by second basis. That’s sort of the high level version of the sales pitch that cloud vendors will give you.

So that is a quick run through between like four generations of the mainstay of how technology gets done at scale, and on the question of where software gets written. So it is quite common for businesses in the traditional economy to not have an internal software engineering competence. And as a result they'll go to vendors like, in this case, you probably know this better than me, but for example, GE and the vendor will sell ‘package software,’ which might not be fully responsive to the needs of the business.
And then some customization happens.

What could happen is the customization might happen within the business if they have some level of software engineers internally. What happens more frequently in a lot of places, particular in Japan where I live, is that the business will contract with, we call them system integrators here. They may be called consultancies in America, you know, a Deloitte or another large consulting firm like that, Accenture. And say, ‘okay, we have this package system, we have these business requirements. Clearly some software engineers have to be involved. We don't have enough on our staff. Can you like figure out the missing link for us?’

And so there is an extensive contract negotiation, the consultancy goes off and does the customizations and they deliver the work to the organization. And then there are various different ways to do maintenance, but often that will involve the original team standing down for a while. That always sounds like a great idea when it's pitched because it's like, ‘oh, great, I don't have to pay expensive engineers every week to just sit around waiting for something to happen.’ And then when something actually happens, you sort of realize that the other end of the option value there where ‘oh, not so great. I don't have a team of like experts who understand the system, ready to call up at a moment's notice and bring in to debug the problems that we're seeing right now.’

And so -- I have no specific knowledge of how it went in this individual instance -- but a thing that happens during a lot of outages is, you know, a quick look into the history of the system to say, ‘Hey, wait, who actually built this for us? Are they still in business? Can we get on their calendar immediately? Is there an engineering team there that is ready to hop in and work at this? Like, goodness, they're going to charge us eye-bleeding rates to do it, but we're going to have to pay that. That's a, you know, tertiary consideration at this point. The bigger consideration is like, how quickly can we get our, you know, organization, like legal paperwork etc., spun up’ and then how quickly can they get the engineering team spun up to address this system in real time.

And this sort of thing is why the sort of major scaled software companies in the economy -- Google, Microsoft, etc., etc.., you all know the names -- largely treat engineering as an internal competence, rather than, you know, the software for the iPhone isn't built by Deloitte, it's built by Apple. And almost everything in the critical path will either be open source or something that there is a team at Apple that owns the sort of totality of the experience.

Joe: (16:08)
So something that I sort of hinted at in the beginning, and we talked about this a little bit, we did an episode on petroleum engineering, but I'm really curious about the distribution of engineering talent. In my mind, I would have to imagine that a talented engineer would be more excited to work for Stripe than Google and more excited to work for Google than Deloitte and more excited to work for Deloitte than Southwest. And I'm curious like A) if that's the case, and whether this is a problem and whether it would be better if there were just more high-tech experts, software experts who would want to work for a Southwest directly or work for a bank directly, whether this sort of distribution of engineering talent contributes to software bottlenecks?

Patrick: (17:00)
So I have nuanced thoughts on engineering talent. On the one hand, that is certainly a thing that exists in the world and talent is not necessarily distributed equally among different industries, individuals, etc. On the other hand, I think that Silicon Valley and the ecosystem too, which I owe a lot, occasionally has an overly enamored view of itself thinking that ‘oh yes, we have the best engineers in the world and all the engineers who made every other system that we rely on in our lives are sort of second rate.’

And clearly that's not true. Like, the phone system works. Airplanes, what's the old Louis CK monologue? Like, it flies through the sky and then teleports beams up to space. None of that happened by accident. And so it's important to not focus on the, you know, engineers working in traditional industry, are worse engineers than engineers that work in software companies, the larger problem that they have is, one, they don't drive the bus, they have less ability to control the situations within their organizations and control larger decisions made such that they have influence over decisions on what the maintenance schedule would look like, or who gets to make decisions with respect to whether something ships or not.

And there is a little bit of sort of the life cycle of engineers thing that happens where the original architects of the system did it 40 years ago -- careers are about as long as they are, many of the original architects of the system will sort of be aging into the retirement years at this point. And so there is a question of did the organization put in the work over the years to recruit newer engineers to inculcate them into how the system is made, etc., etc. Or do they just allow all the knowledge to walk out the door and a thing that comes up a lot is do they put up enough work on a day-to-day basis to maintain an engineering brand so that they can get new talented engineers to join them in 2023, such that those engineers are -- the word used is often ‘gray beard,’ but, you know, the wizened veteran, 30 years from now so that when something happens in 2053, there are people who have been around the block that know where the skeletons are buried in the system?

Interestingly, traditional industry is getting better over the years at having an engineering brand sort of moving away from this world where engineering was largely seen as a cost center where the goal is just to cram down the amount of money you spend on it and improve your margins. I think there's a few things that played into that. One of them was, particularly in the internet age, it became obvious that, you know, in finance you talk about the front office and the back office -- engineering used to be in the back office and the front office where the salespeople live is the one that generates all the money for the bank. And increasingly because experiences that were in the palm of the users’ hands were the thing that were actually generating the sales. Those experiences became sort of institutionally important within banks and airlines and other firms.

And then when those experiences became important, it took a while, but gradually the people and teams that built those experiences became more institutionally important than they'd previously been. And so if you look at the large money center banks in the US, they certainly have no small number of technical challenges, but a thing that is largely true in 2023, which was not true in 2010, is that their mobile apps are actually kind of good these days. If you download, you know, not to endorse anybody in particular but like Chase or Capital One, when you play with their mobile app, it's like, ‘oh, this kind of feels like a mobile app made it in Silicon Valley.’ And the reason is well, yeah, they hired a lot of people who made the apps from Silicon Valley and the those folks brought their skills and sort of level of competence with this and their taste and now exercise it on behalf of old world companies.

One hopes -- knock on wood -- that that seeps into the backend of these systems to where Apple, Google, Microsoft, they have very talented teams on the front end of their systems, but they also have very talented teams on the back end of the systems. And that's bluntly why you don't see core systems at Google going down for a week at a time. That is almost unimaginable. You know, how much of capitalism would break if Google Docs was just down for a week?

Given that there are many parts of the world where there is a liveness constraint, we would, all else being equal, if it is safe to fly airplanes, we would prefer that there would be airplanes flying versus all the airplanes being on the ground -- because airplanes generate value for human society. If that is true, then it must be the case that the backend systems of airlines that control whether airplanes are allowed to fly at a given time, that has to be at least as important as, you know, Google Docs is, which implies that the airlines need to put at least as much work as Google does, roughly, into having true mastery of their own backend systems and the various problems that can happen there.

Tracy: (21:45)
So just on this note, you know, there's another travel disruption that happened recently and we haven't even mentioned it, but that was the FAA experiencing some sort of computer event and grounding, I think all the domestic departures for one morning. And it was recently reported that the proximate cause of that was because there were some software engineers who were trying to upgrade the system and they accidentally deleted a bunch of critical files as they were trying to do this.

Can you talk a little bit more about the technical challenges when it comes to trying to fix some of these legacy systems? Why exactly is it so difficult? You sort of talked about it from an organizational perspective, but from a technical perspective, why does this seem to be such a big challenge?

Patrick: (22:33)
Sure. So one thing is that, that symptom, the underlying cause, that a problem happened during an upgrade, is extremely well understood in the software engineering field. It’s called out in, among other places, Google's book about site reliability engineers, which is essentially the subcategory of engineers that Google relies on keeping the world running.

And it is often the case that the people who are attempting to make an incremental change do not have full context of how the system came to be in the state that it’s currently in. And a change that they thought will have a limited sort of area of impact, ends up having a larger area of impact. The term of art we use in the industry is ‘blast radius.’ You hope to quantify the amount of blast radius of something that goes wrong such that, you know, if I make a mistake, am I going to like bring down our blog or am I going to bring down credit card processing worldwide and you know, be much more careful if the blast radius includes worldwide credit card processing, possibly like you should engineer your system such that there's no way to take down worldwide credit card processing, but that's actually harder to do than than it just sounds.

So anyhow, how do does one get to the point where it is difficult to understand like what the implications of the changes you’re making truly are? These are meat and potatoes questions, and the meat and potatoes answers are often things like, was the system adequately documented when it was made frequently? The answer is no. And a lot of information about how systems are put together survives as oral lore within the engineering team at various companies, which, you know that's an uncomfortable bit of information to hold in your head when you start talking about the life cycle of engineers and the fact that the original architects of many of these systems are literally no longer with us either because they have retired or they might be like, beyond our ability to call up out of retirement at this point.

And so, you have to write down what you do. And that concept was not new to governments and bureaucracies as a result of software engineering happening in the last 70 years. It’s sort of fundamental to the operation of large organizations. Software is just how people choose to do work with each other, but is a lesson that we keep relearning. There is often that issue where because software is how people and organizations choose to work with each other, often software will interface various systems together and problems will happen at the boundaries between systems, either between literal computer systems or between other breakages between organizations. So a thing that you will see frequently is, ‘my software didn't fail, your software didn't fail, we mutually failed together at that point where we are, you know, supposed to transfer information,’ and then both sides end up pointing at their counterparty.

And so part of the discipline of software engineering is one, creating a culture where you don't want to point fingers at the counterparty, and two, creating structures and incentives such that, you know, complex systems that involve multiple different parties with multiple different engineering teams who might not report to the same payroll department will converge on correct outcomes. And there are a variety of ways to do that in industry. Some of them are better than others.

So without naming the particular company, there exists a credit card system, which is extremely… So credit card systems as a baseline are extremely reliable. You probably don't remember the last time that you were unable to use a credit card for a week because literally that does not happen. So, like, A++ for achieving that outcome. What one credit card company does to achieve that is they are willing to make changes to their system precisely twice a year after six months of testing every change that they make.

That's an incredible amount of upfront work to do, like relatively small amounts of engineering. And so the pace at which that ecosystem evolves is much, much slower than more software-forward companies like Google App, Apple, Amazon, etc., etc., where they're shipping thousands of changes to their systems every day. So one of the interesting bits about remixing the skills and techniques of Silicon Valley is attempting to get people who are comfortable with the engineering practices that allow you to ship software thousands of times a day into positions of authority at old line companies such that they can gradually transition from, you know, the point that they're at where they might be able to ship software once or twice a year, maybe quarterly, to the point where they'll be shipping software, like, let's start with bi-week and move up from there.

Joe: (27:23)
So we've been talking a lot about failures or sort of collapses, but the other thing that I sort of associate with large business software is just, the user experience is just not as good. And I think that was, you know, the sort of like the UX, I believe, was part of the story with the infamous Citi error where they transmitted $900 million they shouldn't have to some counterparties. And I think some of the users were confused by the internal software, whether they were actually sending that or not.

I heard a story from someone who worked at the VC arm once of a major bank about the hoops that they have to go through just to share documents with each other, like PowerPoint, because they get flagged often internally. I assume that there's some sort of like regulatory issues, or booking travel. Like, you know, going to like Booking.com is really easy. When Tracy and I book travel for work here, it's not that bad, but the usability of the software internally, it's not as smooth and snappy as consumer travel sites. Why is that? Why is the sort of business a software internally just not as easy to use and sort of visually appealing as the consumer internet?

Patrick: (28:43)
So this is getting better over time and we'll talk about that in a moment. But broadly, your observation is entirely accurate. If you were the software czar for the entire world, you might rank applications by their importance to the world and say, ‘okay, if you are an online application on someone's phone that allows someone to share cat photos, that's like important in some sense, but probably not as important as sending a billion dollars outside of a bank.’ And so we would have a lot more talent and time spent on the question of can you send a billion dollars out of a bank versus does a 13-year-old have an awesome experience when sending a cat photo? In actual fact though much, much more time and talent is spent on the cat photo question than is spent on the wiring billions of dollars out of banks question. That was a choice.

It sounds silly to say maybe we should stop choosing stupid things, but the true answer is business software gets better when the, you know, people and organizations that cause that software to be built, choose that the quality of that software is something that is very relevant to their interests. And so one way to come to that realization is to lose a billion dollars and then you know, hopefully the next time your mid-level engineering management says we should spend a little more on maintenance, you will say, ‘ah, yes, I agree with you, we should spend a little more on maintenance versus taking another billion dollar charge at a time not of our choosing.’ Part of it is just the culture of quality, coming back to these things.

Part of it's also through sort of teaching the user inside of organizations that software doesn't have to be terrible because most software in the world exists inside of companies and runs business processes. That is something that is not broadly known, but of all the lines of software in the world, most exist inside of companies. And for a very long time, because most software people interacted with was at their employer and it was generally kind of terrible and they just had an image of like software is generally kind of terrible. And then the iPhone came around, everyone has a powerful computer in their hand for X number of hours a day. You've used applications, you've tapped three times and you know, interesting things happen in the world as a result of you tapping three times and you broadly like the experience. And then you go back to work and say, ‘wait, I've used software that doesn't suck. All the stuff I use at work sucks. Hey IT department, hey senior management, can you please make, like, our expense tracking software not be terrible?’

And that is starting happen both as a result of internal software producers advocating for change as a result of that user feedback within companies, and also as a result of various startups happening to say -- not to throw a particular expense solution under the bus -- but the thing that most old line economy companies probably use is not a thing that people love using to book their travel. And if you, you know, use TripActions or something that is designed by a modern team with modern sort of UX affordances, it's a much nicer solution for the end user.

And in some companies, end users are starting to have some level of ability to advocate for what software gets adopted, whereas previously that was made by processes that were not user-centric, which team was better at doing wining and dining the person in charge of the purchasing decision and not winning on the basis of the product quality in the last couple years, even some enterprise software is starting to win largely on the basis of product quality versus on sort of the more traditional sales motion. Although goodness knows that the traditional sales motion is still very important to enterprise software companies.

Tracy: (32:13)
So I have a slightly weird question, but I'm thinking a lot about it as as we have this conversation, but it feels to me like software engineering and computer programming, it always seems to be in flux. Like, if your job is a software engineer, it feels like there's always something to do, you're always trying to fix a problem or adapt a system. And I guess my question is why? You know, I fully admit my own programming experience is confined to like HT L, which I learned from that website HTML Goodies in like 1998. But back then, you know, you program your website, you design it in HTML, you release it into the wild and you're kind of done. And yet it seems with these large-scale systems that there's always change. Something is always in motion, something is always in flux. Why is that?

Patrick: (33:06)
So let me push back a tiny bit on this here. Like, how many years have we had lawyers available and does anyone ever go up to the lawyers and say ‘come on guys, it's 2023, haven't you figured out all the laws yet?”

Tracy: (33:18)
Haven’t you finished all the law yet?

Patrick: (33:20)
Yeah. So why does the law change on a week to week basis? Well, it doesn't change per se, it's just the world is complicated. The number of commercial relationships between organizations is increasing all the time. We have increasing demands on what those relationships will do. And the job of lawyers is to adapt to that increasingly complex world every week and continue delivering the law that society needs and the outcomes that come as a result of competently executing on the ability of organizations to collaborate internally with their employees and with other organizations. What's my answer for software engineering? Well, software engineers, they are, you know, working this week on an increasingly complex world where software has more leverage than it had even last week where there are increasing demands on the world, etc., etc., etc. Is there going to be a time where the last line of software is written? Probably not. There will never be a last bit of software written. There will never be a last contract written. There will never be a last book written because humans want more things out of the world and we have kind of like infinite capacity for what at the margin.

Joe: (34:21)
That was a compelling answer. Can you talk a little bit about, you know, we've been talking about banks, airlines versus startups, can you tell us how are the challenges for the public sector? And I remember Obama had a thing about like ‘we want to like bring government websites or government tech into the modern age.’ It just seems like whatever problems exist for big companies seem to be even worse or more tricky when you're dealing with the public sector.

Tracy: (34:47)
I think at one point New Jersey was explicitly like begging the internet for COBOL programmers, wasn't it? In like 2020? I think that was a thing that happened.

Patrick: (34:57)
Yeah, so full disclosure here, I did a non-profit organization last year where a few of us in the tech industry banded together to work on the vaccine location information infrastructure for the United States, because the public sector was having a great deal of difficulty creating websites that would track where the vaccine was and route vaccine seekers to it. So I have lots of thoughts here.

So many issues. Again, we did not wake up in 2023 with these issues magically -- it's a result of decisions that we've collectively made as a society for many years. One decision that we've made in the United States in particular is that government pay scales are what they are. If you compare those government pay scales to what private industry pays for technologists, they are sharply out of whack. And so then, you know, if you look at GS whatever, the highest paid public sector employees in the United States make less than Google interns do. Sell for the equilibrium. If you can get hired by Google, you know, it would require you to…

Joe: (35:56)
Distribution of talent really does matter in this realm.

Patrick: (36:00)
Right. And one of the things that the government has been attempting to do over the years is create things where there are groups of people who are, like, officially they're government employees. Unofficially they think they're sort of doing an active service to the nation in places like the digital services agency, etc., etc., where they already made their money in tech, they're now on the GS whatever, making a fraction of what they previously made but are contributing software expertise to these various problems where what the government needs is some competent software written and that requires having competent software people available, in quantity.

Another issue that governments have is like, what is the true goal you were solving for? Without getting too political about it, in some parts of the government like, you know, an organization might exist as largely a jobs program and IT modernization might sharply decrease the effectiveness of that organization at employing a large number of people to like repeatedly do a process that a machine could do in a faster fashion.

And so sometimes, the powers that be within organizations are like, well, you know, I don't necessarily consider IT modernization one of my top priorities at the moment because that would cause me to need to break faith with a number of people that I employ slash you know, sometimes my own career trajectory as a bureaucrat is, and this is true within private industry as well, you know, there's a bit of empire building involved where you want your number of people that you manage and your budgets to go up every year. And you don't want to say, ‘okay, I've solved my problem so I can deal with 5% as much budget next year. Thank you.’ That is incentive incompatible.

That's not great from the perspective of the parts of society which aren't employed by government, but which nonetheless depend on government for, you know, providing goods and services. And so this is ultimately a thing that we have to resolve through the political system. I’m pushing back a little bit on saying like, ‘Hey, you, you kind of have to be good at what you do.’ And these days that involves making software that is also good at what you do. I have no magic bullet for how to cause that to be, you know, a stunning rallying cry for political parties, but it’s probably something that needs to get said in a lot of places for enough decades until the message sinks in.

Joe: (38:13)
Patrick, this has been an amazing conversation and I feel like A) I already want to have you back and B) almost each one of your answers could be its own full conversation. I have one last question. How does bit rot happen? And I mean, you know, even I, you know, you go away on vacation for two weeks, you come back to your office computer and things are weird and sort of janky. They don't quite work the same. What is that process? Because you would think that just words on a database wouldn’t rot, so what's actually going on there?

Patrick: (38:45)
So the sardonic but true answer that you have to think of at scale is like bits in a computer can literally be flipped by gamma rays coming from outer space that interact with the physical manifestation of your memory in the computer. And that's one cause of this. That's true, that does happen. That isn't the dominant thing that happens. The dominant thing that happens is there exists change in the broader system that must happen on any given basis. Change is a sort of risk, it is not always managed well. This gets back to that a commanding majority of systemic downtime at well-managed software companies is caused by attempts to upgrade the system that go less than optimally, the thing that is amenable to study.

Bit rot happens in some cases because, you know, you had a constellation of software etc. installed on your machine and installed on other machines that your machine connected to, which was working, you might say exactly perfectly. Exactly perfect is unknown in software, but like it was working right now. Something about the constellation changed as a result of a decision made about a machine that is not directly under your control. And that decision must be made at scale in the economy because software can't be allowed to be static to deliver the things that we want from software as a society.

And then that change caused some other part of the system to behave in a less great manner. And then eventually, you know, you see the ripple effects of it in your daily life. That's the dominant way bit rot happens. It is not the bits actually getting corrupted over time, but again, a thing that does happen and we have things in engineering to control against that.

Joe: (40:19)
Well Patrick, you are the perfect guest for this topic. Really appreciate you coming out on Odd Lots.

Patrick: (40:25)
Really appreciate you having me and would be glad to be back sometime.

Joe: (40:28)
Definitely.

Tracy: (40:29)
Thanks so much Patrick. That was great. I learned so many excellent new terms like bit rot, blast radius, heroics, we mutually failed together. That one will come in handy Joe.

Joe: (40:40)
Sure , everything we do wrong is mutual failure. No, I love that. Patrick, thank you so much.

Patrick: (40:46)
Thanks very much for having me.

Joe: (41:00)
Tracy, I think really Patrick was like the perfect guest for that topic. We needed to do this for a while and I'm glad we like did it with Patrick.

Tracy: (41:07)
Yeah, well I also feel like this is something that's going to keep coming up and so we might have more opportunities from Patrick to do interesting postmortems on various tech failures.

Joe: (41:17)
Yeah, absolutely. I mean there were so many interesting things. I really liked some of the questions that you asked about ownership of software and I think that really clicked to me because if you're a software company and software is the main product, and you know, I'm thinking about in the manufacturing analogy, you know, it's like a Taiwan Demiconductor, this sort of like institutional knowledge to build something exists within the firm and it just, you know, gets handed down. When software isn't your main product, like if you're a Southwest, if you are a Citigroup etc., then you can sort of see why that process of internal knowledge that works in manufacturing, you don't get that sort of ongoing feedback, you know, sort of distribution of knowledge in some of these large organizations.

Tracy: (42:09)
Yeah, absolutely. And it feels like, I mean Patrick, I think he used the expression ‘dropping the ball in the middle of both of us,’ but it does seem like that system kind of produces opportunities for, I guess I'm trying to think how to phrase this, no one takes total responsibility for a systems failure like that, right? Because on the one hand someone designed the software, but on the other hand, maybe it was customized by someone else. Maybe you have two different systems talking to each other and both of them mess up in some way or there's some sort of misunderstanding. It just seems like there's such a gray area and maybe this is one of the reasons why it's so difficult to fix because you have all these different things that are sort of operating together

Joe: (42:58)
Well, even, you know, his last answer about how bit rot happens, right? Like somewhere in all these interconnected computers, someone has to make a change. Because he pointed it out in his answer to your question about why software is never done, why it's never a solved problem. It's like we're always demanding more. So there will never be a time where someone has the luxury of not making a change…

Tracy: (43:24)
The computer is done.

Joe: (43:25)
Yeah, we figured it out, finished. Technology is solved. Well maybe that's what the AI, you know, the singularity will be.

Tracy: (43:31)
ChatGPT? Can we ask ChatGPT to do all of it?

Joe: (43:33)
It could be. But, you know, that seems like it makes a lot of sense. Someone has to make a change because that's just how the world works. And then all these other interconnected systems, maybe they're fine with the change, but something happens and then eventually they have to change too. And so it's just this constant state of flux.

Tracy: (43:52)
Can I tell you my one internalized programming lesson?

Joe: (43:55)
Tell me.

Tracy: (43:56)
So, you know, I mentioned HTML and then when I was in high school, as part of our computer science class, we had to learn JavaScript and we had to create a program. And so I wrote this program – again keep in mind that this was the year 2000 or something like that. I wrote a program, it was like a digital fortune cookie and you could click on it and it would give you a fortune. And then at the end of it, at the end of this module, we had to sign a contract signing over our program to our computer teacher who took ownership.

Joe: (44:26)
That’s crazy.

Tracy: (44:27)
No, it was an extremely valuable lesson, which is all the coding that you're doing will ultimately belong to someone else. And they'll be able to monetize it. That's the downside is they'll monetize it for you and maybe you won't get as much. But the upside I guess is that you don't have to take responsibility for it. Once you write it, it goes out into the world. The computer professor owns it and he can do with it what he will.

Joe: (44:51)
I love that that was actually the lesson. Also, I feel like in another universe, you could have sold that startup for a hundred million dollars in like 2010 to Facebook and if it went like super viral. It seems like one of those things, it was just too early. ‘Oh, how did this person make their fortune? Oh, they made like a digital fortune cookie.’ But also, I loved his answer about the public sector because it's like there you really do have this problem of salary disparities and it is pretty crazy that we sort of treat, like the way we're sort of solving this problem in this country is kind of getting people to volunteer, like going to work for the government in IT and tech, it's like what you do after you're rich and you want to give something back. It’s like, ‘okay, I'm going to like go work for the federal government and try to help them like update their systems,’ which is great, that people want to do that and I love that. But that does not seem a great sustainable solution to having a modern government that can communicate with people and provide services for people the way that they expect.

Tracy: (45:53)
No, absolutely. And it's something that we see again and again in various ways. Shall we leave it there?

Joe: (45:59)
Let's leave it there.

You can follow Patrick McKenzie on Twitter at @patio11.