EPISODES CONTACT

Architecting Apps at Netflix Scale, with Arik Devens

Ep 6

Mar 17, 2023 • 56 min
0:00
/
0:00
ABOUT THIS EPISODE
Netflix’s Arik Devens shares what he’s learned about architecting mobile apps with many millions of users, building from early adventures at Netscape and Palm up to architectural roles at Fitbit and Netflix. We talk about Netflix’s culture of small, high-trust teams; some of the pathologies of very large engineering teams; how to sell new architectural patterns to a team working on a large existing codebase; and the challenges of A/B testing hundreds of millions of users.
TRANSCRIPT

Allen: Welcome to It Shipped That Way. Where we talk to product leaders about the lessons they’ve learned helping build great products and teams. I’m Alan Pike. Joining us today is Arik Devens. Arik has a delightfully weird resume having written software at Netscape, Palm, Easel, Fitbit and more. But for the past five years he’s been at Netflix where he’s been helping architect and build the iOS app that millions use to stream their favourite shows every day. Welcome, Arik.

Arik: Hi.

Allen: Hi.

Arik: We’ve never chatted before so-

Allen: Yes. This is our first conversation. Arik and I have for years had a podcast called Fun Facts and before and after we record that show about facts, we often find ourselves talking about engineering leadership and product development and a bunch of other stuff that doesn’t belong in that show but very much belongs in this show.

Arik: Yeah. I was excited that you started this show because I felt like you have so much to say on the topic, so many valuable insights to offer. And I’ve been enjoying it.

Allen: I’m trying to keep the focus on the guests. My insights, as useful as they may be, can take the little gaps in between the focus in all the various experiences you all have. So that’s the fun part for me.

Arik: Right on. Sounds good.

Allen: So I’m really excited to dig in today. We’ll see how far we get in one episode. Want to talk about some of the challenges of leading and architecting an app with a massive user base, Netflix’s culture, which is very unique how it works in theory and practice maybe. But first, can we do a quick tour of your delightfully weird career path?

Arik: What do you think is delightfully weird about it? Let’s start with that.

Allen: One of the things that I think really particularly sticks out to me and maybe it’s particularly because you are not much older than me, but then your first companies that you worked at were these very OG dotcom bubble. So you worked at Easel, Netscape, and Palm, which is … What a fascinating three companies to start your career off, especially for somebody who’s in that early 40s age range. I don’t know. How did that happen?

Arik: Well, let me start with this and I want to mention this because I feel like this is under normalized in the wider engineering community. I do not have a CS degree.

Allen: Thanks for sharing that.

Arik: Yeah. I went to a small liberal arts college in Ohio and I started taking CS classes. I took whatever the 100 level class is, intro to C, and by the second semester I was teaching the class. And not because I’m some super genius. I taught-

Allen: Because you’re in a small liberal arts college in Ohio.

Arik: It’s because it was a small liberal arts college and there was one professor and he didn’t feel like doing it. So I taught a class on GTK Perl. Perl in general, kind of hilarious and GTK Perl … It was actually kind of both. Perl GTK Perl, so that’s like old gnome Linux development stuff. Anyway, quickly became clear that CS was not really going to be a thing and then while I was there they actually eliminated the major. So I ended up graduating with a degree in music. It was the thing I had the most other credits in and something I’ve been passionate about my whole life. But I actually think there’s quite a bit of overlap, at least in the music theory side with the kind of thinking that goes into engineering. At any rate, that particular school has a pretty cool program where you go to school year round, but it’s trimesters and every other trimester you’re expected to get some sort of internship, they call it a co-op, somewhere. In whatever field you’re doing or something you’re interested in or whatever. Maybe not working at a gas station but something interesting. And so my first one came and I cold emailed this Linux company in Boston or Cambridge called Helix Code at the time. It was later called Zemien. People might know that better. And was eventually acquired I think by Novell and something. The same company ended up becoming Zamarin, which created Mono, which was acquired by Microsoft. And my boss from that company ended up being the CEO of GitHub for a while.

Allen: This is what I’m talking about. Anytime we start talking about career stuff, like what’s so fascinating? Anytime you start telling a story it’s like it’s all interconnected. We’re still in the preamble to your first job.

Arik: That’s true. I guess you’re right. I mean I guess I lived it. I don’t think of it as … Anyway, so I did an internship there and that internship led to an internship at a company called Easel out in Mountain View that was kind of an incredible … This is actually pretty amazing. It is definitely the most noteworthy group of talent I’ll probably ever come across. It was a company started by famed original OG Mac software engineer Andy Hertzfeld, and he basically just pulled on all of his connections to get all these people. So it was Susan Care was there, Arlo Rose who designed System eight was there. Darren Adler who’s back at Apple now as some VP of something, but was I believe the tech lead for system 7, 8, 6. I’m not sure. This guy, John Sullivan who was also an Apple Tech lead for one of those classic OS’s. A lot of early B OS people, a lot of early Sun people. All these incredible people including the entire original Safari team who went-

Allen: Like Don Melton and-

Arik: Don Melton was my boss.

Allen: Ken Cosienda I think was there.

Arik: And Maciej Stachowiak who still runs the Safari team I think. Yeah, Don Melton, I call him Gramps. But yeah, Don, who by the way was younger than I am now probably at that time. So he was there. So then, anyway, that company actually was pretty cool. They were building this file manager for Linux and their business model was proto Dropbox, but it was just one of those things where-

Allen: But this was in 1999, so it was-

Arik: Way too early. Yeah.

Allen: It was easy to raise money but fairly challenging to make a business out of a Linux file manager with Dropbox.

Arik: Totally, totally.

Allen: I mean 2002 was the year of Linux on the desktop, so they hadn’t quite made it.

Arik: I got laid off and Don had come from Netscape and he got me an internship at Netscape. So that’s how I ended up at Netscape. It was pretty cool too because I was on this core team that built what they called Zul, which is the XML UI layer. But the people on that team ended up creating the Camino browser, Firefox and Safari. So it’s just a couple of little browsers that people may have heard of.

Allen: It’s just really interesting, those confluences of people. And there’s a few other stories of a group of really talented, really smart people get pulled together for a handful of time and then-

Arik: And me.

Allen: Don’t necessarily succeed. Yeah. And Arik. But the story, I find them fascinating. It’s the rock super group sort of stories where you realize-

Arik: Yeah. And they all do their own albums. These people used to be together at this browser that you all used. I actually left the tech industry then for a few years, was bartending and doing whatever, and ended up getting an in interview at Palm through that classic method of a family friend who worked there who got me the interview. I think most people’s success, if they think about it, is some percentage of ratio of luck, hard work and privilege. And so in my case, I was very privileged to have a family friend who worked at Palm who said, “I can’t get you a job, but I can get you an interview.” Which at that point having only had internships was huge. And so that was my first real job and I happened to work there from ‘06 to ‘08 so I was definitely there when the iPhone was announced and released, but working on-

Allen: Boom.

Arik: Working on a competitor product that eventually became Web OS and the Palm Prix.

Allen: Yeah. You were up in there when there was the drama of … Actually I was working on Palm OS stuff.

Arik: Yeah. You worked there later, right?

Allen: Yeah, I worked on Palm apps in I guess ‘06 era when it was pre-iPhone announce. And then I worked at Palm for a little while on some Web OS stuff after the fact. Were you there when there was the, oh God, we really need to bet the company on something that is not just this old stack evolved of the coming from the black and white Palm days?

Arik: Totally. It was super interesting. So when I was hired, they had a plan to make something new. They knew Palm OS 5 was way past its expiration date, but they had that classic problem of it’s really hard to disrupt yourself. So they were still selling all these trios and Centros. Was that the little one?

Allen: Yeah.

Arik: And all this stuff. So they hired me into this team that was working on new media apps for a device they were going to build. So they had signed a deal with Texas Instruments for this DSP chip and they were going to build a new Linux-based kernel phone using this DSP and a very smart person at Palm had managed to make a Palm 5 Linux emulator. So all of the UI for this was going to be through this Palm 5 layer that was talking through these memory pipes and memory map pages and things to this Linux kernel.

Allen: Okay.

Arik: Yeah, it was a crazy idea. So I was hired to do the UI part of … Well UI and other stuff for these media apps. So I was working on an image viewer, but this is Palm 5 so they gave me a off the shelf book about Palm development from 10 years earlier.

Allen: And a C compiler.

Arik: Yeah. And were like, “Hey, you go for it.” And I’m trying to write a JPEG library and write my own widgets. So we’re working on that and we had gotten, I don’t know, 70, 80% done, it’s hard to say. And then the iPhone was announced and very, very quickly they were like, “Okay, well …” Because the idea was that this would be their last Palm 5 phone before the big new thing happens. But of course they’d said that many times. So they very quickly pivoted and they acquired this company whose name I have long since forgot, but it was a competitor to Android. So at the time in the valley there were a bunch of startups doing things like Android. And so they bought this company, they had this Java based system with this CSS-ish UI layer thing.

Allen: Sounds performant.

Arik: Yeah. Well we started working on that and first of all, I hadn’t done any Java, so now I’m learning Java at work trying to write these apps. So I started working on a new media app, same thing, an image viewer or something like that for this new system. And it quickly became apparent that this was just never going to work. The system was super slow, all this Java stuff was a nightmare at the time and-

Allen: Now you have a new bar. Where before you were comparing yourself to other Palm devices and now everyone knows what the iPhone can do and it is going to eat your lunch unless you are really shipping something totally different in six to 18 months.

Arik: So I ended up working on this … You’re so right about the connections. I ended up working on the email app for this device with this guy named Mark Blank who wrote Zork.

Allen: Okay.

Arik: And started Info Games.

Allen: One of the OG games. Yeah.

Arik: And created Siphon Filter if you ever played that one for PS2, PS1. Anyway, so he’s this games developer who had been a Palm enthusiast and this other guy named Matt Kern who’s like a old school Ruby on Rails developer. And that guy and a bunch of other superstars from the team, they all got together and they came up with this Web OS idea of using WebKit and web technologies to basically do Ruby on Rails OS. And so that happened and everything pivoted that way. But I do have a couple of fun anecdotes about that. One was that we hired a bunch of colonel engineers from Apple and OS engineers right after the iPhone shipped. They were just looking for something new to do. And I was talking to one of them like, “Ah, why did you leave Apple? Why’d you come here?” And he’s like, “Oh man, I just need some life work life balance. This is just a much more pleasant place to be. Not so stressful.”

Allen: Your life and death battle is chill compared to what was happening at Apple.

Arik: And I’m just saying I don’t believe in hashtag 90 hours a week and loving it, but Palm did completely fail. But the other thing that was super interesting being there behind the scenes was that you mentioned that the iPhone came out and it set a whole new standard for what expected. And one of those big things which Steve Jobs discussed at length in the original presentation was this multitouch technology and how they had … “And we’ve patented it. We’ve got this under lock and key.” And the crowd goes wild because Apple historically has had trouble with that. The famous look and feel lawsuit. So then Google had released this prototype Android phone that looked like a Palm Windows phone trash thing. But when they came out with Android, people don’t really remember this. The original Android didn’t have multitouch.

Allen: Yeah. I remember that. It even supported a little toggle. You could navigate it with up down left right.

Arik: But when the Palm Prix came out, it fully supported multitouch.

Allen: Yes.

Arik: Why? Why was that okay? And the answer according to people inside Palm at the time is that Palm had a bunch of patents for basically everything about a smartphone because of the Trio and stuff. They were so early. But more importantly, way more importantly, Palm had the patent for the click wheel on the iPod. And Palm basically said to Apple, “Don’t come at us, we won’t come at you.” So it was a mutually assured destruction thing. So they were allowed to use all this stuff that Apple felt was theirs because they owned a lot of stuff that-

Allen: Man, and Steve Jobs must have hated that so much.

Arik: He must have just steamed under it, right?

Allen: But also that’s so much of big company patent stuff is everyone just needs to get more and more dumb patents so that we have more dumb patents than you have more dumb patents.

Arik: Perpetuates itself. Yeah. That’s-

Allen: So tiny companies get crushed and all the big companies just have all of their lasers pointed at each other, but then mostly are peaceful.

Arik: I think ultimately Palm became a patent sale at some point.

Allen: Yes, it did.

Arik: Long after my time.

Allen: So a link to, I believe years back you did a debug episode with Renee Ritchie.

Arik: Yeah, yeah I did. Yeah.

Allen: Where you talked about a lot of these fun in more detail exploits and adventures, but you worked through Easel, Netscape, Palm and some freelancing and all these generational things. And if you want to answer anything else in that path that we can do. But then you ended up doing some time at Fitbit where you were-

Arik: When the Apple Watch was announced.

Allen: When the Apple Watch was announced and you had another generation of like oh, bam.

Arik: Yeah, it’s bad luck for me to be there.

Allen: But at Fitbit, as I understand it, you were getting into this type of role where you’re on an iOS development team, but you’re taking this kind of architectural view on, okay, we have all these developers and they’re all writing code and you’re helping coordinate that. And also you’re working on an app that’s at a pretty decent scale. I don’t know what the numbers were on the Fitbit app, but a fair number of people are using that app for a fair amount of data exchange.

Arik: Yeah. At that time, I thought that was crazy scale. I mean, I forget what it was now. Maybe we’ll say 15 million, something like that.

Allen: 15 million. Mind blown.

Arik:

  1. 15.

Allen: Yeah. Yeah, I know. But then now Netflix has more than 15 million.

Arik: Yeah. So when I was hired at Fitbit, there were like 150, 200 people, something like that. It was-

Allen: At the whole company or in software?

Arik: In the whole company. In the whole company. And I thought, oh, this is huge. Because I’d been working at a 15 person startup. I mean for context, when I worked at Netscape, there were like, I don’t know, 50,000 people or whatever.

Allen: Probably, yeah.

Arik: Palm was I don’t know how many thousands. But this felt big to me. So I was on an iOS team originally of five people and I ended up owning this architecture data network infrastructure stuff kind of largely semi accidentally. Up until that point I thought of myself as kind of a UI guy. And it was mainly in building some features I stumbled into the architecture they had and really didn’t like it and was pushed by my then boss to be like, okay. They had just fired the guy who had done that work so they were like, “Well, you don’t like it, go do something better,” kind of vibe. And hopefully I did. But I think what was really interesting about the technical challenge at Fitbit that I think people don’t necessarily appreciate beyond the scale and because of the scale is that Fitbit was a app that had to work fully offline. Because if you’re out in the woods on a camping trip and you want to log your food and you can’t because it says you don’t have a network signal, then you’re just like, well I’m never going to use this app again. This is supposed to be this quantified self concept of I’m always tracking the data so I can do something useful with it later, I guess. But if you can’t do that, then you’re like, I’m out. So we suddenly created a distributed systems problem. Because usually most iPhone apps are dumb readers. The source of truth is clearly on the server. They just show it. There may be some complicated caching, there may be some optimizations they want to do for network latency and those kind of things. But at the end of the day, if you run into trouble, just drop it all. Just pull it again.

Allen: Pro tip for anyone out in the audience there, avoid making a distributed systems problem in your product if you can. If you can make your app be primarily about reading and displaying from the API, then you’re going to have a much easier and calmer life than if you end up creating a synchronization of large volume of data in both directions.

Arik: Yeah. So you’re doing conflict resolution and transaction resolution and all this stuff on an iPhone. Not to mention that particular company had chosen to use core data. Actually, I think primarily the reason I got the job in the first place was that when asked about core data in my tech screen, I went on apparently a 25 minute unprompted ran about how bad it was. So of course I ended up owning it.

Allen: And they’re like, “Okay. This person has seen the pain that we’ve seen.”

Arik: Yeah. Let’s put him in charge of this. So you’re writing these rolling migration systems and stuff. But yeah, so the company grew really fast. I was there for three and a half years I think and it went from 200 people to 2000 people. So 10X in three plus years is pretty fast. That’s that hyper growth. And the iOS team in specific went from five people to maybe 45 people.

Allen: So 10X as well.

Arik: Yeah, exactly. Yeah, that’s a good point. Matching the overall growth. And so suddenly I went from being the person writing all the foundational technology to the person tech leading, I guess I was … Actually I was kind of a people manager for a while, but primarily my job was tech leading the in principal architecting the team that owned all of those technologies and also trying to engage with all of the other teams. Because when we started out, we had a very flat like most small companies will have is a very flat one platform team. Everything is sort of device oriented or platform oriented. So you have an iOS team and you have an Android team and maybe you have a web team and whatever else. And then eventually we switched to what I think most companies eventually end up at is this hybrid model where you have these vertical feature teams and these horizontal platform teams. And so we had all these feature teams that were very PM aligned. So it’s like, okay, we have a PM for sleep, so there’s going to be a sleep team. We have a PM for food, so we’re going to have a food team. We have a PM for activity tracking, so we’re going to have an activity tracking team. And those teams were mixtures of iOS, Android, web, API, all these different disciplines. But then we still had to have these horizontal platform teams, one for iOS, one for Android. I forget what other ones we had. But to build those core technologies that are not feature related and own all of the stuff.

Allen: And to set conventions and expectations that individual features are going to work together properly at the networking boundaries and data storage and all that kind of stuff.

Arik: Yeah. Otherwise you end up with all these conflicting-

Allen: The build needs to work.

Arik: Yeah, exactly. And you don’t want to end up like … There was that Facebook post recently that-

Allen: Oh God. We can link it.

Arik: We can go off on an hour long discussion of that. But I think I would say in the spirit of Allen’s make distributed systems apps if you can avoid it, don’t build an app that has six different UI frameworks in it if you can avoid it. It’s just a bad idea.

Allen: Just try not to do that. Yeah. Facebook is the extreme case. I’ve long assumed that they’re the largest iOS team that is out there, but I don’t know for sure that that’s true.

Arik: They’re very big.

Allen: Certainly one of the largest. They’re very, very large. And I’d be interesting to get into this with you a bit about the way that you scale and organize development teams as in the way that you describe where it’s like at a certain scale, you end up typically … Almost always, teams do end up breaking up. And so there’s the feature teams and then there’s the platform team that supports all the feature teams and tries to manage some sort of consistency across them. But then once you’ve made that change, then there’s less friction to add in more feature teams. And so you get more and more and more and more feature teams and then you end up at Facebook’s scale where it was … Even actually many years ago, I remember meeting someone at a conference and they’re like, “Oh, I work at Facebook.” And I’m like, “Okay, cool. That’s great. What team do you work on?” And they’re like, “Oh, we work on the spell checking in the text field when you’re on the wall. It’s the spell checker for the wall. But just for that part of that text editor.” And I was like, “Oh, that’s a team. That’s a whole team that does that.”

Arik: That’s not even one person’s job. That’s multiple people’s full-time job.

Allen: That was the team they were on. And so it had a PM and it had designers and blah, blah, blah. And my head reeled a little bit having mostly worked on non-Facebook scale. Because there’s the scale of the team and then the scale of the user base and they’re correlated but definitely don’t need to be in lockstep. When you have twice as many user, you don’t need twice as many developers. I guess before digging into, in the broader sense, is Netflix organized that way? Or at least the Netflix iOS app. Is it organized that way by now in this feature teams and then there’s a platform team?

Arik: Yes, but full transparency disclosure, it was reorganized that way right before I went on paternity leave. So I haven’t actually experienced what it’s going to look like at Netflix. So for most of my time there it’s been organized as horizontal platform teams. They just did this.

Allen: So I bet there’ll be people in the audience that are fairly surprised that the Netflix iOS app that has … Or I don’t know if you can say how many millions. I know Netflix has a quarter of a billion subscribers.

Arik: It’s a lot. It’s a very big number.

Allen: A lot of those probably use the app. So it’s probably closer to a 100 million user app that you were able to get that far. That only now is the team getting organized into this arrangement of having the feature teams and then a platform team. Again, to the degree you’re able to say, what is the size of that iOS team that was operating at least fairly successfully without that organization until recently?

Arik: I actually have no idea if that’s public information or not so I won’t be specific, but I will say it was at least an order of magnitudes smaller than Facebook’s. It was actually a surprisingly small team.

Allen: So I guess maybe the most interesting question for me is, was it under 150 or over 150 under?

Arik: Under.

Allen: It was under, yeah.

Arik: Way under. Way, way, way, way under.

Allen: Yeah. I was pretty sure that was true. A lot of organizational things break at 150 and kind of famously you have to redesign any organization or team that exceeds that size. But when you’re in the low tens of people is when most organizations do tend to make that transition from … A lot of 20%, 30% teams are not set up that way, but most 80% teams I think are set up in the way that they have feature teams and a platform team. And so it sounds like Netflix maybe is making that transition around … Or the Netflix iOS team I should say is making that tradition around the same size that a lot of organizations do. But it sounds like clearly that Netflix’s iOS team, and I think this transfers across the culture more broadly, did more with fewer people than a lot of apps of that scale or really even of any scale. To what degree was that an explicit mandate/philosophy and how does that tie into Netflix’s unique cultural priorities and things? Was that an enabler of that? I don’t know. Tell me about that.

Arik: Every company I’ve ever joined has probably at some point used that C word. Used culture as a word. Said, “Ah, the culture here.” And to my experience pre Netflix, that was always kind of a little-

Allen: A poster on the wall.

Arik: Yeah. It was just like these all kind of seem the same and there’s flavors and there’s little differences here and there and whatever, and they’re different sizes and things like that, but it didn’t really mean anything. Netflix really does operate pretty fundamentally differently than any other company I’ve been at. And I think there are other ones like that. I know Apple maybe a little bit, Amazon. There are companies that have very strong opinionated workplaces and you either fit that, that works for you or it doesn’t. So I would say that for me, Netflix is a phenomenal place to work. It’s definitely not for everyone. I think there are other people who are great talented people who just wouldn’t like it and it wouldn’t bring out the best in them. One of the things that historically was true, not true anymore, about Netflix was that they only hired … We only hired? I don’t know. I don’t know when you’re supposed to say they and we. But we only hired people who were already at what you would consider a senior software engineer.

Allen: And famously the only title for an engineer at Netflix was senior software engineer. So you have people who would be like principal staff, staff ultra principal, developer somewhere else. And just senior software engineer. We’re not going to ding around and have titles.

Arik: One of the main reasons for titles at other companies is for pay banding and Netflix does what they call top of personal market for compensation. So they try to pay people the most that they could reasonably be expected to earn at any other job.

Allen: And the philosophy as I understand it, is that Netflix wants people to never consider changing jobs because of more pay. They want Netflix to be in charge of whether or not you want to leave rather than-

Arik: That’s exactly right. Yeah, yeah. They want to take that out of the equation. At any rate, more directly to your question, I think that because of that, because we only hired senior engineers and because the standards for that hiring were very high and we really only hired out of need, we never fell into that just hire anyone good you can find thing. We’ll just have more people. We’ll find something for them to do. We never did that. We were very deliberate on hiring. It was like there’s an area we feel like we’re missing in. There’s something we have we need someone to do. Let’s go find someone who would fit that role well. We were hiring generalists in some sense, but it was also more like what do you bring to the table that no one else on the team has?

Allen: Right. Culture adds.

Arik: How do you level up the team? Yeah, exactly. So because of that, people took on a lot and there was also just a lot of trust, still is, a lot of trust and agency. I would say the core concept of Netflix culture is basically to hire smart people, give them as much context as you can and then let them run with it and trust that approval came in the hiring. We found someone we believe in, let’s let them run. And if they go off in the wrong direction, you course correct. If they continue to go off in the wrong direction, then you part ways. So rather than make everything a process along the way, remove as much process as possible and deal with bad outcomes as unavoidable individual situations as opposed to trying to protect yourself constantly from outlier bad outcomes. The famous example would be one person in the US tried to smuggle something into an airport in their shoe, got caught and now everyone has to remove their shoes forever.

Allen: Right. And that accumulates in large companies and eventually tends to choke large companies. I haven’t read the book yet. The No Rules Rules, which I assume by the title is about the avoidance of rules among other things at Netflix culture.

Arik: I actually haven’t read it either. Sorry Reid. But yeah, I lived it. I don’t need to read it.

Allen: You lived it. You know. But the concept as I understand, and I’ve seen it happen, is the rules tend to get added more often than they get removed. And so you constrain and constrain and constrain the ability for people to act rapidly and use their judgment, which reduces the error rate and the problem rate but then it also more problematically doesn’t pay for itself in terms of how much it slows down product progress, ability for you to execute.

Arik: There’s this classic article or something or whatever that the original HR director whose name I think was Patty McCord if I remember correctly, that she’s the one who is credited with popularizing this overall culture at Netflix. And one of the things she said was that in a traditional company, as it grows in size, like you said, there’s these inflection points. And as it grows in size, employee freedom tends to decrease.

Allen: For sure.

Arik: And I think that most people feel like that’s inevitable. And to some extent I’m sure it is, but the idea behind Netflix for a long, long time at least was what if as you scale, you increase employee freedom instead of decreasing it. So there was just no mandate of corporate spyware on your computer, no mandate of PR requests for code check-ins. Just not a lot of process. Just really, really light on process. And instead what you do is enable adults to do their best work and treat them like adults.

Allen: But the treat them adults is that if you don’t have a rule across the entire company that you have to have PR requests, but then somebody’s going around cowboying, checking in things without PR review-

Arik: Yeah, that’s going to be a problem.

Allen: Is causing problems. That’s that person’s fault. It’s not like, well, that’s our fault for not having a rule around it. It’s like, hey, you’re breaking stuff and you’re supposed to be a grownup and we’re paying you top of personal market.

Arik: That’s right. Yeah. If you break the build every week, there’s going to be a problem.

Allen: And that problem may involve someone else having your role that listens to common sense when they’re going around making big changes.

Arik: That’s exactly right. So for a very, very long time we got away with having one team that was just all together on a platform. Technically we had two teams because there was another team that dealt with streaming tech and authentication and things like that. Really low level. But above that level, everything was one team for my whole time there so far.

Allen: And a lot of that, if not all of it, was attributable to being able to keep the team size low by having the average person on the team very senior, by having a minimal amount of rules and restrictions on process that people have to follow. Obviously people opt into process, they follow processes that make sense given their context. And then also being constrained I assume on scope. The Netflix app is relatively featureful in the scheme of things, but it’s not a kitchen sink with 10,000 gidgets and gazmos everywhere.

Arik: No. I would say Fitbit had more features. Fitbit was a surprisingly large app. People don’t really think about it because most people only used some percentage of the surface area, but food tracking alone was a huge … There are whole food apps and Fitbit had that whole thing built in and the sleep tracking is a whole app for most places and that was built in and workouts are a whole app and that was built in. So there were just huge areas of that app. I mean you could understand … I still think the Facebook example of someone doing a spell checker of a one field is insane, but-

Allen: Not just someone doing it, a whole team doing it.

Arik: Well, and that’s the thing. I think at most big companies, what you find, and we just saw this with … Did Twitter cut too many people? Yes. But was Twitter way too big? Also yes. And I think that what you end up is that in most big companies that I’ve been at, everyone basically knows who are the people who are actually doing anything, and those people tend to just talk to each other. They tend to bypass everybody else and they’ll just go directly from person to person and be like, okay, I’m a person getting something done over here and I need something from this other team. I know the person I can talk to on that team where something will actually happen because if I go to the front door, Lord knows what’s going to happen. It’s going to be six months later backlog story pointed to death because most of the people there are just there doing a job. And there’s not necessarily anything wrong with that. I mean, you can get a lot done with a lot of people if you want. I think there’s just some people who feel constrained by that and want to be in a situation where they don’t have to protect themselves as much from that, where it’s not about politics and where it’s just about people wanting to help each other get things done.

Allen: There’s definitely trade-offs to the organizational structure where it’s like, okay, everybody can just work on whatever they feel is the highest impact thing more or less. But then if you misunderstand what that highest impact thing is or you misunderstand how to get it done, then it seems like, correct me if I’m wrong, that Netflix is tending to push down strategic thinking in the org to lower levels, down to maybe even the individual contributor level where because you have more freedom of how you spend your time and attention and what you’re working on and how you do it, and it’s less like your manager’s manager has helped groom a backlog and then you are supposed to do those things in that exact order, then that means that you have more latitude to be working on the wrong thing and then you can be responsible for that failure as opposed to it being like, well, that’s our fault for not feeding you a well-groomed backlog.

Arik: Yeah. I would say the phrase that comes up a lot is highly aligned, loosely coupled. And I think that is exactly what it says on the tin. It puts a different job requirement on an engineering manager because an engineering manager at Netflix is less about control and more about context. So it’s their job to have more context than you and to be able to say … So when you come to them in your one-on-one and say, “Hey, I’ve identified a need here and I’m going to get together this team to do it,” or, “I talked to a product manager and he wants to do this feature,” or, “I think we need this infrastructure piece built,” they’re going to talk to you and say, “Okay, well have you talked to this person over here? Because I heard them say something that I think sounds similar. You should probably get aligned.” Or, “Do you know about this company-wide initiative?” Or, “ave you heard about this priority?” Or, “How does that further this?” They’re not going to tell you you can’t do it. At the end of the day, they’re not going to. That’s not their job. But they are going to try to get you to think about it and make sure that you’ve done the due diligence. And that’s their job to say, “Hey, I know more than you do about what’s going on outside of your limited sphere. I trust you in that sphere, but let me point you in some directions and let me try to keep you within those bounds that you’re talking about so that you don’t go off.” So it should be hopefully very rare that someone goes off and just completely goes on some adventure hunt.

Allen: Crusade.

Arik: Yeah. And just we all look up and go, “What has this person been doing for five months?” I’m not saying it’s impossible, but it should not be very common. And I think that the other thing too is that if you’re going to organize this way, you have to over-index in your hiring and retention for people who are extremely collaborative minded and people who … Even if they’re cowboy coders, collaborative in a organizational sense.

Allen: What does that mean to you? What would be the idea of a cowboy? I think of a cowboy coder as a non-collaborative. A person who’s going off and on their own journey and just-

Arik: No, because you can do both. It’s often true that there’s only one person working on something.

Allen: Especially when you have a disciplined small team.

Arik: But that person needs to be communicating with the rest of the team and collaborating with the rest of the team in terms of those touchpoint surfaces. Especially because famously at all of these big companies, it’s all AB testing. So you have this app that represents some … Everyone’s app is some bespoke collection of features that are on or off or whatever. So if you’re existing in an environment where you have shared canvases and a lot of tests happening, you have to be able to collaborate so that you don’t break other people’s tests, so that you exclude the right tests, so that you combine those learnings, so that you incorporate them, so that you land in the right order and things like that. So there’s a lot of collaboration at the organizational level is what I mean. Even if you’re not collaborating on a code level. You’re collaborating on a code level in the sense that you’re probably editing some of the same files, but you know what I mean? It’s not like you and another person are designing and building something together. You’re doing it. But you have to do that in an environment with other people who are also doing that.

Allen: And that structure where individual engineers are taking the majority of a given feature rather than being like, okay, five people are all coming together and building this thing, has a whole bunch of prerequisites in terms of how complicated and how maintainable and how architecturally thoughtful is the code you’re all working in.

Arik: As it turns out, you can get away with it being awful as long as the people are really good.

Allen: Okay, yeah.

Arik: Again, you can cover over most things with talented enough and motivated enough and senior enough people. People who don’t need a lot of handholding and know what they’re doing. Those same people would presumably move even faster in a world where that wasn’t true, but in fact you can get by for a very long time. I will say that the other piece of that collaboration piece and the overall idea of the coupling is that I said you have to over-index on people who are collaborative. You also have to over-index on people who take feedback well and give feedback well.

Allen: Yeah. You and I have talked about that a bunch, and I feel like there’s about 20 follow up things from everything you’ve said over the last five minutes that I could ask. I’d like to talk more about the feedback piece. Maybe have you back at some time. But on looking at the time, I even more importantly want to give a chance to touch on something that follows from what we were talking about before, which is this topic of helping drive architectural and platform level positive change in a code base that has been around a long time and has a relatively large number of people on it and has a really large user base. And that’s, as I understand, you’ve focused in on that. Since this experience at Fitbit, you’ve focused in that in your career. As an individual contributor, there’s these different career paths where some people become the problem solver that goes in and fixes a gnarly issue in this part of the code base, and then later they fix this gnarly issue and some people become the execution expert that builds the main features in this particular area. But you’ve taken this architectural path where you’re helping not exclusively as I understand it, but you’re helping coordinate and scale things that has a wide impact across the whole platform, product, company or whatever. I’d be curious to hear you share a little bit of what have you learned in being a fair number of years into doing that kind of work, and what are your approaches for if you have some feature maybe pulling from your Fitbit experience or Netflix experience or how those contrast. What have you learned in effectively driving change across a large code base when you’re trying to level things up in terms of engineering and how does your approach nowadays contrast to maybe when you’re a little greener or maybe a little more naive about, all right, everybody, we’re all going to be good now and then maybe that not everyone follows along?

Arik: In some sense, the easiest way to do it is to be really early on an app. If the app is not that old, or if it’s being redone for some reason, like iOS seven has just happened, that kind of thing, or it’s a really small team where people want someone to take care of that, that’s the easiest way to do it. Because if there isn’t anything or there is something and it’s bad and you just step in and say, “I’m going to handle this.”, the best is actually if the app is being started from scratch, right?

Allen: Sure.

Arik: You just do it. But if you’re in this environment that it’s already a well-established app, you’ve got people who have been in there a long time, and especially if you’re in an environment where there hasn’t been any sort of cohesion and you have a lot of different approaches in different places, it’s really hard. It ends up coming down to a lot of the times having earned respect through personal relationships, having had a track record of good changes, but also just being … Ultimately you have to be quite a strong advocate for those changes. You really have to become a little bit of a developer evangelist because you’re basically coming to people and saying, okay, there’s going to be some cost. These people, they have completely different motivations and goals than you do. Especially if you’re working in an environment where you’ve got a lot of feature developers and you’re coming in and you’re saying, “Okay, I want to change out the ground underneath you.” Now, there are different levels of this. If the change is something so far below their ground that they don’t know it’s there or care, great. Just do it. So if you’re replacing one image catching solution with another, or you’re changing out how the internals of the network API work, as long as the surface area they touch can remain the same or very similar, they don’t care.

Allen: But I think the interesting, tricky, gnarly things that people either get impaled on or are heroes for is when you can bring across a change that isn’t encapsulated well, that it’s like, okay, now I need you all to stop using this pattern that is making the app slower and slower and slower and do this a bit less convenient thing that maybe you’re not familiar with.

Arik: And you might fail. And I think it comes down a lot to timing and quality and what the value is. I mean, at some companies, you just have the authority. That’s another option. If you’re at a company where they’ve said, this person is the principal and they’re in charge of this and they do what they want and you’re just all going to have to do it, and then they all just have to do it.

Allen: Well, that’s how you get into the architecture astronaut problem though, where somebody’s been promoted so highly that they no longer need to worry about the mere mortals who are actually implementing any of this stuff. And so then architectural mandates come down that everyone resents even if they’re good.

Arik: Even if they’re good because it’s so disconnected. But if you’re not in that situation, and if you’re in a collaborative environment where you don’t have any authority over people, what you have over them is more focus on the area than they do. They’re not paying as much attention to the thing you’re paying attention to. So if they value you as a coworker and they know you’re focused on that area and they believe in you, you can try to do it. But as an example, I tried to move an app to a unidirectional data flow right pattern from a bidirectional.

Allen: All the rage now.

Arik: All the rage now from a bidirectional data flow. And I did it in a sort of Reduxy kind of way. It wasn’t pure Redux, it wasn’t even pure Swift, but it was something along those lines. And it failed. No one used it. Because I was ahead of the point of time where anyone was thinking about that. Not because I’m so brilliant. I had been hanging out with web developers, but no one on the iOS side was thinking about that.

Allen: And also changing that way of thinking. If you’re like, oh, here’s a simpler way that you can get things done with less code and then you can remove a bunch of stuff, and it just kind of boop-

Arik: No. This is a fundamental brain shift. And I think it wasn’t until Swift UI came out and then all of a sudden people started saying, “Oh, that’s a good idea.” And you’re like, “Yeah, I know.”

Allen: If we make this thing more like that, then it’ll be more Swift UI and you’ll-

Arik: It’ll work better with Swift UI. Everything will be smoother and cleaner.

Allen: Great. Let’s do it.

Arik: Let’s do it. Exactly. And in that particular case, the case I’m talking about where I failed, what I ended up doing was taking out essentially all of the architecture and just using the pattern in one part of the app. And just encapsulated things in that way, but with no formalness to it. And so I got it in and people used it, but it never took over the whole thing. It didn’t happen.

Allen: Yeah. That’s not surprising, unfortunately. A really common pattern I’m seeing all the way back to when I was a developer at Apple and all the way through various projects I’ve worked on is that there’s so much gravity towards people. And this is especially true when you have a mixed seniority team. Little bit less true when you have all senior developers, but still happens a lot. It’s like people have a gravity towards following the conventions and the code that’s already local to where they’re working. And so if it’s like, okay, even if everybody totally is paying attention during your presentation pitch, all new networking should use this API.

Arik: Use this. They don’t do it.

Allen: Then they’re looking at a section of code and they’re adding something, and then they see that it makes a networking call using the old library and it’s not like it’s going to stop working tomorrow.

Arik: Yeah. Yeah, exactly. You either have to go in and do it all yourself, literally remove the old thing-

Allen: And have a PR that is so large that it’s constantly unmergeable and constantly breaking.

Arik: Everyone stop working all day. I’m going to get this thing in. You can do that.

Allen: For the next six weeks while I go and refactor everything.

Arik: Well, yeah. That’s the thing, right? Yeah. It’s a real challenge. I mean, you’re trying to change the engine of a car as it’s driving down the highway, and it’s like the Speed bus, if you drop below a certain amount of speed, it explodes. But that’s what I enjoy about it, right? There are so many interesting challenges that come with building an architecture at scale anyway. And then as you mentioned, you were talking about scale of users, but also scale of development teams and scale of complexity space. What is the company trying to achieve, and how complicated is that? Because if you’re building an app that does something that’s very simple, then your complexity space in the architecture should also be fairly simple. But if you’re building out something with complexity to it, inherent complexity to it, and if you have all three of those things, complexity of problem, scale of development team and scale of user base, it gets really fun. It’s really challenging. I don’t think it’s for everyone, but I look at these UI engineers who are just sweating over really tiny details of what an ease in ease out curve looks like, or one pixel to the left, one pixel to the right. And I’m blessed that I can see all those things, which is cool. I know not everyone can.

Allen: Or cursed.

Arik: Yeah, or cursed. But I can see it, but I don’t have the passion anymore to try to fix it. Instead, I find that same sort of passion in trying to figure out, okay, we’re talking about how do I build this infrastructure architecture better in a way that will be useful enough to these people that they’ll use it. Or how do I set a line somewhere where I can make these changes I need to make underneath them where they’re not going to care. That kind of stuff.

Allen: So one of the things I mentioned before I want to loop back to before we were run out of time here is … And then have to have you back again, because I have all these notes of things I wanted to ask you about and little pieces of conversations we have, but I feel like I’m speed running conversation. Is about these challenges of when you have a large code base. So a lot of folks who are listening probably have worked either in a product or design or engineering level on code bases and products that have had hundreds of thousands of users or maybe millions of users. But once you get into the tens of hundreds of millions of users, there’s certain things that challenges that come up. And so I’d be curious if you have any stories you could share or maybe patterns that you’ve seen that make that kind of work interesting.

Arik: Again, what creates the most complexity is all this testing. By which I mean AB testing. And also when a transition like Swift happens. So most people who work on these kind of apps, you have this really old code base that’s probably largely in objective C, and then Swift shows up and it’s like all the new stuff is being done in Swift, and then you’re dealing with all that complexity. But then on top of that, you really have these AB tests where it’s really hard to reason about what code is even being executed or what of this are we using? What tests am I in? And are we testing the code that is on the off path? It’s more of a quality control issue, but it does make working in that environment pretty complicated. I’m not sure if I’m completely answer talking about the thing you were talking about though.

Allen: Well, I mean, it’s certainly correlated. The more users you have, the more useful your AB tests are. And a lot of people who are working on products that are earlier in their growth curve will start trying to implement a whole bunch of AB testing on something that has tens of thousands of users, and then they just get noise data.

Arik: Yeah, there’s no signal.

Allen: But when you have millions of users and tens and hundreds of millions of users, then AB tests A, can get way more signal, and B, if you can tweak some outcome by a relatively small amount-

Arik: You can have massive, massive wins. It’s amazing how these tiny, tiny wins are magnified over that scale.

Allen: And then they compound across each other. And so it sounds like one of the side effects of that is that you accumulate a lot of AB tests. The question that comes under mind is, do you have tools or maybe just discipline for baking in and then removing the code around all the AB tests that end up just getting permanently turned on?

Arik: You should.

Allen: Okay.

Arik: It’s probably pretty helpful.

Allen: You imagine that that would be helpful.

Arik: I could imagine that that would be pretty helpful. Yeah, it’s really complicated. The complexity kind of grows in every direction. So even something that’s maybe surprisingly complex is just figuring out what even is the list of tests that someone’s in.

Allen: That seems like that would be one of the straightforward things but-

Arik: No, it’s not at all because there are a lot of factors. So first of all, there’s a lot of exclusion issues. So if one test is saying, I want people on this platform who’ve only been a member or user for this long-

Allen: In this cohort.

Arik: In this cohort who’ve only been on this device who have these capabilities. And then you’re saying, okay, and my code doesn’t work at this … So then you’re like, okay, well I’m testing in the same area, but I want completely different things and I can’t have a user in both because then it won’t be useful signal. So you’re saying, okay, well if your test is on, my test can’t be on. Or even just resolving like, okay, I’m a client now. I’m a user. I’m taking out my device, I’m looking at my app and the app needs to know extremely quickly … Because we’re talking about boot up, right?

Allen: Yeah. It’s like the critical moment. You want ideally a hundred milliseconds and the thing is running, but now it’s going to the server and is doing all these complicated evaluations.

Arik: Yeah. You’re trying to figure out … Okay, yeah. And then the server is trying to take all this data of like, okay, what’s the OS? What’s the device? What’s the capabilities? Do they have this turned on? Do they have that turned on? Whatever the things are. And resolve a list and taking context from the rest of your backend of like, okay, are we in some kind of special state because we’re running some promotion in some other part of the business, or are we trying out some new policy? So you’re taking just a tremendous amount of sources and collapsing them into this flat list of like, okay … Because ultimately all the platform code wants to know, all the app … Sorry, all the feature code wants to know is yes or no. Do I enable this code block or do I enable this code block? And then of course, the tests have individual cells, so it’s like, okay, am I in this part of this test? And you’re trying to resolve all that at boot time as quickly as possible. And so that’s its own whole set of complexity that probably involves three or four teams.

Allen: And God helped you if the thing you’re trying to test is some new thing that might improve boot time.

Arik: Totally. No, no, no, absolutely. What do you do? It’s like a whole other set of problems. What do you do if you want to test them? Or what if you do if have an app that has signed in and signed out, what do you do if you want to test something that happens when they’re signed out and you don’t know who they are?

Allen: And then they sign in and they’ve got to put in the test, but when they signed in, it’s like, oh, well this test you’ve put them is incompatible with this other test. You’re like, well, they’re in it now.

Arik: How do you test something like switching from one data framework to another?

Allen: Oh man. Oh no. Okay.

Arik: One networking stack to another. How do you verify that you’re not crashing? How do you verify? How do you find out if you’re crashing before your crash software would be working? Because the thing about having super large code bases or super large user bases, I should say, is that any possible thing that could happen … This will be a cross reference to the most recent episode of fun fact that has been recorded at this time. The possible number of combinations is probably not a 68 digit number, but very, very, very-

Allen: Arbitrarily large.

Arik: Very large. So with the number of launches you have, with the number of users you have, with the number of configurations you have, with the number of user behavior changes. Someone who uses the app slower than someone else. Right? Because developers famously, our brains go the happy path almost no matter how much we try.

Allen: Yes. Oh, man, when I was QA, it would drive me nuts how when I would be like, “Hey, developer, this happens. You can get it to crash when you do X.” And they’re like, “Oh, it works for me.” And then I see them use the app-

Arik: And they don’t do X.

Allen: In the most slow, deliberate, tap, wait for the entire thing to load. Once the entire thing loads, they deliberately tap at a very perfect … Where it’s like as a QA, I’m not going like tap, touch that. Anytime it’s loading, I’m randomly tapping other parts of the screen just to see what happens.

Arik: You’re a chaos monkey.

Allen: Yeah, exactly.

Arik: Which is software from Netflix.

Allen: Which is what the real users do, right? They’re chaos monkeys. They’ll force quit the app at any time.

Arik: At any moment.

Allen: Whenever the most inconvenient time where the app assumes it won’t be force quit, they’ll happen to force quit it then.

Arik: One of my favorite parts of my job/least favorite parts of my job is that a bug doesn’t get to me almost ever if it’s reproducible.

Allen: Yeah.

Arik: If it’s reproducible, someone else has already fixed it. It gets to me when they’re like, “We don’t know but it’s happening often enough that it matters.”

Allen: That it matters. But we’re not sure how the code even gets into the state.

Arik: We don’t have any idea how this could happen. We don’t know what happens. The stack trace is all apples. Whatever it is. The stack trace doesn’t even seem like it has anything to do with our code almost. Go figure it out. And it’s like, okay, well put my splunking hat on. And then you get those incredible moments, those truly incredible moments where you figure it out.

Allen: Yes. Mind expanding.

Arik: Yeah. Those are great.

Allen: Fist pumping moments. And that’s a theme that I think we’ll probably continue to see but definitely talking to leaders across all different stripes of product development, whether it’s product or designer or whatever, that as you get more and more into a leadership position, by definition, if your organization is working well, you’re going to be spending more and more time on those gnarlier, gnarlier things where you’re going to have other folks on the team solving the easy things because you’ve done delegation well, and you’ve set up good systems so that they have enough information to solve the easy things. And then your reward/punishment is you just kind of ladder up working on harder and harder problems throughout your career.

Arik: But by the way, even if you’re in an environment where that’s not true, where everyone is super senior, they’re also solving really hard problems. They’re just ones in domains that are not yours.

Allen: Yes. Absolutely.

Arik: So maybe they’re trying to figure out how to work around some crazy bug in collection view or some Swift UI thing, and they’re trying to figure out, I need this animation to be smooth and it’s just not smooth. And I wouldn’t be the person to work on that. But if it’s like there’s some race condition here that’s causing the app to crash somehow, some of the time. We don’t know.

Allen: That sounds like a lot of fun.

Arik: Yeah. Well, keeps me employed. No, I enjoy it. I really do.

Allen: Thanks, Arik. It’s been great having you on today.

Arik: Thank you.

Allen: Where can people go to learn more about you and your work?

Arik: Oh man. I don’t know if there’s anywhere anymore. No. Danieltiger.com is my personal website where I write about-

Allen: What do you write about?

Arik: I don’t know. Installing water filter systems. I don’t know.

Allen: Yeah, yeah.

Arik: But that’s probably the best landing spot at this point because I’m not really doing a lot of social media these days. But yeah, so danieltiger.com.

Allen: Excellent. We’ll link that up in the show notes. It Shipped That Way is brought to you by Steamclock Software. If you’re a growing business and your customers need a really nice mobile app, get in touch with Steamclock. That’s it for today. You can give us feedback or rate the show by going to itship.fm, or you can check us out on Twitter and Mastodon. Send us a message, tell us what you think. And also you can rate this show on the Apple Podcasts app. That’s helpful for people to be able to find the show. Until next time, keep shipping.

Read More