Catching up on being lean (again)

The most interesting revelations come by chance. During a regular one-on-one meeting with a TL, one thing led to another and he asked me if I had read The Phoenix Project. I vaguely remembered my wife saying that a “Phoenix book” is upsetting her because it’s too much like real life. Didn’t need telling thrice, I decided to pick the book pronto Toronto.

The IT novel captured my attention from the very start. It is relatable, funny and focussed on accurately illustrating life in the technology industry. I especially enjoyed the portrayal of day to day problems and instinctive solutions. The concepts, although known to me, were so interesting that I decided to pick up the original book that The Phoenix Project is based on — The Goal.

I had my doubts about The Goal. It is based solely on the manufacturing sector. But it turns out, software development “management” can review manufacturing lines directly to explain things that are very close to home (no surprises for many of you, right?).

FWIW I think the story and the lessons are worth sharing.

Some background

Since Jan 2020, two of my teams have been grappling with (if not chocking out on) a never ending stream of tickets from our users \ customer support (CS). This started due to a very justifiable change in our department.

As part of a major process change at that time, we had decided to rip the band-aid and let go of a centralized “Support initiative” (SI) pod. The SI pod was comprised of full stack developers from all of our teams on a rotation basis. Instead of this, we decided to have “in team OPS”. We knew this would solve many problems that I won’t go into. We had hoped that this would increase each team’s domain expertise, allow them to detect and solve root problems faster overtime and eventually make for a better, healthier ownership. And it did achieve all of those things, for the most part.

From the get go it became clear that these two domains were generating the highest number of issues. In terms of effort, the number of developers required to maintain our SLAs was equal to the original SI pod model.

Fast forward to summer of 2020. Like many other SAAS companies & startups, we needed to make quick tactical decision to survive, thrive & inspire. The Phoenix Project talks about a concept of Value Stream.

From a value stream POV, we increased throughput to breakneck speed in order to meet the new market demands and changes required . And given our higher education seasonality, that work had almost all come to production once Fall semester started. We hit many crises and the team’s (and department’s) dedication managed to get us past the difficulties.

I was determined to understand the root causes and find ways to prevent such a situation in the future. I spent many weeks trying to unpack the many factors and arrive at a conclusion. The Goal and its Five Focusing Steps helped me do just that.

“Focusing Step #1: IDENTIFY the system’s constraint

Basically what is our Bottleneck? Wait, could it be the domains of these two teams? Obvious? Sh*t!

The moment that idea came to my mind, a wave of supporting evidence hit me all at once: The backlog of work, the projects that keep running into our “cool downs”, how other team’s work keeps generating CS issues for these two teams. To be clear, I could be wrong, but the evidence is overwhelming. The more I looked at it, the more data I was able to find. But TBH , just seeing how big the CS backlog of lower priority tickets is probably sufficient.

Armed with this new revelation I’m at the start of this journey to apply lean in the way I never have before. Yes, Scrum & Kanban have been awesome to “work in” as a dev, “work with” as a manager. But working with stake holders & execs to adjust our department processes to ensure we work with the bottleneck — now that seems like a new challenge! So on we go.

Focusing Step #2: EXPLOIT the constraint

Like The Goal, we have recently worked with CS, Product & QA to plot the value stream. We then aligned on what is our goal and easily identified the bottleneck to achieving it. And after some brainstorming, we decided that for incoming tickets we will add the following info:
a. “is reproducible” by PM\QA\Dev in their sandbox
b. “Confidence” that we have in the reach of issue
Those two factors, along with “impact” & “reach”, will be taken into account for prioritization with the hope that they maximize throughput of “fixed” and reduce “can’t repro”, “wont fix” & “duplicate”. The latter resolutions could (should?) be viewed as a waste of developers time (even though both we & our users can see the pain of these issues in Fullstory).

What comes next & Focusing Steps #3,#4,#5

We have about 2 months to observe if this works. Then the summer term will begin and that usually means 10% usage of our regular Fall & Winter terms. Either way, I’m hopeful that we will learn and iterate another attempt before summer, or be prepared to try in the Fall.

There are 3 more steps that are getting planned at the moment. If you have reached this part and hopefully have decent context, I’d be happy to discuss any thoughts or suggestions we could consider for either of the steps.
Until then, staying true to stockdale’s paradox

Thanks for reading.

Director of Engineering at Top Hat. Worked at successful startups and large technology companies in various high responsibility roles.