Blog

Refactoring with confidence

Sat, 18 May 2024 00:03:15 GMT

I was recently in a conversation about some significant refactorings that had happened in our codebase. There were a couple of opposing viewpoints about the value of those refactorings:

When a regression was detected during a service rollout, the large number of changes involved in the refactoring made it very difficult to spot the issue.
On the other hand, refactoring existing complex code is a good way to learn how to understand it. Successfully completing a large refactoring of existing code can really bring a sense of ownership. It means that the engineers on your team are more comfortable making future changes, and they feel more pride in the codebase because it’s theirs.

So, where do I stand on the issue? I agree with both points. I think it’s important to refactor the code for the reasons above, but also to generally simplify it over time. The key to me, is how you approach refactoring a codebase.

There are a few things to consider, before, during, and after you complete a refactoring.

Before refactoring – pin things down

In the excellent book “Working Effectively with Legacy Code” author Michael Feathers defines Legacy Code as code without tests. He goes on to describe that the secret to working on legacy codebases is to bring them under test. This is the same thing you need to think about when you’re planning a large refactoring. Before embarking on the refactoring, understand what tests already exist? Do they cover all of the major scenarios? If not, in order to refactor responsibly, you need to add tests. A common concern with writing tests before refactoring is that the refactoring is just going to break the tests anyway. My opinion is that if that is true, the tests weren’t very high quality to start with. Ideally, tests should characterize the customer observable behavior of the system, without being tightly coupled to the implementation details of the code. Building mocks that ensure an exact sequence of calls is typically less useful than a test that operates a larger chunk of your code with only its external dependencies abstracted.

Consider whether you can write tests that act as an oracle or provide a baseline of expected customer results. This can allow you to significantly change the implementation while still ensuring that customer behavior is preserved.

Code coverage can be a tool to help identify scenarios that you haven’t considered. I’m not a big fan of trying to hit specific code coverage percentage targets, because I think that incentivizes the wrong things (a topic for another day). However, if determining what parts of your code are covered by your existing tests is possible with a reasonable amount of work, reviewing those results to find areas of code that aren’t covered can be a good way to identify where you should focus your efforts on increasing test coverage.

Once you have the tests in place, you can much more confidently proceed with your refactoring. Of course, it’s unlikely that you’ll have a test for every possible case, so diligence is still warranted.

During refactoring – lean on your tools

During refactoring, consider separating completely mechanical, automated refactoring operations into discrete units of work (a separate PR if you squash, or a separate commit if you don’t). Many of today’s developer tools have sophisticated and robust refactoring operations for things like Rename and Extract Method – I even worked on the ones for C# back in the day. Most of the time, if your tooling does an operation for you, you can trust that it does a pretty good job, and so separating it out makes it easier to skip over when looking for the culprit if a regression is found once your code does go to production.

After refactoring – you’re not done

So, you’ve implemented tests, you’ve refactored the code, created a Pull Request, gotten it reviewed and merged, time to celebrate, right? Well, slow down there a bit. Now that you’ve refactored this code, you should feel increased ownership. It’s important to remember that that ownership lasts through the whole life cycle. Once you’ve merged your refactoring, make sure you understand how and when it gets deployed. Follow it through whatever systems your team has until it's live and being used by customers. Watch for indications that something went wrong, and proactively attempt to diagnose and repair them. Ideally, your on-call team isn’t surprised by an issue because you’ve already noticed the signal and addressed it. If they are, make sure you are available to help troubleshoot. Be accountable for the quality of your code right through the deployment process until it’s vetted by customers.

Conclusion

As I said above, refactoring is an important way of improving the quality of a codebase, helping people new to the code base learn it, and engendering a sense of ownership. However, it’s not without risks, and shouldn’t be taken lightly. Consider what you can do to make refactoring as safe as possible. Your team, and your customers will appreciate it!

How much should you spend on infrastructure?

Sat, 24 Feb 2024 04:07:33 GMT

Recently, a colleague posed a thought-provoking question: "How much should you spend on infrastructure?" This resonated with me, especially considering the constant challenge of improving build and test infrastructure while maintaining a reasonable level of investment.

Reflecting on past experiences, I found a couple of guidelines worth sharing:

A division head once advocated allocating 25% of engineering resources to enhance and maintain infrastructure.
In my previous organization, we dedicated 2 full-time individuals and a rotating shift to infrastructure support for a team of 25, alongside central engineering support.

Initially, these seemed like sensible benchmarks, but the more I pondered, the more I realized a universal truth across diverse teams and technologies—slow builds, flaky tests, and information gaps are persistent challenges in software engineering.

This realization led me to a compelling conclusion: When you have someone passionate about improving infrastructure, don't constrain their investment. Finding dedicated individuals for infrastructure work is often a challenge, and imposing rotating shifts leads minimal investment, and in the long term, chronic under-support.

If you have a team member with innovative ideas and genuine motivation to enhance infrastructure, seize that opportunity! Instead of setting limits, nurture their enthusiasm. Acknowledge the positive impact on the entire team and ensure that investing in infrastructure doesn't become a career dead-end. Of course, a well-thought-out plan, progress tracking, and continuous communication are essential. I'm curious to hear from fellow managers—how do you approach this question? What experiences have you had with build and test infrastructure improvements? Let's share insights and elevate the standards of our software engineering practices! 💡

Productive code reviews

Sat, 24 Feb 2024 04:02:54 GMT

You’re not done when you publish a PR

Your accountability is to ship code to customers. You aren’t finished with that bug fix as soon as you publish your PR. You are the one who is accountable to ensure the code gets merged, that it passes any deployment gates, and any other step that might exist until the customer is using it. It may be that you have a great team culture where all PRs get reviewed and merged promptly. If that’s true, I’d love to talk and learn from you. In my experience it often takes work to make sure someone reviews your PR. This might mean reminding people, up to and including booking time on their calendar.

You’ve got to be tenacious and realize that a PR waiting for approval in GitHub or AzDO isn’t delivering any value to anyone.

Synchronous code reviews can save time

I said above that getting your PR reviewed might require booking time on someone’s calendar. If this happens to you, embrace it! Since asynchronous code reviews have gotten popular, I’ve noticed that reviews often fall into one of two categories: they are either fairly trivial, and done with only the context shown by the tool, or they are extremely expensive as the reviewer spends time to pull down the branch, build it, explore dependencies, possibly debug the code and so on. But I’ve got a secret – the person that submitted the PR probably already has the code locally. They can probably debug in seconds instead of minutes. Also, you can very quickly ask questions like “Hey, did you consider this case, what happens?” instead of either trying that case yourself, or posting the question and then not being able to finish the review until you get an answer.

Don’t underestimate how much the preparation of the PR author, and the high bandwidth communication available can reduce the burden of performing reviews.

Reviewing PRs contributes to others’ success

I don’t know about other companies, but at Microsoft part of our performance management system asks people to consider how they contribute to others’ success. Often we feel like we’re too busy with our own work to be able to do time consuming PRs, but I’ve got a secret for you. Doing PRs is a very real way that you contribute to other people’s success. They literally can’t ship their code without a review, and if you are an expert you can also use the time to help them learn about the codebase. Even if you’re not an expert in that part of the codebase, there may be things that you have been burned by that you watch out for in code reviews, and discussing that can help them avoid repeating your mistakes.

You don’t have to be the expert

Finally, one issue that I’ve seen on teams is that reviews are always blocked on “the experts”. Sometimes that’s the right thing, but sometimes it just slows things down, and acts as a gatekeeping function. A principle that I like to consider is to think about consequences and reversibility. Many non-experts don’t feel comfortable signing off on PRs because they are worried they are going to miss a bug. I’ve got a secret for them: I’ve been ‘the expert’ on a lot of teams, and still missed plenty of bugs during code reviews. However, we tend to have multiple safeguards in place like additional pre-deployment gates, safe deployment to smaller regions, previews, A/B testing, etc. Code reviews are just one part of a defense in depth strategy around bug prevention and detection, and it’s expected that some bugs will get through. That’s usually ok, and it will be caught by a later stage. Even in the worst case, where it isn’t, it’s typically not difficult to reverse, git has a handy “revert” command for just such occasions.

Now, there are exceptions to this, where you really do want to have experts sign off on a review because the consequences could be severe. Security fixes that must be rapidly deployed are one example here. I’m sure you can think of others. So, I’m not saying anyone should be able to sign off on any PR any time, but I do think in many teams there is an overabundance of caution that ultimately ends up slowing down the team’s productivity. Oh, and on this topic – consider doing synchronous group reviews for cases like this. That way others can learn what the expert is looking for and can get better at doing their own reviews and one day become the expert.

Context switching and productivity for engineers

Sat, 24 Feb 2024 03:58:24 GMT

Tl/dr: While it’s important to prioritize tasks and work on the highest priority task, you can be more productive by maintaining a list of alternative tasks, and using time that you are blocked to unblock others, and to make progress on the next highest priority task that you aren’t blocked on.

It's Microsoft's twice annual performance management season (what we call "Connects"). As part of that, I've been reflecting on what differentiates some of the most productive engineers I've ever worked with. I've had discussions with a number of folks on the topic and decided to try to write some of them down, and share in case they help others. So, here goes...

Context switching is hard, but it’s also important.

Often, highest impact/priority work will require large chunks of asynchronous work. It can be useful to also have a queue of work that you can get done without blocking that you can switch to when your highest priority task is blocked.

Given that list, when you become blocked on your primary task, pick other work from your queue based on the following guidelines:

First, focus on unblocking others during small breaks. Answer coworker questions, review pull-requests, etc. Unblocking others has a multiplicative effect because it means that you and they can both start working in parallel.
Next, consider an estimate of how long you’ll be blocked and pick work accordingly. If you expect to have a relatively long block of time, pick a high priority task that requires a lot of context and let yourself get into flow state. Alternatively, if you only expect your primary work to be blocked for a short period your next highest priority task may not be the best thing to focus on now. Instead, consider whether there are smaller items that you can do that aren’t blocked to fill up that time.

Dan Moseley suggested that using the Pomodoro technique may help with this - it gives you a regular time to check if something more important is unblocked, while allowing you to focus in the meantime. It also gives you a concrete unit to estimate tasks in to decide which ones are worth starting now based on how long you expect to be blocked.

This relates to another interesting thing to think about around tasks at work. Ideally you have a balance of ABC tasks*. That’s tasks that are:

Above current ability/level
Below current ability/level
Current ability/level.

Ideally, most of your tasks will be at your current level, some are above so that you have the opportunity to learn and grow, and all jobs end up with things below your abilities, because they are required to keep things moving. There can be an interesting correlation, where you’re frequently blocked on tasks above your level as you wait for guidance, etc. Most things below your level are probably easy to do without being blocked. Things at your current level are likely a mix of collaborative tasks that will be at least partially asynchronous, and independent work.

Context switching is expensive, so do it deliberately. However, we should recognize that we participate in many asynchronous workflows, and if we don’t context switch, we can end up wasting a lot of time. The best engineers context switch only when necessary, but actively work on being able to context switch effectively. There is a tension between context switching and distractability. For example, leaving email notifications enabled and context switching to Outlook every time a new email arrives will result in a lot of overhead from context switching, and likely prevent you from entering a highly productive flow state. However, choosing to intentionally switch to a recently unblocked higher priority task within a reasonable time is imperative.

Credit to Scott Wadsworth for this way of categorizing work

Also thanks to Dan Moseley, Matthew Gertz, Eilon Lipton, Steve Carroll, Jeff Schwartz, Jared Parsons and others for discussing some of these ideas with me.