An endless journey to app stability

Three or four months ago, a friend of mine who is a CTO of a commerce start-up, consulted with me about his concerns. A similar bug has been occurring several times in an app that is operating and an important feature suddenly stops working repeatedly, and he asked me how this can be fundamentally improved. Fortunately, our team’s significant achievement this year was app stability improvement. After constant efforts to improve codes, infrastructure, and development process, app stability has been steadily improving. After listening to the situation of my friend’s team, I suggested several methods to implement in the long term.

App stability can be interpreted differently for every person and team, so there are various ways to measure it. From the user’s perspective, the percentage of crash-free sessions and crash-free users can be tracked. These values which affect the users directly are information about the things that happen after the release. But stability values that can be measured before the release are not only very important but also deeply related. Generally, codes are deployed to end users after function development, function testing, and regression testing. Regression bugs found during regression testing or after deployment are very critical in terms of app stability.

Unexpected bugs occurring in parts where we didn’t think the codes were directly modified primarily means there was a problem with the code. But when it’s found after the code review, automated testing, and QA, that means there is a hole within our team’s process for bugs to escape, so we need to find them and fill them up for each step. Since each stage has a different method and purpose of improvement, we need to adapt to measures according to each of them and maintain them in the long term.

1. Test and Improve Code Quality

First of all, the most reliable and sustainable solution in the long term is improving the quality of the code at the source of the bug and creating tests. Uncle Bob said, “Math proves something as right, but science proves something as wrong. You can trust science because its test coverage is high. Since the software is science, we can prove that our program is wrong through testing. Therefore, we cannot trust software that has low coverage”(original text). The way to check if the software I made is working the way I intended is to create a test. In an app development environment, testing and architecture have a close relationship. MVC, which we can say is the default for iOS app architecture, is not very suitable for testing. Various architectures have been created to solve problems like this. If MVC is the current main architecture, set up a mid to long-term plan and carry out refactoring to another architecture. Good architecture reduces a developer’s mistakes and makes it easier to code in the right direction.

Types of tests are divided based on speed and the degree of integration, so introduce them according to the situation and circumstances with priorities. iOS’s XCTest framework supports unit tests and UI tests. Generally, unit tests run very fast because they can be run on a modular basis. In addition, Unit tests are divided into Application Tests and Library Tests, so you can choose one as technically appropriate. Also, since Xcode lets you know the code coverage as a result of unit testing, you can easily collect the coverage for each module by reading the xcresult file that is created in DerivedData. On the other hand, UI tests typically take a lot of time since it needs to build the whole app and run it in a simulator, but it can test the features, not only a part of a code, just like a person is using the app, so it has the advantage of being able to check if the important features are really working properly.

Code review is also one of the ways to increase code quality. Google’s internal guide is very helpful for doing a good code review.

2. Building Infrastructure

Automated Infrastructure

To continuously develop, deploy, and operate, an automated infrastructure called CI/CD is needed. Most basically, it can automate deployment and testing. Deployment can be done at every commit, at a specific time every day, and whenever the developer wants to as needed. You can also ask Siri to do it for you. Also, if your team has a full request-based development process you can automatically run tests for each full request and check if the existing features are not broken even before the new code is merged into the main branch. Automation becomes the basis of development and all the related processes. As much automation is applied, developers can save their time and effort spent on repetitive tasks, and since automated testing is the primary defense line that can catch regression bugs, make sure to build this infrastructure.

Real-time Quality Monitoring

To quickly and accurately respond to problems while operating a service after deploying a new feature to the users, a monitoring base must be equipped. The index which represents service quality felt by the end user is called Quality of Experience Metrics(referred to as QEM). This is different from the usability analysis tool from things like Google Analytics. If usability analysis is used to improve UI/UX, QEM is a tool that monitors the service quality in real-time and helps solve problems when they occur. Typical QEM values would be required time for X, success/fail rate for Y, etc. For example, you can measure the actual service speed experienced by the users by recording the app loading time. You can take action based on these numbers or check how it’s improved. Also, you can record the success/failures of important tasks like the core server API and monitor their real-time to quickly find out about anomalies.

Always take a close look to ensure the resulting figures are in line with the purpose of observing the phenomenon. The values should be drawn and analyzed in consideration of different variables. This is because different actions need to be taken depending on whether the increasing number of failures of a specific network request is due to an increase in users, or a problem within the system. You should also look at the values carefully when analyzing numbers related to performance. Users may experience something completely different from what we expected depending on the smartphone’s performance and the network quality by region. It might be better to divide up the population according to their meaningful characteristics. Also, you should look at not only the simple average values, but also the several statistically meaningful values such as P50, P90, and P95, etc. in order to figure out the phenomenon more comprehensively and take action accordingly.

Various Status Boards

Useful status boards can be built for the organization other than real-time monitoring like the QEM. Status boards had a positive effect when it comes to a personal sense of accomplishment and motivation other than its original purpose. Red and yellow indicated that the coverage was too low, and green indicated that the target value was achieved on the status board which shows code coverage for each module. The board was red at first, and it gradually changed into green. You can’t really see the results of creating tests just like developing a new feature, but making a status board like this has the effect of seeing the results of writing test codes with your eyes. Since I’m a client developer, I was able to gain more enthusiasm thanks to the visual results.

However, you shouldn’t create a status board in order to see it from time to time. Mark Porter, Grab’s former CTO said, “Do not stare into the status board. It is the alert’s role to let you know when to look at the status board.” Do not stop at making the status board look good, but make it send you a notification when an anomaly occurs so you don’t miss an important event or respond to it late by mistake.

3. Improve Work Culture

Since every organization has a different process, nothing can be uniformly applied to all organizations. So I summarized two cases that I thought were effective among the things our team did this year.

Close Cooperation with the QA Team

Since 20% of our team’s work goes into code refactoring and structural improvement, existing features, not new ones, that used to work well are sometimes affected. In the first half of the year, a person in charge of QA was not assigned to this refactoring task and it was only done within the development team. Therefore information was not well-shared, so critical bugs appeared several times because the QA team did not know where to pay attention when performing regression tests. After that, the development team and the QA team started to evaluate the scope of impact of technical assignments and the need for manual testing, and they tried to eliminate the tasks that affect the existing parts without the QA team knowing. Recently, the task has been more automated and when a developer marks the scope of impact on his/her MR, the modification is automatically collected once the regression test begins and it’s shown to the QA team.

Also, the Grab app introduces the feature flag technique, so it is able to merge codes in the master branch without completing the feature. But there are development tasks that cannot be managed using feature flags. In this case, development needs to be done in the feature branch, but since there are more than 120 people contributing to a single app, there’s a high possibility of a code collision as the merging is delayed as well as the risk of new bugs appearing. Therefore, a process is being equipped so features like these can be merged quickly after going through manual testing as quickly as possible when they’re completely developed with good coordination with the QA team.

Weekly Bug Retrospective Meeting

Our team takes an hour per week to look back at the bugs that occurred that week and discuss the cause of the bug and what to do differently in order to completely prevent the same problem from occurring in the future. We do not try to find the person who caused the bug or hold the person accountable. It’s based on the thought that anyone can make mistakes, but the same mistake happening again is the problem of the team. If the testing was not enough, we make the test better. If it’s something that several developers make mistakes on, we create a document to share our knowledge or modify the architecture. If there is a new technique or a process that needs to be introduced, someone further investigates it and takes follow-up measures. From experience, an active discussion takes place and information and knowledge are exchanged between teammates in this retrospective meeting, thus the whole team grows together. When the number of bugs is reduced and the retrospective meeting ends short, I feel a sense of satisfaction and accomplishment.

Multiple Lines of Defense and Automation as Priority

Various measures complement each other. Just as the fine powder can be obtained after sifting coarse powder several times, it’s impossible to catch all the bugs and problems with just one method, so several measures need to be placed in multiple layers. Therefore, it’s important to thoroughly figure out the advantages and limitations of each method and fill up the empty holes. For example, no matter how many test codes you’ve written, it cannot be perfect. The test case may be insufficient, the code logic itself may be incorrectly written, or a developer who found a failed test may have arbitrarily modified the code to make it look like it has passed the test. So automated tests should be supplemented in other ways as well. We can find mistakes with code reviews by making several developers check for modifications. Not only that, when we independently test a part of a code using the unit test, situations, where it doesn’t work properly in actual service or crashes, might occur. We can supplement it with a more integrated UI test. After that, approach it like you’re building layers of defense lines to find problems in advance before the new codes are deployed by methods such as going through the QA team.

When introducing new processes or policies, the first priority should be automation. As there are more processes that are not automated, the team gets shackled. Some of you may have experienced simple repetitive tasks that developers had to do manually disappear one day without making a sound. Then in the worst case, a bug or a problem that occurred previously will happen again. Always remember that a process should be automated to be sustainable in the long run. It’s easy to newly add a work process or a feature of an app, but it’s much more difficult to remove something and no one wants to do it. So look for ways to automate a new process as much as possible, and be passive about introducing tasks that can only be done manually.

In Conclusion

To increase app stability and constantly maintain it, all areas, including code quality, development process, testing, release, and operating need to be improved. It begins with teammates sharing the problem. But since everything can’t be done at once, divide it into short-term, medium-term, and long-term according to the team’s situation, capability, and priority, and make plans as you learn by continuously trying new things. Increasing app stability, ultimately not only increases user satisfaction but also makes the developers confident and happy.

Tags: tests, app stability, regression  

테스트 코드 작성하면 좋은 점

코드를 수정하고나서 풀리퀘를 올렸는데 테스트가 실패한 것을 보고 내가 놓친 부분을 찾아서 수정하는 경험을 했었고 테스트의 가치를 피부로 느낄 수 있었다. 우리 팀은 커버리지를 KPI로 잡는다. 업무 시간의 20%는 테스트 작성, 빌드 타임 개선, 모듈화 등의 기술 과제에 할당한다. 테스트 코드를 작성하고 유지하는 경험을 통해 느낀 점이 있다.

더 나은 아키텍처를 좇게 한다

테스트가 어려운 코드는 좋지 않은 구조일 가능성이 매우 높다. 테스트 가능한 코드는 의존성에 대한 고민이 녹아있다. 특히 의존성 역전 원칙을 철저히 지키면서 소스 코드의 의존성을 정리하고 관리해야 하는데 이에 대해서는 엉클밥이 의존성 규칙dependency rule을 통해 잘 설명하고 있다.

내 코드 개밥먹기dogfooding

테스트를 작성하면 본인이 짠 코드를 사용자로써 직접 체험해볼 수 있다. 마치 집을 짓고 내부를 꾸미다가 밖으로 나와서 외부에서는 어떻게 보이는지 확인하는 것이다. 창문은 어떻게 생겼으며 문은 어떻게 열고, 입구는 어디에 있는지 등등. 창문과 입구는 public API와 같다. 집 내부는 private 요소들이다. 테스트에서 객체의 동작을 검사할 때는 public API를 쓴다. 객체를 직접 생성도 해보고 설정을 하고 public 메서드를 호출 하거나 프로퍼티에 접근해본다. 그러면서 내가 만든 코드의 사용성을 확인해볼 수 있다. API가 엉성하면 테스트를 작성하기도 까다롭고 고통스럽다. ‘이런 것까지 테스트해야돼?’ 또는 ‘이 케이스는 왜 이렇게 테스트하기가 어렵지?’ 등등 여러 상황을 맞닥뜨린다. 따라서 API의 사용성에 신경쓸 수 밖에 없고, 신경 쓰다보면 더 심플하고 직관적인 API를 만들 수 있게 된다. 더 심플하다는건 여러가지 의미가 있는데 내 생각엔 public 메서드나 프로퍼티의 갯수가 적은게 특히 중요한 것 같다.

심플하고 직관적인 API 설계

기능 개발 → 테스트 작성 → 기능 수정 → 테스트 수정의 과정을 반복하다보면 API를 만들때 설계 단계에서 미리 큰 그림을 그려볼 수 있는 근육이 생기는 것 같다. 가령 기능 수정을 조금 하고나니 기존 테스트를 전부 폐기하고 새로 짜야할때가 발생한다. 이러면 설계가 처음부터 좋지 않았을 가능성이 높다. 쉽게 바뀌지 않는 정책적인 것을 public으로 공개하고, 상세 구현부는 숨기거나 외부로 옮겨야 한다. 설계가 잘못되면 테스트를 작성하는게 불가능하거나, 어찌저찌 되더라도 유지하는게 괴롭다. 이런 일이 반복되면 객체의 구조나 역할, 테스트 가능성testability을 고려할 수 밖에 없다. 무엇을 이 객체 안에 넣어야하며 무엇을 빼야하고, 어떻게 의존성을 둬야할지 고민하게 한다. 어떻게 코드를 설계해야할지 막막하거나 어떤 패턴이 더 나은지 판단하기 어렵다면 테스트 코드를 기준으로 삼는 것도 좋은 선택인 것 같다.

의미있는 테스트를 만들고 유지할 수 있게 팀의 역량 증가

안타깝지만 그냥 어느날 갑자기 테스트 코드를 짜고 싶다고 짤 수 있는건 아니다. 테스트를 짤 수 있는 코드 구조와 환경이 갖춰져 있어야 한다. 따라서 테스트를 짤 수 있는 아키텍처를 도입하고 유지해야 한다. 애플의 MVC는 테스트 짜기가 너무 힘들다. UI를 만들지 않고는 유저 액션 이벤트를 발생시킬수 없다는 문제점이 있다. 또한 뷰컨트롤러는 라이프사이클 관련 메서드가 복잡하고 갯수가 많아서 테스트가 까다롭다. 그리고 의존성을 주입하는 것 마저도 간단하지가 않다. 그래서 MVVM, VIPER, RIBs 등의 아키텍처가 생겨났다. 그러나 아키텍처를 도입한다고 끝이 아니다. 아키텍처는 은탄환이 아니기 때문에 도입하더라도 팀이 처한 상황과 해결해야 하는 문제에 따라 끊임없이 변형된다. 그래서 시간이 지나도 테스트가 가능한, 그리고 테스트를 짜는게 힘들지 않고 재밌는 구조를 유지하는 기술 리더십이 중요하다. 또한 테스트를 짜는 일이 힘들거나 생산성이 떨어지지 않는 환경을 구축하고 적절한 툴을 도입하거나 만드는 팀워크도 중요하다.

Tags: tests