Skip to main content

Detect errors on time to gauge the success of your launch — Simple concepts, huge benefits!

When you develop your applications, do you put thought into a proper logging mechanism? In a perfect world, there should be no errors in the logs, but we all know that’s not the case in reality. Then how do you know that the project you just launched to a production environment is not introducing new errors or new types of errors on top of already existing errors that your triage team has been investigating?
Maybe on day 1 of your launch, you don’t see customer complaints, but in reality your systems are hurting or bleeding slowly. Yes, a network operations team could be checking the high-level health of the systems through a list of dashboards, but what are we software developers going to do to introduce a new level of detection? This all starts from the enterprise architecture and frameworks you build. Let’s assume you have a very robust logging mechanism and that this mechanism allows you to log the happy path. Let’s also assume that you have very clean guidelines for error and exception handling and utilizing the logging framework where necessary.
Now that you have all of the above in place, at the beginning of your project that is implementing business requirements within the existing framework, you have ability to cleanly define the top 20 cases to measure the success of the new code/features. Each developer can use this top-20 list as a guideline while developing the code and logging happy/negative cases. Let’s say your code is now in production and you are scanning through the logs manually and detecting the top-20 cases. Is this efficient? Are you supposed to do this on daily basis manually?
My recommendation is that you develop a lightweight solution that will be able to automatically do the following for you:
  • * Scan the logs on daily/hourly basis and produce the count of the top-20 scenarios and display the results in a table on some internal dashboard website
  • * Have ability to detect if the number of errors in each category * increases by more than X% (daily comparison of errors per Y units of work and units of work could be somehow defined and tied to the traffic on your website).
  • * Have ability to detect if new types of errors and exceptions that start happening so that the team can manually assess the situation and then add each new type of error to the top-20 list and start tracking it on daily basis.
If you have all of this automated, then there is no manual work needed when you launch something to production. You will be able to tell if your new code is hurting the numbers on existing top-20 categories and you will also be able to tell if you started introducing new types of errors that hurt the revenue of your company. Let’s assume that your production deployment involves deploying to a smaller/secondary data center first and then later to the rest of your data centers. Then this type of mechanisms can help you decide whether you continue deploying to the rest of data centers after deploying to that smaller data center.
These are all simple concepts. You can spend minimal efforts in building it yourself or maybe decide to buy a solution. The importang thing is to always take the “keep it simple” approach in decision making.

Conclusion:

Start by tracking top-20 errors on daily/hourly basis and use the percent of change as the gauge for the success of your code being pushed to production environments. Detect the newly introduced low-level engineering errors in production on time to gauge the success of your launch. Don’t over-design this! Keep it simple!
Almir Mustafic

Comments

Popular posts from this blog

Brand New programming language and one solution OR …

Brand New programming language and one solution OR Two existing programming languages, one solution for EACH? I understand that there is no right or wrong. It all depends on your software architecture, team structure, team skills and other factors, but I still want to explain the scenario as it may look familiar to some. Let me explain. Let’s assume that you have microservices and common libraries in two major programming languages. You have some teams who are experts in one and some teams experts in the other programming language. Now you need to come up with a solution for a scenario that all teams will need to leverage. Let’s assume that your cloud platform has an off-the-shelf approach for this but it is supported by a 3rd programming language that your teams do not have much experience in. What is the right thing for your organization and not just from the technical point of view? A) Do you embrace what your cloud platform gives you off the shelf and implement thi...

Programming / Software Engineering  — Think Paper, Paper, then Code

Most of the software engineering problems are solved in what I call the high-level brainstorming sessions. We basically walk into a meeting room and white-board our thoughts and come up with solutions. When things start falling apart, you better believe this happens in the last stretch of projects and it does work.  Now the issue is that we as programmers do NOT do the similar type of exercise before a line of code is written ? I typically see developers get requirements in the form of a document or a user story or in the form of walk-by requirements. The next thing I see on developers’ screens is code editors or IDEs. Is that the right thing to do? You may say that you are advanced enough and that you like to dive into coding right away, but this happens even to the best of us. We fall into this trap and rarely step back and review our habits. We have to go back to fundamentals. What did we do in school?  Professors taught us to write down our thoughts and to show what...

Owning and improving what you have is a sign of maturity in software engineering — Is it?

Re-factoring vs. Re-Writing? A lot of times we heard the talks about “Never be satisfied” in the context of innovation and driving your teams forward. In that context, this is totally fine, but when this mindset gets blindly used in the software engineering low level details, then you could be constantly  re-writing  code without taking the effort to truly understand it. Is this the right thing for business? Is it costly? Re-factoring  means that you took time to understand what you have, and you are improving it. On the other hand,  re-writing  does not necessarily mean that you took the time to understand the low-level details; it may mean that you went back to requirements and decided to re-write it without trying to understand the low-level details. There could be something in those details that you need to know so you don’t make the same mistake again in the process of re-writing it. At the end of the day, if you are constantly re-writing code, t...