Listen to post:
Good descriptive logs are an essential part of every code that makes it to production. But once the deliverable leaves your laptop, how much do you really look at them?
Sure, when catastrophe hits, they provide a lot of the required context of the problem, but if everything just works (or so you think) do you look at them? Monitoring tools do (hopefully), but even they are configured to only look for specific signs of disaster, not your everyday anomalies. And when will these be added? Yup, soon after a failure, as we all know any root cause analysis doesn’t come complete with a list of additional monitoring tasks.
One of our security researchers developed a solution. Here’s what he had to say:
What I’ve implemented is a touch-free and personalized notification system that takes you and your fellow developers a couple of steps closer to the production runtime. Those warning and error logs? Delivered to you in near real time, or a (daily) digest shedding light on what really goes on in that ant farm you’ve just built. Moreover, by using simple code annotations log messages can be sent to a slack channel enabling group notifications and collaborations. Your production environment starts talking to ya.
The system enables developers to gain visibility into the production runtime, resulting in quicker bug resolution times, fine tuning runtime behavior and better understanding of the service behavior.
Oh, and I named it Dice – Dice Is Cato’s Envoy. It was a fun project to code and is a valuable tool we use.Eliminate Threat Intelligence False Positives with SASE
How does it work then?
The first step is building a list of log messages extracted from the source code and a matching list of interested parties. These can be explicitly stated on a comment following the log line in the code, or automatically deduced by looking in the source control history for the last author of the line (i.e. git blame). Yes, I can hear you shouting that the last one on the blame list isn’t necessarily the right developer and you’d be right. However, in practice this isn’t a major problem, and can be addressed by explicit code annotations.
Equipped with this list of messages and authors the system now scans the logs, looking for messages. We decided to focus on Warning and Error messages as they are usually used to signal anomalies or plain faults. However, when an explicit annotation is present in the code we process the message regardless of its log level.
|Code line||Alerting effect|
|INFO_LOG(“hello cruel world”); // #hello-worlders||Channel to which messages should be sent|
|WARN_LOG(“the sky is crying”); // @firstname.lastname@example.org||Explicit mentioning of the developer (Elmore)|
|ERROR_LOG(“it hurts me too”);||No annotation here, so blame information will be used (e.g. email@example.com)|
Real time messages
Channel messages (as in the example above) are delivered as soon as they are detected, which we used to communicate issues in real time to developers and support engineers. This proved to be very valuable as it enabled us to do a system inspection during runtime, while the investigated issue was still occurring, dramatically lowering the time to resolution.
For example, we used channel messages to debug a particularly nasty IPsec configuration mismatch. The IPsec connection configuration is controlled by our client, and hence we could not debug issues in a sterile environment where we have full control over both ends of the configuration. With the immediate notifications, we were able to get the relevant information out of the running system.
Digests are also of great value, informing a developer of unexpected or even erroneous behavior. My code (and I guess yours also) has these “this can’t really happen” branches, where you just log the occurrence and get the hell out of the function. With Dice’s messages, I was able to know that these unimaginable acts of the Internet are actually more frequent than I imagined and should get special treatment rather than being disregarded as anomalies. Alerts are usually sent to users in the form of a daily digest, grouping all the same messages together with the number of occurrences, on which servers and the overall time frame.
Using Slack as the communication platform, enables the system to make some judgment regarding the notifications delivery – developers asked for digests to be sent only when they are online and, in any case, not during the weekend, which is easy to accommodate. Furthermore, the ability to add interactive components into the messages opens the door for future enhancements described below.
Useful as Dice is, it can be made even greater. Interactivity should be improved – many times notifications should be snoozed temporarily, till they are addressed in the code, or indefinitely as they are just redundant. The right (or some definition of right) solution is usually to change the log level or remove the message entirely. However, the turnaround for this can be weeks, we deploy new versions every two weeks, so this is too cumbersome. A better way is to allow snoozing/disabling a particular message directly in Slack, via actions.
“It wasn’t me” claim many Sing Sing inmates and blamed developers – the automatically generated blame database may point to the wrong author, and the system should allow for an easy, interactive way of directing a particular message to its actual author. It can be achieved via code annotations, but again this is too slow. Slack actions and a list of blame overrides is a better approach.
Logs are essentially a read-only API of a system, yet they are mostly written in free form with no structural or longevity guarantees. At any point a developer can change the text and add or remove variable outputs from the messages. It is therefore hard to build robust systems that rely on message analysis. Dice, elegantly if I may say, avoids this induced complexity by shifting the attention to personalized and prompt delivery of messages directly to relevant parties, rather than feeding them into a database of some sort and relying on the monitoring team to notify developers of issues.