The Art of logging

When you have issues in production environment, application logs are the most useful assets to perform investigations and root cause analysis. In order to get the most value from your logs, it's important to follow some best practices when producing logs.
Learn how to write meaningful log messages that will help you to debug and solve issues in production
Log events using key/pair values
You should consider your application logs as a database of events that you'll query for investigating production issues. So instead of recording an event like this:
2020-11-11 15:00:00 Sending email to customers. 34 email sent, 3 failed
record it using the following format:
Date='2020-11-11 15:00:00' Action=Email SentEmails=4 FailedEmails=3
This format is more suitable for searches, you can easily filter events by Action
, SendEmails
or FailedEmails
.
What to record in your logs?
Here is an example to HTTP calls to a web-service that recommends Stories to readers.
GET /v1/story/recommend?StoryId=277&limit=9
Here is the associate log record:
2020-11-11 13:52:12 Type=INFO HttpRequestMethod=GET HttpRequestType=application/json HttpRequestURI=/v1/story/recommend HttpResponseStatus=200 Latency=104 Limit=9 Recommendation="284 232 117 113 386 360 380 325 382 370 376 369 379" RecommendationAlgorithm=CategoryRecommendation StoryId=277 Success=true X-Message-ID=d22b1310-d87b-48e8-b8a3c2f9ab659108 X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Client-ID=myapp-android
This log record helps us to answer the following questions
When did the event occured?
Each log record should have a timestamp, normalized to UTC timezone. It helps understanding the sequence of events.
Who initiates the call?
These information help understanding the origin of the request. This includes:
- The HTTP Referer.
- The client's device information. Ex: User-Agent or phone's device ID
- The 3rd party caller (application or service). Ex: See
X-Client-ID
in our example above, than indicate the call was initiated by the Android app.
What are the input?
These information help understanding the data sent to initiate the request. The input recorded should include:
- The URL called. Ex:
HttpRequestURI
- The parameters. Ex:
StoryId
,Limit
- For
POST
methods, include the request body data serialized to key/pair values
What's the output?
There information help understanding what was returned to the caller. The output recorded should includes:
- HTTP status code to help understanding if the request was successful or not. Ex:
HttpResponseStatus
- The return value(s) of the call. Ex:
Recommendation
containing the IDs of the stories recommended
Why that decision was made?
Add any intermediary information to help understanding how the system behaved, this is very useful for finding bugs. In our exemple, RecomendationAlgorithm
indicates which algorithm was used to perform the recommendation
Was the it successful?
Having a boolean field like Success
help us to easily filter successful or unsuccessful events.
For unsuccessful events, you should also record the reason why the failure occured:
- The error code
- The description of the failure
- The exception class name exception
- The exception stack trace so that we can identify to source of the error in the source code.
Here an example of log record on failure:
2020-11-11 13:52:12 Type=INFO ErrorCode=story_not_found ErrorMessage="The Story 315 was not wound" ExceptionClass=com.x.y.NotFoundException HttpRequestMethod=GET HttpRequestURI=/v1/story/recommend HttpResponseStatus=404 Latency=5 Limit=9 StoryId=277 Success=false X-Message-ID=d22b1310-d87b-48e8-b8a3c2f9ab659108 X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Client-ID=myapp-android Unexpected error 2020-11-11 13:52:12 com.x.y.NotFoundException: story_not_found 2020-11-11 13:52:12 at com.x.y.story.service.StoryService.findById(StoryService.kt:82) 2020-11-11 13:52:12 at at com.x.y.story.controller.StoryController.find()) ....
How long did it take?
Always add in your trace the latency of each operation. This will help you to identify bottleneck on your system. I usually record the latencies in milliseconds. Ex: See Latency
field
Other Considerations for Logging
Use English
Use english language, because some tools or console have trouble printing certain unicode characters (Ex: accented characters)
Naming Convention
As you record your log events using key/pair values, make sure to adopt the same naming convention for your fields: use either camel case (Ex: UserId
), snake case (Ex: user_id
) or kebab case (Ex: user-id
)
Centralize your logs
Consider a centralized server where to aggregate all the the logs your applications. Using log aggregator like Splunk, Papertrail or Logstash will provide you additional query and monitoring capabilities.
Correlate log events
In the following exemple, a call from the mobile phone to the ServiceA
will initiate calls to ServiceB
and ServiceC
.

This will generate the following logs
service-a.log Date="Feb 05 12:33:16" ... X-Client-ID=myapp-android X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Message-ID=4a069cd9-7442-460f-bfed-11111111
service-b.log Date="Feb 05 12:34:00" ... X-Client-ID=service-a X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Message-ID=4a069cd9-7442-460f-bfed-22222222
service-c.log Date="Feb 05 12:34:10" ... X-Client-ID=service-a X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Message-ID=4a069cd9-7442-460f-bfed-22222222
- The field
X-Correlation-ID
is used to group all the log events related to the same original call. - The field
X-Message-ID
is a globally unique identifier assigned to each log record
As always, it’s easier said than done. Logging requires some practice to master this, but once you have, you will know instinctively when and how you should log, and when you shouldn’t. As you perform investigations in production, refine and improve your logs to help your team in the future.