The Art of logging

Pragmatic Nerdz
Feb 11, 2021 - 4 Minutes

When you have issues in production environment, application logs are the most useful assets to perform investigations and root cause analysis. In order to get the most value from your logs, it's important to follow some best practices when producing logs.

Learn how to write meaningful log messages that will help you to debug and solve issues in production

Log events using key/pair values

You should consider your application logs as a database of events that you'll query for investigating production issues. So instead of recording an event like this:

2020-11-11 15:00:00 Sending email to customers. 34 email sent, 3 failed

record it using the following format:

Date='2020-11-11 15:00:00' Action=Email SentEmails=4 FailedEmails=3

This format is more suitable for searches, you can easily filter events by Action, SendEmails or FailedEmails.

What to record in your logs?

Here is an example to HTTP calls to a web-service that recommends Stories to readers.

GET /v1/story/recommend?StoryId=277&limit=9

Here is the associate log record:

2020-11-11 13:52:12 Type=INFO HttpRequestMethod=GET
HttpRequestType=application/json HttpRequestURI=/v1/story/recommend
HttpResponseStatus=200 Latency=104 Limit=9
Recommendation="284 232 117 113 386 360 380 325 382 370 376 369 379"
RecommendationAlgorithm=CategoryRecommendation StoryId=277 Success=true

This log record helps us to answer the following questions

When did the event occured?

Each log record should have a timestamp, normalized to UTC timezone. It helps understanding the sequence of events.

Who initiates the call?

These information help understanding the origin of the request. This includes:

  • The HTTP Referer.
  • The client's device information. Ex: User-Agent or phone's device ID
  •  The 3rd party caller (application or service). Ex: See X-Client-ID in our example above, than indicate the call was initiated by the Android app.

What are the input?

These information help understanding the data sent to initiate the request. The input recorded should include:

  • The URL called. Ex:HttpRequestURI
  • The parameters. Ex: StoryId, Limit
  • For POST methods, include the request body data serialized to key/pair values

What's the output?

There information help understanding what was returned to the caller. The output recorded should includes:

  • HTTP status code to help understanding if the request was successful or not. Ex: HttpResponseStatus
  • The return value(s) of the call. Ex: Recommendation containing the IDs of the stories recommended
Receive my Stories your e-mail inbox as soon as I publish them.
Subscribe to my Blog

Why that decision was made?

Add any intermediary information to help understanding how the system behaved, this is very useful for finding bugs. In our exemple, RecomendationAlgorithm indicates which algorithm was used to perform the recommendation

Was the it successful?

Having a boolean field like Success help us  to easily filter successful or unsuccessful events.

For unsuccessful events, you should also record the reason why the failure occured:

  • The error code
  • The description of the failure
  • The exception class name exception
  • The exception stack trace so that we can identify to source of the error in the source code.

Here an example of log record on failure:

2020-11-11 13:52:12 Type=INFO ErrorCode=story_not_found 
ErrorMessage="The Story 315 was not wound" ExceptionClass=com.x.y.NotFoundException
HttpRequestMethod=GET HttpRequestURI=/v1/story/recommend
HttpResponseStatus=404 Latency=5 Limit=9 StoryId=277 Success=false

Unexpected error
2020-11-11 13:52:12 com.x.y.NotFoundException: story_not_found
2020-11-11 13:52:12 at com.x.y.story.service.StoryService.findById(StoryService.kt:82)
2020-11-11 13:52:12 at at com.x.y.story.controller.StoryController.find())

How long did it take?

Always add in your trace the latency of each operation. This will help you to identify bottleneck on your system. I usually record the latencies in milliseconds. Ex: See Latency field

Other Considerations for Logging

Use English

Use english language, because some tools or console have trouble printing certain unicode characters (Ex: accented characters)

Naming Convention

As you record your log events using key/pair values, make sure to adopt the same naming convention for your fields: use either camel case (Ex: UserId), snake case (Ex: user_id) or kebab case (Ex: user-id)

Centralize your logs

Consider a centralized server where to aggregate all the the logs your applications. Using log aggregator like Splunk, Papertrail or Logstash will provide you additional query and monitoring capabilities.

Correlate log events

 In the following exemple, a call from the mobile phone to the ServiceA will initiate calls to ServiceB and ServiceC.

Call orchestration between multiple services
Call orchestration between multiple services

This will generate the following logs


Date="Feb 05 12:33:16" ... X-Client-ID=myapp-android X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Message-ID=4a069cd9-7442-460f-bfed-11111111

Date="Feb 05 12:34:00" ... X-Client-ID=service-a X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Message-ID=4a069cd9-7442-460f-bfed-22222222

Date="Feb 05 12:34:10" ... X-Client-ID=service-a X-Correlation-ID=4a069cd9-7442-460f-bfed-b9e24efaa1d8 X-Message-ID=4a069cd9-7442-460f-bfed-22222222
  • The field X-Correlation-ID  is used to group all the log events related to the same original call. 
  • The field X-Message-ID is a globally unique identifier assigned to each log record

As always, it’s easier said than done. Logging requires some practice to master this, but once you have, you will know instinctively when and how you should log, and when you shouldn’t. As you perform investigations in production, refine and improve your logs to help your team in the future.

Production Issue