Log Rotation Nightmares: When /var/log/ explodes at 3 AM

Log Rotation Nightmares: When /var/log/ Explodes at 3 AM

There are few experiences more visceral for a Linux System Administrator than waking up to a notification that the root partition is at 100% capacity. It is almost always 3:00 AM, and it is almost always caused by a runaway application log file that decided to consume every remaining byte of storage in mere seconds.

The Anatomy of a Log Catastrophe

In a healthy system, log rotation is the silent guardian of disk space. Tools like logrotate ensure that files are compressed, purged, or moved to prevent them from hitting disk limits. However, when the configuration fails or an application misbehaves, the filesystem becomes a ticking time bomb. Common triggers for this nightmare include:

  • Misconfigured logrotate cron jobs failing to execute.
  • Applications logging at “DEBUG” or “TRACE” level in a production environment.
  • External log handlers or sidecars crashing and leaving behind massive, unmanaged text files.
  • Disk space being held by deleted files that are still open by a zombie process.

Immediate Triage: The First Responders

When the disk is full, the system may refuse to spawn new processes, making even basic troubleshooting difficult. Your first goal is to free up space instantly. Start by identifying the largest files in your log directories.


# Find the largest files in /var/log

du -ah /var/log | sort -rh | head -n 10

If you identify a massive file, do not simply run rm on it if the application is still actively writing to it. The operating system will hold the file descriptor, and the disk space will not be reclaimed until the process is restarted. Instead, truncate the file to zero length:


# Truncate the file without deleting the handle

> /var/log/application/massive_log_file.log

Finding the “Ghost” Files

Sometimes, the disk usage doesn’t make sense. You’ve deleted files, but the free space hasn’t returned. This is often because a process has an open file descriptor to a deleted file. You can track these down with lsof:


# Look for deleted files that are still consuming space

lsof +L1

Identify the Process ID (PID) from the output and restart the service. Once the process is terminated and restarted, the kernel will release the blocks, and your free space will return to normal.

Preventative Measures

To avoid a repeat performance, you must move from reactive firefighting to proactive configuration management:

  • Use copytruncate: If an application does not natively support closing and reopening files during rotation, ensure your logrotate config uses the copytruncate directive.
  • Implement Log Limits: Configure your application logging frameworks (like Log4j or Python’s logging module) to enforce their own file size limits as a secondary fail-safe.
  • Monitoring: Set up an external monitoring agent (like Prometheus/Node Exporter) to alert you when your disk usage hits 80%, long before the 100% threshold is reached.
  • Dedicated Log Partition: If possible, mount /var/log on its own partition. This ensures that even if the logs go rogue, the rest of the OS, including the boot sequence and critical binaries, remains functional.

Log management is rarely the most glamorous part of the job, but it is one of the most critical. A well-tuned logrotate configuration is the difference between a restful night’s sleep and a frantic scramble to keep your production servers alive.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *