Image result for LinuxCon: How Facebook Monitors Hundreds of Thousands of Servers with Netconsole

Facebook is among the largest users of Linux and open source in the world, as the company relies on them to run and maintain its operations.

At the LinuxCon conference here, Calvin Owens, production engineer at Facebook, detailed how the social networking giant makes use of the open-source netconsole tool to identify potentially problematic servers.

Owens said Facebook processes billions of netconsole messages every day. Netconsole has been part of the Linux kernel since at least 2001.

The original kernel documentation for the feature explains that the netconsole module logs kernel printk messages over UDP, allowing debugging of problems where disk logging fails and serial consoles are impractical.

Many organizations will choose to use syslog as a way to track potential server errors, but Owens said kernel bugs can crash a machine, so it doesn’t help nearly as much as netconsole.

He added that Facebook had a system in the past for monitoring that used syslog-ng, but it was less than 60 percent reliable. In contrast, Owens stated netconsole is highly scalable and can handle enormous log volume with greater than 99.99 percent reliability.

“Netconsole is fanatically easy to deploy,” Owens said. “Configuration is independent of the hardware and by definition you already have a network.”

What Facebook Looks for in Server Error Messages

There are a number of different things that Facebook looks for in terms of error messages that could indicate a broader server issue. Among them is what is known as a “softlookup,” which is an error message triggered when a work queue locks up a CPU for 20 seconds or more.

“A soft lockup is always a bug and something that should be fixed,” Owens said.

Facebook also looks for page allocation failures like hung tasks that can be triggered on severely overloaded boxes. Additionally, Owens said Facebook looks for filesystem errors to help find issues in storage hardware.

Netconsole has also been helpful in enabling Facebook to find what Owens referred to as “crazy hardware.”

“When you have enough hardware, you get some bad eggs,” Owen said.

He added that often the majority of Kernel error messages (known as OOPses) will come come from one box.

So rather than just looking at the volume of overall logs, with netconsole Facebook monitors the number of boxes that emitted a given message per minute, rather than raw count of error messages. As such, if a large number of boxes have errors, Facebook can catch that quickly.

To help more server and data center administrators benefit from netconsole, Owens has publicly 
postedthe information on how to set up a netconsole-based monitoring environment.

“We would love to see more people using netconsole; it has been fantastic and useful for us,  and we’d like to build a community around it,” Owens said.

 

 

[Source: Serverwatch]