Notes from Explaining Dtrace

 

These are my notes from Paolo Fragomeni's dtrace talk given July 3, 2012 at NodeConf. Paolo is the CTO of nodejitsu. You can follow him at github.com/hij1nx or twitter.com/hij1nx

Everything is flawed, but everything is going to be OK if we are prepared to anticipate the problems and admit that we will never have perfection.

Horology is the art and science of measuring time. Let's use it to remove our ideas from the software context. There are people who spend hundreds of thousands of dollars for a wristwatch. In watchmaking, the mechanics inside a timepiece is called a movement. It tells hours, minutes and seconds and that's it. If it gets more precise, it's called a complication. This can include several hundred to a thousand small pieces that move and interact with each other. Some of the screws inside these movements are the size of a grain of sand. They're also very expensive. One polish watchmaker's company recently manufactured a watch which sold for $5.5 million. People are willing to pay this money because the items are all mechanical and artisan; it's a craft. They're also very old.

There's a lot of risk in taking apart complex systems. Once you build a complex system and deploy it to a server and it's running and there's real life data and interactions, there's no way you can take it apart. It's very difficult to interact with living systems without incurring some side-effect. The solution is permanent utility at intersections where interesting data points occur. In the context of code, this would be permanent code. But when we think of doing this, we think of instrumentation. Generally the way that's done is to put a printf or console.log() in your code.

But there's a problem with that. Instrumenting your system using something that's I/O bound is going to incur an extremely high cost, especially if you're in a very hot path; a program where if you tried to put a log inside the node event emitter, you wouldn't be able to run your program at scale. There are lots of programs that try to give you visibility into your running programs, e.g. the resources your programs are using. For example you can use top, but then you have an arbitrary process watching all the other processes. You get pollution. Tools that take the approach of "one arbitrary process watching another one" don't really work. We need a way to holistically instrument the system.

When an operating system is instrumented with dtrace, there are thousands of probes throughout the OS and at the kernel level. When I create a dscript, dtrace turns it into bytecode and sends it into the kernel. From the kernel space I'm able to observe any I/O calls or sys calls and get a vertical slice from the kernel to userland, without including the cost of I/O operations at arbitrary locations through my code.

To be less hand-wavey about how this works, libdtrace takes the bytecode and makes it available to the kernel. Let's take a look at the dscripts that are on your OS right now. Do an ls -l /usr/bin/*.d Also, take a look using vim /usr/bin/iosnoop. dtrace is a non-turing-complete language, so if you build a script and putting it in the kernel worries you, the architects of dtrace have been very careful to ensure that you can't do any damage. Let's run iosnoop!

dtrace script is composed of a few different parts. The probe tells what you're interested in looking for. You can type on the command line dtrace -l to list the huge number of probes there are on your system. Optionally, there's a predicate, which is really your only form of flow control. (For obvious reasons, you don't want to put flow control inside the kernel.)

Did you enjoy this post? Please spread the word.