All Your Base Are Belong To Us

Sasha >>= Sela >>= Google

Sasha Goldshtein — Thu, 03 May 2018 13:13:44 +0000

Looking back to 2007, I don’t think I imagined this blog would survive this far. I had an absolute blast writing here and sharing my thoughts and rants with you. Just to be clear, my blog is not going anywhere — it’s just a moment of reminiscence for me as I share a bit of personal news.

Today was my last day at Sela after 11.5 years. If you follow my blog, or if you just check out the archives section, you’ll see that I did a lot of different things over the years. From the deep internals of the .NET runtime on Windows, through cloud services like Azure Media Services, all the way to Linux containers and Java performance optimization, I kept changing focus every year or two and that’s part of what I kept me going for so long: my managers and the entire organization had my back for whatever experiments, ideas, or directions I wanted to pursue at the moment.

Sela was an amazing place to work at, and I was constantly surrounded by top-notch technology experts, good friends and colleagues, and incredibly outgoing and welcoming people — with a very big overlap between these groups

My next position is a software engineering role at Google Research in Tel-Aviv, where I will be working on machine learning solutions. I don’t know when and where I’ll be able to write more, but I definitely look forward to sharing some of it with you in the future if I can.

Thanks for following me so far, and please feel free to stay in touch — I know some of you are decade-long readers of this blog. My DMs on Twitter are open (@goldshtn), and you can also shoot me an email (same handle, @gmail).

Dynamic Tracing of .NET Core Methods

Sasha Goldshtein — Thu, 08 Feb 2018 07:32:37 +0000

tl;dr I wrote a simple proof-of-concept tool called place-probe.py which helps place dynamic tracepoints on .NET methods. For example: place-probe.py $PID 'System.Threading.Thread::Sleep'.

Dynamic tracing is one of the Linux diagnostics superpowers. By adding dynamic tracepoints on arbitrary functions across the system, you can diagnose a variety of “impossible” bugs and performance problems on a live production application without having to add instrumentation, rebuild, and restart. The underlying kernel mechanism that makes dynamic tracing possible is called uprobes (for userspace) and kprobes (for kernel functions).

Unfortunately, uprobes can only be placed on code that is backed by an on-disk image. In other words, not generated code, which was compiled at runtime. This precludes runtimes like JVM or CLR from using uprobes, because Java bytecode and CLR intermediate language instructions are compiled to machine code on the fly, and not backed by a disk image.

But the CLR has a trick up its sleeve: ahead-of-time compilation. On Windows, this is known as NGen, and the .NET Core cross-platform mechanism is called CrossGen. This is a tool that invokes the JIT compiler (libclrjit.so) on an assembly and stores the compilation results in a native image, which contains machine code instructions. These native images are then loaded into memory and executed directly, and because they are backed by a disk image, they can be traced with dynamic tracepoints!

The actual work of placing a dynamic probe on a CrossGen-compiled image is the following. You need the method’s offset from the image base, and then you place the probe with something like:

perf probe -x /path/to/MyImage.dll --add 0xbadcafe

The only problem is finding the offset that corresponds to a given managed method. The general approach is as follows:

Use the crossgen command-line tool to generate debug information for all the CrossGen-compiled assemblies. This produces .map files in a simple format that contains the method offset and name.
Find the desired managed method in the .map files. The map entry will look like the following, where the offset (in the first column) is the offset from the base address where the native image is loaded (let’s call it $METHODOFFSET):

0000000000020D70 36 instance void [app] app.Employee::Work()

Find the native assembly’s load address and first executable section in /proc/$PID/maps. We need the offset of the executable section from the assembly’s load address (let’s call it $EXEOFFSET), and the offset within the on-disk image ($DISKOFFSET). Here’s an example for System.Console.dll – the executable section starts at 7f7e038a1000, while the first section is at 7f7e03880000, so the difference is 0x21000; and the on-disk offset for the executable section is the third column, which is 0x1000.

7f7e03880000-7f7e03881000 r--p 00000000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll
7f7e03890000-7f7e03892000 rw-p 00000000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll
7f7e038a1000-7f7e038cd000 r-xp 00001000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll
7f7e038dc000-7f7e038dd000 r--p 0002c000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll

Now, compute $PROBEOFFSET = $METHODOFFSET – $EXEOFFSET + $DISKOFFSET. This is the offset that we need to place the dynamic probe on in order to trace the managed method.

The above process is encapsulated by a POC tool I wrote called place-probe.py, which performs the above computations and places the probe for you, or prints the required command, if given the –dry-run switch. Here’s a simple example:

$ ./place-probe.py $(pidof app) 'System.Threading.Thread::Sleep(int32)'
Added new event:
  probe_System:abs_4d6610 (on 0x4d6610 in /home/ubuntu/dotnet/out/System.Private.CoreLib.dll)

You can now use it in all perf tools, such as:

    perf record -e probe_System:abs_4d6610 -aR sleep 1

Added new event:
  probe_System:abs_5920 (on 0x5920 in /home/ubuntu/dotnet/out/System.Threading.Thread.dll)

You can now use it in all perf tools, such as:

    perf record -e probe_System:abs_5920 -aR sleep 1

$ sudo perf record -e probe_System:* -ag -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.136 MB perf.data (20 samples) ]

$ sudo chown $USER perf.data
$ perf script | head
Failed to open /home/ubuntu/dotnet/out/System.Threading.Thread.dll, continuing without symbols
Failed to open [kernel.kallsyms], continuing without symbols
Failed to open /home/ubuntu/dotnet/out/System.Private.CoreLib.dll, continuing without symbols
app 29891 [001] 154218.288270: probe_System:abs_5920: (7f7e03855920)
                5920 [unknown] (/home/ubuntu/dotnet/out/System.Threading.Thread.dll)
              256f07 CallDescrWorkerInternal (/home/ubuntu/dotnet/out/libcoreclr.so)
              167ce0 MethodDescCallSite::CallTargetWorker (/home/ubuntu/dotnet/out/libcoreclr.so)
              278c03 RunMain (/home/ubuntu/dotnet/out/libcoreclr.so)
              278ea3 Assembly::ExecuteMainMethod (/home/ubuntu/dotnet/out/libcoreclr.so)
               aa3fb CorHost2::ExecuteAssembly (/home/ubuntu/dotnet/out/libcoreclr.so)
               84dd6 coreclr_execute_assembly (/home/ubuntu/dotnet/out/libcoreclr.so)
               8a433 coreclr::execute_assembly (/home/ubuntu/dotnet/out/libhostpolicy.so)
               7f0d8 run (/home/ubuntu/dotnet/out/libhostpolicy.so)

$ sudo perf probe --del=*

To use this with your own application binaries (and not just CrossGen-compiled .NET Core assemblies), run CrossGen on them. Here’s an example that assumes you’ve used dotnet publish --self-contained such that all .NET dependencies are in the out directory:

crossgen /Platform_Assemblies_Paths out out/app.dll

After doing this, you can replace the original out/app.dll with the generated out/app.ni.dll (or out/app.ni.exe for the main executable) and use place-probe.py on that binary.

Oh, and where does CrossGen come from? You can either build it from source, or download it from the .NET Core NuGet packages. My dotnet-mapgen-v2.py script can help, among other things, with downloading CrossGen automatically and generating the required map files.

Getting Stacks for LTTng Events with .NET Core on Linux

Sasha Goldshtein — Tue, 06 Feb 2018 08:08:20 +0000

On Windows, .NET contains numerous very useful ETW events, which can be used for tracing garbage collections, assembly loading, exceptions thrown, object allocations, and other interesting scenarios. All events can come with a stack trace, which helps understand where they’re coming from. In fact, I’d say for some events, not getting the stack trace means the event is completely useless — e.g. what good is the ExceptionThrown event if you don’t have the exception stack trace?

On Linux, .NET Core doesn’t use ETW (Event Tracing for Windows, ya know). It uses LTTng instead, which is an awesome tracing framework but doesn’t have stack trace support for userspace events. But I think we can hack around it. Specifically, all LTTng events used by .NET Core are fired through a set of auto-generated functions, named FireEtXplat and EventXplatEnabled. If we trace these functions using standard dynamic tracing (uprobes) with ftrace, perf, or BPF, we don’t get the event payload (which can be also quite important), but we do get the stack trace. If we only need the event count (a rough replacement for Windows performance counters) or the stack traces, we don’t have create an LTTng session and record the events, which can also help lower the overhead. The downside is using hacky internal details, which can change at any moment — but that’s the nature of dynamic tracing.

Here’s a simple demo. Suppose you know from looking at the heap statistics or the GC LTTng events that you have lots of garbage collections, and would like to reduce the object allocations in your app. To do so, you have to figure out where the allocations are coming from. The GCAllocationTick event can tell you roughly which objects you’re allocating by using a low-overhead sampling approach, but it doesn’t tell you where the allocations are coming from, which is quite important. What we’re going to do, then, is trace the EventXplatEnabledGCAllocationTick function in libcoreclr.so, and gather its stack traces. Then, we’ll generate a flame graph that points to the heavy allocation sites in the app. I’ll demonstrate two ways — with perf, and with the stackcount tool from BCC (which is based on eBPF).

The perf way:

perf probe -x $APPDIR/libcoreclr.so -a  EventXplatGCEnabledAllocationTick*
perf record -p $PID probe_libcoreclr:* -g -o allocs.data
perf script | $FLAMEGRAPH/stackcollapse-perf.pl | $FLAMEGRAPH/flamegraph.pl > allocs.svg

The stackcount way:

$BCC/stackcount $APPDIR/libcoreclr.so:EventXplatEnabledGCAllocationTick* -f > allocs.stacks
$FLAMEGRAPH/flamegraph.pl < allocs.stacks > allocs.svg

Just for fun, let me show you the flame graph, pointing to the StatsController.Get method as the primary source of allocations:

To quantify the overhead, I tested this approach on an ASP.NET Core app with a trivial endpoint that performs hundreds of thousands of allocations per second. I also tested LTTng event collection, where I created an LTTng session and enabled only the GCAllocationTick event. I ran the benchmark for 20 seconds with 10 concurrent clients in each mode. The results were as follows:

No tracing: 12.32 ms/request
Dynamic tracing with perf: 12.51 ms/request (total of 58,422 events recorded; almost 3,000 events per second)
Dynamic tracing with stackcount: 12.59 ms/request
LTTng recording: 12.72 ms/request (total of 56,931 events recorded)

Just to clarify, any of the approaches above still require launching the application with the COMPlus_EnableEventLog=1 environment variable. If it is off, the entire eventing infrastructure is not invoked at all. Incidentally, for this application, which has the potential of generating hundreds of thousands of events per second, turning off this environment variable produces a huge speedup: 7.39 ms/request. For an app with more reasonable event rates, it will probably make sense to keep the environment variable on, because turning it off means you can’t do any meaningful event collection without restarting the process.

In conclusion, it seems that using dynamic tracing to probe the CoreCLR methods directly is a feasible approach for collecting stack traces of interesting CLR events. You don’t get the event payload (although in some cases it can be collected as well from the function’s arguments), but you do get the code location, which is often enough.

Wrapping Up Sela’s Hackathon With Four New Diagnostic Projects

Sasha Goldshtein — Mon, 25 Dec 2017 14:12:58 +0000

In the beginning of December, the consultants team at Sela had a day off-site for our annual hackathon to work on a variety of projects. This day was a blast, and there was a bunch of great energy and interesting work being done all around, but my team (Avi Avni and I) focused on diagnostics tools — my favorite — and here are some preliminary results.

Real-time Win32 memory leak diagnoser

This is a project I’ve had on my todo list for a couple of years now. In a nutshell, Win32 memory leak analysis in production is quite painful because of the sheer amount of data that has to be collected. Traditional approaches, which I’ve used quite successfully in the past, require recording every single allocation and deallocation, and then cross-correlating them to find allocations that weren’t freed (e.g., in this post using xperf and WPA). While this generally works, for an application with high-frequency allocations that leaks at a slow rate, collecting data over an hour or day or week is simply impractical due to the sheer sizes of the data files.

A couple of years ago, I wrote a BPF-based tool called memleak, which uses Linux uprobes to record allocations and deallocation stacks in a runtime data structure, without emitting data to files. I’ve already used this tool a couple of times to diagnose production issues.

The NativeLeakDetector project that Avi Avni built in just a few hours during the hackathon does the very same thing — for Windows, using ETW events. It’s still a bit shy on documentation, but is quite simple in principle. It uses the TraceEvent library to record heap allocation and deallocation events in a given process, and keeps track of all allocations with their call stacks in a runtime map. When instructed to, the tool prints all the allocations that were not freed and the call stacks leading to these allocations. There’s a bit of work remaining to make this tool production-ready, but the general skeleton is there and working quite fine.

Process snapshotting support in CLRMD

Our second project, also contributed by Avi Avni, was to add process snapshotting support to the popular CLRMD debugging library. If you haven’t seen it yet, CLRMD provides a convenient C# API for attaching to a live process or opening a dump file and analyzing its contents. You can walk threads and call stacks, locate specific objects in memory, investigate the heap size and GC state, and numerous other scenarios. The only catch is that to use CLRMD, you have to opt in for one of the following modes:

Create a dump file of the process and open the dump file. This allows you to capture an accurate snapshot of the process’ state, but creating the dump file can take a long time and take a lot of disk space.
Attach to the process invasively, like a debugger. Again, this lets you inspect the process’s state, but if the process is a production service, you just paused it completely.
Inspect the process’ memory without suspending it. The process keeps running, which is great for production services, but it means you’re not seeing a consistent snapshot. For example, while you’re enumerating heap objects, a GC can occur and completely mess everything up.

Avi’s pull request adds another option: create a virtual address clone of the process using the Process Snapshotting API (essentially POSIX fork(), but without actually executing code in the child process), and then attach CLRMD to the clone. The original process can keep running, but we have an accurate snapshot of its state to analyze — and then throw away. What’s best, the snapshotting API uses copy-on-write, so only pages modified by the original process are actually cloned (on demand) in physical memory.

.NET Core real-time event tracer for Linux

Earlier this year, I wrote a couple of blog posts on tracing .NET Core runtime events on Linux, such as garbage collections, allocations, exceptions, and others. The tracing approach I’ve shown is based on recording LTTng events to a trace file, and analyzing the trace file later. While this has its merits, it’s not really suitable for real-time, continuous monitoring. So I set out to build a proof-of-concept script that captures a real-time trace of .NET Core events, aggregates them in real-time, and produces interesting statistics.

The result is dntrace, a two-part tool: dntrace.sh, which turns on the LTTng events and records them, and dntrace.py, which parses them in real-time and displays statistics. Currently, the Python part uses an extremely fragile approach, where the trace data is passed through Babeltrace and then parsed from strings back into structured events. Babeltrace 2.0 will introduce API support for parsing events from real-time sessions, which is when the dntrace.py script can be revisited and implemented in a less hacky way.

It’s still not bad, though — you can get real-time GC information, including GC durations and generation sizes; printouts on any exceptions thrown; live allocation data; and other statistics. See the project repository for an example.

Bonus project: run a process in a Windows job object

During the day, I started working on another little tool, which I was only able to finish a few days later: jobrun. This tool runs a process inside a Windows job object, and lets you apply various limits to its behavior. You can restrict the process’ memory usage, CPU time, CPU affinity, scheduling priority, scheduling weight, and apply additional quotas — all supported by the Windows job object API.

For me, this was a useful tool for testing how a process deals with scarce resources. What happens when I can’t commit more than 300 MB of memory? How long does it take for the application to start up when I only get 3% of CPU time per scheduling interval? Can a single batch job complete within a hard limit of 30 CPU seconds? Perhaps you’ll find some other uses for this tool, too.

You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.

Lightweight JVM Diagnostics Tools and Containers

Sasha Goldshtein — Wed, 27 Sep 2017 10:08:23 +0000

If you’re reading this, I hope you’re curious what your options are when it comes to running JVM diagnostic tools on containerized applications. Generally when it comes to containers, you can either shove all your diagnostic tools into the container image, or you can try running them from the host — this short post tries to explain what works, what doesn’t, and what can be done about it. Although it is focused on JVM tools (and HotSpot specifically), a lot of the same obstacles will apply to other runtimes and languages.

Container isolation

As a very quick reminder, container isolation on Linux works by using namespaces. Containerized processes are placed in a PID namespace that gives them private process ids that aren’t shared with the host (although they also have process ids on the host); in a mount namespace that gives them their own view of mount points, and hence their own view of the filesystem; in a network namespace that gives them their own network interfaces; and so on. A lot of diagnostic tools aren’t namespace-aware, and will happily try to open files on the host using container paths, or try to attach to a process by using the container’s PID namespace, or exhibit any number of other failures.

Additionally, container resources are often limited by using control groups. This is not so much an isolation mechanism as it is a quota mechanism: the cpu control group restricts container CPU usage shares; the memory control group restricts user and kernel memory usage; the blkio control group restricts I/O throughput and operation count; and so on.

Finally, a lot of container runtimes (including Docker) use seccomp to restrict the set of syscalls containerized processes can make, to further isolate them from the host and avoid nasty surprises. Turns out, though, that some of these restricted syscalls are actually essential for diagnostic tools to work properly.

JVM diagnostic mechanisms

This is by no means a complete survey, but it’s worth just listing quickly the main JVM diagnostic mechanisms and how they work, before we can consider what happens in a containerized environment. (For more on this, including source links, check out Serviceability in Hotspot from the OpenJDK documentation.)

JVM performance data: by default, the JVM emits binary data into a file in the temp directory called hsperfdata_$UID/$PID. This file contains statistics on garbage collection, class loading, JIT compilation, and other events. It is the data source for jstat, and is also how jps and jinfo discover information about running JVM processes.
JVM attach interface: by default, the JVM will react to a QUIT signal by looking for a file in the working directory called .attach_pid$PID. If the file exists, it will create a UNIX domain socket in the temp directory called .java_pid$PID, and create a thread that will listen for commands on that socket. jmap, jstack, jcmd are some of the tools that rely on the attach interface for heap dumps, thread dumps, obtaining VM information, and other facilities.
Serviceability Agent: a component that runs in an external process and reads JVM data structures from the target by using ptrace (for a live process) or ELF parsing (for a core dump). This allows live diagnostics and core dump analysis to see thread states, heap objects, call stacks, and so on. HSDB, SOSQL, and other tools rely on the Serviceability Agent API. Notably, the JDK version has to match exactly between the original JVM and the one used to analyze the core dump or live process.
JVMTI: this tool interface allows an external agent library (.so) to be loaded with or attached to a JVM process and register for various interesting events, including class loading, thread start, garbage collection, monitor contention, and others. To load an agent with your process you use the -agentpath command-line argument; to attach an agent to a live process you use the JVM attach interface.
JMX: the JDK runtime provides a basic set of managed beans for inspecting the GC heap, threads, and other components. Many additional managed beans exist in various application containers like Tomcat.

Another important concept to consider is perf maps, used by the Linux perf tool to map JIT-compiled code addresses to Java methods. A common way of creating these is by using a JVMTI agent (e.g. perf-map-agent), which writes a perf map out to the default location in /tmp/perf-$PID.map. These are crucial for a lot of native Linux performance tools if you plan to use them with JVM processes.

Running diagnostic tools from the host

If you look at the way some of the JVM tools are implemented, it is clear that running them from the host will present a set of interesting challenges. Here’s how to address these challenges in some cases:

The JVM performance data store will usually not be accessible from the host. However, you can bind-mount the temp directory to make it visible from the host, which makes tools like jstat happy. (With Docker, this would be something like -v /tmp:/tmp).
The JVM attach interface has multiple points of failure: the containerized JVM thinks its process ID is X, while the host tool thinks it’s Y; and of course the attach file and the UNIX domain socket will be in the wrong mount namespaces. I just recently added a namespace-awareness patch to Andrei Pangin’s jattach tool, which covers the functionality of jmap, jstack, jcmd, and jinfo in a single package — so you can now use jattach from the host with no additional flags.
The Serviceability Agent API requires the full JDK to be available on the host, and requires a perfect match between the host and container JDK. This is not a likely scenario.
Attaching a JVMTI agent to a containerized process can be done with jattach, provided that the agent library is accessible in the container. This can be done with bind-mounts.
JMX beans can be accessed from the host by making the container expose them remotely using RMI. This StackOverflow question and answer thread covers it well.
If you plan on using perf maps, you need to generate them inside the container (by attaching a JVMTI agent) and then make them accessible to the host tool with the right PID. This is taken care of automatically by some tools, and was recently added to perf as well.

Running diagnostic tools from the container

Although I don’t particularly like the idea of bloating your container image with diagnostic tools, suppose you’ve done it anyway. Here are some of the likely problems:

The Serviceability Agent API uses the ptrace syscall, which is disabled in Docker’s seccomp profile (and I imagine it would be disabled by other sensible container runtimes as well). You can use a custom seccomp profile, of course, if you understand the security consequences for your host.
Using perf and perf-based tools inside the container requires the perf_event_open syscall, which is again blocked by Docker’s default seccomp profile.

Summary

Most diagnostic tools at our disposal today were not designed with containers in mind. You could say they are not container-aware — but they’re not aware of a bazillion other things which still don’t break their behavior. Unfortunately, most tools will not work out-of-the-box for containerized JVM processes, but there are ways to make them work with a fairly minimal effort.

You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.

Profiling the JVM on Linux: A Hybrid Approach

Sasha Goldshtein — Fri, 07 Jul 2017 19:27:08 +0000

I hope you’re outraged that your performance tools are lying to you. For quite a while, many Java sampling profilers have been known to blatantly misrepresent reality. In a nutshell, stack sampling using the documented JVMTI GetStackTrace method produces results that are biased towards safepoints, and not representative of the real CPU processing performed by your program.

Over the years, alternative profilers popped up, trying to fix this problem by using AsyncGetCallTrace, a less-documented API that doesn’t wait for a safepoint, and can produce more accurate results. Simply calling AGCT from a timer signal handler gives you a fairly reliable way to do stack sampling of JVM processes. Unfortunately, even AGCT can sometimes fail, and in any case, it doesn’t help with profiling the non-Java parts of your process: JVM code, GC, JIT, syscalls, kernel work performed on your behalf, and really anything else that’s not pure JVM bytecode.

Another popular alternative is using Linux perf, which doesn’t directly support Java but has great support for profiling native code, and doesn’t have any trouble looking at kernel stacks as well. For JVM support, you need two pieces:

A perf map that maps JIT-compiled addresses to function names (as a corollary, only compiled frames are supported; interpreter frames are invisible)
A JIT switch -XX:+PreserveFramePointer that makes sure perf can walk the Java stack, added in OpenJDK 1.8u60

When using this method:

You end up losing interpreter frames
You can’t profile an older JVM that doesn’t have the PreserveFramePointer flag
You risk having stale entries in your perf map because the JIT can throw away and recompile code
You risk not having certain functions in your perf map because the JIT threw the code away

At JPoint 2017, Andrei Pangin and Vadim Tsesko from Odnoklassniki introduced a new approach for JVM profiling on Linux, which brings together the best from both worlds: perf for native code and kernel frames, and AGCT for Java frames. Thus, async-profiler was born.

Async-profiler’s method of operation is fairly simple. It uses the perf_events API to configure CPU sampling into a memory buffer, and asks for a signal to be delivered when a sample occurs. The signal handler then calls AsyncGetCallTrace, and merges the two stacks together: the Java stack, captured by AsyncGetCallTrace, and the native + kernel stack, captured by perf_events. For non-Java threads, only the perf_events stack is retained.

Async-profiler’s approach for constructing a merged call stack, from Andrei Pangin’s and Vadim Tsesko’s presentation at JPoint 2017

This approach has its limitations, but it also offers a lot of appeal. You don’t need a special switch to preserve frame pointers. You get full-fidelity data about interpreter frames. The agent supports older JVMs. The stack aggregation happens in the agent, so there are no expensive perf.data files to store and parse.

A flame graph generated by using async-profiler

To try async-profiler, you can build from source (it’s very simple) and then use the helper profiler.sh script, which I contributed:

./profiler.sh start $(pidof java)
./profiler.sh stop -o flamegraph -f /tmp/java.stacks

Full instructions are in the README — any feedback, contributions, or suggestions are very welcome. Odnoklassniki are using this in production, but I’m sure they’ll be delighted to know that you found it useful, too!

You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.

Tracing .NET Core on Linux with USDT and BCC

Sasha Goldshtein — Sun, 02 Apr 2017 19:12:49 +0000

In my last post, I lamented the lack of call stack support for LTTng events in .NET Core. Fortunately, being open source, this is somewhat correctable — so I set out to produce a quick-and-dirty patch that adds USDT support for CoreCLR’s tracing events. This post explores some of the things that then become possible, and will hopefully become available in one form or another in CoreCLR in the future.

Very Brief USDT Primer

USDT (User Statically Defined Tracing) is a lightweight approach for embedding static trace markers into user-space libraries and applications. I’ve taken a closer look a year ago when discussing USDT support in BCC, so you might want take a look as a refresher.

In a very small nutshell, to embed USDT probes into your library, you use a special set of macros, which then produce ELF NT_STAPSDT notes with information about the probe’s location (instruction offset), its name, its arguments, and a global variable that can be poked at runtime to turn the probe on and off (this is called the probe’s semaphore).

When tracing is disabled, i.e. the semaphore is off, USDT probes have a near-zero cost, essentially a single NOP instruction. If the argument preparation for the probe is prohibitively expensive, your code can protect relevant sections with another macro that checks if the probe is enabled before preparing and submitting its arguments. Here’s what the whole thing might look like:

// Declaring the trace semaphore and the trace macro:
#define _SDT_HAS_SEMAPHORES 1
#include 

#define MYAPP_REQUEST_START_ENABLED() __builtin_expect (myapp_request_start_semaphore, 0)
__extension__ unsigned short myapp_request_start_semaphore __attribute ((unused)) __attribute__ ((section (".probes")));
#define MYAPP_REQUEST_START(url, client_port) DTRACE_PROBE2(myapp, request_start, url, client_port)

// The actual tracing code:
if (MYAPP_REQUEST_START_ENABLED()) {
  char const *url = curr_request->uri().canonicalize();
  unsigned short port = curr_request->connection()->client_port;
  MYAPP_REQUEST_START(url, client_port);
}

Okay, so why was I so eager to get these probes into CoreCLR? Because there are existing, lightweight tools for tracing USDT probes. One is SystemTap, which is great but requires a kernel module, and the other is the BCC toolkit, which I described extensively in previous posts. Also, because USDT probes can be mapped to specific program locations, the existing Linux uprobes mechanism can be used to trace them and obtain stack traces, with perf or ftrace-based machinery. Without subtracting from the value of LTTng traces, I really wanted to get the BCC tools working with CoreCLR, and that requires USDT.

Putting USDT Probes in CoreCLR

At this point you might be thinking of some monstrous patch that modifies thousands of trace locations in CoreCLR to support USDT, somehow. Fortunately, there is a Python script in the CoreCLR source called genXplatLttng.py, which is responsible for generating function stubs for each CLR event. All I had to do is patch it ever so slightly (31 changed lines), and the resulting CoreCLR binary (libcoreclr.so) now has USDT probes!

# readelf -n .../libcoreclr.so

Displaying notes found at file offset 0x00000200 with length 0x00000024:
  Owner                 Data size       Description
  GNU                  0x00000014       NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: a93f07f0d169d6dd53fb8a09e3fe793cda56072d

Displaying notes found at file offset 0x0079cf90 with length 0x0000c25c:
  Owner                 Data size       Description
  stapsdt              0x00000046       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCStart
    Location: 0x000000000051a296, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc18
    Arguments: 4@-28(%rbp) 4@-32(%rbp)
  stapsdt              0x0000006d       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCStart_V1
    Location: 0x000000000051a39e, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1a
    Arguments: 4@-44(%rbp) 4@-48(%rbp) 4@-52(%rbp) 4@-56(%rbp) 2@-58(%rbp)
  stapsdt              0x00000079       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCStart_V2
    Location: 0x000000000051a4b9, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1c
    Arguments: 4@-44(%rbp) 4@-48(%rbp) 4@-52(%rbp) 4@-56(%rbp) 2@-58(%rbp) 8@-72(%rbp)
  stapsdt              0x00000044       NT_STAPSDT (SystemTap probe descriptors)
    Provider: DotNETRuntime
    Name: GCEnd
    Location: 0x000000000051a597, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1e
    Arguments: 4@-28(%rbp) 2@-30(%rbp)

Many more notes were omitted for brevity — there is a total of 394 events on the build I used. Now, it’s important to clarify that this patch doesn’t get the full fidelity events. LTTng events have a richer payload than what USDT probes support, and support complex structures, sequences, and more. However, in many tracing scenarios, very basic information such as strings and numbers is sufficient. And, of course, the call stack. So let’s see what we can do now.

Tracing .NET Core Garbage Collections

OK, so what can we do with these newly-obtained superpowers? To begin with, we can trace USDT probes using the generic trace and argdist tools from BCC. For example, let’s get some statistics about garbage collections — how many collections do we have in each generation?

# argdist -p $(pidof helloworld) -C 'u::GCStart_V2():int:arg2#collections by generation' -c
[03:04:29]
collections by generation
        COUNT      EVENT
        4          arg2 = 2
        8          arg2 = 1
        13         arg2 = 0
[03:04:30]
collections by generation
        COUNT      EVENT
        5          arg2 = 2
        20         arg2 = 1
        25         arg2 = 0
[03:04:31]
collections by generation
        COUNT      EVENT
        5          arg2 = 2
        22         arg2 = 1
        28         arg2 = 0
[03:04:32]
collections by generation
        COUNT      EVENT
        6          arg2 = 2
        30         arg2 = 1
        36         arg2 = 0
[03:04:33]
collections by generation
        COUNT      EVENT
        9          arg2 = 2
        40         arg2 = 1
        49         arg2 = 0

arg2 in the above output is the collection “depth”, which is the collected generation. As you can see, we have quite a few gen0 and gen1 collection every second, and a handful of gen2 collections as well. (By the way, BCC has a tool called ugc for exploring GC latencies specifically, but it doesn’t currently support .NET Core.)

How did I know that arg2 is the collection depth, and how did I know that the collection “depth” is the generation to be collected? There are many more examples in this post that look a bit magical with various arg1, arg2, …, arg6 incantations. Right now, the answer is by inspecting the CLR source code to see where the probes are emitted, and what the values passed to them mean. In this particular case:

~/coreclr$ ack GCStart_V2 src/
src/vm/eventtrace.cpp
901: FireEtwGCStart_V2(pGcInfo->GCStart.Count, pGcInfo->GCStart.Depth, pGcInfo->GCStart.Reason, pGcInfo->GCStart.Type, GetClrInstanceId(), l64ClientSequenceNumberToLog);
...
src/gc/env/etmdummy.h
7:#define FireEtwGCStart_V2(Count, Depth, Reason, Type, ClrInstanceID, ClientSequenceNumber) 0

~/coreclr$ ack GCStart.*Depth src/
src/vm/eventtrace.cpp
895: (pGcInfo->GCStart.Depth == GCHeapUtilities::GetGCHeap()->GetMaxGeneration()) &&
901: FireEtwGCStart_V2(pGcInfo->GCStart.Count, pGcInfo->GCStart.Depth, pGcInfo->GCStart.Reason, pGcInfo->GCStart.Type, GetClrInstanceId(), l64ClientSequenceNumberToLog);
...
src/gc/gcee.cpp
91: Info.GCStart.Depth = (uint32_t)pSettings->condemned_generation;
100: else if (Info.GCStart.Depth < max_generation)

The argument order in the FireEtwGCStart_V2 function makes it clear that arg2 is going to be the collection depth. Then, the assignment statement in gcee.cpp hopefully makes it clear: the GC depth is the “condemned generation”, which is the generation to be collected.

Now, where are these pesky collections coming from? The stackcount tool summarizes call stacks in-kernel:

# stackccount u:.../libcoreclr.so:GCStart_V2 -p $(pidof helloworld)
^C
  FireEtXplatGCStart_V2
  ETW::GCLog::FireGcStartAndGenerationRanges(ETW::GCLog::st_GCEventInfo*)
  WKS::GCHeap::UpdatePreGCCounters()
  WKS::gc_heap::do_pre_gc()
  WKS::gc_heap::garbage_collect(int)
  WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason)
  WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)
  WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)
  SlowAllocateString(unsigned int)
  StringObject::NewString(char16_t const*, int)
  Int32ToDecStr(int, int, StringObject*)
  COMNumber::FormatInt32(int, StringObject*, NumberFormatInfo*)
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    58

OK, so this looks like a fairly obvious path: there is a string allocation in DoSomeWork caused by converting an int32 to a string, and that triggers a GC repeatedly. Apparently, some of these GCs are gen0/gen1 but some of them actually require gen2 to clean up. Note that we get a full-fidelity call stack, including managed code (thanks to the COM_PerfMapEnabled switch we saw in an earlier post).

If necessary, stack traces like these can also be visualized as flame graphs. Here’s an example flame graph from perf, of the above application while it was churning through a lot of memory allocations. The GC paths are clearly visible — in the foreground (allocating) thread, and in a background thread.

Another interesting thing to trace about the GC comes from the HeapStats_V1 event. This is an event that gets fired with every collection, and provides information about individual generation sizes, amount of promoted and finalized memory, and a bunch of other interesting stuff. Here’s an example of tracing generation 2 size over time, visualized as a histogram every 15 seconds:

# argdist -p $(pidof helloworld) -H 'u::GCHeapStats_V1():u64:arg5/1048576#gen2 size (MB)' -i 15 -c
[15:10:51]
     gen2 size (MB)      : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 23       |****                                    |
       128 -> 255        : 63       |************                            |
       256 -> 511        : 196      |****************************************|
[15:11:06]
     gen2 size (MB)      : count     distribution
         0 -> 1          : 6        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 3        |                                        |
        16 -> 31         : 6        |                                        |
        32 -> 63         : 10       |                                        |
        64 -> 127        : 49       |****                                    |
       128 -> 255        : 107      |**********                              |
       256 -> 511        : 404      |****************************************|

From the histogram, we can see that the gen 2 size is usually between 256MB and 512MB, but there are occasional GCs that bring it down, even as low as the 0-1MB bucket.

Tracing Object Allocations

Very similarly to the approach above, we could trace object allocations. The CLR includes a lightweight allocation tick event (GCAllocationTick_V3), which fires roughly every 100KB of object allocations. It includes the most recently allocated type name, and the amount of memory allocated since the last tick — allowing for low-overhead object allocation sampling, without tracing each individual allocation, which would be extremely expensive.

Unfortunately, the current trace and argdist tools don’t support Unicode strings, which is how the type name is provided to these events, so the output is slightly less useful — but we can still get histograms for the allocated amount at each tick, or a summary of type ids. First, let’s try arg6, which is the type name as a string:

# argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():char*:arg6' -z 32
[03:25:06]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        1          arg6 = S
        59         arg6 = S
        1254       arg6 = S
[03:25:07]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        1          arg6 = S
        383        arg6 = S
[03:25:08]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        2          arg6 = S
        11         arg6 = S
        1053       arg6 = S

That’s not very nice because when we treat the Unicode string as char*, only the first character gets displayed. This is fixable by modifying the tools, or writing a dedicated tool that would display these strings correctly. For example, here’s output from a patched argdist that appropriately decodes the strings instead of treating them like ASCII:

# argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():char*:arg6' -z 64
[03:51:58]
u::GCAllocationTick_V3():char*:arg6
        COUNT      EVENT
        1          arg6 = System.Char[]
        59         arg6 = System.String[]
        260        arg6 = System.String

We can also get good statistics by looking at type ids (method tables, actually) — which would have to be translated to type names separately, e.g. using SOS:

# argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():u64:arg5#type id'
[03:31:07]
type id
        COUNT      EVENT
        10         arg5 = 139987795580656
        746        arg5 = 139987795692592
[03:31:08]
type id
        COUNT      EVENT
        2          arg5 = 139987795580656
        1396       arg5 = 139987795692592
[03:31:09]
type id
        COUNT      EVENT
        1          arg5 = 139987795580656
        1064       arg5 = 139987795692592

The call stacks work great, though:

# stackcount -p $(pidof helloworld) u:.../libcoreclr.so:GCAllocationTick_V3
^C
  FireEtXplatGCAllocationTick_V3
  WKS::gc_heap::fire_etw_allocation_event(unsigned long, int, unsigned char*)
  WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)
  WKS::gc_heap::allocate_large_object(unsigned long, long&)
  WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)
  FastAllocatePrimitiveArray(MethodTable*, unsigned int, int)
  JIT_NewArr1(CORINFO_CLASS_STRUCT_*, long)
  [unknown]
  instance class [System.Collections]System.Collections.Generic.List`1 [System.Linq] System.Linq.Enumerable+SelectListIterator`2[System.__Canon,System.Char]::ToList()
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    131

  FireEtXplatGCAllocationTick_V3
  WKS::gc_heap::fire_etw_allocation_event(unsigned long, int, unsigned char*)
  WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)
  WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)
  SlowAllocateString(unsigned int)
  StringObject::NewString(char16_t const*, int)
  Int32ToDecStr(int, int, StringObject*)
  COMNumber::FormatInt32(int, StringObject*, NumberFormatInfo*)
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    2496

This shows two major stack traces allocating objects: one allocating an array inside a LINQ ToList() call, and another one that we’ve already seen, formatting an int32 as a string.

Tracing Exception Events

Let’s take a look at another example. Suppose your application is suddenly hitting lots of internal exceptions, which are handled and processed, but still producing some bad results. We will trace the exceptions as they occur, and get the call stacks where they are thrown. First, how many exceptions are we seeing? This is a question for the funccount tool:

# funccount -p $(pidof helloworld) u:.../libcoreclr.so:ExceptionThrown_V1
Tracing 1 functions for "u:/home/vagrant/helloworld/bin/Debug/netcoreapp2.0/ubuntu.16.10-x64/publish/libcoreclr.so:ExceptionThrown_V1"... Hit Ctrl-C to end.
^C
FUNC                                    COUNT
ExceptionThrown_V1                        100
Detaching...

All right, we have a fairly high rate of exceptions. What types? (This requires the same patched argdist from the allocation tracing example.)

# argdist -p $(pidof helloworld) -C 'u::ExceptionThrown_V1():char*:arg1#exception type' -i 5
[04:00:01]
exception type
        COUNT      EVENT
        100        arg1 = System.IndexOutOfRangeException

# argdist -p $(pidof helloworld) -C 'u::ExceptionThrown_V1():char*:arg2#exception message' -i 5 -128
[04:00:29]
exception message
        COUNT      EVENT
        200        arg2 = Index was outside the bounds of the array.

That’s pretty impressive — just like that, we can trace exception types and messages happening inside our application. And of course we can get the call stacks, using our good friend stackcount:

# stackcount -p $(pidof helloworld) u:.../libcoreclr.so:ExceptionThrown_V1
^C
  FireEtXplatExceptionThrown_V1
  ETW::ExceptionLog::ExceptionThrown(CrawlFrame*, int, int)
  ExceptionTracker::ProcessExplicitFrame(CrawlFrame*, StackFrame, int, ExceptionTracker::StackTraceState&)
  ExceptionTracker::ProcessOSExceptionNotification(_EXCEPTION_RECORD*, _CONTEXT*, _DISPATCHER_CONTEXT*, unsigned int, StackFrame, Thread*, ExceptionTracker::StackTraceState)
  ProcessCLRException
  UnwindManagedExceptionPass1(PAL_SEHException&, _CONTEXT*)
  DispatchManagedException(PAL_SEHException&, bool)
  __FCThrow(void*, RuntimeExceptionKind, unsigned int, char16_t const*, char16_t const*, char16_t const*)
  COMString::GetCharAt(StringObject*, int)
  char [helloworld] helloworld.Program::Selector(string)
  instance class [System.Collections]System.Collections.Generic.List`1 [System.Linq] System.Linq.Enumerable+SelectListIterator`2[System.__Canon,System.Char]::ToList()
  void [helloworld] helloworld.Program::DoSomeWork()
  void [helloworld] helloworld.Program::Main(string[])
  CallDescrWorkerInternal
  MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)
  RunMain(MethodDesc*, short, int*, PtrArray**)
  Assembly::ExecuteMainMethod(PtrArray**, int)
  CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)
  coreclr_execute_assembly
  run(arguments_t const&)
  [unknown]
  [unknown]
    200

OK, so in a function called Selector we’re trying to access a character in a string, and hitting an out-of-bounds condition. Perhaps the string is empty, or the index is invalid. All that — without a debugger!

Conclusion

There are plenty of other things that are made possible by collecting call stacks from CoreCLR events — tracing assembly loads, method JIT, object movement, finalization, and many other interesting scenarios. Currently, this is all just wishful thinking: I don’t seriously expect anyone to patch their CoreCLR to emit USDT probes, just for the sake of BCC tools or SystemTap. However, it goes to show what’s possible — and what’s desirable — for the future of .NET Core tracing, debugging, and profiling on Linux.

In the meantime, there seems to be a very recent patchset proposing stack trace collection support for LTTng. If merged, you should be able to attach a stack trace to an event using the context mechanism, similar to how we attached the pid and the process name in the previous post. Although that wouldn’t light up all the BCC tools and SystemTap, it would be a step in the right direction, and would make most of the analyses shown in this post possible.

You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.

Tracing Runtime Events in .NET Core on Linux

Sasha Goldshtein — Thu, 30 Mar 2017 12:30:03 +0000

After exploring the basic profiling story, let’s turn to ETW events. On Windows, the CLR is instrumented with a myriad of ETW events, which can be used to tackle very hard problems at runtime. Here are some examples of these events:

Garbage collections
Assembly load/unload
Thread start/stop (including thread pool threads)
Object allocations
Exceptions thrown, caught, filtered
Methods compiled (JIT)

By collecting all of, or a subset of, these events, you can get a very nice picture of what your .NET application is doing. By combining these with Windows kernel events for CPU sampling, file accesses, process creations and more — you have a fairly complete tool for performance investigations. You might find my recorded DotNext talk on using PerfView for .NET performance investigations useful — it shows how ETW and pretty much nothing else but ETW can be used to solve a huge variety of performance problems.

Unfortunately, ETW is a Windows-only mechanism. Event Tracing for Windows, you know? It is implemented in the Windows kernel, and that’s partly why it is so efficient and powerful. While looking for an equivalent Linux framework, the CLR team considered multiple alternatives, and decided to go with LTTng. it’s not that Linux doesn’t have enough tracers — quite the opposite — they had to choose which tracer is most appropriate for the CLR’s needs.

Hello, LTTNG

LTTng is similar in spirit to ETW. It is a lightweight tracing framework that can process events in real-time or record them to a file for later processing. It supports multiple simultaneous trace sessions, and each session can have multiple providers enabled — a system call provider (which requires a kernel module to be installed) alongside with user-space providers, such as the CoreCLR. Traces can be analyzed on the same machine or on another machine, and there are viewers available for more advanced visualization. Custom EventSource-based providers are also supported.

The massive, unfortunate, painful, unforgivable downside of this choice made by the CLR team is that LTTng doesn’t have stack trace support for user-space events. And this hurts more than you can imagine:

You can collect GC events, but you can’t aggregate stack traces to figure out where the GCs are coming from
You can collect exception events, but you don’t have the exception call stack
You can collect assembly load events (and potential failures), but you don’t know what’s triggering that assembly load
You can collect object allocations, but you can’t aggregate statistics to indicate which code paths are causing lots of allocations (and thereby garbage collections)

Effectively, not having stack trace support makes LTTng for CoreCLR a logging framework, which can be used to record and investigate logs, but not a tracing framework, which can help diagnose performance problems and troubleshoot hard issues in the field.

Collecting LTTng Traces

Let’s take a look at collecting LTTng traces from CoreCLR. As in the previous post, you could use the perfcollect tool — but it is typically overkill. First, perfcollect’s current implementation always turns on CPU sampling, which takes a massive amount of space and introduces a certain overhead. Second, perfcollect doesn’t have an event filter — it has exactly three modes for CoreCLR events: everything, GC only, and GC collect only. If you only care about assembly load events, or GC events, or JIT events, you’re not in luck.

Fortunately, it’s very easy to roll your own LTTng collection (you’ll need to install lttng-tools to record, and babeltrace to view):

# lttng create exceptions-trace
# lttng add-context --userspace --type vpid
# lttng add-context --userspace --type vtid
# lttng add-context --userspace --type procname
# lttng enable-event --userspace --tracepoint DotNETRuntime:Exception*
# lttng start

In the preceding commands, lttng create creates a trace session that you then add providers to. The add-context command makes sure each event will have, in addition to the provider data, the PID, TID, and process name. Then, enable-event adds a specific event set from the CoreCLR provider — note that LTTNG doesn’t need any metadata about these events ahead of time. To get a list of all the possible events, one easy way is to just read the perfcollect script — it declares them all, e.g.:

declare -a DotNETRuntime_NoKeyword=(
	DotNETRuntime:ExceptionThrown
	DotNETRuntime:Contention
	DotNETRuntime:RuntimeInformationStart
	DotNETRuntime:EventSource
)

Finally, lttng start starts the trace session with the enabled providers. By default, the traces are written to ~/session-name-timestamp — in our case, it’s going to be /root/exceptions-trace-20170330-something.

Now you run your scenario (the target application should have the COMPlus_EnableEventLog environment variable set to 1), and stop the trace when you’re done:

# lttng stop
# lttng destroy
# babeltrace ~/exceptions-trace
[07:31:11.751548909] (+?.?????????) ubuntu-16 DotNETRuntime:ExceptionThrown_V1: { cpu_id = 0 }, { ExceptionType = "System.NotSupportedException", ExceptionMessage = "Sample exception.", ExceptionEIP = 139767278604807, ExceptionHRESULT = 2148734229, ExceptionFlags = 16, ClrInstanceID = 0 }
[07:31:11.751603953] (+0.000055044) ubuntu-16 DotNETRuntime:ExceptionCatchStart: { cpu_id = 0 }, { EntryEIP = 139765244804131, MethodID = 139765233785640, MethodName = "void [Runny] Runny.Program::Main(string[])", ClrInstanceID = 0 }

babeltrace is a simple trace viewer, but there are also UI tools like Trace Compass and other visualization tools that can parse the CTF (Common Trace Format) specification.

What Now?

We have the fundamentals for collecting LTTng traces from CoreCLR and from custom providers, but the lack of stack traces is gnawing at me. In the next post, we will explore a hacky way of getting stack traces from interesting events, and even using BCC/SystemTap for tracing them.

You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.

Profiling a .NET Core Application on Linux

Sasha Goldshtein — Mon, 27 Feb 2017 14:29:58 +0000

In the same vein of my previous post on analyzing core dumps of .NET Core applications on Linux, let’s take a look at what it takes to do some basic performance profiling. When starting out, here are a few things I wrote down that would be nice to do:

CPU profiling (sampling) to see where the CPU bottlenecks are
Grabbing stacks for interesting system events (file accesses, network, forks, etc.)
Tracing memory management activity such as GCs and object allocations
Identifying blocked time and the block and wake-up reasons

With this task list in mind, let’s get started!

Collecting Call Stacks of .NET Core Processes

Generally speaking, a .NET Core application runs as a regular Linux process. There’s nothing particularly fancy involved, which means we can use perf and ftrace and even BPF-based tools to monitor performance. There’s just one catch: resolving symbols for call stacks. Here’s what happens when we profile a CPU-intensive application, running with defaults, using perf:

# perf record -F 97 -ag
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.364 MB perf.data (789 samples) ]
# perf report

As you can see, debugging symbols are missing for pretty much everything under the dotnet process, so we only get addresses rather than method names. Fortunately, .NET Core ships with a knob that can be turned in order to get a perf map file generated in /tmp, which perf can then find and use for symbols. To turn on the knob, export COMPlus_PerfMapEnabled=1:

$ export COMPlus_PerfMapEnabled=1
$ dotnet run &
[1] 23503

$ ls /tmp/perf*
/tmp/perf-23503.map  /tmp/perf-23517.map  /tmp/perfinfo-23503.map  /tmp/perfinfo-23517.map

$ head -2 /tmp/perfinfo-23517.map
ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Private.CoreLib.ni.dll;{14b5688c-fe9a-4a0d-a0d1-b3af5439e23b};
ImageLoad;/home/vagrant/Runny/bin/Debug/netcoreapp1.1/Runny.dll;{ebb3ede4-dc41-44f4-93d3-152cd0b54ac0};

$ head -2 /tmp/perf-23517.map
00007FABB90D4480 2e instance bool [System.Private.CoreLib] dynamicClass::IL_STUB_UnboxingStub()
00007FABB90D44D0 2e instance System.__Canon /* MT: 0x00007FABB8F60318 */ [System.Private.CoreLib] dynamicClass::IL_STUB_UnboxingStub()

Equipped with these files, we can repeat the perf recording and then the report looks a bit better, with symbols starting to appear, such as ConsoleApplication.Primes::CountPrimes. Note that because the .NET process wrote the perf map file, you might need to tell perf to ignore the fact that it’s not owned by root by using the -f switch (perf report -f), or simply chown it.

Although, who reads perf reports anyway — let’s generate a flame graph!

Getting a Flame Graph

Well, a flame graph is a flame graph, nothing special about .NET Core here once we have the right data in our perf files. Let’s go:

# git clone --depth=1 https://github.com/BrendanGregg/FlameGraph
...
# perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flame.svg

Here’s a part of the generated flame graph, looking pretty good:

If you look closely, you’ll notice that some symbols are still missing — notably, we don’t have any symbols for libcoreclr.so. And that’s just the way it is:

$ objdump -t $(find /usr/share/dotnet -name libcoreclr.so)
/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/libcoreclr.so:     file format elf64-x86-64

SYMBOL TABLE:
no symbols

If you build .NET Core from source, you can build with debug information, but that’s not what we get by default from the Microsoft package repository.

Stacks For Other Events

Now that we have the necessary building blocks for getting symbols resolved, we can of course move on to other events (and use other tools, too). For example, let’s trace context switches to see where our threads are getting blocked:

# perf record -e sched:sched_switch -ag
...
# perf report -f

(This is a fairly typical stack for where the thread gets preempted to let another thread run, even though it hasn’t called any blocking API.)

Or, let’s try some of my favorite tools from BCC. For example, let’s trace file opens:

# opensnoop
PID    COMM               FD ERR PATH
1      systemd            17   0 /proc/955/cgroup
24675  dotnet              3   0 /etc/ld.so.cache
24675  dotnet              3   0 /lib/x86_64-linux-gnu/libdl.so.2
24675  dotnet              3   0 /lib/x86_64-linux-gnu/libpthread.so.0
24675  dotnet              3   0 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
...
24689  dotnet             47   0 /home/vagrant/Runny/perfcollect
24689  dotnet             47   0 /home/vagrant/Runny/opens.txt
24689  dotnet             47   0 /home/vagrant/Runny/project.lock.json
24689  dotnet             47   0 /home/vagrant/Runny/.Program.cs.swp
24689  dotnet             47   0 /home/vagrant/Runny/Program.cs
24689  dotnet             -1  13 /home/vagrant/Runny/perf.data.old

We can conclude that everything more or less works. I dare say this is even a little easier than the JVM situation, where we need an external agent to generate debugging symbols. On the other hand, you have to run the .NET Core process with the COMPlus_PerfMapEnabled environment variable at initialization time — you can’t generate the debugging information after the process has already started without it.

But then I tried one more thing. Let’s try to aggregate file read stacks by using the stackcount tool from BCC to probe read in libpthread (which is where .NET Core’s syscalls are routed through on my box). The result is not very pretty:

$ stackcount pthread:read -p 29751
Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end.
  read
  [unknown]
  [unknown]
  [unknown]
  [unknown]
  void [Runny] ConsoleApplication.Program::Main(string[])
  [unknown]
  [unknown]
  [unknown]
  [unknown]
  [unknown]
  coreclr_execute_assembly
  coreclr::execute_assembly(void*, unsigned int, int, char const**, char const*, unsigned int*)
  run(arguments_t const&)
  corehost_main
... snipped for brevity ...
    16

The [unknown] frames prior to Main are not very surprising — this is libcoreclr.so, and we already know it doesn’t ship with debuginfo. But the top-most frames are disappointing — this is a managed assembly, with managed frames, and there’s no reason why we shouldn’t be able to trace them.

To figure out where these frames are coming from, I’m going to need addresses. With the -v switch, stackcount prints addresses in addition to symbols:

# stackcount pthread:read -v -p 29751
Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end.
^C
  7f77b10b1680     read
  7f773651f267     [unknown]
  7f773651e8d5     [unknown]
  7f773651e880     [unknown]
  7f773651846a     [unknown]  7f77364bfb5d     void [Runny] ConsoleApplication.Program::Main(string[])
...

All right, so which module is 7f773651f267 in, for example? Let’s take a look at the loaded modules (I’m keeping only executable regions):

$ cat /proc/29751/maps | grep 'xp '
...
7f77364bf000-7f77364c6000 rwxp 00000000 00:00 0
7f7736502000-7f7736530000 r-xp 00003000 fd:00 787585                     /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll
7f7736534000-7f7736564000 r-xp 00003000 fd:00 787603                     /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.IO.FileSystem.dll
7f7736577000-7f7736578000 r-xp 00002000 fd:00 787665                     /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Threading.Thread.dll
7f773657a000-7f7736587000 r-xp 00002000 fd:00 787606                     /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.IO.dll
7f773658a000-7f773659a000 r-xp 00002000 fd:00 787668                     /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Threading.dll
7f773659d000-7f773659e000 r-xp 00002000 fd:00 787658                     /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Text.Encoding.dll
...

OK, so we seem to be making progress — the desired address is clearly in the range that belongs to System.Console.dll. But, being a managed assembly, we’re not going to find any debug information in it:

$ file /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll
/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll: PE32+ executable (DLL) (console) Mono/.Net assembly, for MS Windows

$ objdump -tT /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll
objdump: /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll: File format not recognized

Hmpf. So how are we supposed to get symbolic information for these addresses?

If you look online, you’ll find that there’s a tool on the .NET Core repos called perfcollect — essentially a Bash script for collecting performance information from .NET Core processes running on Linux. Let’s take a look.

The perfcollect Tool

The perfcollect tool is fairly self-contained, and installs its own dependencies, most notably perf and lttng — .NET Core on Linux uses LTTng to generate various events, including garbage collections, object allocations, thread starts, assembly loads, and many others. Then, perfcollect follows your instructions and runs perf and lttng to collect CPU sampling events, package them up to a big zip file, and hand that to you.

What are you supposed to do with that zip file? Open it on Windows, apparently, using PerfView. Now, I love PerfView, but a face palm is the only reasonable reaction to hearing this. What’s more, perfcollect does a bunch of work that you don’t really need if you plan to analyze the results on the same machine. But there’s one thing it does which sounds very relevant:

WriteStatus "Generating native image symbol files"

# Get the list of loaded images and use the path to libcoreclr.so to find crossgen.
# crossgen is expected to sit next to libcoreclr.so.
local buildidList=`$perfcmd buildid-list | grep libcoreclr.so | cut -d ' ' -f 2`

That definitely sounds good! Turns out that .NET Core writes out an additional map file, named /tmp/perfinfo-$PID.map, which contains a list of image load events for your application’s assemblies. perfcollect then parses that list and invokes the crossgen tool to generate an additional perf map for each assembly, which can be fed into PerfView on the Windows side. Here’s what the perfinfo file looks like:

$ head -4 /tmp/perfinfo-29751.map
ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Private.CoreLib.ni.dll;{14b5688c-fe9a-4a0d-a0d1-b3af5439e23b};
ImageLoad;/home/vagrant/Runny/bin/Debug/netcoreapp1.1/Runny.dll;{319d161b-f17e-44f6-a210-f297df920194};
ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Runtime.dll;{819d412e-d773-4dbb-8d01-20d412b6cf09};
ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/mscorlib.dll;{080dac22-6a0e-41ae-85fb-fb79cc07911b};

Now, that’s what crossgen is supposed to do. And according to the comment above, crossgen is also supposed to be in the same folder as libcoreclr.so. But it isn’t:

$ find /usr/share/dotnet -name crossgen

That’s right, no results. Looking online, it seems that crossgen is generated as part of a .NET Core build, and part of the CoreCLR runtime NuGet package, but it’s not part of the pre-packaged binaries you get from the Microsoft package repositories. But with a little effort borrowed from the corefx repo, we can fetch our own crossgen:

$ export CoreClrVersion=1.1.0
$ export Rid=$(dotnet --info | sed -n -e 's/^.*RID:[[:space:]]*//p')
$ echo "{\"frameworks\":{\"netcoreapp1.1\":{\"dependencies\":{\"Microsoft.NETCore.Runtime.CoreCLR\":\"$CoreClrVersion\", \"Microsoft.NETCore.Platforms\": \"$CoreClrVersion\"}}},\"runtimes\":{\"$Rid\":{}}}" > project.json
$ dotnet restore ./project.json --packages .
... output omitted for brevity ...
$ ls ./runtime.$Rid.Microsoft.NETCore.Runtime.CoreCLR/$CoreClrVersion/tools
crossgen

All right! So we have crossgen, at which point we can try it out to generate debug information for System.Console.dll, or any other assembly we need, really. Here goes:

$ crossgen /Platform_Assemblies_Paths /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0 \
           /CreatePerfMap . /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll

Microsoft (R) CoreCLR Native Image Generator - Version 4.5.22220.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Successfully generated perfmap for native assembly '/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll'.

What does this perfmap file look like? The same as any other perfmap, except the addresses are not absolute — they are offsets from the load address of that module:

$ head -4 System.Console.ni.\{3b33b403-e8c1-44af-a7fb-369b2603f2a3\}.map
0000000000017590 58 void [System.Console] Interop::ThrowExceptionForIoErrno(valuetype Interop/ErrorInfo,string,bool,class [System.Runtime]System.Func`2)
00000000000175F0 4d void [System.Console] Interop::CheckIo(valuetype Interop/Error,string,bool,class [System.Runtime]System.Func`2)
0000000000017640 82 int64 [System.Console] Interop::CheckIo(int64,string,bool,class [System.Runtime]System.Func`2)
00000000000176D0 17 int32 [System.Console] Interop::CheckIo(int32,string,bool,class [System.Runtime]System.Func`2)

Well, let’s see if we can at least resolve our desired address by using this approach. If you go back above, we were chasing the address 7f773651f267, loaded into System.Console.dll. First, let’s find the base address where System.Console.dll is loaded:

$ cat /proc/29751/maps | grep System.Console.dll | head -1
7f77364ff000-7f7736500000 r--p 00000000 fd:00 787585                     /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll

The offset, then, is:

$ echo 'ibase=16;obase=10;7F773651F267-7F77364FF000' | bc
20267

So now we need to look for this offset in the System.Console map file. The closest match is here:

0000000000020150 286 instance valuetype System.ConsoleKeyInfo [System.Console] System.IO.StdInReader::ReadKey(bool&)

With that, we have one frame resolved! There are only a few more This process begs to be automated. It would be great to automatically run crossgen, generate the map files with the relative addresses, convert them to absolute addresses, and merge them with the main /tmp/perf-PID.map file that other tools know and love. Read on!

dotnet-mapgen.py

Well, I wrote a small script called dotnet-mapgen.py that automates the above steps and produces a single, unified map file that contains both JIT-compiled addresses and addresses that lie in crossgen’d (AOT-compiled) modules, such as System.Console.dll. The script has two modes:

$ ./dotnet-mapgen.py generate $(pgrep -n dotnet)
couldn't find crossgen, trying to fetch it automatically...
crossgen succesfully downloaded and placed in libcoreclr's dir
crossgen map generation: 15 succeeded, 2 failed

In the “generate” mode, the script first locates crossgen (downloading it if necessary, using the NuGet restore approach shown above), and then runs crossgen on all the managed assemblies loaded into the target process. The 2 failures in the above output are for assemblies that weren’t AOT-compiled. Note that this generation step can be done once, and the map files retained for subsequent runs — unless you change the set of AOT-compiled assemblies loaded into your process.

$ ./dotnet-mapgen.py merge $(pgrep -n dotnet)
perfmap merging: 14 succeeded, 3 failed

In the “merge” mode, the script calculates absolute addresses for all the symbols generated in the previous step, and concatenates this information to the main /tmp/perf-PID.map file for the target process.

There’s just one final problem. Turns out, perf refuses to use the map file for symbols that are in memory regions that belong to a module (in our case above, System.Console.dll). And there’s no way to convince perf that it should try to resolve such addresses using the map file. Fortunately, I have a bit more control over BCC tools, so I proposed a PR for retrying symbol resolution using a map file if the symbol wasn’t resolved using the original module. With this patch, here’s stackcount‘s output:

# stackcount ... pthread:read
Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end.
^C
  read
  instance valuetype System.ConsoleKeyInfo [System.Console] System.IO.StdInReader::ReadKey(bool&)
  instance string [System.Console] System.IO.StdInReader::ReadLine(bool)
  instance string [System.Console] System.IO.StdInReader::ReadLine()
  string [System.Console] System.Console::ReadLine()
  void [Runny] ConsoleApplication.Program::Main(string[])
...
  16

Note how all symbols are now resolved to managed frames: the JIT-compiled Program::Main, and the AOT-compiled Console::ReadLine, StdInReader::ReadLine, and everything else.

Once this support lands in BCC, we can also do full-fidelity profiling with the profile tool, stack tracing with trace and stackcount, blocked time analysis using offcputime/offwaketime, and a variety of other tools. For most purposes, the perf-based workflow shown in the beginning of the post is a poorer alternative, if you can run a recent-enough kernel with BPF support.

So Where Are We?

We can use a variety of Linux performance tools to monitor .NET Core processes on Linux, including perf and BCC tools
To resolve stacks and symbols in general, the COMPlus_PerfMapEnabled environment variable needs to be set to 1 prior to running the .NET Core process
Some binaries still ship out of the box with no debug information (notably libcoreclr.so)
Some managed assemblies aren’t included in the dynamic /tmp/perf-PID.map file because they were compiled ahead-of-time (using crossgen), and don’t contain debugging information
For these assemblies, crossgen can generate map files that are sort-of useful, but can’t be used directly with perf
The dotnet-mapgen script can automate the process of generating map files for AOT-compiled assemblies and merging them into the main map file for analysis
BCC tools will be updated to support this scenario and enable full-fidelity tracing

In a subsequent post, I also plan to explore the LTTng traces to see if we can trace garbage collections, object allocations, managed exceptions, and other events of interest.

You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.

Analyzing a .NET Core Core Dump on Linux

Sasha Goldshtein — Sun, 26 Feb 2017 14:31:29 +0000

Recently, I had to open a core dump of a .NET Core application on Linux. I thought this walkthrough might be useful if you find yourself in the same boat, because, to be quite honest, I didn’t find it trivial.

Configure Linux to Generate Core Dumps

Before you begin, you need to configure your Linux box to generate core dumps in the first place. A lot of distros will have something preconfigured, but the simplest approach is to just put a file name in the /proc/sys/kernel/core_pattern file:

# echo core > /proc/sys/kernel/core_pattern

Additionally, there’s a system limit maximum size for the generated core file. ulimit -c unlimited removes that limit. Now, whenever your .NET Core process (or any other process) crashes, you’ll get a core file generated in the same directory. By the way, .NET Core on Linux x86_64 reserves a pretty gigantic address space, so expect your core files to be pretty big. But compression helps — I had a 6.5GB core dump compress into a 59MB gzip file.

Installing LLDB

To open the core dump, you’ll need LLDB built with the same architecture as your CoreCLR. Here’s how I found out what I needed:

$ find /usr/share/dotnet -name libsosplugin.so
/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/libsosplugin.so

$ ldd $(find /usr/share/dotnet -name libsosplugin.so) | grep lldb
liblldb-3.5.so.1 => /usr/lib/x86_64-linux-gnu/liblldb-3.5.so.1 (0x00007f0a6b2d8000)

Seeing that LLDB 3.5 was required, I installed it with sudo apt install lldb-3.5, but YMMV on other distros, of course.

Opening The Core File And Loading SOS

Now you’re ready to open the core file in LLDB. If you’re doing this on a different box, you’ll need the same version of .NET Core installed — that’s where the dotnet binary, SOS itself, and the DAC (debugger data access component) are coming from. You could also copy the /usr/share/dotnet/shared/Microsoft.NETCore.App/nnnn directory over, of course.

$ lldb $(which dotnet) --core ./core

Once inside LLDB, you’ll need to load the SOS plugin. It’s the one we found earlier:

(lldb) plugin load /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.1/libsosplugin.so

Now, if everything went well, the SOS plugin needs the DAC (libmscordaccore.so), so you’ll need to tell it where to look:

(lldb) setclrpath /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.1

With that, SOS should be loaded and ready for use.

Running Analysis

You’d think you can just start running the SOS commands you know and love, but there’s one final hurdle. Here’s what happened when I opened a core file generated from a crash, and tried to get the exception information (note that you should prefix SOS commands with ‘sos’):

(lldb) sos PrintException
The current thread is unmanaged

… which is kind of odd, right? Considering that the process crashed as a result of a managed exception. Looking at the docs, it looks like SOS and LLDB have trouble communicating around the current thread’s identity. So first, let’s find the thread that encountered an exception:

(lldb) sos Threads
ThreadCount:      13
UnstartedThread:  0
BackgroundThread: 11
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                                                                                        Lock
       ID OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
XXXX    1 57ff 0000000000C2B380  2020020 Preemptive  (nil):(nil)                       0000000000C195C0 0     Ukn
XXXX    2 5807 0000000000CAAF80    21220 Preemptive  0x7f5ad2fcbc40:0x7f5ad2fcdae0     0000000000C195C0 0     Ukn (Finalizer)
XXXX    4 580a 0000000000DC2730    21220 Preemptive  (nil):(nil)                       0000000000C195C0 0     Ukn
XXXX    6 580d 0000000000EC1D70    21220 Preemptive  0x7f5ad576b4d0:0x7f5ad576cf58     0000000000C195C0 0     Ukn
XXXX    7 5a13 00007F5ABC0292A0  1021220 Preemptive  0x7f5ad5888d30:0x7f5ad5888fd0     0000000000C195C0 0     Ukn (Threadpool Worker)
XXXX    8 5a15 00007F5AC006A3F0    21020 Preemptive  0x7f5ad594dd10:0x7f5ad594ece8     0000000000C195C0 0     Ukn System.IO.FileNotFoundException 00007f5ad593fa80 (nested exceptions)
XXXX    9 5a16 00007F5AC00916A0    21220 Preemptive  (nil):(nil)                       0000000000C195C0 0     Ukn
XXXX   10 5a17 00007F5AC80015D0  1021220 Preemptive  0x7f5ad593a9a0:0x7f5ad593b978     0000000000C195C0 0     Ukn (Threadpool Worker)
XXXX    5 5a18 00007F5AC0814DF0    21220 Preemptive  0x7f5ad50ed1b8:0x7f5ad50eefd0     0000000000C195C0 0     Ukn
XXXX    3 5a19 00007F5C54000A00  1020220 Preemptive  (nil):(nil)                       0000000000C195C0 0     Ukn (Threadpool Worker)
XXXX   11 5a1a 00007F5C50019270  1021220 Preemptive  0x7f5ad58a5710:0x7f5ad58a6fd0     0000000000C195C0 0     Ukn (Threadpool Worker)
XXXX   12 5a1b 00007F5AC0831B80  1021220 Preemptive  0x7f5ad58fcf68:0x7f5ad58fd000     0000000000C195C0 0     Ukn (Threadpool Worker)
XXXX   13 5a1c 0000000000E8F720  1021220 Preemptive  0x7f5ad593bc80:0x7f5ad593d978     0000000000C195C0 0     Ukn (Threadpool Worker)

Thread #8 looks suspicious, what with the System.IO.FileNotFoundException in the Exception column. Now, let’s see all the LLDB threads:

(lldb) thread list
Process 0 stopped
* thread #1: tid = 0, 0x00007f5c5d83b7ef libc.so.6`__GI_raise(sig=2) + 159 at raise.c:58, name = 'dotnet', stop reason = signal SIGABRT
  thread #2: tid = 1, 0x00007f5c5e482510 libpthread.so.0`__pthread_cond_wait + 256, stop reason = signal SIGABRT
  thread #3: tid = 2, 0x00007f5c5d907d29 libc.so.6`syscall + 25, stop reason = signal SIGABRT
  thread #4: tid = 3, 0x00007f5c5d907d29 libc.so.6`syscall + 25, stop reason = signal SIGABRT
... more threads snipped for brevity ...

Here, it looks like thread 1 is the one with the exception being raised. So we have to map the OS thread ID from the first command, to the LLDB thread id from the second command:

(lldb) setsostid 5a15 1
Mapped sos OS tid 0x5a15 to lldb thread index 1

And now, we’re ready to roll:

(lldb) sos PrintException
Exception object: 00007f5ad593fa80
Exception type:   System.IO.FileNotFoundException
Message:          Could not load the specified file.
InnerException:   
StackTrace (generated):    SP               IP               Function
    00007F5C45D227C0 00007F5BE37412E7 System.Private.CoreLib.ni.dll!System.Runtime.Loader.AssemblyLoadContext.ResolveUsingEvent(System.Reflection.AssemblyName)+0x20ab07
    00007F5C45D227F0 00007F5BE353664F System.Private.CoreLib.ni.dll!System.Runtime.Loader.AssemblyLoadContext.ResolveUsingResolvingEvent(IntPtr, System.Reflection.AssemblyName)+0x4f

StackTraceString: 
HResult: 80070002

Nested exception -------------------------------------------------------------
Exception object: 00007f5ad593dea0
Exception type:   System.InvalidOperationException
Message:          Authorization cannot be requested before logging in.
InnerException:   
StackTrace (generated):
    SP               IP               Function
    00007F5C45D29890 00007F5BE63002FE kitt3ns.dll!WebApplication.Controllers.AuthorizationBackgroundWorker.VerifyAuthorized(System.String)+0xae
    00007F5C45D298D0 00007F5BE630022B kitt3ns.dll!WebApplication.Controllers.AuthorizationBackgroundWorker.RequestAuthorization()+0x2b
    00007F5C45D298E0 00007F5BE55BC31C kitt3ns.dll!WebApplication.Controllers.AuthorizationBackgroundWorker+<>c.b__0_0()+0x4c
    00007F5C45D29910 00007F5BE33BDF11 System.Private.CoreLib.ni.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x111

StackTraceString: 
HResult: 80131509

(lldb) sos ClrStack
OS Thread Id: 0x5a15 (1)
        Child SP               IP Call Site
00007F5C45D272C8 00007f5c5d83b7ef [HelperMethodFrame: 00007f5c45d272c8]
00007F5C45D273E0 00007F5BE33BDF11 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
00007F5C45D29770 00007f5c5cbe9bad [HelperMethodFrame: 00007f5c45d29770]
00007F5C45D29890 00007F5BE63002FE WebApplication.Controllers.AuthorizationBackgroundWorker.VerifyAuthorized(System.String) [/home/vagrant/kitt3ns/Controllers/AccountController.cs @ 37]
00007F5C45D298D0 00007F5BE630022B WebApplication.Controllers.AuthorizationBackgroundWorker.RequestAuthorization() [/home/vagrant/kitt3ns/Controllers/AccountController.cs @ 30]
00007F5C45D298E0 00007F5BE55BC31C WebApplication.Controllers.AuthorizationBackgroundWorker+<>c.b__0_0() [/home/vagrant/kitt3ns/Controllers/AccountController.cs @ 24]
00007F5C45D29910 00007F5BE33BDE71 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
00007F5C45D29B50 00007f5c5cbfb207 [GCFrame: 00007f5c45d29b50] 
00007F5C45D29D30 00007f5c5cbfb207 [DebuggerU2MCatchHandlerFrame: 00007f5c45d29d30]

This gives us the exception information and the thread’s current stack, if we want it. We could similarly inspect other threads by mapping the OS thread id to the LLDB thread id, but for a thread that didn’t have an exception, where do you get that clue that connects the OS thread id to the debugger thread ID? Well, it seems that GDB is using the same numbering as LLDB, but in GDB you can actually see the LWP id (on Linux, GDB LWP = kernel pid = thread) using ‘info threads’:

$ gdb $(which dotnet) --core ./core
...

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f5c45d2a700 (LWP 23061) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
  2    Thread 0x7f5c5eaab740 (LWP 22527) 0x00007f5c5e482510 in pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:219
  3    Thread 0x7f5c5b411700 (LWP 22529) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7f5c5ac10700 (LWP 22530) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  5    Thread 0x7f5c5a40f700 (LWP 22531) 0x00007f5c5d9020bd in poll () at ../sysdeps/unix/syscall-template.S:84
  6    Thread 0x7f5c59c0e700 (LWP 22532) 0x00007f5c5e485d8d in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:84
  7    Thread 0x7f5c5940d700 (LWP 22533) 0x00007f5c5e482510 in pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:219
  8    Thread 0x7f5c589b2700 (LWP 22534) 0x00007f5c5e482510 in pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:219
  9    Thread 0x7f5c498ae700 (LWP 22535) 0x00007f5c5e4828b9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:258
  10   Thread 0x7f5c454ef700 (LWP 22538) 0x00007f5c5e4856ed in __close_nocancel () at ../sysdeps/unix/syscall-template.S:84
  11   Thread 0x7f5ad2324700 (LWP 22540) 0x00007f5c5e4856ed in __close_nocancel () at ../sysdeps/unix/syscall-template.S:84
  12   Thread 0x7f5ad1b23700 (LWP 22541) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  13   Thread 0x7f5ad2b25700 (LWP 23059) 0x00007f5c5e4828b9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:258
... more output snipped for brevity ...

So, for example, suppose we wanted to know what managed thread #6 (OS thread id 0x580d from the ‘sos Threads’ output above) was doing when the dump file was generated. 0x580d = 22541, which is thread #12 in the output above. Going back to LLDB (note the hex notation for both thread ids):

(lldb) setsostid 580d c
Mapped sos OS tid 0x580d to lldb thread index 12

(lldb) clrstack
OS Thread Id: 0x580d (12)
        Child SP               IP Call Site
00007F5AD1B227F8 00007f5c5d907d29 [InlinedCallFrame: 00007f5ad1b227f8] Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.Libuv+NativeMethods.uv_run(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)
00007F5AD1B227F8 00007f5be45cea3a [InlinedCallFrame: 00007f5ad1b227f8] Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.Libuv+NativeMethods.uv_run(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)
00007F5AD1B227E0 00007F5BE45CEA3A DomainBoundILStubClass.(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)
00007F5AD1B22890 00007F5BE45CE968 Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.Libuv.run(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)
00007F5AD1B228B0 00007F5BE45CBCFF Microsoft.AspNetCore.Server.Kestrel.Internal.KestrelThread.ThreadStart(System.Object)
00007F5AD1B22910 00007F5BE33BDE71 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
00007F5AD1B22B50 00007f5c5cbfb207 [GCFrame: 00007f5ad1b22b50]
00007F5AD1B22D30 00007f5c5cbfb207 [DebuggerU2MCatchHandlerFrame: 00007f5ad1b22d30]

Other SOS commands that don’t depend on thread context (e.g. listing assemblies, heap objects, finalization queues and so on) do not require any fiddling with thread ids, and you can just run them directly.

Summary

So, what we had to do in order to open a .NET Core core dump from a Linux system was:

Set up the Linux system to generate core dumps on crash
Copy or install the right version of .NET Core on the analysis machine
Install the version of LLDB matching your .NET Core’s SOS plugin
Load the SOS plugin in LLDB and tell it where to find the DAC
Set the debugger thread id for SOS thread-sensitive commands to work
Run sos PrintException or any other commands to analyze the crash

Fun fun fun.

You can also follow me on Twitter, where I put stuff that doesn’t necessarily deserve a full-blown blog post.