<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>All Your Base Are Belong To Us</title>
	<atom:link href="http://blogs.microsoft.co.il/sasha/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.microsoft.co.il/sasha</link>
	<description>Mostly .NET internals and other kinds of gory details</description>
	<lastBuildDate>Thu, 03 May 2018 13:13:44 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.0.2</generator>
	<item>
		<title>Sasha &#062;&#062;= Sela &#062;&#062;= Google</title>
		<link>http://blogs.microsoft.co.il/sasha/2018/05/03/sasha-sela-google/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2018/05/03/sasha-sela-google/#respond</comments>
		<pubDate>Thu, 03 May 2018 13:13:44 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[RandomThoughts]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2819005</guid>
		<description><![CDATA[Looking back to 2007, I don’t think I imagined this blog would survive this far. I had an absolute blast writing here and sharing my thoughts and rants with you. Just to be clear, my blog is not going anywhere &#8212; it’s just a moment of reminiscence for me as I share a bit of [&#8230;]]]></description>
				<content:encoded><![CDATA[<p><span style="font-weight: 400">Looking back to 2007, I don’t think I imagined this blog would survive this far. I had an absolute blast writing here and sharing my thoughts and rants with you. Just to be clear, my blog is not going anywhere &#8212; it’s just a moment of reminiscence for me as I share a bit of personal news.</span></p>
<p><span style="font-weight: 400">Today was my last day at Sela after 11.5 years. If you follow my blog, or if you just check out the archives section, you’ll see that I did a lot of different things over the years. From the deep internals of the .NET runtime on Windows, through cloud services like Azure Media Services, all the way to Linux containers and Java performance optimization, I kept changing focus every year or two and that’s part of what I kept me going for so long: my managers and the entire organization had my back for whatever experiments, ideas, or directions I wanted to pursue at the moment.</span></p>
<p><span style="font-weight: 400">Sela was an amazing place to work at, and I was constantly surrounded by top-notch technology experts, good friends and colleagues, and incredibly outgoing and welcoming people &#8212; with a very big overlap between these groups <img src="https://s.w.org/images/core/emoji/11/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></span></p>
<p><span style="font-weight: 400">My next position is a software engineering role at Google Research in Tel-Aviv, where I will be working on machine learning solutions. I don’t know when and where I’ll be able to write more, but I definitely look forward to sharing some of it with you in the future if I can.</span></p>
<p><span style="font-weight: 400">Thanks for following me so far, and please feel free to stay in touch &#8212; I know some of you are decade-long readers of this blog. My DMs on Twitter are open (<a href="https://twitter.com/goldshtn">@goldshtn</a>), and you can also shoot me an email (same handle, @gmail).</span></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2018/05/03/sasha-sela-google/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dynamic Tracing of .NET Core Methods</title>
		<link>http://blogs.microsoft.co.il/sasha/2018/02/08/dynamic-tracing-of-net-core-methods/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2018/02/08/dynamic-tracing-of-net-core-methods/#respond</comments>
		<pubDate>Thu, 08 Feb 2018 07:32:37 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[.NETCore]]></category>
		<category><![CDATA[.NETInternals]]></category>
		<category><![CDATA[DEV]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2819001</guid>
		<description><![CDATA[tl;dr I wrote a simple proof-of-concept tool called place-probe.py which helps place dynamic tracepoints on .NET methods. For example: place-probe.py $PID 'System.Threading.Thread::Sleep'. Dynamic tracing is one of the Linux diagnostics superpowers. By adding dynamic tracepoints on arbitrary functions across the system, you can diagnose a variety of “impossible” bugs and performance problems on a live [&#8230;]]]></description>
				<content:encoded><![CDATA[<blockquote><p><strong>tl;dr</strong> I wrote a simple proof-of-concept tool called <a href="https://gist.github.com/goldshtn/56d3daae842ccad44612cdacd68bac29">place-probe.py</a> which helps place dynamic tracepoints on .NET methods. For example: <code>place-probe.py $PID 'System.Threading.Thread::Sleep'</code>.</p></blockquote>
<p>Dynamic tracing is one of the Linux diagnostics superpowers. By adding dynamic tracepoints on arbitrary functions across the system, you can diagnose a variety of “impossible” bugs and performance problems on a live production application without having to add instrumentation, rebuild, and restart. The underlying kernel mechanism that makes dynamic tracing possible is called uprobes (for userspace) and kprobes (for kernel functions).</p>
<p>Unfortunately, uprobes can only be placed on code that is backed by an on-disk image. In other words, not generated code, which was compiled at runtime. This precludes runtimes like JVM or CLR from using uprobes, because Java bytecode and CLR intermediate language instructions are compiled to machine code on the fly, and not backed by a disk image.</p>
<p>But the CLR has a trick up its sleeve: ahead-of-time compilation. On Windows, this is known as NGen, and the .NET Core cross-platform mechanism is called CrossGen. This is a tool that invokes the JIT compiler (libclrjit.so) on an assembly and stores the compilation results in a <em>native image</em>, which contains machine code instructions. These native images are then loaded into memory and executed directly, and because they are backed by a disk image, they can be traced with dynamic tracepoints!</p>
<p>The actual work of placing a dynamic probe on a CrossGen-compiled image is the following. You need the method’s offset from the image base, and then you place the probe with something like:</p>
<pre><code>perf probe -x /path/to/MyImage.dll --add 0xbadcafe
</code></pre>
<p>The only problem is finding the offset that corresponds to a given managed method. The general approach is as follows:</p>
<ol>
<li>Use the <a href="https://github.com/dotnet/coreclr/blob/master/Documentation/building/crossgen.md">crossgen</a> command-line tool to generate debug information for all the CrossGen-compiled assemblies. This produces .map files in a simple format that contains the method offset and name.</li>
<li>Find the desired managed method in the .map files. The map entry will look like the following, where the offset (in the first column) is the offset from the base address where the native image is loaded (let’s call it $METHODOFFSET):</li>
</ol>
<pre><code>0000000000020D70 36 instance void [app] app.Employee::Work()
</code></pre>
<ol start="3">
<li>Find the native assembly’s load address and first executable section in <strong>/proc/$PID/maps</strong>. We need the offset of the executable section from the assembly’s load address (let’s call it $EXEOFFSET), and the offset within the on-disk image ($DISKOFFSET). Here’s an example for System.Console.dll – the executable section starts at 7f7e038a1000, while the first section is at 7f7e03880000, so the difference is 0x21000; and the on-disk offset for the executable section is the third column, which is 0x1000.</li>
</ol>
<pre><code>7f7e03880000-7f7e03881000 r--p 00000000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll
7f7e03890000-7f7e03892000 rw-p 00000000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll
7f7e038a1000-7f7e038cd000 r-xp 00001000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll
7f7e038dc000-7f7e038dd000 r--p 0002c000 ca:01 537652                     /home/ubuntu/dotnet/out/System.Console.dll
</code></pre>
<ol start="4">
<li>Now, compute $PROBEOFFSET = $METHODOFFSET &#8211; $EXEOFFSET + $DISKOFFSET. This is the offset that we need to place the dynamic probe on in order to trace the managed method.</li>
</ol>
<p>The above process is encapsulated by a POC tool I wrote called <a href="https://gist.github.com/goldshtn/56d3daae842ccad44612cdacd68bac29">place-probe.py</a>, which performs the above computations and places the probe for you, or prints the required command, if given the <strong>–dry-run</strong> switch. Here’s a simple example:</p>
<pre><code>$ ./place-probe.py $(pidof app) 'System.Threading.Thread::Sleep(int32)'
Added new event:
  probe_System:abs_4d6610 (on 0x4d6610 in /home/ubuntu/dotnet/out/System.Private.CoreLib.dll)

You can now use it in all perf tools, such as:

    perf record -e probe_System:abs_4d6610 -aR sleep 1

Added new event:
  probe_System:abs_5920 (on 0x5920 in /home/ubuntu/dotnet/out/System.Threading.Thread.dll)

You can now use it in all perf tools, such as:

    perf record -e probe_System:abs_5920 -aR sleep 1

$ sudo perf record -e probe_System:* -ag -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.136 MB perf.data (20 samples) ]

$ sudo chown $USER perf.data
$ perf script | head
Failed to open /home/ubuntu/dotnet/out/System.Threading.Thread.dll, continuing without symbols
Failed to open [kernel.kallsyms], continuing without symbols
Failed to open /home/ubuntu/dotnet/out/System.Private.CoreLib.dll, continuing without symbols
app 29891 [001] 154218.288270: probe_System:abs_5920: (7f7e03855920)
                5920 [unknown] (/home/ubuntu/dotnet/out/System.Threading.Thread.dll)
              256f07 CallDescrWorkerInternal (/home/ubuntu/dotnet/out/libcoreclr.so)
              167ce0 MethodDescCallSite::CallTargetWorker (/home/ubuntu/dotnet/out/libcoreclr.so)
              278c03 RunMain (/home/ubuntu/dotnet/out/libcoreclr.so)
              278ea3 Assembly::ExecuteMainMethod (/home/ubuntu/dotnet/out/libcoreclr.so)
               aa3fb CorHost2::ExecuteAssembly (/home/ubuntu/dotnet/out/libcoreclr.so)
               84dd6 coreclr_execute_assembly (/home/ubuntu/dotnet/out/libcoreclr.so)
               8a433 coreclr::execute_assembly (/home/ubuntu/dotnet/out/libhostpolicy.so)
               7f0d8 run (/home/ubuntu/dotnet/out/libhostpolicy.so)

$ sudo perf probe --del=*
</code></pre>
<p>To use this with your own application binaries (and not just CrossGen-compiled .NET Core assemblies), run CrossGen on them. Here’s an example that assumes you’ve used <code>dotnet publish --self-contained</code> such that all .NET dependencies are in the <strong>out</strong> directory:</p>
<pre><code>crossgen /Platform_Assemblies_Paths out out/app.dll
</code></pre>
<p>After doing this, you can replace the original out/app.dll with the generated out/app.ni.dll (or out/app.ni.exe for the main executable) and use <a href="http://place-probe.py">place-probe.py</a> on that binary.</p>
<p>Oh, and where does CrossGen come from? You can either build it from source, or download it from the .NET Core NuGet packages. My <a href="https://gist.github.com/goldshtn/fe3f7c3b10ec7e5511ae755abaf52172">dotnet-mapgen-v2.py</a> script can help, among other things, with downloading CrossGen automatically and generating the required map files.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2018/02/08/dynamic-tracing-of-net-core-methods/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Getting Stacks for LTTng Events with .NET Core on Linux</title>
		<link>http://blogs.microsoft.co.il/sasha/2018/02/06/getting-stacks-for-lttng-events-with-net-core-on-linux/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2018/02/06/getting-stacks-for-lttng-events-with-net-core-on-linux/#respond</comments>
		<pubDate>Tue, 06 Feb 2018 08:08:20 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[.NETCore]]></category>
		<category><![CDATA[DEV]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818999</guid>
		<description><![CDATA[On Windows, .NET contains numerous very useful ETW events, which can be used for tracing garbage collections, assembly loading, exceptions thrown, object allocations, and other interesting scenarios. All events can come with a stack trace, which helps understand where they’re coming from. In fact, I’d say for some events, not getting the stack trace means [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>On Windows, .NET contains numerous <a href="https://docs.microsoft.com/en-us/dotnet/framework/performance/clr-etw-events">very useful ETW events</a>, which can be used for tracing garbage collections, assembly loading, exceptions thrown, object allocations, and other interesting scenarios. All events can come with a stack trace, which helps understand where they’re coming from. In fact, I’d say for some events, not getting the stack trace means the event is completely useless — e.g. what good is the <strong>ExceptionThrown</strong> event if you don’t have the exception stack trace?</p>
<p>On Linux, .NET Core doesn’t use ETW (Event Tracing for <em>Windows</em>, ya know). <a href="http://blogs.microsoft.co.il/sasha/2017/03/30/tracing-runtime-events-in-net-core-on-linux/">It uses LTTng instead</a>, which is an awesome tracing framework but doesn’t have stack trace support for userspace events. But I think we can hack around it. Specifically, all LTTng events used by .NET Core are fired through a set of auto-generated functions, named <code>FireEtXplat&lt;EventName&gt;</code> and <code>EventXplatEnabled&lt;EventName&gt;</code>. If we trace these functions using standard dynamic tracing (uprobes) with ftrace, perf, or BPF, we don’t get the event payload (which can be also quite important), but we do get the stack trace. If we only need the event count (a rough replacement for Windows performance counters) or the stack traces, we don’t have create an LTTng session and record the events, which can also help lower the overhead. The downside is using hacky internal details, which can change at any moment — but that’s the nature of dynamic tracing.</p>
<p>Here’s a simple demo. Suppose you know from looking at the heap statistics or the GC LTTng events that you have lots of garbage collections, and would like to reduce the object allocations in your app. To do so, you have to figure out where the allocations are coming from. The <strong>GCAllocationTick</strong> event can tell you roughly which objects you’re allocating by using a low-overhead sampling approach, but it doesn’t tell you where the allocations are coming from, which is quite important. What we’re going to do, then, is trace the <code>EventXplatEnabledGCAllocationTick</code> function in libcoreclr.so, and gather its stack traces. Then, we’ll generate a flame graph that points to the heavy allocation sites in the app. I’ll demonstrate two ways — with perf, and with the <a href="https://github.com/iovisor/bcc/blob/master/tools/stackcount.py">stackcount tool from BCC</a> (which is based on eBPF).</p>
<p>The perf way:</p>
<pre>perf probe -x $APPDIR/libcoreclr.so -a  EventXplatGCEnabledAllocationTick*
perf record -p $PID probe_libcoreclr:* -g -o allocs.data
perf script | $FLAMEGRAPH/stackcollapse-perf.pl | $FLAMEGRAPH/flamegraph.pl &gt; allocs.svg
</pre>
<p>The stackcount way:</p>
<pre>$BCC/stackcount $APPDIR/libcoreclr.so:EventXplatEnabledGCAllocationTick* -f &gt; allocs.stacks
$FLAMEGRAPH/flamegraph.pl &lt; allocs.stacks &gt; allocs.svg</pre>
<p>Just for fun, let me show you the flame graph, pointing to the <code>StatsController.Get</code> method as the primary source of allocations:</p>
<p><img src="https://transfer.sh/5txlk/alloc.svg" alt="allocations flame graph" /></p>
<p>To quantify the overhead, I tested this approach on an ASP.NET Core app with a trivial endpoint that performs hundreds of thousands of allocations per second. I also tested LTTng event collection, where I created an LTTng session and enabled only the <strong>GCAllocationTick </strong>event. I ran the benchmark for 20 seconds with 10 concurrent clients in each mode. The results were as follows:</p>
<ul>
<li>No tracing: 12.32 ms/request</li>
<li>Dynamic tracing with perf: 12.51 ms/request (total of 58,422 events recorded; almost 3,000 events per second)</li>
<li>Dynamic tracing with stackcount: 12.59 ms/request</li>
<li>LTTng recording: 12.72 ms/request (total of 56,931 events recorded)</li>
</ul>
<blockquote><p>Just to clarify, any of the approaches above still require launching the application with the <code>COMPlus_EnableEventLog=1</code> environment variable. If it is off, the entire eventing infrastructure is not invoked at all. Incidentally, for this application, which has the potential of generating hundreds of thousands of events per second, turning <em>off</em> this environment variable produces a huge speedup: 7.39 ms/request. For an app with more reasonable event rates, it will probably make sense to keep the environment variable on, because turning it off means you can’t do any meaningful event collection without restarting the process.</p></blockquote>
<p>In conclusion, it seems that using dynamic tracing to probe the CoreCLR methods directly is a feasible approach for collecting stack traces of interesting CLR events. You don’t get the event payload (although in some cases it can be collected as well from the function’s arguments), but you do get the code location, which is often enough.</p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2018/02/06/getting-stacks-for-lttng-events-with-net-core-on-linux/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wrapping Up Sela&#8217;s Hackathon With Four New Diagnostic Projects</title>
		<link>http://blogs.microsoft.co.il/sasha/2017/12/25/wrapping-up-selas-hackathon-with-four-new-diagnostic-projects/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2017/12/25/wrapping-up-selas-hackathon-with-four-new-diagnostic-projects/#respond</comments>
		<pubDate>Mon, 25 Dec 2017 14:12:58 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[.NETCore]]></category>
		<category><![CDATA[Debugging]]></category>
		<category><![CDATA[DEV]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818996</guid>
		<description><![CDATA[In the beginning of December, the consultants team at Sela had a day off-site for our annual hackathon to work on a variety of projects. This day was a blast, and there was a bunch of great energy and interesting work being done all around, but my team (Avi Avni and I) focused on diagnostics [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>In the beginning of December, the consultants team at Sela had a day off-site for our annual hackathon to work on a variety of projects. This day was a blast, and there was a bunch of great energy and interesting work being done all around, but my team (<a href="https://github.com/AviAvni">Avi Avni</a> and I) focused on diagnostics tools &#8212; my favorite &#8212; and here are some preliminary results.</p>
<h3>Real-time Win32 memory leak diagnoser</h3>
<p>This is a project I&#8217;ve had on my todo list for a couple of years now. In a nutshell, Win32 memory leak analysis in production is quite painful because of the sheer amount of data that has to be collected. Traditional approaches, which I&#8217;ve used quite successfully in the past, require recording every single allocation and deallocation, and then cross-correlating them to find allocations that weren&#8217;t freed (e.g., in <a href="http://blogs.microsoft.co.il/sasha/2014/12/02/diagnosing-native-memory-leaks-etw-wpa/">this post using xperf and WPA</a>). While this generally works, for an application with high-frequency allocations that leaks at a slow rate, collecting data over an hour or day or week is simply impractical due to the sheer sizes of the data files.</p>
<p>A couple of years ago, I wrote a BPF-based tool called <a href="http://blogs.microsoft.co.il/sasha/2016/02/14/two-new-ebpf-tools-memleak-and-argdist/">memleak</a>, which uses Linux uprobes to record allocations and deallocation stacks in a runtime data structure, without emitting data to files. I&#8217;ve already used this tool a couple of times to diagnose production issues.</p>
<p>The <a href="https://github.com/AviAvni/NativeLeakDetector">NativeLeakDetector project</a> that Avi Avni built in just a few hours during the hackathon does the very same thing &#8212; for Windows, using ETW events. It&#8217;s still a bit shy on documentation, but is quite simple in principle. It uses the <a href="https://www.nuget.org/packages/Microsoft.Diagnostics.Tracing.TraceEvent/">TraceEvent</a> library to record heap allocation and deallocation events in a given process, and keeps track of all allocations with their call stacks in a runtime map. When instructed to, the tool prints all the allocations that were not freed and the call stacks leading to these allocations. There&#8217;s a bit of work remaining to make this tool production-ready, but the general skeleton is there and working quite fine.</p>
<h3>Process snapshotting support in CLRMD</h3>
<p>Our second project, also contributed by Avi Avni, was to add <a href="https://github.com/Microsoft/clrmd/pull/91">process snapshotting support to the popular CLRMD debugging library</a>. If you haven&#8217;t seen it yet, <a href="http://blogs.microsoft.co.il/sasha/2013/05/20/traversing-the-gc-heap-with-clrmd/">CLRMD</a> provides a convenient C# API for attaching to a live process or opening a dump file and analyzing its contents. You can walk threads and call stacks, locate specific objects in memory, investigate the heap size and GC state, and numerous other scenarios. The only catch is that to use CLRMD, you have to opt in for one of the following modes:</p>
<ul>
<li>Create a dump file of the process and open the dump file. This allows you to capture an accurate snapshot of the process&#8217; state, but creating the dump file can take a long time and take a lot of disk space.</li>
<li>Attach to the process invasively, like a debugger. Again, this lets you inspect the process&#8217;s state, but if the process is a production service, you just paused it completely.</li>
<li>Inspect the process&#8217; memory without suspending it. The process keeps running, which is great for production services, but it means you&#8217;re not seeing a consistent snapshot. For example, while you&#8217;re enumerating heap objects, a GC can occur and completely mess everything up.</li>
</ul>
<p>Avi&#8217;s pull request adds another option: create a virtual address clone of the process using the <a href="https://msdn.microsoft.com/en-us/library/dn457825(v=vs.85).aspx">Process Snapshotting API</a> (essentially POSIX <strong>fork()</strong>, but without actually executing code in the child process), and then attach CLRMD to the clone. The original process can keep running, but we have an accurate snapshot of its state to analyze &#8212; and then throw away. What&#8217;s best, the snapshotting API uses copy-on-write, so only pages modified by the original process are actually cloned (on demand) in physical memory.</p>
<h3>.NET Core real-time event tracer for Linux</h3>
<p>Earlier this year, I wrote a couple of blog posts on <a href="http://blogs.microsoft.co.il/sasha/2017/03/30/tracing-runtime-events-in-net-core-on-linux/">tracing .NET Core runtime events on Linux</a>, such as garbage collections, allocations, exceptions, and others. The tracing approach I&#8217;ve shown is based on recording LTTng events to a trace file, and analyzing the trace file later. While this has its merits, it&#8217;s not really suitable for real-time, continuous monitoring. So I set out to build a proof-of-concept script that captures a real-time trace of .NET Core events, aggregates them in real-time, and produces interesting statistics.</p>
<p>The result is <a href="https://github.com/goldshtn/dntrace">dntrace</a>, a two-part tool: dntrace.sh, which turns on the LTTng events and records them, and dntrace.py, which parses them in real-time and displays statistics. Currently, the Python part uses an extremely fragile approach, where the trace data is passed through Babeltrace and then parsed from strings back into structured events. Babeltrace 2.0 will introduce API support for parsing events from real-time sessions, which is when the dntrace.py script can be revisited and implemented in a less hacky way.</p>
<p>It&#8217;s still not bad, though &#8212; you can get real-time GC information, including GC durations and generation sizes; printouts on any exceptions thrown; live allocation data; and other statistics. See the <a href="https://github.com/goldshtn/dntrace">project repository</a> for an example.</p>
<h3>Bonus project: run a process in a Windows job object</h3>
<p>During the day, I started working on another little tool, which I was only able to finish a few days later: <a href="https://github.com/goldshtn/jobrun">jobrun</a>. This tool runs a process inside a Windows job object, and lets you apply various limits to its behavior. You can restrict the process&#8217; memory usage, CPU time, CPU affinity, scheduling priority, scheduling weight, and apply additional quotas &#8212; all supported by the <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx">Windows job object API</a>.</p>
<p>For me, this was a useful tool for testing how a process deals with scarce resources. What happens when I can&#8217;t commit more than 300 MB of memory? How long does it take for the application to start up when I only get 3% of CPU time per scheduling interval? Can a single batch job complete within a hard limit of 30 CPU seconds? Perhaps you&#8217;ll find some other uses for this tool, too.</p>
<hr />
<p><em>You can also <a href="https://twitter.com/goldshtn">follow me on Twitter</a>, where I put stuff that doesn’t necessarily deserve a full-blown blog post.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2017/12/25/wrapping-up-selas-hackathon-with-four-new-diagnostic-projects/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lightweight JVM Diagnostics Tools and Containers</title>
		<link>http://blogs.microsoft.co.il/sasha/2017/09/27/lightweight-jvm-diagnostics-tools-and-containers/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2017/09/27/lightweight-jvm-diagnostics-tools-and-containers/#comments</comments>
		<pubDate>Wed, 27 Sep 2017 10:08:23 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[DEV]]></category>
		<category><![CDATA[JVM]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818989</guid>
		<description><![CDATA[If you&#8217;re reading this, I hope you&#8217;re curious what your options are when it comes to running JVM diagnostic tools on containerized applications. Generally when it comes to containers, you can either shove all your diagnostic tools into the container image, or you can try running them from the host &#8212; this short post tries [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>If you&#8217;re reading this, I hope you&#8217;re curious what your options are when it comes to running JVM diagnostic tools on containerized applications. Generally when it comes to containers, you can either shove all your diagnostic tools into the container image, or you can try running them from the host &#8212; this short post tries to explain what works, what doesn&#8217;t, and what can be done about it. Although it is focused on JVM tools (and HotSpot specifically), a lot of the same obstacles will apply to other runtimes and languages.</p>
<h3>Container isolation</h3>
<p>As a very quick reminder, container isolation on Linux works by using <a href="http://man7.org/linux/man-pages/man7/namespaces.7.html">namespaces</a>. Containerized processes are placed in a PID namespace that gives them private process ids that aren&#8217;t shared with the host (although they also have process ids on the host); in a mount namespace that gives them their own view of mount points, and hence their own view of the filesystem; in a network namespace that gives them their own network interfaces; and so on. A lot of diagnostic tools aren&#8217;t namespace-aware, and will happily try to open files on the host using container paths, or try to attach to a process by using the container&#8217;s PID namespace, or exhibit any number of other failures.</p>
<p>Additionally, container resources are often limited by using <a href="http://man7.org/linux/man-pages/man7/cgroups.7.html">control groups</a>. This is not so much an isolation mechanism as it is a quota mechanism: the cpu control group restricts container CPU usage shares; the memory control group restricts user and kernel memory usage; the blkio control group restricts I/O throughput and operation count; and so on.</p>
<p>Finally, a lot of container runtimes (including Docker) use <a href="http://man7.org/linux/man-pages/man2/seccomp.2.html">seccomp</a> to restrict the set of syscalls containerized processes can make, to further isolate them from the host and avoid nasty surprises. Turns out, though, that some of these restricted syscalls are actually essential for diagnostic tools to work properly.</p>
<h3>JVM diagnostic mechanisms</h3>
<p>This is by no means a complete survey, but it&#8217;s worth just listing quickly the main JVM diagnostic mechanisms and how they work, before we can consider what happens in a containerized environment. (For more on this, including source links, check out <a href="http://openjdk.java.net/groups/hotspot/docs/Serviceability.html">Serviceability in Hotspot</a> from the OpenJDK documentation.)</p>
<ul>
<li><strong>JVM performance data</strong>: by default, the JVM emits binary data into a file in the temp directory called <strong>hsperfdata_$UID/$PID</strong>. This file contains statistics on garbage collection, class loading, JIT compilation, and other events. It is the data source for <strong>jstat</strong>, and is also how <strong>jps</strong> and <strong>jinfo</strong> discover information about running JVM processes.</li>
<li><strong>JVM attach interface</strong>: by default, the JVM will react to a QUIT signal by looking for a file in the working directory called <strong>.attach_pid$PID</strong>. If the file exists, it will create a UNIX domain socket in the temp directory called <strong>.java_pid$PID</strong>, and create a thread that will listen for commands on that socket. <strong>jmap</strong>, <strong>jstack</strong>, <strong>jcmd</strong> are some of the tools that rely on the attach interface for heap dumps, thread dumps, obtaining VM information, and other facilities.</li>
<li><strong>Serviceability Agent</strong>: a component that runs in an external process and reads JVM data structures from the target by using <strong>ptrace</strong> (for a live process) or ELF parsing (for a core dump). This allows live diagnostics and core dump analysis to see thread states, heap objects, call stacks, and so on. <strong>HSDB</strong>, <strong>SOSQL</strong>, and other tools rely on the Serviceability Agent API. Notably, the JDK version has to match exactly between the original JVM and the one used to analyze the core dump or live process.</li>
<li><strong>JVMTI</strong>: this tool interface allows an external agent library (.so) to be loaded with or attached to a JVM process and register for various interesting events, including class loading, thread start, garbage collection, monitor contention, and others. To load an agent with your process you use the <strong>-agentpath</strong> command-line argument; to attach an agent to a live process you use the JVM attach interface.</li>
<li><strong>JMX</strong>: the JDK runtime provides a basic set of managed beans for inspecting the GC heap, threads, and other components. Many additional managed beans exist in various application containers like Tomcat.</li>
</ul>
<p>Another important concept to consider is perf maps, used by the Linux perf tool to map JIT-compiled code addresses to Java methods. A common way of creating these is by using a JVMTI agent (e.g. <a href="https://github.com/jvm-profiling-tools/perf-map-agent">perf-map-agent</a>), which writes a perf map out to the default location in <strong>/tmp/perf-$PID.map</strong>. These are crucial for a lot of native Linux performance tools if you plan to use them with JVM processes.</p>
<h3>Running diagnostic tools from the host</h3>
<p>If you look at the way some of the JVM tools are implemented, it is clear that running them from the host will present a set of interesting challenges. Here&#8217;s how to address these challenges in some cases:</p>
<ul>
<li>The JVM performance data store will usually not be accessible from the host. However, you can bind-mount the temp directory to make it visible from the host, which makes tools like <strong>jstat</strong> happy. (With Docker, this would be something like <strong>-v /tmp:/tmp</strong>).</li>
<li>The JVM attach interface has multiple points of failure: the containerized JVM thinks its process ID is X, while the host tool thinks it&#8217;s Y; and of course the attach file and the UNIX domain socket will be in the wrong mount namespaces. I just recently added a namespace-awareness patch to Andrei Pangin&#8217;s <a href="https://github.com/apangin/jattach">jattach</a> tool, which covers the functionality of <strong>jmap</strong>, <strong>jstack</strong>,<strong> jcmd</strong>, and <strong>jinfo</strong> in a single package &#8212; so you can now use <strong>jattach</strong> from the host with no additional flags.</li>
<li>The Serviceability Agent API requires the full JDK to be available on the host, and requires a perfect match between the host and container JDK. This is not a likely scenario.</li>
<li>Attaching a JVMTI agent to a containerized process can be done with <strong>jattach</strong>, provided that the agent library is accessible in the container. This can be done with bind-mounts.</li>
<li>JMX beans can be accessed from the host by making the container expose them remotely using RMI. <a href="https://stackoverflow.com/questions/856881/how-to-activate-jmx-on-my-jvm-for-access-with-jconsole">This StackOverflow question</a> and answer thread covers it well.</li>
<li>If you plan on using perf maps, you need to generate them inside the container (by attaching a JVMTI agent) and then make them accessible to the host tool with the right PID. This is taken care of automatically by <a href="https://github.com/iovisor/bcc/pull/1051">some tools</a>, and was <a href="https://lkml.org/lkml/2017/7/5/771">recently added to perf</a> as well.</li>
</ul>
<h3>Running diagnostic tools from the container</h3>
<p>Although I don&#8217;t particularly like the idea of bloating your container image with diagnostic tools, suppose you&#8217;ve done it anyway. Here are some of the likely problems:</p>
<ul>
<li>The Serviceability Agent API uses the <strong>ptrace</strong> syscall, which is disabled in Docker&#8217;s seccomp profile (and I imagine it would be disabled by other sensible container runtimes as well). You can use a <a href="https://docs.docker.com/engine/security/seccomp/">custom seccomp profile</a>, of course, if you understand the security consequences for your host.</li>
<li>Using perf and perf-based tools inside the container requires the <strong>perf_event_open</strong> syscall, which is again blocked by Docker&#8217;s default seccomp profile.</li>
</ul>
<h3>Summary</h3>
<p>Most diagnostic tools at our disposal today were not designed with containers in mind. You could say they are not container-aware &#8212; but they&#8217;re not aware of a bazillion other things which still don&#8217;t break their behavior. Unfortunately, most tools will not work out-of-the-box for containerized JVM processes, but there are ways to make them work with a fairly minimal effort.</p>
<hr />
<p><em>You can also <a href="https://twitter.com/goldshtn">follow me on Twitter</a>, where I put stuff that doesn’t necessarily deserve a full-blown blog post.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2017/09/27/lightweight-jvm-diagnostics-tools-and-containers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Profiling the JVM on Linux: A Hybrid Approach</title>
		<link>http://blogs.microsoft.co.il/sasha/2017/07/07/profiling-the-jvm-on-linux-a-hybrid-approach/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2017/07/07/profiling-the-jvm-on-linux-a-hybrid-approach/#respond</comments>
		<pubDate>Fri, 07 Jul 2017 19:27:08 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[DEV]]></category>
		<category><![CDATA[JVM]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818983</guid>
		<description><![CDATA[I hope you&#8217;re outraged that your performance tools are lying to you. For quite a while, many Java sampling profilers have been known to blatantly misrepresent reality. In a nutshell, stack sampling using the documented JVMTI GetStackTrace method produces results that are biased towards safepoints, and not representative of the real CPU processing performed by your [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>I hope you&#8217;re outraged that your performance tools are lying to you. For quite a while, <a href="https://psy-lob-saw.blogspot.co.il/2016/02/why-most-sampling-java-profilers-are.html">many Java sampling profilers have been known to blatantly misrepresent reality</a>. In a nutshell, stack sampling using the documented JVMTI <a href="https://docs.oracle.com/javase/7/docs/platform/jvmti/jvmti.html">GetStackTrace</a> method produces results that are biased towards safepoints, and not representative of the real CPU processing performed by your program.</p>
<p>Over the years, alternative profilers popped up, trying to fix this problem by using <a href="https://psy-lob-saw.blogspot.co.il/2016/06/the-pros-and-cons-of-agct.html">AsyncGetCallTrace</a>, a less-documented API that doesn&#8217;t wait for a safepoint, and can produce more accurate results. Simply calling AGCT from a timer signal handler gives you a fairly reliable way to do stack sampling of JVM processes. Unfortunately, even AGCT can sometimes fail, and in any case, it doesn&#8217;t help with profiling the non-Java parts of your process: JVM code, GC, JIT, syscalls, kernel work performed on your behalf, and really anything else that&#8217;s not pure JVM bytecode.</p>
<p>Another popular alternative is using Linux <strong>perf</strong>, which doesn&#8217;t directly support Java but has great support for profiling native code, and doesn&#8217;t have any trouble looking at kernel stacks as well. <a href="https://medium.com/netflix-techblog/java-in-flames-e763b3d32166">For JVM support, you need two pieces</a>:</p>
<ol>
<li>A perf map that maps JIT-compiled addresses to function names (as a corollary, only compiled frames are supported; interpreter frames are invisible)</li>
<li>A JIT switch <strong>-XX:+PreserveFramePointer</strong> that makes sure perf can walk the Java stack, added in OpenJDK 1.8u60</li>
</ol>
<p>When using this method:</p>
<ol>
<li>You end up losing interpreter frames</li>
<li>You can&#8217;t profile an older JVM that doesn&#8217;t have the PreserveFramePointer flag</li>
<li>You risk having stale entries in your perf map because the JIT can throw away and recompile code</li>
<li>You risk not having certain functions in your perf map because the JIT threw the code away</li>
</ol>
<p>At <a href="https://jpoint.ru/en/">JPoint 2017</a>, Andrei Pangin and Vadim Tsesko from Odnoklassniki <a href="http://assets.contentful.com/oxjq45e8ilak/4mfbX5FJuw0A8M00UK4uKa/ce60f2cab12408e01ce927e90ebb2f7a/Andrey_Pangin__Vadim_Tsesko._The_Art_of_JVM_Profiling.pdf">introduced a new approach for JVM profiling on Linux</a>, which brings together the best from both worlds: perf for native code and kernel frames, and AGCT for Java frames. Thus, <a href="https://github.com/jvm-profiling-tools/async-profiler">async-profiler</a> was born.</p>
<p>Async-profiler&#8217;s method of operation is fairly simple. It uses the perf_events API to configure CPU sampling into a memory buffer, and asks for a signal to be delivered when a sample occurs. The signal handler then calls AsyncGetCallTrace, and merges the two stacks together: the Java stack, captured by AsyncGetCallTrace, and the native + kernel stack, captured by perf_events. For non-Java threads, only the perf_events stack is retained.</p>
<p><a href="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/async-profiler.png"><img class="alignnone size-full wp-image-2818985" src="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/async-profiler.png" alt="" width="864" height="276" srcset="https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/async-profiler.png 864w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/async-profiler-300x96.png 300w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/async-profiler-768x245.png 768w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/async-profiler-624x199.png 624w" sizes="(max-width: 864px) 100vw, 864px" /></a></p>
<p><em>Async-profiler&#8217;s approach for constructing a merged call stack, from Andrei Pangin&#8217;s and Vadim Tsesko&#8217;s presentation at JPoint 2017</em></p>
<p>This approach has its limitations, but it also offers a lot of appeal. You don&#8217;t need a special switch to preserve frame pointers. You get full-fidelity data about interpreter frames. The agent supports older JVMs. The stack aggregation happens in the agent, so there are no expensive perf.data files to store and parse.</p>
<p><a href="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/collecty.png"><img class="alignnone size-large wp-image-2818984" src="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/collecty-1024x406.png" alt="" width="625" height="248" srcset="https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/collecty-1024x406.png 1024w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/collecty-300x119.png 300w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/collecty-768x304.png 768w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/collecty-624x247.png 624w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/07/collecty.png 1194w" sizes="(max-width: 625px) 100vw, 625px" /></a></p>
<p><em>A flame graph generated by using async-profiler</em></p>
<p>To try async-profiler, you can build from source (it&#8217;s very simple) and then use the helper profiler.sh script, which I contributed:</p>
<pre>./profiler.sh start $(pidof java)
./profiler.sh stop -o flamegraph -f /tmp/java.stacks</pre>
<p>Full instructions are in the <a href="https://github.com/jvm-profiling-tools/async-profiler">README</a> &#8212; any feedback, contributions, or suggestions are very welcome. Odnoklassniki are using this in production, but I&#8217;m sure they&#8217;ll be delighted to know that you found it useful, too!</p>
<hr />
<p><em>You can also <a href="https://twitter.com/goldshtn">follow me on Twitter</a>, where I put stuff that doesn’t necessarily deserve a full-blown blog post.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2017/07/07/profiling-the-jvm-on-linux-a-hybrid-approach/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tracing .NET Core on Linux with USDT and BCC</title>
		<link>http://blogs.microsoft.co.il/sasha/2017/04/02/tracing-net-core-on-linux-with-usdt-and-bcc/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2017/04/02/tracing-net-core-on-linux-with-usdt-and-bcc/#comments</comments>
		<pubDate>Sun, 02 Apr 2017 19:12:49 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[.NETCore]]></category>
		<category><![CDATA[.NETInternals]]></category>
		<category><![CDATA[DEV]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Tracing]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818976</guid>
		<description><![CDATA[In my last post, I lamented the lack of call stack support for LTTng events in .NET Core. Fortunately, being open source, this is somewhat correctable &#8212; so I set out to produce a quick-and-dirty patch that adds USDT support for CoreCLR&#8217;s tracing events. This post explores some of the things that then become possible, [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>In my <a href="http://blogs.microsoft.co.il/sasha/2017/03/30/tracing-runtime-events-in-net-core-on-linux/">last post</a>, I lamented the lack of call stack support for LTTng events in .NET Core. Fortunately, being open source, this is somewhat correctable &#8212; so I set out to produce a quick-and-dirty patch that adds USDT support for CoreCLR&#8217;s tracing events. This post explores some of the things that then become possible, and will hopefully become available in one form or another in CoreCLR in the future.</p>
<h3>Very Brief USDT Primer</h3>
<p>USDT (User Statically Defined Tracing) is a lightweight approach for embedding static trace markers into user-space libraries and applications. I&#8217;ve taken a closer look a year ago when discussing <a href="http://blogs.microsoft.co.il/sasha/2016/03/30/usdt-probe-support-in-bpfbcc/">USDT support in BCC</a>, so you might want take a look as a refresher.</p>
<p>In a very small nutshell, to embed USDT probes into your library, you use a special set of macros, which then produce ELF NT_STAPSDT notes with information about the probe&#8217;s location (instruction offset), its name, its arguments, and a global variable that can be poked at runtime to turn the probe on and off (this is called the probe&#8217;s semaphore).</p>
<p>When tracing is disabled, i.e. the semaphore is off, USDT probes have a near-zero cost, essentially a single NOP instruction. If the argument preparation for the probe is prohibitively expensive, your code can protect relevant sections with another macro that checks if the probe is enabled before preparing and submitting its arguments. Here&#8217;s what the whole thing might look like:</p>
<pre>// Declaring the trace semaphore and the trace macro:
#define _SDT_HAS_SEMAPHORES 1
#include &lt;sys/sdt.h&gt;

#define MYAPP_REQUEST_START_ENABLED() __builtin_expect (myapp_request_start_semaphore, 0)
__extension__ unsigned short myapp_request_start_semaphore __attribute ((unused)) __attribute__ ((section (".probes")));
#define MYAPP_REQUEST_START(url, client_port) DTRACE_PROBE2(myapp, request_start, url, client_port)

// The actual tracing code:
if (MYAPP_REQUEST_START_ENABLED()) {
  char const *url = curr_request-&gt;uri().canonicalize();
  unsigned short port = curr_request-&gt;connection()-&gt;client_port;
  MYAPP_REQUEST_START(url, client_port);
}</pre>
<p>Okay, so why was I so eager to get these probes into CoreCLR? Because there are existing, lightweight tools for tracing USDT probes. One is SystemTap, which is great but requires a kernel module, and the other is the <a href="https://github.com/iovisor/bcc">BCC toolkit</a>, which I described extensively in previous posts. Also, because USDT probes can be mapped to specific program locations, the existing Linux uprobes mechanism can be used to trace them and obtain stack traces, with perf or ftrace-based machinery. Without subtracting from the value of LTTng traces, I really wanted to get the BCC tools working with CoreCLR, and that requires USDT.</p>
<h3>Putting USDT Probes in CoreCLR</h3>
<p>At this point you might be thinking of some monstrous patch that modifies thousands of trace locations in CoreCLR to support USDT, somehow. Fortunately, there is a Python script in the CoreCLR source called <a href="https://github.com/dotnet/coreclr/blob/master/src/scripts/genXplatLttng.py">genXplatLttng.py</a>, which is responsible for generating function stubs for each CLR event. All I had to do is <a href="https://gist.github.com/goldshtn/eae953dfccd2eb4b4d43d4c326b62fb5">patch</a> it ever so slightly (31 changed lines), and the resulting CoreCLR binary (libcoreclr.so) now has USDT probes!</p>
<pre class="p1"><span class="s1"># readelf -n .../libcoreclr.so</span>

<span class="s1">Displaying notes found at file offset 0x00000200 with length 0x00000024:</span>
<span class="s1"><span class="Apple-converted-space">  </span>Owner <span class="Apple-converted-space">                </span>Data size <span class="Apple-converted-space">      </span>Description</span>
<span class="s1"><span class="Apple-converted-space">  </span>GNU<span class="Apple-converted-space">                  </span>0x00000014 <span class="Apple-converted-space">      </span>NT_GNU_BUILD_ID (unique build ID bitstring)</span>
<span class="s1"><span class="Apple-converted-space">    </span>Build ID: a93f07f0d169d6dd53fb8a09e3fe793cda56072d</span>

<span class="s1">Displaying notes found at file offset 0x0079cf90 with length 0x0000c25c:</span>
<span class="s1"><span class="Apple-converted-space">  </span>Owner <span class="Apple-converted-space">                </span>Data size <span class="Apple-converted-space">      </span>Description</span>
<span class="s1"><span class="Apple-converted-space">  </span>stapsdt<span class="Apple-converted-space">              </span>0x00000046 <span class="Apple-converted-space">      </span>NT_STAPSDT (SystemTap probe descriptors)</span>
<span class="s1"><span class="Apple-converted-space">    </span>Provider: DotNETRuntime</span>
<span class="s1"><span class="Apple-converted-space">    </span>Name: GCStart</span>
<span class="s1"><span class="Apple-converted-space">    </span>Location: 0x000000000051a296, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc18</span>
<span class="s1"><span class="Apple-converted-space">    </span>Arguments: 4@-28(%rbp) 4@-32(%rbp)</span>
<span class="s1"><span class="Apple-converted-space">  </span>stapsdt<span class="Apple-converted-space">              </span>0x0000006d <span class="Apple-converted-space">      </span>NT_STAPSDT (SystemTap probe descriptors)</span>
<span class="s1"><span class="Apple-converted-space">    </span>Provider: DotNETRuntime</span>
<span class="s1"><span class="Apple-converted-space">    </span>Name: GCStart_V1</span>
<span class="s1"><span class="Apple-converted-space">    </span>Location: 0x000000000051a39e, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1a</span>
<span class="s1"><span class="Apple-converted-space">    </span>Arguments: 4@-44(%rbp) 4@-48(%rbp) 4@-52(%rbp) 4@-56(%rbp) 2@-58(%rbp)</span>
<span class="s1"><span class="Apple-converted-space">  </span>stapsdt<span class="Apple-converted-space">              </span>0x00000079 <span class="Apple-converted-space">      </span>NT_STAPSDT (SystemTap probe descriptors)</span>
<span class="s1"><span class="Apple-converted-space">    </span>Provider: DotNETRuntime</span>
<span class="s1"><span class="Apple-converted-space">    </span>Name: GCStart_V2</span>
<span class="s1"><span class="Apple-converted-space">    </span>Location: 0x000000000051a4b9, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1c</span>
<span class="s1"><span class="Apple-converted-space">    </span>Arguments: 4@-44(%rbp) 4@-48(%rbp) 4@-52(%rbp) 4@-56(%rbp) 2@-58(%rbp) 8@-72(%rbp)</span>
<span class="s1"><span class="Apple-converted-space">  </span>stapsdt<span class="Apple-converted-space">              </span>0x00000044 <span class="Apple-converted-space">      </span>NT_STAPSDT (SystemTap probe descriptors)</span>
<span class="s1"><span class="Apple-converted-space">    </span>Provider: DotNETRuntime</span>
<span class="s1"><span class="Apple-converted-space">    </span>Name: GCEnd</span>
<span class="s1"><span class="Apple-converted-space">    </span>Location: 0x000000000051a597, Base: 0x000000000061f0e8, Semaphore: 0x000000000099cc1e</span>
<span class="s1"><span class="Apple-converted-space">    </span>Arguments: 4@-28(%rbp) 2@-30(%rbp)</span></pre>
<p>Many more notes were omitted for brevity &#8212; there is a total of 394 events on the build I used. Now, it&#8217;s important to clarify that this patch doesn&#8217;t get the full fidelity events. LTTng events have a richer payload than what USDT probes support, and support complex structures, sequences, and more. However, in many tracing scenarios, very basic information such as strings and numbers is sufficient. And, of course, the call stack. So let&#8217;s see what we can do now.</p>
<h3>Tracing .NET Core Garbage Collections</h3>
<p>OK, so what can we do with these newly-obtained superpowers? To begin with, we can trace USDT probes using the generic <strong>trace</strong> and <strong>argdist</strong> tools from <a href="https://github.com/iovisor/bcc">BCC</a>. For example, let&#8217;s get some statistics about garbage collections &#8212; how many collections do we have in each generation?</p>
<pre class="p1"><span class="s1"># argdist -p $(pidof helloworld) -C 'u::GCStart_V2():int:arg2#collections by generation' -c</span>
<span class="s1">[03:04:29]</span>
<span class="s1">collections by generation</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>4<span class="Apple-converted-space">          </span>arg2 = 2</span>
<span class="s1"><span class="Apple-converted-space">        </span>8<span class="Apple-converted-space">          </span>arg2 = 1</span>
<span class="s1"><span class="Apple-converted-space">        </span>13 <span class="Apple-converted-space">        </span>arg2 = 0</span>
<span class="s1">[03:04:30]</span>
<span class="s1">collections by generation</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>5<span class="Apple-converted-space">          </span>arg2 = 2</span>
<span class="s1"><span class="Apple-converted-space">        </span>20 <span class="Apple-converted-space">        </span>arg2 = 1</span>
<span class="s1"><span class="Apple-converted-space">        </span>25 <span class="Apple-converted-space">        </span>arg2 = 0</span>
<span class="s1">[03:04:31]</span>
<span class="s1">collections by generation</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>5<span class="Apple-converted-space">          </span>arg2 = 2</span>
<span class="s1"><span class="Apple-converted-space">        </span>22 <span class="Apple-converted-space">        </span>arg2 = 1</span>
<span class="s1"><span class="Apple-converted-space">        </span>28 <span class="Apple-converted-space">        </span>arg2 = 0</span>
<span class="s1">[03:04:32]</span>
<span class="s1">collections by generation</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>6<span class="Apple-converted-space">          </span>arg2 = 2</span>
<span class="s1"><span class="Apple-converted-space">        </span>30 <span class="Apple-converted-space">        </span>arg2 = 1</span>
<span class="s1"><span class="Apple-converted-space">        </span>36 <span class="Apple-converted-space">        </span>arg2 = 0</span>
<span class="s1">[03:04:33]</span>
<span class="s1">collections by generation</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>9<span class="Apple-converted-space">          </span>arg2 = 2</span>
<span class="s1"><span class="Apple-converted-space">        </span>40 <span class="Apple-converted-space">        </span>arg2 = 1</span>
<span class="s1"><span class="Apple-converted-space">        </span>49 <span class="Apple-converted-space">        </span>arg2 = 0</span></pre>
<p>arg2 in the above output is the collection &#8220;depth&#8221;, which is the collected generation. As you can see, we have quite a few gen0 and gen1 collection every second, and a handful of gen2 collections as well. (By the way, BCC has a tool called <strong>ugc</strong> for exploring GC latencies specifically, but it doesn&#8217;t currently support .NET Core.)</p>
<p>How did I know that arg2 is the collection depth, and how did I know that the collection &#8220;depth&#8221; is the generation to be collected? There are many more examples in this post that look a bit magical with various arg1, arg2, &#8230;, arg6 incantations. Right now, the answer is by inspecting the CLR source code to see where the probes are emitted, and what the values passed to them mean. In this particular case:</p>
<pre>~/coreclr$ ack GCStart_V2 src/
src/vm/eventtrace.cpp
901: FireEtwGCStart_V2(pGcInfo-&gt;GCStart.Count, pGcInfo-&gt;GCStart.Depth, pGcInfo-&gt;GCStart.Reason, pGcInfo-&gt;GCStart.Type, GetClrInstanceId(), l64ClientSequenceNumberToLog);
...
src/gc/env/etmdummy.h
7:#define FireEtwGCStart_V2(Count, Depth, Reason, Type, ClrInstanceID, ClientSequenceNumber) 0

~/coreclr$ ack GCStart.*Depth src/
src/vm/eventtrace.cpp
895: (pGcInfo-&gt;GCStart.Depth == GCHeapUtilities::GetGCHeap()-&gt;GetMaxGeneration()) &amp;&amp;
901: FireEtwGCStart_V2(pGcInfo-&gt;GCStart.Count, pGcInfo-&gt;GCStart.Depth, pGcInfo-&gt;GCStart.Reason, pGcInfo-&gt;GCStart.Type, GetClrInstanceId(), l64ClientSequenceNumberToLog);
...
src/gc/gcee.cpp
91: Info.GCStart.Depth = (uint32_t)pSettings-&gt;condemned_generation;
100: else if (Info.GCStart.Depth &lt; max_generation)</pre>
<p>The argument order in the <strong>FireEtwGCStart_V2</strong> function makes it clear that arg2 is going to be the collection depth. Then, the assignment statement in gcee.cpp hopefully makes it clear: the GC depth is the &#8220;condemned generation&#8221;, which is the generation to be collected.</p>
<p>Now, where are these pesky collections coming from? The <strong>stackcount</strong> tool summarizes call stacks in-kernel:</p>
<pre class="p1"><span class="s1"><span class="Apple-converted-space"># stackccount u:.../libcoreclr.so:GCStart_V2 -p $(pidof helloworld)
^C
  </span>FireEtXplatGCStart_V2</span>
<span class="s1"><span class="Apple-converted-space">  </span>ETW::GCLog::FireGcStartAndGenerationRanges(ETW::GCLog::st_GCEventInfo*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::GCHeap::UpdatePreGCCounters()</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::do_pre_gc()</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::garbage_collect(int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>SlowAllocateString(unsigned int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>StringObject::NewString(char16_t const*, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>Int32ToDecStr(int, int, StringObject*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>COMNumber::FormatInt32(int, StringObject*, NumberFormatInfo*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::DoSomeWork()</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::Main(string[])</span>
<span class="s1"><span class="Apple-converted-space">  </span>CallDescrWorkerInternal</span>
<span class="s1"><span class="Apple-converted-space">  </span>MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>RunMain(MethodDesc*, short, int*, PtrArray**)</span>
<span class="s1"><span class="Apple-converted-space">  </span>Assembly::ExecuteMainMethod(PtrArray**, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>coreclr_execute_assembly</span>
<span class="s1"><span class="Apple-converted-space">  </span>run(arguments_t const&amp;)</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">    </span>58</span></pre>
<p>OK, so this looks like a fairly obvious path: there is a string allocation in <strong>DoSomeWork</strong> caused by converting an int32 to a string, and that triggers a GC repeatedly. Apparently, some of these GCs are gen0/gen1 but some of them actually require gen2 to clean up. Note that we get a full-fidelity call stack, including managed code (thanks to <a href="http://blogs.microsoft.co.il/sasha/2017/02/27/profiling-a-net-core-application-on-linux/">the COM_PerfMapEnabled switch we saw in an earlier post</a>).</p>
<p>If necessary, stack traces like these can also be visualized as flame graphs. Here&#8217;s an example flame graph from <strong>perf,</strong> of the above application while it was churning through a lot of memory allocations. The GC paths are clearly visible &#8212; in the foreground (allocating) thread, and in a background thread.</p>
<p><a href="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/04/netcore.png"><img class="alignnone size-large wp-image-2818980" src="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/04/netcore-1024x534.png" alt="" width="625" height="326" srcset="https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/04/netcore-1024x534.png 1024w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/04/netcore-300x157.png 300w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/04/netcore-768x401.png 768w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/04/netcore-624x326.png 624w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/04/netcore.png 1200w" sizes="(max-width: 625px) 100vw, 625px" /></a></p>
<p>Another interesting thing to trace about the GC comes from the <strong>HeapStats_V1</strong> event. This is an event that gets fired with every collection, and provides information about individual generation sizes, amount of promoted and finalized memory, and a bunch of other interesting stuff. Here&#8217;s an example of tracing generation 2 size over time, visualized as a histogram every 15 seconds:</p>
<pre class="p1"><span class="s1"># argdist -p $(pidof helloworld) -H 'u::GCHeapStats_V1():u64:arg5/1048576#gen2 size (MB)' -i 15 -c
[15:10:51]</span>
<span class="s1"><span class="Apple-converted-space">     </span>gen2 size (MB)<span class="Apple-converted-space">      </span>: count <span class="Apple-converted-space">    </span>distribution</span>
<span class="s1"><span class="Apple-converted-space">         </span>0 -&gt; 1<span class="Apple-converted-space">          </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">         </span>2 -&gt; 3<span class="Apple-converted-space">          </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">         </span>4 -&gt; 7<span class="Apple-converted-space">          </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">         </span>8 -&gt; 15 <span class="Apple-converted-space">        </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">        </span>16 -&gt; 31 <span class="Apple-converted-space">        </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">        </span>32 -&gt; 63 <span class="Apple-converted-space">        </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">        </span>64 -&gt; 127<span class="Apple-converted-space">        </span>: 23 <span class="Apple-converted-space">      </span>|****<span class="Apple-converted-space">                                    </span>|</span>
<span class="s1"><span class="Apple-converted-space">       </span>128 -&gt; 255<span class="Apple-converted-space">        </span>: 63 <span class="Apple-converted-space">      </span>|************<span class="Apple-converted-space">                            </span>|</span>
<span class="s1"><span class="Apple-converted-space">       </span>256 -&gt; 511<span class="Apple-converted-space">        </span>: 196<span class="Apple-converted-space">      </span>|****************************************|</span>
<span class="s1">[15:11:06]</span>
<span class="s1"><span class="Apple-converted-space">     </span>gen2 size (MB)<span class="Apple-converted-space">      </span>: count <span class="Apple-converted-space">    </span>distribution</span>
<span class="s1"><span class="Apple-converted-space">         </span>0 -&gt; 1<span class="Apple-converted-space">          </span>: 6<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">         </span>2 -&gt; 3<span class="Apple-converted-space">          </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">         </span>4 -&gt; 7<span class="Apple-converted-space">          </span>: 0<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">         </span>8 -&gt; 15 <span class="Apple-converted-space">        </span>: 3<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">        </span>16 -&gt; 31 <span class="Apple-converted-space">        </span>: 6<span class="Apple-converted-space">        </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">        </span>32 -&gt; 63 <span class="Apple-converted-space">        </span>: 10 <span class="Apple-converted-space">      </span>|<span class="Apple-converted-space">                                        </span>|</span>
<span class="s1"><span class="Apple-converted-space">        </span>64 -&gt; 127<span class="Apple-converted-space">        </span>: 49 <span class="Apple-converted-space">      </span>|****<span class="Apple-converted-space">                                    </span>|</span>
<span class="s1"><span class="Apple-converted-space">       </span>128 -&gt; 255<span class="Apple-converted-space">        </span>: 107<span class="Apple-converted-space">      </span>|**********<span class="Apple-converted-space">                              </span>|</span>
<span class="s1"><span class="Apple-converted-space">       </span>256 -&gt; 511<span class="Apple-converted-space">        </span>: 404<span class="Apple-converted-space">      </span>|****************************************|</span></pre>
<p>From the histogram, we can see that the gen 2 size is usually between 256MB and 512MB, but there are occasional GCs that bring it down, even as low as the 0-1MB bucket.</p>
<h3>Tracing Object Allocations</h3>
<p>Very similarly to the approach above, we could trace object allocations. The CLR includes a lightweight allocation tick event (<strong>GCAllocationTick_V3</strong>), which fires roughly every 100KB of object allocations. It includes the most recently allocated type name, and the amount of memory allocated since the last tick &#8212; allowing for low-overhead object allocation sampling, without tracing each individual allocation, which would be extremely expensive.</p>
<p>Unfortunately, the current trace and argdist tools don&#8217;t support Unicode strings, which is how the type name is provided to these events, so the output is slightly less useful &#8212; but we can still get histograms for the allocated amount at each tick, or a summary of type ids. First, let&#8217;s try arg6, which is the type name as a string:</p>
<pre class="p1"><span class="s1"># argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():char*:arg6' -z 32</span>
<span class="s1">[03:25:06]</span>
<span class="s1">u::GCAllocationTick_V3():char*:arg6</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>1<span class="Apple-converted-space">          </span>arg6 = S</span>
<span class="s1"><span class="Apple-converted-space">        </span>59 <span class="Apple-converted-space">        </span>arg6 = S</span>
<span class="s1"><span class="Apple-converted-space">        </span>1254 <span class="Apple-converted-space">      </span>arg6 = S</span>
<span class="s1">[03:25:07]</span>
<span class="s1">u::GCAllocationTick_V3():char*:arg6</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>1<span class="Apple-converted-space">          </span>arg6 = S</span>
<span class="s1"><span class="Apple-converted-space">        </span>383<span class="Apple-converted-space">        </span>arg6 = S</span>
<span class="s1">[03:25:08]</span>
<span class="s1">u::GCAllocationTick_V3():char*:arg6</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>2<span class="Apple-converted-space">          </span>arg6 = S</span>
<span class="s1"><span class="Apple-converted-space">        </span>11 <span class="Apple-converted-space">        </span>arg6 = S</span>
<span class="s1"><span class="Apple-converted-space">        </span>1053 <span class="Apple-converted-space">      </span>arg6 = S</span></pre>
<p>That&#8217;s not very nice because when we treat the Unicode string as char*, only the first character gets displayed. This is fixable by modifying the tools, or writing a dedicated tool that would display these strings correctly. For example, here&#8217;s output from a <a href="https://gist.github.com/goldshtn/860825d8f0f05636a80e045a55d6369a">patched</a> <strong>argdist </strong>that appropriately decodes the strings instead of treating them like ASCII:</p>
<pre class="p1"><span class="s1"># argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():char*:arg6' -z 64</span>
<span class="s1">[03:51:58]</span>
<span class="s1">u::GCAllocationTick_V3():char*:arg6</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>1<span class="Apple-converted-space">          </span>arg6 = System.Char[]</span>
<span class="s1"><span class="Apple-converted-space">        </span>59 <span class="Apple-converted-space">        </span>arg6 = System.String[]</span>
<span class="s1"><span class="Apple-converted-space">        2</span>60<span class="Apple-converted-space">        </span>arg6 = System.String</span></pre>
<p>We can also get good statistics by looking at type ids (method tables, actually) &#8212; which would have to  be translated to type names separately, e.g. <a href="http://blogs.microsoft.co.il/sasha/2017/02/26/analyzing-a-net-core-core-dump-on-linux/">using SOS</a>:</p>
<pre class="p1"><span class="s1"># argdist -p $(pidof helloworld) -C 'u::GCAllocationTick_V3():u64:arg5#type id'</span>
<span class="s1">[03:31:07]</span>
<span class="s1">type id</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>10 <span class="Apple-converted-space">        </span>arg5 = 139987795580656</span>
<span class="s1"><span class="Apple-converted-space">        </span>746<span class="Apple-converted-space">        </span>arg5 = 139987795692592</span>
<span class="s1">[03:31:08]</span>
<span class="s1">type id</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>2<span class="Apple-converted-space">          </span>arg5 = 139987795580656</span>
<span class="s1"><span class="Apple-converted-space">        </span>1396 <span class="Apple-converted-space">      </span>arg5 = 139987795692592</span>
<span class="s1">[03:31:09]</span>
<span class="s1">type id</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>1<span class="Apple-converted-space">          </span>arg5 = 139987795580656</span>
<span class="s1"><span class="Apple-converted-space">        </span>1064 <span class="Apple-converted-space">      </span>arg5 = 139987795692592</span></pre>
<p>The call stacks work great, though:</p>
<pre># stackcount -p $(pidof helloworld) u:.../libcoreclr.so:GCAllocationTick_V3
^C
<span class="s1"><span class="Apple-converted-space">  </span>FireEtXplatGCAllocationTick_V3</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::fire_etw_allocation_event(unsigned long, int, unsigned char*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::allocate_large_object(unsigned long, long&amp;)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>FastAllocatePrimitiveArray(MethodTable*, unsigned int, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>JIT_NewArr1(CORINFO_CLASS_STRUCT_*, long)</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">  </span>instance class [System.Collections]System.Collections.Generic.List`1&lt;!1&gt; [System.Linq] System.Linq.Enumerable+SelectListItera</span><span class="s1">tor`2[System.__Canon,System.Char]::ToList()</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::DoSomeWork()</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::Main(string[])</span>
<span class="s1"><span class="Apple-converted-space">  </span>CallDescrWorkerInternal</span>
<span class="s1"><span class="Apple-converted-space">  </span>MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>RunMain(MethodDesc*, short, int*, PtrArray**)</span>
<span class="s1"><span class="Apple-converted-space">  </span>Assembly::ExecuteMainMethod(PtrArray**, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>coreclr_execute_assembly</span>
<span class="s1"><span class="Apple-converted-space">  </span>run(arguments_t const&amp;)</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">    </span>131</span>

<span class="s1"><span class="Apple-converted-space">  </span>FireEtXplatGCAllocationTick_V3</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::fire_etw_allocation_event(unsigned long, int, unsigned char*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned long, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>SlowAllocateString(unsigned int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>StringObject::NewString(char16_t const*, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>Int32ToDecStr(int, int, StringObject*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>COMNumber::FormatInt32(int, StringObject*, NumberFormatInfo*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::DoSomeWork()</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::Main(string[])</span>
<span class="s1"><span class="Apple-converted-space">  </span>CallDescrWorkerInternal</span>
<span class="s1"><span class="Apple-converted-space">  </span>MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>RunMain(MethodDesc*, short, int*, PtrArray**)</span>
<span class="s1"><span class="Apple-converted-space">  </span>Assembly::ExecuteMainMethod(PtrArray**, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>coreclr_execute_assembly</span>
<span class="s1"><span class="Apple-converted-space">  </span>run(arguments_t const&amp;)</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">    </span>2496
</span></pre>
<p>This shows two major stack traces allocating objects: one allocating an array inside a LINQ ToList() call, and another one that we&#8217;ve already seen, formatting an int32 as a string.</p>
<h3>Tracing Exception Events</h3>
<p>Let&#8217;s take a look at another example. Suppose your application is suddenly hitting lots of internal exceptions, which are handled and processed, but still producing some bad results. We will trace the exceptions as they occur, and get the call stacks where they are thrown. First, how many exceptions are we seeing? This is a question for the <strong>funccount</strong> tool:</p>
<pre class="p1"><span class="s1"># funccount -p $(pidof helloworld) u:.../libcoreclr.so:ExceptionThrown_V1</span>
<span class="s1">Tracing 1 functions for "u:/home/vagrant/helloworld/bin/Debug/netcoreapp2.0/ubuntu.16.10-x64/publish/libcoreclr.so:ExceptionThrown_V1"... Hit Ctrl-C to end.</span>
<span class="s1">^C</span>
<span class="s1">FUNC<span class="Apple-converted-space">                                    </span>COUNT</span>
<span class="s1">ExceptionThrown_V1<span class="Apple-converted-space">                        </span>100</span>
<span class="s1">Detaching...</span></pre>
<p>All right, we have a fairly high rate of exceptions. What types? (This requires the same patched <strong>argdist</strong> from the allocation tracing example.)</p>
<pre class="p1"><span class="s1"># argdist -p $(pidof helloworld) -C 'u::ExceptionThrown_V1():char*:arg1#exception type' -i 5</span>
<span class="s1">[04:00:01]</span>
<span class="s1">exception type</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>100<span class="Apple-converted-space">        </span>arg1 = System.IndexOutOfRangeException</span>

<span class="s1"># argdist -p $(pidof helloworld) -C 'u::ExceptionThrown_V1():char*:arg2#exception message' -i 5 -128</span>
<span class="s1">[04:00:29]</span>
<span class="s1">exception message</span>
<span class="s1"><span class="Apple-converted-space">        </span>COUNT<span class="Apple-converted-space">      </span>EVENT</span>
<span class="s1"><span class="Apple-converted-space">        </span>200<span class="Apple-converted-space">        </span>arg2 = Index was outside the bounds of the array.</span></pre>
<p>That&#8217;s pretty impressive &#8212; just like that, we can trace exception types and messages happening inside our application. And of course we can get the call stacks, using our good friend <strong>stackcount</strong>:</p>
<pre class="p1"><span class="s1"># stackcount -p $(pidof helloworld) u:.../libcoreclr.so:ExceptionThrown_V1</span>
<span class="s1">^C</span>
<span class="s1"><span class="Apple-converted-space">  </span>FireEtXplatExceptionThrown_V1</span>
<span class="s1"><span class="Apple-converted-space">  </span>ETW::ExceptionLog::ExceptionThrown(CrawlFrame*, int, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>ExceptionTracker::ProcessExplicitFrame(CrawlFrame*, StackFrame, int, ExceptionTracker::StackTraceState&amp;)</span>
<span class="s1"><span class="Apple-converted-space">  </span>ExceptionTracker::ProcessOSExceptionNotification(_EXCEPTION_RECORD*, _CONTEXT*, _DISPATCHER_CONTEXT*, unsigned int, StackFrame, Thread*, ExceptionTracker::StackTraceState)</span>
<span class="s1"><span class="Apple-converted-space">  </span>ProcessCLRException</span>
<span class="s1"><span class="Apple-converted-space">  </span>UnwindManagedExceptionPass1(PAL_SEHException&amp;, _CONTEXT*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>DispatchManagedException(PAL_SEHException&amp;, bool)</span>
<span class="s1"><span class="Apple-converted-space">  </span>__FCThrow(void*, RuntimeExceptionKind, unsigned int, char16_t const*, char16_t const*, char16_t const*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>COMString::GetCharAt(StringObject*, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>char [helloworld] helloworld.Program::Selector(string)</span>
<span class="s1"><span class="Apple-converted-space">  </span>instance class [System.Collections]System.Collections.Generic.List`1&lt;!1&gt; [System.Linq] System.Linq.Enumerable+SelectListIterator`2[System.__Canon,System.Char]::ToList()</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::DoSomeWork()</span>
<span class="s1"><span class="Apple-converted-space">  </span>void [helloworld] helloworld.Program::Main(string[])</span>
<span class="s1"><span class="Apple-converted-space">  </span>CallDescrWorkerInternal</span>
<span class="s1"><span class="Apple-converted-space">  </span>MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>RunMain(MethodDesc*, short, int*, PtrArray**)</span>
<span class="s1"><span class="Apple-converted-space">  </span>Assembly::ExecuteMainMethod(PtrArray**, int)</span>
<span class="s1"><span class="Apple-converted-space">  </span>CorHost2::ExecuteAssembly(unsigned int, char16_t const*, int, char16_t const**, unsigned int*)</span>
<span class="s1"><span class="Apple-converted-space">  </span>coreclr_execute_assembly</span>
<span class="s1"><span class="Apple-converted-space">  </span>run(arguments_t const&amp;)</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">  </span>[unknown]</span>
<span class="s1"><span class="Apple-converted-space">    </span>200</span></pre>
<p>OK, so in a function called <strong>Selector</strong> we&#8217;re trying to access a character in a string, and hitting an out-of-bounds condition. Perhaps the string is empty, or the index is invalid. All that &#8212; without a debugger!</p>
<h3>Conclusion</h3>
<p>There are plenty of other things that are made possible by collecting call stacks from CoreCLR events &#8212; tracing assembly loads, method JIT, object movement, finalization, and many other interesting scenarios. Currently, this is all just wishful thinking: I don&#8217;t seriously expect anyone to patch their CoreCLR to emit USDT probes, just for the sake of BCC tools or SystemTap. However, it goes to show what&#8217;s possible &#8212; and what&#8217;s desirable &#8212; for the future of .NET Core tracing, debugging, and profiling on Linux.</p>
<p>In the meantime, there seems to be a very recent <a href="https://lists.lttng.org/pipermail/lttng-dev/2017-March/026985.html">patchset proposing stack trace collection support for LTTng</a>. If merged, you should be able to attach a stack trace to an event using the context mechanism, similar to how we attached the pid and the process name in the previous post. Although that wouldn&#8217;t light up all the BCC tools and SystemTap, it would be a step in the right direction, and would make most of the analyses shown in this post possible.</p>
<hr />
<p><em>You can also <a href="https://twitter.com/goldshtn">follow me on Twitter</a>, where I put stuff that doesn’t necessarily deserve a full-blown blog post.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2017/04/02/tracing-net-core-on-linux-with-usdt-and-bcc/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Tracing Runtime Events in .NET Core on Linux</title>
		<link>http://blogs.microsoft.co.il/sasha/2017/03/30/tracing-runtime-events-in-net-core-on-linux/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2017/03/30/tracing-runtime-events-in-net-core-on-linux/#comments</comments>
		<pubDate>Thu, 30 Mar 2017 12:30:03 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[.NETCore]]></category>
		<category><![CDATA[.NETInternals]]></category>
		<category><![CDATA[DEV]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[LTTNG]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[Tracing]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818973</guid>
		<description><![CDATA[After exploring the basic profiling story, let&#8217;s turn to ETW events. On Windows, the CLR is instrumented with a myriad of ETW events, which can be used to tackle very hard problems at runtime. Here are some examples of these events: Garbage collections Assembly load/unload Thread start/stop (including thread pool threads) Object allocations Exceptions thrown, caught, [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>After exploring the <a href="http://blogs.microsoft.co.il/sasha/2017/02/27/profiling-a-net-core-application-on-linux/">basic profiling story</a>, let&#8217;s turn to ETW events. On Windows, the CLR is instrumented with a myriad of ETW events, which can be used to tackle very hard problems at runtime. Here are some examples of these events:</p>
<ul>
<li>Garbage collections</li>
<li>Assembly load/unload</li>
<li>Thread start/stop (including thread pool threads)</li>
<li>Object allocations</li>
<li>Exceptions thrown, caught, filtered</li>
<li>Methods compiled (JIT)</li>
</ul>
<p>By collecting all of, or a subset of, these events, you can get a very nice picture of what your .NET application is doing. By combining these with Windows kernel events for CPU sampling, file accesses, process creations and more &#8212; you have a fairly complete tool for performance investigations. You might find my recorded <a href="https://www.youtube.com/watch?v=eX644hod65s&amp;list=PLtWrKx3nUGBeYn-pSCjEgBPL90SvpylZL&amp;index=22">DotNext talk on using PerfView for .NET performance investigations</a> useful &#8212; it shows how ETW and pretty much nothing else but ETW can be used to solve a huge variety of performance problems.</p>
<p>Unfortunately, ETW is a Windows-only mechanism. Event Tracing <em>for Windows</em>, you know? It is implemented in the Windows kernel, and that&#8217;s partly why it is so efficient and powerful. While looking for an equivalent Linux framework, the CLR team <a href="https://github.com/dotnet/coreclr/blob/master/Documentation/coding-guidelines/cross-platform-performance-and-eventing.md">considered multiple alternatives</a>, and decided to go with LTTng. it&#8217;s not that Linux doesn&#8217;t have enough tracers &#8212; quite the opposite &#8212; they had to choose which tracer is most appropriate for the CLR&#8217;s needs.</p>
<h3>Hello, LTTNG</h3>
<p><a href="https://lttng.org/">LTTng</a> is similar in spirit to ETW. It is a lightweight tracing framework that can process events in real-time or record them to a file for later processing. It supports multiple simultaneous trace sessions, and each session can have multiple providers enabled &#8212; a system call provider (which requires a kernel module to be installed) alongside with user-space providers, such as the CoreCLR. Traces can be analyzed on the same machine or on another machine, and there are viewers available for more advanced visualization. Custom EventSource-based providers are also supported.</p>
<p><strong>The massive, unfortunate, painful, unforgivable downside of this choice made by the CLR team is that LTTng doesn&#8217;t have stack trace support for user-space events.</strong> And this hurts more than you can imagine:</p>
<ul>
<li>You can collect GC events, but you can&#8217;t aggregate stack traces to figure out where the GCs are coming from</li>
<li>You can collect exception events, but you don&#8217;t have the exception call stack</li>
<li>You can collect assembly load events (and potential failures), but you don&#8217;t know what&#8217;s triggering that assembly load</li>
<li>You can collect object allocations, but you can&#8217;t aggregate statistics to indicate which code paths are causing lots of allocations (and thereby garbage collections)</li>
</ul>
<p>Effectively, not having stack trace support makes LTTng for CoreCLR a <i>logging</i> framework, which can be used to record and investigate logs, but not a <em>tracing</em> framework, which can help diagnose performance problems and troubleshoot hard issues in the field.</p>
<h3>Collecting LTTng Traces</h3>
<p>Let&#8217;s take a look at collecting LTTng traces from CoreCLR. As in the previous post, you could use the <a href="https://raw.githubusercontent.com/dotnet/corefx-tools/master/src/performance/perfcollect/perfcollect">perfcollect</a> tool &#8212; but it is typically overkill. First, perfcollect&#8217;s current implementation <em>always</em> turns on CPU sampling, which takes a massive amount of space and introduces a certain overhead. Second, perfcollect doesn&#8217;t have an event filter &#8212; it has exactly three modes for CoreCLR events: everything, GC only, and GC collect only. If you only care about assembly load events, or GC events, or JIT events, you&#8217;re not in luck.</p>
<p>Fortunately, it&#8217;s very easy to roll your own LTTng collection (you&#8217;ll need to install lttng-tools to record, and babeltrace to view):</p>
<pre># lttng create exceptions-trace
# lttng add-context --userspace --type vpid
# lttng add-context --userspace --type vtid
# lttng add-context --userspace --type procname
# lttng enable-event --userspace --tracepoint DotNETRuntime:Exception*
# lttng start</pre>
<p>In the preceding commands, <strong>lttng create</strong> creates a trace session that you then add providers to. The <strong>add-context</strong> command makes sure each event will have, in addition to the provider data, the PID, TID, and process name. Then, <strong>enable-event</strong> adds a specific event set from the CoreCLR provider &#8212; note that LTTNG doesn&#8217;t need any metadata about these events ahead of time. To get a list of all the possible events, one easy way is to just read the <a href="https://raw.githubusercontent.com/dotnet/corefx-tools/master/src/performance/perfcollect/perfcollect">perfcollect</a> script &#8212; it declares them all, e.g.:</p>
<pre>declare -a DotNETRuntime_NoKeyword=(
	DotNETRuntime:ExceptionThrown
	DotNETRuntime:Contention
	DotNETRuntime:RuntimeInformationStart
	DotNETRuntime:EventSource
)</pre>
<p>Finally, <strong>lttng start</strong> starts the trace session with the enabled providers. By default, the traces are written to <strong>~/session-name-timestamp</strong> &#8212; in our case, it&#8217;s going to be <strong>/root/exceptions-trace-20170330-something</strong>.</p>
<p>Now you run your scenario (the target application should have the <strong>COMPlus_EnableEventLog</strong> environment variable set to 1), and stop the trace when you&#8217;re done:</p>
<pre># lttng stop
# lttng destroy
# babeltrace ~/exceptions-trace
[07:31:11.751548909] (+?.?????????) ubuntu-16 DotNETRuntime:ExceptionThrown_V1: { cpu_id = 0 }, { ExceptionType = "System.NotSupportedException", ExceptionMessage = "Sample exception.", ExceptionEIP = 139767278604807, ExceptionHRESULT = 2148734229, ExceptionFlags = 16, ClrInstanceID = 0 }
[07:31:11.751603953] (+0.000055044) ubuntu-16 DotNETRuntime:ExceptionCatchStart: { cpu_id = 0 }, { EntryEIP = 139765244804131, MethodID = 139765233785640, MethodName = "void [Runny] Runny.Program::Main(string[])", ClrInstanceID = 0 }</pre>
<p><a href="http://diamon.org/babeltrace/">babeltrace</a> is a simple trace viewer, but there are also UI tools like <a href="http://tracecompass.org/">Trace Compass</a> and other visualization tools that can parse the CTF (Common Trace Format) specification.</p>
<h3>What Now?</h3>
<p>We have the fundamentals for collecting LTTng traces from CoreCLR and from custom providers, but the lack of stack traces is gnawing at me. In the next post, we will explore a hacky way of getting stack traces from interesting events, and even using <a href="https://github.com/iovisor/bcc">BCC</a>/SystemTap for tracing them.</p>
<hr />
<p><em>You can also <a href="https://twitter.com/goldshtn">follow me on Twitter</a>, where I put stuff that doesn’t necessarily deserve a full-blown blog post.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2017/03/30/tracing-runtime-events-in-net-core-on-linux/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Profiling a .NET Core Application on Linux</title>
		<link>http://blogs.microsoft.co.il/sasha/2017/02/27/profiling-a-net-core-application-on-linux/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2017/02/27/profiling-a-net-core-application-on-linux/#comments</comments>
		<pubDate>Mon, 27 Feb 2017 14:29:58 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[.NETCore]]></category>
		<category><![CDATA[DEV]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Performance]]></category>
		<category><![CDATA[Tools]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818961</guid>
		<description><![CDATA[In the same vein of my previous post on analyzing core dumps of .NET Core applications on Linux, let&#8217;s take a look at what it takes to do some basic performance profiling. When starting out, here are a few things I wrote down that would be nice to do: CPU profiling (sampling) to see where the CPU [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>In the same vein of <a href="http://blogs.microsoft.co.il/sasha/2017/02/26/analyzing-a-net-core-core-dump-on-linux/">my previous post on analyzing core dumps of .NET Core applications on Linux</a>, let&#8217;s take a look at what it takes to do some basic performance profiling. When starting out, here are a few things I wrote down that would be nice to do:</p>
<ul>
<li>CPU profiling (sampling) to see where the CPU bottlenecks are</li>
<li>Grabbing stacks for interesting system events (file accesses, network, forks, etc.)</li>
<li>Tracing memory management activity such as GCs and object allocations</li>
<li>Identifying blocked time and the block and wake-up reasons</li>
</ul>
<p>With this task list in mind, let&#8217;s get started!</p>
<h3>Collecting Call Stacks of .NET Core Processes</h3>
<p>Generally speaking, a .NET Core application runs as a regular Linux process. There&#8217;s nothing particularly fancy involved, which means we can use <a href="https://perf.wiki.kernel.org/index.php/Main_Page">perf</a> and ftrace and even <a href="https://github.com/iovisor/bcc">BPF-based tools</a> to monitor performance. There&#8217;s just one catch: resolving symbols for call stacks. Here&#8217;s what happens when we profile a CPU-intensive application, running with defaults, using perf:</p>
<pre># perf record -F 97 -ag
<span class="s1">^C[ perf record: Woken up 1 times to write data ]</span>
<span class="s1">[ perf record: Captured and wrote 0.364 MB perf.data (789 samples) ]</span>
# perf report</pre>
<p><a href="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report.png"><img class="alignnone wp-image-2818962" src="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-300x145.png" width="600" height="290" srcset="https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-300x145.png 300w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-768x371.png 768w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-624x302.png 624w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report.png 933w" sizes="(max-width: 600px) 100vw, 600px" /></a></p>
<p>As you can see, debugging symbols are missing for pretty much everything under the dotnet process, so we only get addresses rather than method names. Fortunately, .NET Core ships with a knob that can be turned in order to get a perf map file generated in /tmp, which perf can then find and use for symbols. To turn on the knob, <strong>export COMPlus_PerfMapEnabled=1</strong>:</p>
<pre>$ export COMPlus_PerfMapEnabled=1
$ dotnet run &amp;
[1] 23503
<span class="s1">
$ ls /tmp/perf*</span>
<span class="s1">/tmp/perf-23503.map<span class="Apple-converted-space">  </span>/tmp/perf-23517.map<span class="Apple-converted-space">  </span>/tmp/perfinfo-23503.map<span class="Apple-converted-space">  </span>/tmp/perfinfo-23517.map</span>
<span class="s1">
$ head -2 /tmp/perfinfo-23517.map</span>
<span class="s1">ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Private.CoreLib.ni.dll;{14b5688c-fe9a-4a0d-a0d1-b3af5439e23b};</span>
<span class="s1">ImageLoad;/home/vagrant/Runny/bin/Debug/netcoreapp1.1/Runny.dll;{ebb3ede4-dc41-44f4-93d3-152cd0b54ac0};</span>
<span class="s1">
$ head -2 /tmp/perf-23517.map</span>
<span class="s1">00007FABB90D4480 2e instance bool [System.Private.CoreLib] dynamicClass::IL_STUB_UnboxingStub()</span>
<span class="s1">00007FABB90D44D0 2e instance System.__Canon /* MT: 0x00007FABB8F60318 */ [System.Private.CoreLib] dynamicClass::IL_STUB_UnboxingStub()</span></pre>
<p class="p2"><span class="s1">Equipped with these files, we can repeat the perf recording and then the report looks a bit better, with symbols starting to appear, such as <strong>ConsoleApplication.Primes::CountPrimes</strong>. Note that because the .NET process wrote the perf map file, you might need to tell perf to ignore the fact that it&#8217;s not owned by root by using the -f switch (<strong>perf report -f</strong>), or simply chown it.</span></p>
<p class="p2"><a href="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-2.png"><img class="alignnone size-large wp-image-2818963" src="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-2-1024x261.png" alt="" width="625" height="159" srcset="https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-2-1024x261.png 1024w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-2-300x77.png 300w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-2-768x196.png 768w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/perf-report-2-624x159.png 624w" sizes="(max-width: 625px) 100vw, 625px" /></a></p>
<p class="p2"><span class="s1">Although, who reads perf reports anyway &#8212; let&#8217;s generate a flame graph!</span></p>
<h3 class="p1">Getting a Flame Graph</h3>
<p>Well, <a href="http://www.brendangregg.com/flamegraphs.html">a flame graph is a flame graph</a>, nothing special about .NET Core here once we have the right data in our perf files. Let&#8217;s go:</p>
<pre># git clone --depth=1 https://github.com/BrendanGregg/FlameGraph
...
# perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl &gt; flame.svg</pre>
<p>Here&#8217;s a part of the generated flame graph, looking pretty good:</p>
<p><a href="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/flamegraph.png"><img class="alignnone size-large wp-image-2818964" src="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/flamegraph-1024x445.png" alt="" width="625" height="272" srcset="https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/flamegraph-1024x445.png 1024w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/flamegraph-300x131.png 300w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/flamegraph-768x334.png 768w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/flamegraph-624x271.png 624w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/flamegraph.png 1200w" sizes="(max-width: 625px) 100vw, 625px" /></a></p>
<p>If you look closely, you&#8217;ll notice that some symbols are still missing &#8212; notably, we don&#8217;t have any symbols for libcoreclr.so. And that&#8217;s just the way it is:</p>
<pre>$ objdump -t $(find /usr/share/dotnet -name libcoreclr.so)
<span class="s1">/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/libcoreclr.so: <span class="Apple-converted-space">    </span>file format elf64-x86-64</span>

<span class="s1">SYMBOL TABLE:</span>
<span class="s1">no symbols
</span></pre>
<p>If you build .NET Core from source, you can build with debug information, but that&#8217;s not what we get by default from the <a href="https://www.microsoft.com/net/core#linuxubuntu">Microsoft package repository</a>.</p>
<h3>Stacks For Other Events</h3>
<p>Now that we have the necessary building blocks for getting symbols resolved, we can of course move on to other events (and use other tools, too).  For example, let&#8217;s trace context switches to see where our threads are getting blocked:</p>
<pre># perf record -e sched:sched_switch -ag
...
# perf report -f</pre>
<p><a href="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/sched-switch.png"><img class="alignnone size-large wp-image-2818965" src="http://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/sched-switch-1024x365.png" alt="" width="625" height="223" srcset="https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/sched-switch-1024x365.png 1024w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/sched-switch-300x107.png 300w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/sched-switch-768x273.png 768w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/sched-switch-624x222.png 624w, https://blogs.microsoft.co.il/sasha/wp-content/uploads/sites/210/2017/02/sched-switch.png 1413w" sizes="(max-width: 625px) 100vw, 625px" /></a></p>
<p>(This is a fairly typical stack for where the thread gets preempted to let another thread run, even though it hasn&#8217;t called any blocking API.)</p>
<p>Or, let&#8217;s try some of my favorite tools from <a href="https://github.com/iovisor/bcc">BCC</a>. For example, let&#8217;s trace file opens:</p>
<pre># opensnoop
<span class="s1">PID<span class="Apple-converted-space">    </span>COMM <span class="Apple-converted-space">              </span>FD ERR PATH</span>
<span class="s1">1<span class="Apple-converted-space">      </span>systemd<span class="Apple-converted-space">            </span>17 <span class="Apple-converted-space">  </span>0 /proc/955/cgroup</span>
<span class="s1">24675<span class="Apple-converted-space">  </span>dotnet<span class="Apple-converted-space">              </span>3 <span class="Apple-converted-space">  </span>0 /etc/ld.so.cache</span>
<span class="s1">24675<span class="Apple-converted-space">  </span>dotnet<span class="Apple-converted-space">              </span>3 <span class="Apple-converted-space">  </span>0 /lib/x86_64-linux-gnu/libdl.so.2</span>
<span class="s1">24675<span class="Apple-converted-space">  </span>dotnet<span class="Apple-converted-space">              </span>3 <span class="Apple-converted-space">  </span>0 /lib/x86_64-linux-gnu/libpthread.so.0</span>
<span class="s1">24675<span class="Apple-converted-space">  </span>dotnet<span class="Apple-converted-space">              </span>3 <span class="Apple-converted-space">  </span>0 /usr/lib/x86_64-linux-gnu/libstdc++.so.6</span>
...
<span class="s1">24689<span class="Apple-converted-space">  </span>dotnet <span class="Apple-converted-space">            </span>47 <span class="Apple-converted-space">  </span>0 /home/vagrant/Runny/perfcollect</span>
<span class="s1">24689<span class="Apple-converted-space">  </span>dotnet <span class="Apple-converted-space">            </span>47 <span class="Apple-converted-space">  </span>0 /home/vagrant/Runny/opens.txt</span>
<span class="s1">24689<span class="Apple-converted-space">  </span>dotnet <span class="Apple-converted-space">            </span>47 <span class="Apple-converted-space">  </span>0 /home/vagrant/Runny/project.lock.json</span>
<span class="s1">24689<span class="Apple-converted-space">  </span>dotnet <span class="Apple-converted-space">            </span>47 <span class="Apple-converted-space">  </span>0 /home/vagrant/Runny/.Program.cs.swp</span>
<span class="s1">24689<span class="Apple-converted-space">  </span>dotnet <span class="Apple-converted-space">            </span>47 <span class="Apple-converted-space">  </span>0 /home/vagrant/Runny/Program.cs</span>
<span class="s1">24689<span class="Apple-converted-space">  </span>dotnet <span class="Apple-converted-space">            </span>-1<span class="Apple-converted-space">  </span>13 /home/vagrant/Runny/perf.data.old
</span></pre>
<p>We can conclude that everything more or less works. I dare say this is even a little easier than the JVM situation, where we need an external agent to generate debugging symbols. On the other hand, you have to run the .NET Core process with the <strong>COMPlus_PerfMapEnabled</strong> environment variable at initialization time &#8212; you can&#8217;t generate the debugging information after the process has already started without it.</p>
<p>But then I tried one more thing. Let&#8217;s try to aggregate file read stacks by using the <strong>stackcount</strong> tool from <a href="https://github.com/iovisor/bcc">BCC</a> to probe <strong>read</strong> in libpthread (which is where .NET Core&#8217;s syscalls are routed through on my box). The result is not very pretty:</p>
<pre>$ stackcount pthread:read -p 29751
<span class="s1">Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end.</span>
<span class="s1">  read
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>void [Runny] ConsoleApplication.Program::Main(string[])
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>coreclr_execute_assembly
</span><span class="s1"><span class="Apple-converted-space">  </span>coreclr::execute_assembly(void*, unsigned int, int, char const**, char const*, unsigned int*)
</span><span class="s1"><span class="Apple-converted-space">  </span>run(arguments_t const&amp;)
</span><span class="s1"><span class="Apple-converted-space">  </span>corehost_main</span><span class="s1">
... snipped for brevity ...
    16</span></pre>
<p>The [unknown] frames prior to Main are not very surprising &#8212; this is libcoreclr.so, and we already know it doesn&#8217;t ship with debuginfo. But the top-most frames are disappointing &#8212; this is a managed assembly, with managed frames, and there&#8217;s no reason why we shouldn&#8217;t be able to trace them.</p>
<p>To figure out where these frames are coming from, I&#8217;m going to need addresses. With the -v switch, <strong>stackcount</strong> prints addresses in addition to symbols:</p>
<pre># stackcount pthread:read -v -p 29751
<span class="s1">Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end.</span>
<span class="s1">^C
</span><span class="s1"><span class="Apple-converted-space">  </span>7f77b10b1680 <span class="Apple-converted-space">    </span>read
</span><span class="s1"><span class="Apple-converted-space">  </span>7f773651f267 <span class="Apple-converted-space">    </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>7f773651e8d5 <span class="Apple-converted-space">    </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>7f773651e880 <span class="Apple-converted-space">    </span>[unknown]
</span><span class="s1"><span class="Apple-converted-space">  </span>7f773651846a <span class="Apple-converted-space">    </span>[unknown]<span class="Apple-converted-space">  </span>7f77364bfb5d <span class="Apple-converted-space">    </span>void [Runny] ConsoleApplication.Program::Main(string[])
</span>...</pre>
<p>All right, so which module is <span class="s1">7f773651f267</span><span class="s1"> in, for example? Let&#8217;s take a look at the loaded modules (I&#8217;m keeping only executable regions):</span></p>
<pre>$ cat /proc/29751/maps | grep 'xp '
...
<span class="s1">7f77364bf000-7f77364c6000 rw</span><span class="s2">xp </span><span class="s1">00000000 00:00 0
</span><span class="s1">7f7736502000-7f7736530000 r-</span><span class="s2">xp </span><span class="s1">00003000 fd:00 787585 <span class="Apple-converted-space">                    </span>/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll
</span><span class="s1">7f7736534000-7f7736564000 r-</span><span class="s2">xp </span><span class="s1">00003000 fd:00 787603 <span class="Apple-converted-space">                    </span>/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.IO.FileSystem.dll
</span><span class="s1">7f7736577000-7f7736578000 r-</span><span class="s2">xp </span><span class="s1">00002000 fd:00 787665 <span class="Apple-converted-space">                    </span>/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Threading.Thread.dll
</span><span class="s1">7f773657a000-7f7736587000 r-</span><span class="s2">xp </span><span class="s1">00002000 fd:00 787606 <span class="Apple-converted-space">                    </span>/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.IO.dll
</span><span class="s1">7f773658a000-7f773659a000 r-</span><span class="s2">xp </span><span class="s1">00002000 fd:00 787668 <span class="Apple-converted-space">                    </span>/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Threading.dll
</span><span class="s1">7f773659d000-7f773659e000 r-</span><span class="s2">xp </span><span class="s1">00002000 fd:00 787658 <span class="Apple-converted-space">                    </span>/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Text.Encoding.dll
</span>...</pre>
<p>OK, so we seem to be making progress &#8212; the desired address is clearly in the range that belongs to System.Console.dll. But, being a managed assembly, we&#8217;re not going to find any debug information in it:</p>
<pre class="p1"><span class="s1">$ file /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll</span>
<span class="s1">/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll: PE32+ executable (DLL) (console) Mono/.Net assembly, for MS Windows</span>

$ <span class="s1">objdump -tT /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll</span>
<span class="s1">objdump: /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll: File format not recognized</span></pre>
<p class="p1">Hmpf. So how are we supposed to get symbolic information for these addresses?</p>
<p>If you look online, you&#8217;ll find that there&#8217;s a tool on the .NET Core repos called perfcollect &#8212; essentially a Bash script for collecting performance information from .NET Core processes running on Linux. Let&#8217;s take a look.</p>
<h3>The perfcollect Tool</h3>
<p>The <a href="https://github.com/dotnet/corefx-tools/blob/master/src/performance/perfcollect/perfcollect">perfcollect</a> tool is fairly self-contained, and installs its own dependencies, most notably perf and <a href="https://lttng.org/">lttng</a> &#8212; .NET Core on Linux uses LTTng to generate various events, including garbage collections, object allocations, thread starts, assembly loads, and many others. Then, perfcollect follows your instructions and runs perf and lttng to collect CPU sampling events, package them up to a big zip file, and hand that to you.</p>
<p>What are you supposed to do with that zip file? <a href="https://blogs.msdn.microsoft.com/vancem/2016/02/20/analyzing-cpu-traces-from-linux-with-perfview/">Open it on Windows, apparently, using PerfView</a>. Now, <a href="https://www.youtube.com/watch?v=eX644hod65s">I love PerfView</a>, but a face palm is the only reasonable reaction to hearing this. What&#8217;s more, perfcollect does a bunch of work that you don&#8217;t really need if you plan to analyze the results on the same machine. But there&#8217;s <a href="https://github.com/dotnet/corefx-tools/blob/master/src/performance/perfcollect/perfcollect#L1344">one thing it does which sounds very relevant</a>:</p>
<pre>WriteStatus "Generating native image symbol files"

# Get the list of loaded images and use the path to libcoreclr.so to find crossgen.
# crossgen is expected to sit next to libcoreclr.so.
local buildidList=`$perfcmd buildid-list | grep libcoreclr.so | cut -d ' ' -f 2`</pre>
<p>That definitely sounds good! Turns out that .NET Core writes out an additional map file, named /tmp/perfinfo-$PID.map, which contains a list of image load events for your application&#8217;s assemblies. perfcollect then parses that list and invokes the crossgen tool to generate an additional perf map for each assembly, which can be fed into PerfView on the Windows side. Here&#8217;s what the perfinfo file looks like:</p>
<pre class="p1"><span class="s1">$ head -4 /tmp/perfinfo-29751.map</span>
<span class="s1">ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Private.CoreLib.ni.dll;{14b5688c-fe9a-4a0d-a0d1-b3af5439e23b};</span>
<span class="s1">ImageLoad;/home/vagrant/Runny/bin/Debug/netcoreapp1.1/Runny.dll;{319d161b-f17e-44f6-a210-f297df920194};</span>
<span class="s1">ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Runtime.dll;{819d412e-d773-4dbb-8d01-20d412b6cf09};</span>
<span class="s1">ImageLoad;/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/mscorlib.dll;{080dac22-6a0e-41ae-85fb-fb79cc07911b};</span></pre>
<p>Now, that&#8217;s what crossgen is <em>supposed</em> to do. And according to the comment above, crossgen is also <em>supposed</em> to be in the same folder as libcoreclr.so. But it isn&#8217;t:</p>
<pre>$ find /usr/share/dotnet -name crossgen</pre>
<p>That&#8217;s right, no results. Looking online, it seems that crossgen is generated as part of a .NET Core build, and part of the CoreCLR runtime NuGet package, but it&#8217;s not part of the pre-packaged binaries you get from the Microsoft package repositories. But with a little effort <a href="https://github.com/dotnet/corefx/blob/master/Tools-Override/crossgen.sh">borrowed from the corefx repo</a>, we can fetch our own crossgen:</p>
<pre>$ <span class="s1">export CoreClrVersion=1.1.0
$ export Rid=$(dotnet --info | sed -n -e </span><span class="s1">'s/^.*RID:[[:space:]]*//p')
$ </span><span class="s1">echo "{\"frameworks\":{\"netcoreapp1.1\":{\"dependencies\":{\"Microsoft.NETCore.Runtime.CoreCLR\":\"$CoreClrVersion\", \"Microsoft.NETCore.Platforms\": \"$CoreClrVersion\"}}},\"runtimes\":{\"$Rid\":{}}}" &gt; project.json
$ dotnet restore ./project.json --packages .
... output omitted for brevity ...
$ ls ./runtime.$Rid.Microsoft.NETCore.Runtime.CoreCLR/$CoreClrVersion/tools
crossgen</span></pre>
<p>All right! So we have crossgen, at which point we can try it out to generate debug information for System.Console.dll, or any other assembly we need, really. Here goes:</p>
<pre>$ crossgen /Platform_Assemblies_Paths /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0 \
           /CreatePerfMap . /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll

<span class="s1">Microsoft (R) CoreCLR Native Image Generator - Version 4.5.22220.0</span>
<span class="s1">Copyright (c) Microsoft Corporation.<span class="Apple-converted-space">  </span>All rights reserved.</span>

<span class="s1">Successfully generated perfmap for native assembly '/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll'.</span></pre>
<p>What does this perfmap file look like? The same as any other perfmap, except the addresses are not absolute &#8212; they are offsets from the load address of that module:</p>
<pre class="p1"><span class="s1">$ head -4 System.Console.ni.\{3b33b403-e8c1-44af-a7fb-369b2603f2a3\}.map</span>
<span class="s1">0000000000017590 58 void [System.Console] Interop::ThrowExceptionForIoErrno(valuetype Interop/ErrorInfo,string,bool,class [System.Runtime]System.Func`2&lt;valuetype Interop/ErrorInfo,valuetype Interop/ErrorInfo&gt;)</span>
<span class="s1">00000000000175F0 4d void [System.Console] Interop::CheckIo(valuetype Interop/Error,string,bool,class [System.Runtime]System.Func`2&lt;valuetype Interop/ErrorInfo,valuetype Interop/ErrorInfo&gt;)</span>
<span class="s1">0000000000017640 82 int64 [System.Console] Interop::CheckIo(int64,string,bool,class [System.Runtime]System.Func`2&lt;valuetype Interop/ErrorInfo,valuetype Interop/ErrorInfo&gt;)</span>
<span class="s1">00000000000176D0 17 int32 [System.Console] Interop::CheckIo(int32,string,bool,class [System.Runtime]System.Func`2&lt;valuetype Interop/ErrorInfo,valuetype Interop/ErrorInfo&gt;)</span></pre>
<p>Well, let&#8217;s see if we can at least resolve our desired address by using this approach. If you go back above, we were chasing the address 7f773651f267, loaded into System.Console.dll. First, let&#8217;s find the base address where System.Console.dll is loaded:</p>
<pre class="p1"><span class="s1">$ cat /proc/29751/maps | grep System.Console.dll | head -1
</span><span class="s1">7f77364ff000-7f7736500000 r--p 00000000 fd:00 787585 <span class="Apple-converted-space">                    </span>/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/System.Console.dll</span></pre>
<p><span class="s1">The offset, then, is:</span></p>
<pre class="p1"><span class="s1">$ echo 'ibase=16;obase=10;7F773651F267-</span><span class="s1">7F77364FF000</span><span class="s1">' | bc
</span><span class="s1">20267</span></pre>
<p>So now we need to look for this offset in the System.Console map file. The closest match is here:</p>
<pre class="p1"><span class="s1">0000000000</span><span class="s2">020</span><span class="s1">150 286 instance valuetype System.ConsoleKeyInfo [System.Console] System.IO.StdInReader::ReadKey(bool&amp;)</span></pre>
<p class="p1">With that, we have one frame resolved! There are only a few more <img src="https://s.w.org/images/core/emoji/11/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /> This process begs to be automated. It would be great to automatically run crossgen, generate the map files with the relative addresses, convert them to absolute addresses, and merge them with the main /tmp/perf-PID.map file that other tools know and love. Read on!</p>
<h3 class="p1">dotnet-mapgen.py</h3>
<p>Well, I wrote <a href="https://gist.github.com/goldshtn/fe3f7c3b10ec7e5511ae755abaf52172">a small script called dotnet-mapgen.py</a> that automates the above steps and produces a single, unified map file that contains both JIT-compiled addresses and addresses that lie in crossgen&#8217;d (AOT-compiled) modules, such as System.Console.dll. The script has two modes:</p>
<pre>$ ./dotnet-mapgen.py generate $(pgrep -n dotnet)
<span class="s1">couldn't find crossgen, trying to fetch it automatically...</span>
<span class="s1">crossgen succesfully downloaded and placed in libcoreclr's dir</span>
<span class="s1">crossgen map generation: 15 succeeded, 2 failed</span></pre>
<p class="p1">In the &#8220;generate&#8221; mode, the script first locates crossgen (downloading it if necessary, using the NuGet restore approach shown above), and then runs crossgen on all the managed assemblies loaded into the target process. The 2 failures in the above output are for assemblies that weren&#8217;t AOT-compiled. Note that this generation step can be done once, and the map files retained for subsequent runs &#8212; unless you change the set of AOT-compiled assemblies loaded into your process.</p>
<pre class="p1">$ ./dotnet-mapgen.py merge $(pgrep -n dotnet)
<span class="s1">perfmap merging: 14 succeeded, 3 failed</span></pre>
<p class="p1">In the &#8220;merge&#8221; mode, the script calculates absolute addresses for all the symbols generated in the previous step, and concatenates this information to the main /tmp/perf-PID.map file for the target process.</p>
<p class="p1">There&#8217;s just one final problem. Turns out, perf refuses to use the map file for symbols that are in memory regions that belong to a module (in our case above, System.Console.dll). And there&#8217;s no way to convince perf that it should try to resolve such addresses using the map file. Fortunately, I have a bit more control over BCC tools, so I proposed a <a href="https://github.com/iovisor/bcc/pull/1015">PR for retrying symbol resolution using a map file</a> if the symbol wasn&#8217;t resolved using the original module. With this patch, here&#8217;s <strong>stackcount</strong>&#8216;s output:</p>
<pre class="p1"># stackcount ... pthread:read
Tracing 1 functions for "pthread:read"... Hit Ctrl-C to end.
^C
<span class="s1">  read</span>
<span class="s1"><span class="Apple-converted-space">  </span>instance valuetype System.ConsoleKeyInfo [System.Console] System.IO.StdInReader::ReadKey(bool&amp;)</span>
<span class="s1"><span class="Apple-converted-space">  </span>instance string [System.Console] System.IO.StdInReader::ReadLine(bool)</span>
<span class="s1"><span class="Apple-converted-space">  </span>instance string [System.Console] System.IO.StdInReader::ReadLine()</span>
<span class="s1"><span class="Apple-converted-space">  </span>string [System.Console] System.Console::ReadLine()
<span class="Apple-converted-space">  </span>void [Runny] ConsoleApplication.Program::Main(string[])</span>
...
  16</pre>
<p>Note how all symbols are now resolved to managed frames: the JIT-compiled Program::Main, and the AOT-compiled Console::ReadLine, StdInReader::ReadLine, and everything else.</p>
<p>Once this support lands in BCC, we can also do full-fidelity profiling with the <strong>profile</strong> tool, stack tracing with <strong>trace</strong> and <strong>stackcount,</strong> blocked time analysis using <strong>offcputime</strong>/<strong>offwaketime,</strong> and a variety of other tools. For most purposes, the perf-based workflow shown in the beginning of the post is a poorer alternative, if you can run a recent-enough kernel with BPF support.</p>
<h3 class="p1">So Where Are We?</h3>
<ul>
<li>We can use a variety of Linux performance tools to monitor .NET Core processes on Linux, including perf and BCC tools</li>
<li>To resolve stacks and symbols in general, the COMPlus_PerfMapEnabled environment variable needs to be set to 1 prior to running the .NET Core process</li>
<li>Some binaries still ship out of the box with no debug information (notably libcoreclr.so)</li>
<li>Some managed assemblies aren&#8217;t included in the dynamic /tmp/perf-PID.map file because they were compiled ahead-of-time (using crossgen), and don&#8217;t contain debugging information</li>
<li>For these assemblies, crossgen can generate map files that are sort-of useful, but can&#8217;t be used directly with perf</li>
<li>The dotnet-mapgen script can automate the process of generating map files for AOT-compiled assemblies and merging them into the main map file for analysis</li>
<li>BCC tools will be updated to support this scenario and enable full-fidelity tracing</li>
</ul>
<p>In a subsequent post, I also plan to explore the LTTng traces to see if we can trace garbage collections, object allocations, managed exceptions, and other events of interest.</p>
<hr />
<p><em>You can also <a href="https://twitter.com/goldshtn">follow me on Twitter</a>, where I put stuff that doesn’t necessarily deserve a full-blown blog post.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2017/02/27/profiling-a-net-core-application-on-linux/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Analyzing a .NET Core Core Dump on Linux</title>
		<link>http://blogs.microsoft.co.il/sasha/2017/02/26/analyzing-a-net-core-core-dump-on-linux/</link>
		<comments>http://blogs.microsoft.co.il/sasha/2017/02/26/analyzing-a-net-core-core-dump-on-linux/#comments</comments>
		<pubDate>Sun, 26 Feb 2017 14:31:29 +0000</pubDate>
		<dc:creator><![CDATA[Sasha Goldshtein]]></dc:creator>
				<category><![CDATA[.NETCore]]></category>
		<category><![CDATA[Debugging]]></category>
		<category><![CDATA[DEV]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://blogs.microsoft.co.il/sasha/?p=2818958</guid>
		<description><![CDATA[Recently, I had to open a core dump of a .NET Core application on Linux. I thought this walkthrough might be useful if you find yourself in the same boat, because, to be quite honest, I didn&#8217;t find it trivial. Configure Linux to Generate Core Dumps Before you begin, you need to configure your Linux box [&#8230;]]]></description>
				<content:encoded><![CDATA[<p>Recently, I had to open a core dump of a .NET Core application on Linux. I thought this walkthrough might be useful if you find yourself in the same boat, because, to be quite honest, I didn&#8217;t find it trivial.</p>
<h3>Configure Linux to Generate Core Dumps</h3>
<p>Before you begin, you need to configure your Linux box to generate core dumps in the first place. A lot of distros will have something preconfigured, but the simplest approach is to just put a file name in the /proc/sys/kernel/core_pattern file:</p>
<pre># echo core &gt; /proc/sys/kernel/core_pattern</pre>
<p>Additionally, there&#8217;s a system limit maximum size for the generated core file. <strong>ulimit -c unlimited</strong> removes that limit. Now, whenever your .NET Core process (or any other process) crashes, you&#8217;ll get a core file generated in the same directory. By the way, .NET Core on Linux x86_64 reserves a pretty gigantic address space, so expect your core files to be pretty big. But compression helps &#8212; I had a 6.5GB core dump compress into a 59MB gzip file.</p>
<h3>Installing LLDB</h3>
<p>To open the core dump, you&#8217;ll need LLDB built with the same architecture as your CoreCLR. Here&#8217;s how I found out what I needed:</p>
<pre class="p1"><span class="s1">$ find /usr/share/dotnet -name libsosplugin.so
</span><span class="s1">/usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.0/libsosplugin.so

$ </span><span class="s1">ldd $(find /usr/share/dotnet -name libsosplugin.so) | grep lldb
</span><span class="s1">lib</span><span class="s2"><b>lldb</b></span><span class="s1">-3.5.so.1 =&gt; /usr/lib/x86_64-linux-gnu/lib</span><span class="s2"><b>lldb</b></span><span class="s1">-3.5.so.1 (0x00007f0a6b2d8000)</span></pre>
<p class="p1">Seeing that LLDB 3.5 was required, I installed it with <strong>sudo apt install lldb-3.5</strong>, but YMMV on other distros, of course.</p>
<h3 class="p1">Opening The Core File And Loading SOS</h3>
<p>Now you&#8217;re ready to open the core file in LLDB. If you&#8217;re doing this on a different box, you&#8217;ll need the same version of .NET Core installed &#8212; that&#8217;s where the dotnet binary, SOS itself, and the DAC (debugger data access component) are coming from. You could also copy the /usr/share/dotnet/shared/Microsoft.NETCore.App/nnnn directory over, of course.</p>
<pre>$ lldb $(which dotnet) --core ./core</pre>
<p>Once inside LLDB, you&#8217;ll need to load the SOS plugin. It&#8217;s the one we found earlier:</p>
<pre>(lldb) plugin load /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.1/libsosplugin.so</pre>
<p>Now, if everything went well, the SOS plugin needs the DAC (libmscordaccore.so), so you&#8217;ll need to tell it where to look:</p>
<pre>(lldb) setclrpath /usr/share/dotnet/shared/Microsoft.NETCore.App/1.1.1</pre>
<p>With that, SOS should be loaded and ready for use.</p>
<h3 class="p1">Running Analysis</h3>
<p>You&#8217;d think you can just start running the SOS commands you know and love, but there&#8217;s one final hurdle. Here&#8217;s what happened when I opened a core file generated from a crash, and tried to get the exception information (note that you should prefix SOS commands with &#8216;sos&#8217;):</p>
<pre>(lldb) sos PrintException
The current thread is unmanaged</pre>
<p>&#8230; which is kind of odd, right? Considering that the process crashed as a result of a managed exception. Looking at the docs, it looks like SOS and LLDB have trouble communicating around the current thread&#8217;s identity. So first, let&#8217;s find the thread that encountered an exception:</p>
<pre>(lldb) sos Threads
<span class="s1">ThreadCount:<span class="Apple-converted-space">      </span>13
</span><span class="s1">UnstartedThread:<span class="Apple-converted-space">  </span>0
</span><span class="s1">BackgroundThread: 11
</span><span class="s1">PendingThread:<span class="Apple-converted-space">    </span>0
</span><span class="s1">DeadThread: <span class="Apple-converted-space">      </span>0
</span><span class="s1">Hosted Runtime: <span class="Apple-converted-space">  </span>no
</span><span class="s1"><span class="Apple-converted-space">                                                                                                        </span>Lock
</span><span class="s1"><span class="Apple-converted-space">       </span>ID OSID ThreadOBJ <span class="Apple-converted-space">          </span>State GC Mode <span class="Apple-converted-space">    </span>GC Alloc Context<span class="Apple-converted-space">                  </span>Domain <span class="Apple-converted-space">          </span>Count Apt Exception
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>1 57ff 0000000000C2B380<span class="Apple-converted-space">  </span>2020020 Preemptive<span class="Apple-converted-space">  </span>(nil):(nil) <span class="Apple-converted-space">                      </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>2 5807 0000000000CAAF80<span class="Apple-converted-space">    </span>21220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad2fcbc40:0x7f5ad2fcdae0 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn (Finalizer)
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>4 580a 0000000000DC2730<span class="Apple-converted-space">    </span>21220 Preemptive<span class="Apple-converted-space">  </span>(nil):(nil) <span class="Apple-converted-space">                      </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>6 580d 0000000000EC1D70<span class="Apple-converted-space">    </span>21220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad576b4d0:0x7f5ad576cf58 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>7 5a13 00007F5ABC0292A0<span class="Apple-converted-space">  </span>1021220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad5888d30:0x7f5ad5888fd0 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn (Threadpool Worker)
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>8 5a15 00007F5AC006A3F0<span class="Apple-converted-space">    </span>21020 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad594dd10:0x7f5ad594ece8 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn System.IO.FileNotFoundException 00007f5ad593fa80 (nested exceptions)
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>9 5a16 00007F5AC00916A0<span class="Apple-converted-space">    </span>21220 Preemptive<span class="Apple-converted-space">  </span>(nil):(nil) <span class="Apple-converted-space">                      </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn
</span><span class="s1">XXXX <span class="Apple-converted-space">  </span>10 5a17 00007F5AC80015D0<span class="Apple-converted-space">  </span>1021220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad593a9a0:0x7f5ad593b978 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn (Threadpool Worker)
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>5 5a18 00007F5AC0814DF0<span class="Apple-converted-space">    </span>21220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad50ed1b8:0x7f5ad50eefd0 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn
</span><span class="s1">XXXX<span class="Apple-converted-space">    </span>3 5a19 00007F5C54000A00<span class="Apple-converted-space">  </span>1020220 Preemptive<span class="Apple-converted-space">  </span>(nil):(nil) <span class="Apple-converted-space">                      </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn (Threadpool Worker)
</span><span class="s1">XXXX <span class="Apple-converted-space">  </span>11 5a1a 00007F5C50019270<span class="Apple-converted-space">  </span>1021220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad58a5710:0x7f5ad58a6fd0 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn (Threadpool Worker)
</span><span class="s1">XXXX <span class="Apple-converted-space">  </span>12 5a1b 00007F5AC0831B80<span class="Apple-converted-space">  </span>1021220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad58fcf68:0x7f5ad58fd000 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn (Threadpool Worker)
</span><span class="s1">XXXX <span class="Apple-converted-space">  </span>13 5a1c 0000000000E8F720<span class="Apple-converted-space">  </span>1021220 Preemptive<span class="Apple-converted-space">  </span>0x7f5ad593bc80:0x7f5ad593d978 <span class="Apple-converted-space">    </span>0000000000C195C0 0 <span class="Apple-converted-space">    </span>Ukn (Threadpool Worker)</span></pre>
<p>Thread #8 looks suspicious, what with the System.IO.FileNotFoundException in the Exception column. Now, let&#8217;s see all the LLDB threads:</p>
<pre>(lldb) thread list
<span class="s1">Process 0 stopped
</span><span class="s1">* thread #1: tid = 0, 0x00007f5c5d83b7ef libc.so.6`__GI_raise(sig=2) + 159 at raise.c:58, name = 'dotnet', stop reason = signal SIGABRT
</span><span class="s1"><span class="Apple-converted-space">  </span>thread #2: tid = 1, 0x00007f5c5e482510 libpthread.so.0`__pthread_cond_wait + 256, stop reason = signal SIGABRT
</span><span class="s1"><span class="Apple-converted-space">  </span>thread #3: tid = 2, 0x00007f5c5d907d29 libc.so.6`syscall + 25, stop reason = signal SIGABRT
</span><span class="s1"><span class="Apple-converted-space">  </span>thread #4: tid = 3, 0x00007f5c5d907d29 libc.so.6`syscall + 25, stop reason = signal SIGABRT
... more threads snipped for brevity ...</span></pre>
<p>Here, it looks like thread 1 is the one with the exception being raised. So we have to map the OS thread ID from the first command, to the LLDB thread id from the second command:</p>
<pre>(lldb) <span class="s1">setsostid 5a15 1
</span><span class="s1">Mapped sos OS tid 0x5a15 to lldb thread index 1</span></pre>
<p>And now, we&#8217;re ready to roll:</p>
<pre>(lldb) sos PrintException
<span class="s1">Exception object: 00007f5ad593fa80</span>
<span class="s1">Exception type: <span class="Apple-converted-space">  </span>System.IO.FileNotFoundException</span>
<span class="s1">Message:<span class="Apple-converted-space">          </span>Could not load the specified file.</span>
<span class="s1">InnerException: <span class="Apple-converted-space">  </span>&lt;none&gt;</span>
<span class="s1">StackTrace (generated):<span class="Apple-converted-space">    </span>SP <span class="Apple-converted-space">              </span>IP <span class="Apple-converted-space">              </span>Function</span>
<span class="s1"><span class="Apple-converted-space">    </span>00007F5C45D227C0 00007F5BE37412E7 System.Private.CoreLib.ni.dll!System.Runtime.Loader.AssemblyLoadContext.ResolveUsingEvent(System.Reflection.AssemblyName)+0x20ab07</span>
<span class="s1"><span class="Apple-converted-space">    </span>00007F5C45D227F0 00007F5BE353664F System.Private.CoreLib.ni.dll!System.Runtime.Loader.AssemblyLoadContext.ResolveUsingResolvingEvent(IntPtr, System.Reflection.AssemblyName)+0x4f</span>
<span class="s1">
StackTraceString: &lt;none&gt;</span>
<span class="s1">HResult: 80070002</span>

<span class="s1">Nested exception -------------------------------------------------------------</span>
<span class="s1">Exception object: 00007f5ad593dea0</span>
<span class="s1">Exception type: <span class="Apple-converted-space">  </span>System.InvalidOperationException</span>
<span class="s1">Message:<span class="Apple-converted-space">          </span>Authorization cannot be requested before logging in.</span>
<span class="s1">InnerException: <span class="Apple-converted-space">  </span>&lt;none&gt;</span>
<span class="s1">StackTrace (generated):</span>
<span class="s1"><span class="Apple-converted-space">    </span>SP <span class="Apple-converted-space">              </span>IP <span class="Apple-converted-space">              </span>Function</span>
<span class="s1"><span class="Apple-converted-space">    </span>00007F5C45D29890 00007F5BE63002FE kitt3ns.dll!WebApplication.Controllers.AuthorizationBackgroundWorker.VerifyAuthorized(System.String)+0xae</span>
<span class="s1"><span class="Apple-converted-space">    </span>00007F5C45D298D0 00007F5BE630022B kitt3ns.dll!WebApplication.Controllers.AuthorizationBackgroundWorker.RequestAuthorization()+0x2b</span>
<span class="s1"><span class="Apple-converted-space">    </span>00007F5C45D298E0 00007F5BE55BC31C kitt3ns.dll!WebApplication.Controllers.AuthorizationBackgroundWorker+&lt;&gt;c.&lt;Authorize&gt;b__0_0()+0x4c</span>
<span class="s1"><span class="Apple-converted-space">    </span>00007F5C45D29910 00007F5BE33BDF11 System.Private.CoreLib.ni.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x111</span>

<span class="s1">StackTraceString: &lt;none&gt;</span>
<span class="s1">HResult: 80131509

</span>(lldb) sos ClrStack
<span class="s1">OS Thread Id: 0x5a15 (1)</span>
<span class="s1"><span class="Apple-converted-space">        </span>Child SP <span class="Apple-converted-space">              </span>IP Call Site</span>
<span class="s1">00007F5C45D272C8 00007f5c5d83b7ef [HelperMethodFrame: 00007f5c45d272c8]</span>
<span class="s1">00007F5C45D273E0 00007F5BE33BDF11 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)</span>
<span class="s1">00007F5C45D29770 00007f5c5cbe9bad [HelperMethodFrame: 00007f5c45d29770]</span>
<span class="s1">00007F5C45D29890 00007F5BE63002FE WebApplication.Controllers.AuthorizationBackgroundWorker.VerifyAuthorized(System.String) [/home/vagrant/kitt3ns/Controllers/AccountController.cs @ 37]</span>
<span class="s1">00007F5C45D298D0 00007F5BE630022B WebApplication.Controllers.AuthorizationBackgroundWorker.RequestAuthorization() [/home/vagrant/kitt3ns/Controllers/AccountController.cs @ 30]</span>
<span class="s1">00007F5C45D298E0 00007F5BE55BC31C WebApplication.Controllers.AuthorizationBackgroundWorker+&lt;&gt;c.&lt;Authorize&gt;b__0_0() [/home/vagrant/kitt3ns/Controllers/AccountController.cs @ 24]</span>
<span class="s1">00007F5C45D29910 00007F5BE33BDE71 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)</span>
<span class="s1">00007F5C45D29B50 00007f5c5cbfb207 [GCFrame: 00007f5c45d29b50] </span>
<span class="s1">00007F5C45D29D30 00007f5c5cbfb207 [DebuggerU2MCatchHandlerFrame: 00007f5c45d29d30] </span></pre>
<p>This gives us the exception information and the thread&#8217;s current stack, if we want it. We could similarly inspect other threads by mapping the OS thread id to the LLDB thread id, but for a thread that didn&#8217;t have an exception, where do you get that clue that connects the OS thread id to the debugger thread ID? Well, it seems that GDB is using the same numbering as LLDB, but in GDB you can actually see the LWP id (on Linux, GDB LWP = kernel pid = thread) using &#8216;info threads&#8217;:</p>
<pre>$ gdb $(which dotnet) --core ./core
...

(gdb) info threads
<span class="s1"><span class="Apple-converted-space">  </span>Id <span class="Apple-converted-space">  </span>Target Id <span class="Apple-converted-space">        </span>Frame</span>
<span class="s1">* 1<span class="Apple-converted-space">    </span>Thread 0x7f5c45d2a700 (LWP 23061) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58</span>
<span class="s1"><span class="Apple-converted-space">  </span>2<span class="Apple-converted-space">    </span>Thread 0x7f5c5eaab740 (LWP 22527) 0x00007f5c5e482510 in pthread_cond_wait@@GLIBC_2.3.2 ()</span>
<span class="s1"><span class="Apple-converted-space">    </span>at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:219</span>
<span class="s1"><span class="Apple-converted-space">  </span>3<span class="Apple-converted-space">    </span>Thread 0x7f5c5b411700 (LWP 22529) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38</span>
<span class="s1"><span class="Apple-converted-space">  </span>4<span class="Apple-converted-space">    </span>Thread 0x7f5c5ac10700 (LWP 22530) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38</span>
<span class="s1"><span class="Apple-converted-space">  </span>5<span class="Apple-converted-space">    </span>Thread 0x7f5c5a40f700 (LWP 22531) 0x00007f5c5d9020bd in poll () at ../sysdeps/unix/syscall-template.S:84</span>
<span class="s1"><span class="Apple-converted-space">  </span>6<span class="Apple-converted-space">    </span>Thread 0x7f5c59c0e700 (LWP 22532) 0x00007f5c5e485d8d in __pause_nocancel () at ../sysdeps/unix/syscall-template.S:84</span>
<span class="s1"><span class="Apple-converted-space">  </span>7<span class="Apple-converted-space">    </span>Thread 0x7f5c5940d700 (LWP 22533) 0x00007f5c5e482510 in pthread_cond_wait@@GLIBC_2.3.2 ()</span>
<span class="s1"><span class="Apple-converted-space">    </span>at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:219</span>
<span class="s1"><span class="Apple-converted-space">  </span>8<span class="Apple-converted-space">    </span>Thread 0x7f5c589b2700 (LWP 22534) 0x00007f5c5e482510 in pthread_cond_wait@@GLIBC_2.3.2 ()</span>
<span class="s1"><span class="Apple-converted-space">    </span>at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:219
</span><span class="s1"><span class="Apple-converted-space">  </span>9<span class="Apple-converted-space">    </span>Thread 0x7f5c498ae700 (LWP 22535) 0x00007f5c5e4828b9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()</span>
<span class="s1"><span class="Apple-converted-space">    </span>at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:258</span>
<span class="s1"><span class="Apple-converted-space">  </span>10 <span class="Apple-converted-space">  </span>Thread 0x7f5c454ef700 (LWP 22538) 0x00007f5c5e4856ed in __close_nocancel () at ../sysdeps/unix/syscall-template.S:84</span>
<span class="s1"><span class="Apple-converted-space">  </span>11 <span class="Apple-converted-space">  </span>Thread 0x7f5ad2324700 (LWP 22540) 0x00007f5c5e4856ed in __close_nocancel () at ../sysdeps/unix/syscall-template.S:84</span>
<span class="s1"><span class="Apple-converted-space">  </span>12 <span class="Apple-converted-space">  </span>Thread 0x7f5ad1b23700 (LWP 22541) syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38</span>
<span class="s1"><span class="Apple-converted-space">  </span>13 <span class="Apple-converted-space">  </span>Thread 0x7f5ad2b25700 (LWP 23059) 0x00007f5c5e4828b9 in pthread_cond_timedwait@@GLIBC_2.3.2 ()</span>
<span class="s1"><span class="Apple-converted-space">    </span>at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:258</span>
... more output snipped for brevity ...</pre>
<p>So, for example, suppose we wanted to know what managed thread #6 (OS thread id 0x580d<span class="s1"> from the &#8216;sos Threads&#8217; output above) was doing when the dump file was generated. 0x580d = 22541, which is thread #12 in the output above. Going back to LLDB (note the hex notation for both thread ids):</span></p>
<pre>(lldb) setsostid 580d c
<span class="s1">Mapped sos OS tid 0x580d to lldb thread index 12</span>

<span class="s1">(lldb) clrstack</span>
<span class="s1">OS Thread Id: 0x580d (12)</span>
<span class="s1"><span class="Apple-converted-space">        </span>Child SP <span class="Apple-converted-space">              </span>IP Call Site</span>
<span class="s1">00007F5AD1B227F8 00007f5c5d907d29 [InlinedCallFrame: 00007f5ad1b227f8] Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.Libuv+NativeMethods.uv_run(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)</span>
<span class="s1">00007F5AD1B227F8 00007f5be45cea3a [InlinedCallFrame: 00007f5ad1b227f8] Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.Libuv+NativeMethods.uv_run(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)</span>
<span class="s1">00007F5AD1B227E0 00007F5BE45CEA3A DomainBoundILStubClass.(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)</span>
<span class="s1">00007F5AD1B22890 00007F5BE45CE968 Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.Libuv.run(Microsoft.AspNetCore.Server.Kestrel.Internal.Networking.UvLoopHandle, Int32)</span>
<span class="s1">00007F5AD1B228B0 00007F5BE45CBCFF Microsoft.AspNetCore.Server.Kestrel.Internal.KestrelThread.ThreadStart(System.Object)</span>
<span class="s1">00007F5AD1B22910 00007F5BE33BDE71 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)</span>
<span class="s1">00007F5AD1B22B50 00007f5c5cbfb207 [GCFrame: 00007f5ad1b22b50]</span>
<span class="s1">00007F5AD1B22D30 00007f5c5cbfb207 [DebuggerU2MCatchHandlerFrame: 00007f5ad1b22d30]</span></pre>
<p>Other SOS commands that don&#8217;t depend on thread context (e.g. listing assemblies, heap objects, finalization queues and so on) do not require any fiddling with thread ids, and you can just run them directly.</p>
<h3>Summary</h3>
<p>So, what we had to do in order to open a .NET Core core dump from a Linux system was:</p>
<ul>
<li>Set up the Linux system to generate core dumps on crash</li>
<li>Copy or install the right version of .NET Core on the analysis machine</li>
<li>Install the version of LLDB matching your .NET Core&#8217;s SOS plugin</li>
<li>Load the SOS plugin in LLDB and tell it where to find the DAC</li>
<li>Set the debugger thread id for SOS thread-sensitive commands to work</li>
<li>Run <strong>sos PrintException</strong> or any other commands to analyze the crash</li>
</ul>
<p>Fun fun fun.</p>
<hr />
<p><em>You can also <a href="https://twitter.com/goldshtn">follow me on Twitter</a>, where I put stuff that doesn’t necessarily deserve a full-blown blog post.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://blogs.microsoft.co.il/sasha/2017/02/26/analyzing-a-net-core-core-dump-on-linux/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>
