<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" gd:etag="W/&quot;CEACQHozeyp7ImA9WxJUF04.&quot;"><id>tag:blogger.com,1999:blog-6800934446457898793</id><updated>2009-07-16T04:26:01.483-04:00</updated><title>Moserware</title><subtitle type="html">Jeff Moser's software development adventures.</subtitle><link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="http://www.moserware.com/feeds/posts/default" /><link rel="alternate" type="text/html" href="http://www.moserware.com/" /><link rel="next" type="application/atom+xml" href="http://www.blogger.com/feeds/6800934446457898793/posts/default?start-index=4&amp;max-results=3&amp;redirect=false&amp;v=2" /><author><name>Jeff Moser</name><uri>http://www.blogger.com/profile/16074905903060665396</uri><email>noreply@blogger.com</email></author><generator version="7.00" uri="http://www.blogger.com">Blogger</generator><openSearch:totalResults>40</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>3</openSearch:itemsPerPage><geo:lat>39.95645</geo:lat><geo:long>-86.008729</geo:long><link rel="self" href="http://feeds.feedburner.com/Moserware" type="application/atom+xml" /><feedburner:emailServiceId>Moserware</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><entry gd:etag="W/&quot;CEYBSXs8fSp7ImA9WxJXFkQ.&quot;"><id>tag:blogger.com,1999:blog-6800934446457898793.post-680453867994856278</id><published>2009-06-10T08:57:00.010-04:00</published><updated>2009-06-10T23:49:18.575-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-06-10T23:49:18.575-04:00</app:edited><title>The First Few Milliseconds of an HTTPS Connection</title><content type="html">&lt;p&gt;Convinced from spending hours reading &lt;a href="http://www.amazon.com/Tuscan-Whole-Milk-Gallon-128/product-reviews/B00032G1S0/ref=dp_top_cm_cr_acr_txt?ie=UTF8&amp;amp;showViewpoints=1"&gt;rave reviews&lt;/a&gt;, Bob eagerly clicked "Proceed to Checkout" for his gallon of &lt;a href="http://www.amazon.com/gp/product/B00032G1S0?ie=UTF8&amp;amp;tag=moserware-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=B00032G1S0"&gt;Tuscan Whole Milk&lt;/a&gt; and...&lt;/p&gt;&lt;p&gt;Whoa! What just happened?&lt;/p&gt;&lt;a href="http://4.bp.blogspot.com/_Zfbv3mHcYrc/ShgnOU1MihI/AAAAAAAABNI/BAF-YQdhkJU/s1600-h/securitysymbols.png"&gt;&lt;img src="http://4.bp.blogspot.com/_Zfbv3mHcYrc/ShgnOU1MihI/AAAAAAAABNI/BAF-YQdhkJU/s400/securitysymbols.png" /&gt;&lt;/a&gt; &lt;p&gt;In the 220 milliseconds that flew by, a lot of interesting stuff happened to make Firefox change the address bar color and put a lock in the lower right corner. With the help of &lt;a href="http://www.wireshark.org/"&gt;Wireshark&lt;/a&gt;, my favorite network tool, and a slightly modified debug build of Firefox, we can see &lt;i&gt;exactly&lt;/i&gt; what's going on.&lt;/p&gt;&lt;p&gt;By agreement of &lt;a href="http://tools.ietf.org/html/rfc2818"&gt;RFC 2818&lt;/a&gt;, Firefox knew that "https" meant it should connect to &lt;a href="http://tools.ietf.org/html/rfc2818#section-2.3"&gt;port 443&lt;/a&gt; at Amazon.com:&lt;/p&gt;&lt;a href="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si8K9mA5QcI/AAAAAAAABOI/1agNWSS6NBE/s1600-h/httpsport.png"&gt;&lt;img src="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si8K9mA5QcI/AAAAAAAABOI/1agNWSS6NBE/s400/httpsport.png" /&gt;&lt;/a&gt; &lt;p&gt;Most people associate HTTPS with &lt;a id="fyrr" title="SSL" href="http://en.wikipedia.org/wiki/Secure_Sockets_Layer"&gt;SSL&lt;/a&gt; (Secure Sockets Layer) which was &lt;a id="yq6y" title="created by Netscape" href="http://www.mozilla.org/projects/security/pki/nss/history.html"&gt;created by Netscape in the mid 90's&lt;/a&gt;. This is becoming less true over time. As Netscape lost market share, SSL's maintenance moved to the Internet Engineering Task Force (&lt;a href="http://en.wikipedia.org/wiki/IETF"&gt;IETF&lt;/a&gt;). The first post-Netscape version was re-branded as Transport Layer Security (&lt;a href="http://en.wikipedia.org/wiki/Secure_Sockets_Layer"&gt;TLS&lt;/a&gt;) 1.0 which &lt;a href="http://tools.ietf.org/html/rfc2246"&gt;was released&lt;/a&gt; in January 1999. It's rare to see true "SSL" traffic given that TLS has been around for 10 years. &lt;/p&gt;&lt;h4&gt;Client Hello&lt;/h4&gt;&lt;p&gt;TLS wraps all traffic in "records" of different types. We see that the first byte out of our browser is the hex byte 0x16 = 22 which &lt;a id="ihzy" title="means" href="http://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml"&gt;means&lt;/a&gt; that this is a "handshake" record:&lt;/p&gt;&lt;a href="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si-df4pU51I/AAAAAAAABQg/A1flWimwg9M/s1600-h/clienthellowithannotations.png"&gt;&lt;img src="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si-df4pU51I/AAAAAAAABQg/A1flWimwg9M/s400/clienthellowithannotations.png" /&gt;&lt;/a&gt; &lt;p&gt;The next two bytes are 0x0301 which indicate that this is a version 3.1 record which shows that TLS 1.0 is essentially SSL 3.1. &lt;/p&gt;&lt;p&gt;The handshake record is broken out into several messages. The first is our "Client Hello" message (0x01). There are a few important things here:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Random:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Si8g9EXh8uI/AAAAAAAABOY/oBt1zr_n1XE/s1600-h/randomclientbytes.png"&gt;&lt;img src="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Si8g9EXh8uI/AAAAAAAABOY/oBt1zr_n1XE/s400/randomclientbytes.png" /&gt;&lt;/a&gt;&lt;br /&gt;There are four bytes representing the current Coordinated Universal Time (&lt;a href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time"&gt;UTC&lt;/a&gt;) in the Unix epoch format, which is the number of seconds since January 1, 1970. In this case, 0x4a2f07ca. It's followed by 28 random bytes. This will be used later on. &lt;/li&gt;&lt;li&gt;Session ID:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Si8iGgIk_gI/AAAAAAAABOg/NsPg9pMMpCw/s1600-h/sessionid.png"&gt;&lt;img src="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Si8iGgIk_gI/AAAAAAAABOg/NsPg9pMMpCw/s400/sessionid.png" /&gt;&lt;/a&gt;&lt;br /&gt;Here it's empty/null. If we had previously connected to Amazon.com a few seconds ago, we could potentially resume a session and avoid a full handshake. &lt;/li&gt;&lt;li&gt;Cipher Suites:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Si8i6DwZDiI/AAAAAAAABOo/_Pv_1d-PbgU/s1600-h/ciphersuites.png"&gt;&lt;img src="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Si8i6DwZDiI/AAAAAAAABOo/_Pv_1d-PbgU/s400/ciphersuites.png" /&gt;&lt;/a&gt;&lt;br /&gt;This is a list of all of the encryption algorithms that the browser is willing to support. Its top pick is a very strong choice of "&lt;a href="http://en.wikipedia.org/wiki/Secure_Sockets_Layer"&gt;TLS&lt;/a&gt;_&lt;a href="http://en.wikipedia.org/wiki/Elliptic_Curve_Diffie-Hellman"&gt;ECDHE&lt;/a&gt;_&lt;a href="http://en.wikipedia.org/wiki/Elliptic_Curve_DSA"&gt;ECDSA&lt;/a&gt;_WITH_&lt;a href="http://en.wikipedia.org/wiki/Advanced_Encryption_Standard"&gt;AES&lt;/a&gt;_256_&lt;a href="http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation#Cipher-block_chaining_.28CBC.29"&gt;CBC&lt;/a&gt;_&lt;a href="http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-0_and_SHA-1"&gt;SHA&lt;/a&gt;" followed by 33 others that it's willing to accept. Don't worry if none of that makes sense. We'll find out later that Amazon doesn't pick our first choice anyway. &lt;/li&gt;&lt;li&gt;&lt;a id="z8g0" title="server_name extension" href="http://tools.ietf.org/html/rfc4366#section-3.1"&gt;server_name extension&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si8jtnutNFI/AAAAAAAABOw/Czowyq3F-6Y/s1600-h/server_name.png"&gt;&lt;img src="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si8jtnutNFI/AAAAAAAABOw/Czowyq3F-6Y/s400/server_name.png" /&gt;&lt;/a&gt;&lt;br /&gt;This is a way to tell Amazon.com that our browser is trying to reach &lt;a href="https://www.amazon.com/"&gt;https://www.amazon.com/&lt;/a&gt;. This is really convenient because our TLS handshake occurs long before any HTTP traffic. HTTP has a &lt;a id="v56x" title="" href="http://tools.ietf.org/html/rfc2616#section-14.23"&gt;"Host" header&lt;/a&gt; which allows a cost-cutting Internet hosting companies to pile hundreds of websites onto a single IP address. SSL has traditionally required a different IP for each site, but this extension allows the server to respond with the appropriate certificate that the browser is looking for. If nothing else, this extension should allow an extra week or so of IPv4 addresses. &lt;/li&gt;&lt;/ul&gt;&lt;h4&gt;Server Hello&lt;/h4&gt;&lt;p&gt;Amazon.com replies with a handshake record that's a massive two packets in size (2,551 bytes). The record has version bytes of 0x0301 meaning that Amazon agreed to our request to use TLS 1.0. This record has three sub-messages with some interesting data:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;"Server Hello" Message (2):&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si-euEnAA6I/AAAAAAAABQo/l4-KRrTyNWY/s1600-h/serverhello.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si-euEnAA6I/AAAAAAAABQo/l4-KRrTyNWY/s400/serverhello.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We get the server's four byte time Unix epoch time representation and its 28 random bytes that will be used later.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;A 32 byte session ID in case we want to reconnect without a big handshake.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Of the 34 cipher suites we offered, Amazon picked "TLS_RSA_WITH_RC4_128_MD5" (0x0004). This means that it will use the "&lt;a href="http://en.wikipedia.org/wiki/RSA"&gt;RSA&lt;/a&gt;" &lt;a href="http://en.wikipedia.org/wiki/Public-key_cryptography"&gt;public key&lt;/a&gt; algorithm to verify certificate signatures and exchange keys, the &lt;a href="http://en.wikipedia.org/wiki/RC4"&gt;RC4&lt;/a&gt; encryption algorithm to encrypt data, and the &lt;a href="http://en.wikipedia.org/wiki/MD5"&gt;MD5&lt;/a&gt; hash function to verify the contents of messages. We'll cover these in depth later on. I personally think Amazon had selfish reasons for choosing this cipher suite. Of the ones on the list, it was the one that was least CPU intensive to use so that Amazon could crowd more connections onto each of their servers. A much less likely &lt;span style="font-size:85%;"&gt;possibility &lt;/span&gt;is that they wanted to pay special tribute to &lt;a href="http://en.wikipedia.org/wiki/Ronald_L._Rivest"&gt;Ron Rivest&lt;/a&gt;, who created all three of these algorithms.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Certificate Message (11):&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Si8lTxR3m0I/AAAAAAAABPA/I-le95y0ldw/s1600-h/certificatemessage.png"&gt;&lt;img src="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Si8lTxR3m0I/AAAAAAAABPA/I-le95y0ldw/s400/certificatemessage.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;This message takes a whopping 2,464 bytes and is the certificate that the client can use to validate Amazon's. It isn't anything fancy. You can view most of its contents in your browser:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Sirx5XZBa5I/AAAAAAAABNo/Z-R75rsjCL8/s1600-h/AmazonBasicCertInfo.png"&gt;&lt;img src="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Sirx5XZBa5I/AAAAAAAABNo/Z-R75rsjCL8/s400/AmazonBasicCertInfo.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;"Server Hello Done" Message (14):&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8l2fMvVlI/AAAAAAAABPI/QrxJ3S9ezOo/s1600-h/serverhellodone.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8l2fMvVlI/AAAAAAAABPI/QrxJ3S9ezOo/s400/serverhellodone.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;This is a zero byte message that tells the client that it's done with the "Hello" process and indicate that the server won't be asking the client for a certificate.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ol&gt;&lt;h4&gt;Checking out the Certificate&lt;/h4&gt;&lt;p&gt;The browser has to &lt;a href="http://www.koders.com/c/fid340AB659241B7C717B5B3E0095BBA4245FCE34FD.aspx#L862"&gt;figure out&lt;/a&gt; if it should trust Amazon.com. In this case, it's using certificates. It looks at Amazon's certificate and &lt;a href="http://www.koders.com/c/fid9207CD3EB61F5F08E38858D14997264BEDB5B62C.aspx#L1091"&gt;sees&lt;/a&gt; that the current time is between the "not before" time of August 26th, 2008 and before the "not after" time of August 27, 2009. It also &lt;a href="http://www.koders.com/c/fid9207CD3EB61F5F08E38858D14997264BEDB5B62C.aspx?s=CERT_CheckCertValidTimes#L1211"&gt;checks&lt;/a&gt; to make sure that the certificate's public key is authorized for exchanging secret keys. &lt;/p&gt;&lt;p&gt;Why should we trust this certificate? &lt;/p&gt;&lt;p&gt;Attached to the certificate is a "signature" that is just a really long number in &lt;a href="http://en.wikipedia.org/wiki/Endianness#Big-endian"&gt;big-endian&lt;/a&gt; format:&lt;/p&gt;&lt;a href="http://4.bp.blogspot.com/_Zfbv3mHcYrc/SihpJBoNeDI/AAAAAAAABNg/DgLY221ncEo/s1600-h/AmazonCertSigned.png"&gt;&lt;img src="http://4.bp.blogspot.com/_Zfbv3mHcYrc/SihpJBoNeDI/AAAAAAAABNg/DgLY221ncEo/s400/AmazonCertSigned.png" /&gt;&lt;/a&gt; &lt;p&gt;Anyone could have sent us these bytes. Why should we trust this signature? To answer that question, need to make a speedy detour into &lt;a id="tlez" title="mathemagic land" href="http://en.wikipedia.org/wiki/Donald_in_Mathmagic_Land"&gt;mathemagic land&lt;/a&gt;:&lt;/p&gt;&lt;h4&gt;Interlude: A Short, Not &lt;i&gt;Too&lt;/i&gt; Scary, Guide to RSA&lt;/h4&gt;&lt;p&gt;People &lt;a href="http://stackoverflow.com/questions/575561/do-programmers-have-to-be-good-in-mathematics-closed"&gt;sometimes wonder&lt;/a&gt; if math has any relevance to programming. Certificates give a very practical example of applied math. Amazon's certificate tells us that we should use the RSA algorithm to check the signature. &lt;a id="e:y3" title="RSA" href="http://en.wikipedia.org/wiki/RSA"&gt;RSA&lt;/a&gt; was created in the 1970's by MIT professors &lt;a id="g825" title="Ron Rivest" href="http://people.csail.mit.edu/rivest/"&gt;Ron *R*ivest&lt;/a&gt;, &lt;a id="h87n" title="Adi  Shamir" href="http://en.wikipedia.org/wiki/Adi_Shamir"&gt;Adi *S*hamir&lt;/a&gt;, and &lt;a id="m0.d" title="Len  Adleman" href="http://en.wikipedia.org/wiki/Leonard_Adleman"&gt;Len *A*dleman&lt;/a&gt; who found a &lt;a id="vw55" title="tied together" href="http://people.csail.mit.edu/rivest/Rsapaper.pdf"&gt;clever way&lt;/a&gt; to combine ideas spanning &lt;a id="w5fp" title="Greek mathematician from 300 BC" href="http://en.wikipedia.org/wiki/Extended_Euclidean_algorithm"&gt;2000&lt;/a&gt; &lt;a id="i-7k" title="a third century AD Chinese mathematician" href="http://en.wikipedia.org/wiki/Chinese_remainder_theorem"&gt;years&lt;/a&gt; &lt;a id="ks_n" title="a 17th century French judge" href="http://en.wikipedia.org/wiki/Fermat%27s_little_theorem"&gt;of&lt;/a&gt; &lt;a id="o:ge" title="18th century math wizard" href="http://en.wikipedia.org/wiki/Euler_totient_function"&gt;math&lt;/a&gt; development to come up with a &lt;a href="http://mathworld.wolfram.com/RSAEncryption.html"&gt;beautifully simple algorithm&lt;/a&gt;:&lt;/p&gt;&lt;p&gt;You &lt;a id="si95" title="pick" href="http://en.wikipedia.org/wiki/Primality_test"&gt;pick&lt;/a&gt; two huge prime numbers "p" and "q." Multiply them to get "n = p*q." Next, you pick a small public &lt;a href="http://en.wikipedia.org/wiki/Exponentiation"&gt;exponent&lt;/a&gt; "e" which is the "encryption exponent" and &lt;a id="p_jx" title="a specially crafted inverse" href="http://en.wikipedia.org/wiki/Modular_multiplicative_inverse"&gt;a specially crafted inverse&lt;/a&gt; of "e" called "d" as the "decryption exponent." You then &lt;b&gt;make "n" and "e" public and keep "d" as secret as you possibly can&lt;/b&gt; and then throw away "p" and "q" (or keep them as secret as "d"). It's really important to remember that "e" and "d" are inverses of each other. &lt;/p&gt;&lt;p&gt;Now, if you have some message, you just need to interpret its bytes as a number "M." If you want to "encrypt" a message to create a "ciphertext", you'd calculate:&lt;/p&gt;&lt;p&gt;C ≡ M&lt;sup&gt;e&lt;/sup&gt; (mod n)&lt;/p&gt;&lt;p&gt;This means that you multiply "M" by itself "e" times. The "mod n" means that we only take the remainder (e.g. "&lt;a href="http://en.wikipedia.org/wiki/Modular_arithmetic"&gt;modulus&lt;/a&gt;") when dividing by "n." For example, 11 AM + 3 hours ≡ 2 (PM) (mod 12 hours). The other recipient knows "d" which allows them to invert the message to recover the original message:&lt;/p&gt;&lt;p&gt;C&lt;sup&gt;d&lt;/sup&gt; ≡ (M&lt;sup&gt;e&lt;/sup&gt;)&lt;sup&gt;d&lt;/sup&gt; ≡ M&lt;sup&gt;e*d&lt;/sup&gt; ≡ M&lt;sup&gt;1&lt;/sup&gt; ≡ M (mod n)&lt;/p&gt;&lt;p&gt;Just as interesting is that the person with "d" can "sign" a document by raising a message "M" to the "d" exponent:&lt;/p&gt;&lt;p&gt;M&lt;sup&gt;d&lt;/sup&gt; ≡ S (mod n)&lt;/p&gt;&lt;p&gt;This works because "signer" makes public "S", "M", "e", and "n." Anyone can verify the signature "S" with a simple calculation:&lt;/p&gt;&lt;p&gt;S&lt;sup&gt;e&lt;/sup&gt; ≡ (M&lt;sup&gt;d&lt;/sup&gt;)&lt;sup&gt;e&lt;/sup&gt; ≡ M&lt;sup&gt;d*e&lt;/sup&gt; ≡ M&lt;sup&gt;e*d&lt;/sup&gt; ≡ M&lt;sup&gt;1&lt;/sup&gt; ≡ M (mod n)&lt;/p&gt;&lt;p&gt;Public key cryptography algorithms like RSA are often called "asymmetric" algorithms because the encryption key (in our case, "e") is not equal to (e.g. "symmetric" with) the decryption key "d". Reducing everything "mod n" makes it impossible to use the easy techniques that we're used to such as normal &lt;a href="http://en.wikipedia.org/wiki/Logarithm"&gt;logarithms&lt;/a&gt;. The magic of RSA works because you can calculate/encrypt C ≡ M&lt;sup&gt;e&lt;/sup&gt; (mod n) &lt;a href="http://en.wikipedia.org/wiki/Modular_exponentiation"&gt;very quickly&lt;/a&gt;, but it is &lt;i&gt;really hard&lt;/i&gt; to calculate/decrypt C&lt;sup&gt;d&lt;/sup&gt; ≡ M (mod n) without knowing "d." As we saw earlier, "d" is derived from &lt;a href="http://en.wikipedia.org/wiki/Integer_factorization"&gt;factoring&lt;/a&gt; "n" back to its "p" and "q", which is a &lt;a href="http://en.wikipedia.org/wiki/NP_%28complexity%29"&gt;tough problem&lt;/a&gt;.&lt;/p&gt;&lt;h4&gt;Verifying Signatures&lt;/h4&gt;&lt;p&gt;The big thing to keep in mind with RSA in the real world is that all of the numbers involved have to be &lt;i&gt;big&lt;/i&gt; to make things really hard to break using the &lt;a href="http://en.wikipedia.org/wiki/General_number_field_sieve"&gt;best algorithms that we have&lt;/a&gt;. How big? Amazon.com's certificate was "signed" by "VeriSign Class 3 Secure Server CA." From the certificate, we see that this VeriSign modulus "n" is 2048 bits long which has this 617 digit base-10 representation:&lt;/p&gt;&lt;blockquote&gt;1890572922 9464742433 9498401781 6528521078 8629616064 3051642608 4317020197 7241822595 6075980039 8371048211 4887504542 4200635317 0422636532 2091550579 0341204005 1169453804 7325464426 0479594122 4167270607 6731441028 3698615569 9947933786 3789783838 5829991518 1037601365 0218058341 7944190228 0926880299 3425241541 4300090021 1055372661 2125414429 9349272172 5333752665 6605550620 5558450610 3253786958 8361121949 2417723618 5199653627 5260212221 0847786057 9342235500 9443918198 9038906234 1550747726 8041766919 1500918876 1961879460 3091993360 6376719337 6644159792 1249204891 7079005527 7689341573 9395596650 5484628101 0469658502 1566385762 0175231997 6268718746 7514321&lt;/blockquote&gt;&lt;p&gt;(Good luck trying to find "p" and "q" from this "n" - if you could, you could generate real-looking VeriSign certificates.)&lt;/p&gt;&lt;p&gt;VeriSign's "e" is 2^16 + 1 = 65537. Of course, they keep their "d" value secret, probably on a safe hardware device protected by retinal scanners and armed guards. Before signing, VeriSign checked the validity of the contents that Amazon.com claimed on its certificate using a real-world "handshake" that involved &lt;a href="http://www.verisign.com/ssl/ssl-information-center/ssl-basics/index.html#a7"&gt;looking at several of their business documents&lt;/a&gt;. Once VeriSign was satisfied with the documents, they used the &lt;a href="http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-0_and_SHA-1"&gt;SHA-1&lt;/a&gt; hash algorithm to get a hash value of the certificate that had all the claims. In Wireshark, the full certificate shows up as the "signedCertificate" part:&lt;/p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8nWe2TGdI/AAAAAAAABPY/E9qxHxjy0xA/s1600-h/certsignature.png"&gt;&lt;img src="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Si8m5BF5D2I/AAAAAAAABPQ/Ljv8Jd0uBEE/s400/signedcertificate.png" /&gt;&lt;/a&gt; &lt;p&gt;It's sort of a misnomer since it actually means that those are the bytes that the signer is &lt;i&gt;going to sign &lt;/i&gt;and not the bytes that already include a signature.&lt;/p&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8nWe2TGdI/AAAAAAAABPY/E9qxHxjy0xA/s400/certsignature.png" /&gt; &lt;p&gt;The actual signature, "S", is simply called "encrypted" in Wireshark. If we raise "S" to VeriSign's public "e" exponent of 65537 and then take the remainder when divided by the modulus "n", we get this "decrypted" signature hex value:&lt;/p&gt;&lt;blockquote&gt;0001FFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFF00302130 0906052B0E03021A 05000414C19F8786 871775C60EFE0542 E4C2167C830539DB&lt;/blockquote&gt;&lt;p&gt;&lt;a href="http://tools.ietf.org/html/rfc2313#page-9"&gt;Per the PKCS #1 v1.5 standard&lt;/a&gt;, the first byte is "00" and it "ensures that the encryption block, [when] converted to an integer, is less than the modulus." The second byte of "01" indicates that this is a private key operation (e.g. it's a signature). This is followed by a lot of "FF" bytes that are used to pad the result to make sure that it's big enough. The padding is terminated by a "00" byte. It's followed by "30 21 30 09 06 05 2B 0E 03 02 1A 05 00 04 14" which is the &lt;a id="at3m" title="PKCS #1 v2.1 way" href="http://tools.ietf.org/html/rfc3447#page-43"&gt;PKCS #1 v2.1 way&lt;/a&gt; of specifying the &lt;a href="http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-0_and_SHA-1"&gt;SHA-1&lt;/a&gt; hash algorithm. The last 20 bytes are SHA-1 hash digest of the bytes in "signedCertificate."&lt;/p&gt;&lt;p&gt;Since the decrypted value &lt;a href="http://www.matasano.com/log/558/public-key-signature-forgery-collected/"&gt;is properly formatted&lt;/a&gt; and the last bytes are the same hash value that we can calculate independently, we can assume that whoever knew "VeriSign Class 3 Secure Server CA"'s private key "signed" it. We implicitly trust that only VeriSign knows the private key "d."&lt;/p&gt;&lt;p&gt;We can repeat the process to verify that "VeriSign Class 3 Secure Server CA"'s certificate was signed by VeriSign's "Class 3 Public Primary Certification Authority."&lt;/p&gt;&lt;p&gt;But why should we trust &lt;i&gt;that&lt;/i&gt;? There are no more levels on the trust chain. &lt;/p&gt;&lt;a href="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Sihn_4zmOYI/AAAAAAAABNY/di1a-vsPbYA/s1600-h/BuiltInCertificateHierarchy.png"&gt;&lt;img src="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Sihn_4zmOYI/AAAAAAAABNY/di1a-vsPbYA/s400/BuiltInCertificateHierarchy.png" /&gt;&lt;/a&gt; &lt;p&gt;The top "VeriSign Class 3 Public Primary Certification Authority" was signed by &lt;i&gt;itself&lt;/i&gt;. This certificate has been built into Mozilla products as an implicitly trusted good certificate since version &lt;a href="http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/security/nss/lib/ckfw/builtins/certdata.txt&amp;amp;rev=NSS_3_12_2_WITH_CKBI_1_73_RTM&amp;amp;mark=1.51"&gt;1.4 of certdata.txt&lt;/a&gt; in the Network Security Services (&lt;a href="http://www.mozilla.org/projects/security/pki/nss/"&gt;NSS&lt;/a&gt;) library. It was checked-in on September 6, 2000 by Netscape's Robert Relyea with the following comment:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;"Make the framework compile with the rest of NSS. Include a 'live' certdata.txt with those certs we have permission to push to open source (additional certs will be added as we get permission from the owners)." &lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This decision has had a relatively long impact since the certificate has a validity range of January 28, 1996 - August 1, 2028.&lt;/p&gt;&lt;p&gt;As Ken Thompson explained so well in his "&lt;a href="http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf"&gt;Reflections on Trusting Trust&lt;/a&gt;", you ultimately have to implicitly trust somebody. There is no way around this problem. In this case, we're implicitly trusting that Robert Relyea made a good choice. We also hope that &lt;a href="http://www.mozilla.org/projects/security/certs/policy/"&gt;Mozilla's built-in certificate policy&lt;/a&gt; is reasonable for the other built-in certificates.&lt;/p&gt;&lt;p&gt;One thing to keep in mind here is that all these certificates and signatures were simply used to form a trust chain. On the public Internet, VeriSign's root certificate is implicitly trusted by Firefox long before you go to any website. In a company, you can create your own root certificate authority (CA) that you can install on everyone's machine. &lt;/p&gt;&lt;p&gt;Alternatively, you can get around having to pay companies like VeriSign and avoid certificate trust chains altogether. Certificates are used to establish trust by using a trusted third-party (in this case, VeriSign). If you have a secure means of sharing a secret "key", such as whispering a long password into someone's ear, then you can use that pre-shared key (PSK) to establish trust. There are extensions to TLS to allow this, such as &lt;a href="http://tools.ietf.org/html/rfc4279"&gt;TLS-PSK&lt;/a&gt;, and my personal favorite, &lt;a href="http://tools.ietf.org/html/rfc5054"&gt;TLS with Secure Remote Password (SRP) extensions&lt;/a&gt;. Unfortunately, these extensions aren't nearly as widely deployed and supported, so they're usually not practical. Additionally, these alternatives impose a burden that we have to have some other secure means of communicating the secret that's more cumbersome than what we're trying to establish with TLS (otherwise, why wouldn't we use &lt;i&gt;that&lt;/i&gt; for everything?).&lt;/p&gt;&lt;p&gt;One final check that we need to do is to verify that the host name on the certificate is what we expected. &lt;a href="http://www.linkedin.com/in/nelsonbolyard"&gt;Nelson Bolyard&lt;/a&gt;'s comment in the &lt;a href="http://www.koders.com/c/fid1C807D78F4E4CA73466FEEAA78EA9F0B2D618199.aspx#L260"&gt;SSL_AuthCertificate function&lt;/a&gt; explains why: &lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="COLOR: rgb(0,128,0)"&gt;/* cert is OK. This is the client side of an SSL connection.&lt;br /&gt; * Now check the name field in the cert against the desired hostname.&lt;br /&gt; * NB: This is our only defense against Man-In-The-Middle (MITM) attacks! */&lt;/span&gt;&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;This check helps prevent against a &lt;a href="http://en.wikipedia.org/wiki/Man-in-the-middle_attack"&gt;man-in-the-middle&lt;/a&gt; attack because we are implicitly trusting that the people on the certificate trust chain wouldn't do something bad, like sign a certificate claiming to be from Amazon.com unless it actually was Amazon.com. If an attacker is able to modify your DNS server by using a technique like &lt;a href="http://en.wikipedia.org/wiki/DNS_cache_poisoning"&gt;DNS cache poisoning&lt;/a&gt;, you might be fooled into thinking you're at a trusted site (like Amazon.com) because the address bar will look normal. This last check implicitly trusts certificate authorities to stop these bad things from happening.&lt;/p&gt;&lt;h4&gt;Pre-Master Secret&lt;/h4&gt;&lt;p&gt;We've verified some claims about Amazon.com and know its public encryption exponent "e" and modulus "n." Anyone listening in on the traffic can know this as well (as evidenced because we are using Wireshark captures). Now we need to create a random secret key that an eavesdropper/attacker can't figure out. This isn't as easy as it sounds. In 1996, researchers figured out that &lt;a href="http://en.wikipedia.org/wiki/Netscape_Navigator"&gt;Netscape Navigator&lt;/a&gt; 1.1 was &lt;a id="cjg_" title="using only three sources" href="http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html"&gt;using only three sources&lt;/a&gt; to seed their pseudo-random number generator (&lt;a href="http://en.wikipedia.org/wiki/Pseudorandom_number_generator"&gt;PRNG&lt;/a&gt;). The sources were: the time of day, the process id, and the parent process id. As the researchers showed, these "random" sources aren't that random and were relatively easy to figure out.&lt;/p&gt;&lt;p&gt;Since everything else was derived from these three "random" sources, it was possible to "break" the SSL "security" in 25 seconds on a 1996 era machine. If you still don't believe that finding randomness is hard, just &lt;a id="dfhi" title="ask the Debian OpenSSL maintainers" href="http://www.schneier.com/blog/archives/2008/05/random_number_b.html"&gt;ask the Debian OpenSSL maintainers&lt;/a&gt;. If you mess it up, all the security built on top of it is suspect.&lt;/p&gt;&lt;p&gt;On Windows, random numbers used for cryptographic purposes are generated by calling the &lt;a id="f9ln" title="CryptGenRandom function" href="http://msdn.microsoft.com/en-us/library/aa379942%28VS.85%29.aspx"&gt;CryptGenRandom function&lt;/a&gt; that hashes bits &lt;a id="qai9" title="sampled from over 125 sources" href="http://blogs.msdn.com/michael_howard/archive/2005/01/14/353379.aspx#353493"&gt;sampled from over 125 sources&lt;/a&gt;. Firefox uses this function along with some bits derived from &lt;a id="a5vi" title="its own  function" href="http://www.koders.com/c/fidBC778BD3666AA64522D1FD4F4EC3331E44B4D204.aspx?s=RNG_GetNoise"&gt;its own function&lt;/a&gt; to seed its &lt;a id="k982" title="pseudo-random number generator" href="http://www.koders.com/c/fidD184CA9064625C0ADF48025F3FA0588FCD664057.aspx"&gt;pseudo-random number generator&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;The 48 byte "pre-master secret" random value that's generated isn't used directly, but it's very important to keep it secret since a lot of things are derived from it. Not surprisingly, Firefox makes it hard to find out this value. I had to compile a debug version and set the &lt;a id="is1y" title="SSLDEBUGFILE" href="http://www.koders.com/c/fidCFCD763A9E0B2BEF3FB9D4D6C17B4094CBF21548.aspx#L2092"&gt;SSLDEBUGFILE&lt;/a&gt; and &lt;a id="sstz" title="SSLTRACE" href="http://www.koders.com/c/fidCFCD763A9E0B2BEF3FB9D4D6C17B4094CBF21548.aspx#L2101"&gt;SSLTRACE&lt;/a&gt; environment variables to see it. &lt;/p&gt;&lt;p&gt;In this particular session, the pre-master secret showed up in the SSLDEBUGFILE as: &lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;4456: SSL[131491792]: Pre-Master Secret [Len: 48]&lt;br /&gt;03 01 bb 7b 08 98 a7 49 de e8 e9 b8 91 52 ec 81 ...{...I.....R..&lt;br /&gt;4c c2 39 7b f6 ba 1c 0a b1 95 50 29 be 02 ad e6 L.9{......P)....&lt;br /&gt;ad 6e 11 3f 20 c4 66 f0 64 22 57 7e e1 06 7a 3b .n.? .f.d"W~..z; &lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Note that it's not completely random. The first two bytes are, &lt;a id="ai16" title="by convention" href="http://tools.ietf.org/html/rfc2246#page-44"&gt;by convention&lt;/a&gt;, the TLS version (03 01).&lt;/p&gt;&lt;h4&gt;Trading Secrets&lt;/h4&gt;&lt;p&gt;We now need to get this secret value over to Amazon.com. By Amazon's wishes of "TLS_RSA_WITH_RC4_128_MD5", we will use RSA to do this. You &lt;em&gt;could&lt;/em&gt; make your input message equal to just the 48 byte pre-master secret, but the Public Key Cryptography Standard (PKCS) #1, version 1.5 RFC &lt;a id="fw3:" title="states" href="http://tools.ietf.org/html/rfc2313#page-8"&gt;tells us&lt;/a&gt; that we should pad these bytes with &lt;i&gt;random&lt;/i&gt; data to make the input equal to exactly the size of the modulus (1024 bits/128 bytes). This makes it harder for an attacker to determine our pre-master secret. It also gives us one last chance to protect ourselves in case we did something really bone-headed, like reusing the same secret. If we reused the key, the eavesdropper would likely see a different value placed on the network due to the random padding.&lt;/p&gt;&lt;p&gt;Again, Firefox makes it hard to see these random values. I had to insert debugging statements into &lt;a id="gyq6" title="the padding function" href="http://www.koders.com/c/fid1EB31A222A560045DBF9EC54457A1E0339825D58.aspx#L190"&gt;the padding function&lt;/a&gt; to see what was going on:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;wrapperHandle = fopen(&lt;span style="COLOR: rgb(163,21,21)"&gt;"plaintextpadding.txt"&lt;/span&gt;, &lt;span style="COLOR: rgb(163,21,21)"&gt;"a"&lt;/span&gt;);&lt;br /&gt;fprintf(wrapperHandle, &lt;span style="COLOR: rgb(163,21,21)"&gt;"PLAINTEXT = "&lt;/span&gt;);&lt;br /&gt;&lt;span style="COLOR: rgb(0,0,255)"&gt;for&lt;/span&gt;(i = 0; i &amp;lt; modulusLen; i++)&lt;br /&gt;{&lt;br /&gt;    fprintf(wrapperHandle, &lt;span style="COLOR: rgb(163,21,21)"&gt;"%02X "&lt;/span&gt;, block[i]);&lt;br /&gt;}&lt;br /&gt;fprintf(wrapperHandle, &lt;span style="COLOR: rgb(163,21,21)"&gt;"\r\n"&lt;/span&gt;);&lt;br /&gt;fclose(wrapperHandle);&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;In this session, the full padded value was:&lt;/p&gt;&lt;blockquote&gt;00 02 12 A3 EA B1 65 D6 81 6C 13 14 13 62 10 53 23 B3 96 85 FF 24 FA CC 46 11 21 24 A4 81 EA 30 63 95 D4 DC BF 9C CC D0 2E DD 5A A6 41 6A 4E 82 65 7D 70 7D 50 09 17 CD 10 55 97 B9 C1 A1 84 F2 A9 AB EA 7D F4 CC 54 E4 64 6E 3A E5 91 A0 06 00 03 01 BB 7B 08 98 A7 49 DE E8 E9 B8 91 52 EC 81 4C C2 39 7B F6 BA 1C 0A B1 95 50 29 BE 02 AD E6 AD 6E 11 3F 20 C4 66 F0 64 22 57 7E E1 06 7A 3B&lt;br /&gt;&lt;/blockquote&gt;&lt;p&gt;Firefox took this value and &lt;a href="http://www.koders.com/c/fid1B0E0F62F1B3DB6D7272F0BD781A1609D76FE6FE.aspx#L312"&gt;calculated&lt;/a&gt; "C ≡ M&lt;sup&gt;e&lt;/sup&gt; (mod n)" to get the value we see in the "&lt;a id="whfx" title="Client Key Exchange" href="http://tools.ietf.org/html/rfc2246#page-43"&gt;Client Key Exchange&lt;/a&gt;" record:&lt;/p&gt;&lt;p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8oafu_5aI/AAAAAAAABPg/r41rp34D1pw/s1600-h/clientkeyexchange.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8oafu_5aI/AAAAAAAABPg/r41rp34D1pw/s400/clientkeyexchange.png" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Finally, Firefox sent out one last unencrypted message, a "&lt;a id="f_on" title="Change Cipher  Spec" href="http://tools.ietf.org/html/rfc2246#page-24"&gt;Change Cipher Spec&lt;/a&gt;" record: &lt;/p&gt;&lt;p&gt;&lt;a href="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Si8o_0Qi0HI/AAAAAAAABPo/M1cyaQ6le5A/s1600-h/clientchangecipherspec.png"&gt;&lt;img src="http://3.bp.blogspot.com/_Zfbv3mHcYrc/Si8o_0Qi0HI/AAAAAAAABPo/M1cyaQ6le5A/s400/clientchangecipherspec.png" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;This is Firefox's way of telling Amazon that it's going to start using the agreed upon secret to encrypt its next message.&lt;/p&gt;&lt;h4&gt;Deriving the Master Secret&lt;/h4&gt;&lt;p&gt;If we've done everything correctly, both sides (and only those sides) now know the 48 byte (256 bit) pre-master secret. There's a slight trust issue here from Amazon's perspective: the pre-master secret just has bits that were generated by the client, they don't take anything into account from the server or anything we said earlier. We'll fix that be computing the "master secret." &lt;a id="lf0j" title="Per the spec" href="http://tools.ietf.org/html/rfc2246#page-47"&gt;Per the spec&lt;/a&gt;, this is done by calculating:&lt;/p&gt;&lt;blockquote&gt;master_secret = PRF(pre_master_secret, "master secret", ClientHello.random + ServerHello.random)&lt;br /&gt;&lt;/blockquote&gt;&lt;p&gt;The "pre_master_secret" is the secret value we sent earlier. The "master secret" is simply a string whose &lt;a href="http://en.wikipedia.org/wiki/ASCII"&gt;ASCII&lt;/a&gt; bytes (e.g. "6d 61 73 74 65 72 ...") are used. We then concatenate the random values that were sent in the ClientHello and ServerHello (from Amazon) messages that we saw at the beginning.&lt;/p&gt;&lt;p&gt;The PRF is the "Pseudo-Random Function" that's also &lt;a id="vku-" title="defined in the  spec" href="http://tools.ietf.org/html/rfc2246#page-11"&gt;defined in the spec&lt;/a&gt; and is quite clever. It combines the secret, the ASCII label, and the seed data we give it by using the keyed-Hash Message Authentication Code (&lt;a href="http://en.wikipedia.org/wiki/HMAC"&gt;HMAC&lt;/a&gt;) versions of both &lt;a id="remv" title="MD5" href="http://en.wikipedia.org/wiki/MD5"&gt;MD5&lt;/a&gt; and &lt;a id="syuh" title="SHA-1" href="http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-0_and_SHA-1"&gt;SHA-1&lt;/a&gt; hash functions. Half of the input is sent to each hash function. It's clever because it is quite resistant to attack, even in the face of &lt;a id="kir7" title="MD5" href="http://www.win.tue.nl/hashclash/rogue-ca/"&gt;weaknesses in MD5&lt;/a&gt; &lt;a id="x8dy" title="and SHA-1" href="http://www.schneier.com/blog/archives/2005/02/sha1_broken.html"&gt;and SHA-1&lt;/a&gt;. This process can feedback on itself and iterate forever to generate as many bytes as we need. &lt;/p&gt;&lt;p&gt;Following this procedure, we obtain a 48 byte "master secret" of &lt;/p&gt;&lt;blockquote&gt;4C AF 20 30 8F 4C AA C5 66 4A 02 90 F2 AC 10 00 39 DB 1D E0 1F CB E0 E0 9D D7 E6 BE 62 A4 6C 18 06 AD 79 21 DB 82 1D 53 84 DB 35 A7 1F C1 01 19&lt;br /&gt;&lt;/blockquote&gt;&lt;h4&gt;Generating Lots of Keys&lt;/h4&gt;&lt;p&gt;Now that both sides have a "master secrets", the spec &lt;a id="qji3" title="shows us" href="http://tools.ietf.org/html/rfc2246#page-21"&gt;shows us&lt;/a&gt; how we can derive all the needed session keys we need using the PRF to create a "key block" where we will pull data from:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;key_block = PRF(SecurityParameters.master_secret, "key expansion", SecurityParameters.server_random + SecurityParameters.client_random);&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The bytes from "key_block" are used to populate the following:&lt;/p&gt;&lt;blockquote&gt;client_write_MAC_secret[SecurityParameters.hash_size]&lt;br /&gt;server_write_MAC_secret[SecurityParameters.hash_size]&lt;br /&gt;client_write_key[SecurityParameters.key_material_length]&lt;br /&gt;server_write_key[SecurityParameters.key_material_length]&lt;br /&gt;client_write_IV[SecurityParameters.IV_size]&lt;br /&gt;server_write_IV[SecurityParameters.IV_size]&lt;br /&gt;&lt;/blockquote&gt;&lt;p&gt;Since we're using a &lt;a id="fgt_" title="stream cipher" href="http://en.wikipedia.org/wiki/Stream_cipher"&gt;stream cipher&lt;/a&gt; instead of a &lt;a id="xz.d" title="block cipher" href="http://en.wikipedia.org/wiki/Block_cipher"&gt;block cipher&lt;/a&gt; like the Advanced Encryption Standard (&lt;a href="http://en.wikipedia.org/wiki/Advanced_Encryption_Standard"&gt;AES&lt;/a&gt;), we don't need the Initialization Vectors (&lt;a id="aj5k" title="Initialization Vector" href="http://en.wikipedia.org/wiki/Initialization_vector"&gt;IV&lt;/a&gt;s). Therefore, we just need two Message Authentication Code (&lt;a id="eo8s" title="Message Authentication Code" href="http://en.wikipedia.org/wiki/Message_authentication_code"&gt;MAC&lt;/a&gt;) keys for each side that are 16 bytes (128 bits) each since the specified MD5 hash digest size is 16 bytes. In addition, the RC4 cipher uses a 16 byte (128 bit) key that both sides will need as well. All told, we need 2*16 + 2*16 = 64 bytes from the key block.&lt;/p&gt;&lt;p&gt;Running the PRF, we get these values: &lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;client_write_MAC_secret = 80 B8 F6 09 51 74 EA DB 29 28 EF 6F 9A B8 81 B0&lt;br /&gt;server_write_MAC_secret = 67 7C 96 7B 70 C5 BC 62 9D 1D 1F 4A A6 79 81 61&lt;br /&gt;client_write_key = 32 13 2C DD 1B 39 36 40 84 4A DE E5 6C 52 46 72&lt;br /&gt;server_write_key = 58 36 C4 0D 8C 7C 74 DA 6D B7 34 0A 91 B6 8F A7&lt;/p&gt;&lt;/blockquote&gt;&lt;h4&gt;Prepare to be Encrypted!&lt;/h4&gt;&lt;p&gt;The last handshake message the client sends out is the "&lt;a id="n8-5" title="Finished messages" href="http://tools.ietf.org/html/rfc2246#page-46"&gt;Finished message&lt;/a&gt;." This is a clever message that proves that no one tampered with the handshake and it proves that we know the key. The client takes all bytes from all handshake messages and puts them into a "handshake_messages" buffer. We then calculate 12 bytes of "verify_data" using the pseudo-random function (PRF) with our master key, the label "client finished", and an MD5 and SHA-1 hash of "handshake_messages": &lt;/p&gt;&lt;blockquote&gt;verify_data = PRF(master_secret, "client finished", MD5(handshake_messages) + SHA-1(handshake_messages)) [12]&lt;br /&gt;&lt;/blockquote&gt;&lt;p&gt;We take the result and add a record header byte "0x14" to indicate "finished" and length bytes "00 00 0c" to indicate that we're sending 12 bytes of verify data. Then, like all future encrypted messages, we need to make sure the decrypted contents haven't been tampered with. Since our cipher suite in use is TLS_RSA_WITH_RC4_128_MD5, this means we use the MD5 hash function.&lt;/p&gt;&lt;p&gt;Some people get paranoid when they hear MD5 because it has some weaknesses. I certainly don't advocate using it as-is. However, TLS is smart in that it doesn't use MD5 directly, but rather the &lt;a href="http://en.wikipedia.org/wiki/HMAC"&gt;HMAC&lt;/a&gt; version of it. This means that instead of using MD5(m) directly, we calculate:&lt;/p&gt;&lt;blockquote&gt;HMAC_MD5(Key, m) = MD5((Key ⊕ opad) ++ MD5((Key ⊕ ipad) ++ m)&lt;/blockquote&gt;&lt;p&gt;(The ⊕ means &lt;a href="http://en.wikipedia.org/wiki/Exclusive_or"&gt;XOR&lt;/a&gt;, ++ means concatenate, "opad" is the bytes "5c 5c ... 5c", and "ipad" is the bytes "36 36 ... 36").&lt;/p&gt;&lt;p&gt;In particular, we calculate:&lt;/p&gt;&lt;blockquote&gt;HMAC_MD5(client_write_MAC_secret, seq_num + TLSCompressed.type + TLSCompressed.version + TLSCompressed.length + TLSCompressed.fragment));&lt;br /&gt;&lt;/blockquote&gt;&lt;p&gt;As you can see, we include a sequence number ("seq_num") along with attributes of the plaintext message (here it's called "TLSCompressed"). The sequence number foils attackers who might try to take a previously encrypted message and insert it midstream. If this occurred, the sequence numbers would definitely be different than what we expected. This also protects us from an attacker dropping a message. &lt;/p&gt;&lt;p&gt;All that's left is to encrypt these bytes.&lt;/p&gt;&lt;h4&gt;RC4 Encryption&lt;/h4&gt;&lt;p&gt;Our negotiated cipher suite was TLS_RSA_WITH_RC4_128_MD5. This tells us that we need to use &lt;a href="http://people.csail.mit.edu/rivest/faq.html"&gt;&lt;span style="COLOR: rgb(128,0,128)"&gt;Ron's Code&lt;/span&gt;&lt;/a&gt; #4 (&lt;a id="guhl" title="RC4" href="http://en.wikipedia.org/wiki/RC4"&gt;RC4&lt;/a&gt;) to encrypt the traffic. &lt;a href="http://en.wikipedia.org/wiki/Ron_Rivest"&gt;Ron Rivest&lt;/a&gt; developed the RC4 algorithm to generate random bytes based on a 256 byte key. The algorithm is so simple you can actually memorize it in a few minutes. &lt;/p&gt;&lt;p&gt;RC4 begins by creating a 256-byte "S" byte array and populating it with 0 to 255. You then iterate over the array by mixing in bytes from the key. You do this to create a state machine that is used to generate "random" bytes. To generate a random byte, we shuffle around the "S" array.&lt;/p&gt;&lt;p&gt;Put graphically, it looks like this:&lt;/p&gt;&lt;a href="http://en.wikipedia.org/wiki/RC4"&gt;&lt;img src="http://4.bp.blogspot.com/_Zfbv3mHcYrc/SihmoVyIg1I/AAAAAAAABNQ/Jl6kMd1z2hE/s320/RC4.png" /&gt;&lt;/a&gt; &lt;p&gt;To encrypt a byte, we &lt;a href="http://en.wikipedia.org/wiki/Exclusive_or"&gt;xor&lt;/a&gt; this pseudo-random byte with the byte we want to encrypt. Remember that xor'ing a bit with 1 causes it to flip. Since we're generating random numbers, on average the xor will flip half of the bits. This random bit flipping is effectively how we encrypt data. As you can see, it's not very complicated and thus it runs quickly. I think that's why Amazon chose it. &lt;/p&gt;&lt;p&gt;Recall that we have a "client_write_key" and a "server_write_key." The means we need to create two RC4 instances: one to encrypt what our browser sends and the other to decrypt what the server sent us.&lt;/p&gt;&lt;p&gt;The first few random bytes out of the "client_write" RC4 instance are "7E 20 7A 4D FE FB 78 A7 33 ..." If we xor these bytes with the unencrypted header and verify message bytes of "14 00 00 0C 98 F0 AE CB C4 ...", we'll get what appears in the encrypted portion that we can see in Wireshark:&lt;/p&gt;&lt;a href="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Si8ryrIoGrI/AAAAAAAABP4/1bSIRcRkERw/s1600-h/clientencryptedkeyexchange.png"&gt;&lt;img src="http://4.bp.blogspot.com/_Zfbv3mHcYrc/Si8ryrIoGrI/AAAAAAAABP4/1bSIRcRkERw/s400/clientencryptedkeyexchange.png" /&gt;&lt;/a&gt; &lt;p&gt;The server does almost the same thing. It sends out a "Change Cipher Spec" and then a "Finished Message" that includes all handshake messages, including the &lt;em&gt;decrypted&lt;/em&gt; version of the client's "Finished Message." Consequently, this proves to the client that the server was able to successfully decrypt our message. &lt;/p&gt;&lt;h4&gt;Welcome to the Application Layer!&lt;/h4&gt;&lt;p&gt;Now, 220 milliseconds after we started, we're finally ready for the application layer. We can now send normal HTTP traffic that'll be encrypted by the TLS layer with the RC4 write instance and decrypt traffic with the server RC4 write instance. In addition, the TLS layer will check each record for tampering by computing the HMAC_MD5 hash of the contents.&lt;/p&gt;&lt;p&gt;At this point, the handshake is over. Our TLS record's content type is now 23 (0x17). Encrypted traffic begins with "17 03 01" which indicate the record type and TLS version. These bytes are followed by our encrypted size, which includes the HMAC hash. &lt;/p&gt;&lt;p&gt;Encrypting the plaintext of:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;GET /gp/cart/view.html/ref=pd_luc_mri HTTP/1.1&lt;br /&gt;Host: www.amazon.com&lt;br /&gt;User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.10) Gecko/2009060911 Minefield/3.0.10 (.NET CLR 3.5.30729)&lt;br /&gt;Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8&lt;br /&gt;Accept-Language: en-us,en;q=0.5&lt;br /&gt;Accept-Encoding: gzip,deflate&lt;br /&gt;Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7&lt;br /&gt;Keep-Alive: 300&lt;br /&gt;Connection: keep-alive&lt;br /&gt;...&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;will give us the bytes we see on the wire:&lt;/p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8s4nuKH8I/AAAAAAAABQA/feL3OBZV83s/s1600-h/firstclientappdataencrypted.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8s4nuKH8I/AAAAAAAABQA/feL3OBZV83s/s400/firstclientappdataencrypted.png" /&gt;&lt;/a&gt; &lt;p&gt;The only other interesting fact is that the sequence number increases on each record, it's now 1 (and the next record will be 2, etc).&lt;/p&gt;&lt;p&gt;The server does the same type of thing on its side using the server_write_key. We see its response, including the tell-tale application data header:&lt;/p&gt;&lt;p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8tz1Y9BqI/AAAAAAAABQQ/B6_UN0lX7fM/s1600-h/firstserverappdata.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Si8tz1Y9BqI/AAAAAAAABQQ/B6_UN0lX7fM/s400/firstserverappdata.png" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Decrypting this gives us: &lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;HTTP/1.1 200 OK&lt;br /&gt;Date: Wed, 10 Jun 2009 01:09:30 GMT&lt;br /&gt;Server: Server&lt;br /&gt;...&lt;br /&gt;Cneonction: close&lt;br /&gt;Transfer-Encoding: chunked&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;which is a normal HTTP reply that includes a non-descriptive "Server: Server" header and a misspelled "&lt;a id="t7o1" title="Cneonction: close" href="http://www.nextthing.org/archives/2005/08/07/fun-with-http-headers"&gt;Cneonction: close&lt;/a&gt;" header coming from Amazon's load balancers.&lt;/p&gt;&lt;p&gt;TLS is just below the application layer. The HTTP server software can act as if it's sending unencrypted traffic. The only change is that it writes to a library that does all the encryption. &lt;a href="http://www.openssl.org/"&gt;OpenSSL&lt;/a&gt; is a popular open-source library for TLS.&lt;/p&gt;&lt;p&gt;The connection will stay open while both sides send and receive encrypted data until either side sends out a "&lt;a href="http://tools.ietf.org/html/rfc2246#page-25"&gt;closure alert&lt;/a&gt;" message and then closes the connection. If we reconnect shortly after disconnecting, we can re-use the negotiated keys (if the server still has them cached) without using public key operations, otherwise we do a completely new full handshake.&lt;/p&gt;&lt;p&gt;It's important to realize that application data records can be &lt;i&gt;anything&lt;/i&gt;. The only reason "HTTPS" is special is because the web is so popular. There are lots of other TCP/IP based protocols that ride on top of TLS. For example, TLS is used by &lt;a id="a7fg" title="SFTP" href="http://tools.ietf.org/html/rfc4217"&gt;FTPS&lt;/a&gt; and &lt;a id="y0ps" title="extensions to SMTP" href="http://tools.ietf.org/html/rfc3207"&gt;secure extensions to SMTP&lt;/a&gt;. It's certainly better to use TLS than inventing your own solution. Additionally, you'll benefit from a protocol that has withstood careful &lt;a href="http://tools.ietf.org/html/rfc5246#appendix-F"&gt;security analysis&lt;/a&gt;. &lt;/p&gt;&lt;h4&gt;... And We're Done!&lt;/h4&gt;&lt;p&gt;The very readable &lt;a id="u9ya" title="TLS RFC" href="http://tools.ietf.org/html/rfc5246"&gt;TLS RFC&lt;/a&gt; covers many more details that were missed here. We covered just one single path in our observation of the 220 millisecond dance between Firefox and Amazon's server. Quite a bit of the process was affected by the TLS_RSA_WITH_RC4_128_MD5 Cipher Suite selection that Amazon made with its ServerHello message. It's a reasonable choice that slightly favors speed over security. &lt;/p&gt;&lt;p&gt;As we saw, if someone could secretly factor Amazon's "n" modulus into its respective "p" and "q", they could effectively decrypt all "secure" traffic until Amazon changes their certificate. Amazon counter-balances this concern this with a short one year duration certificate:&lt;/p&gt;&lt;a href="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si8u2evehpI/AAAAAAAABQY/pFMUJ5-vmOU/s1600-h/amazoncertvalidity.png"&gt;&lt;img src="http://1.bp.blogspot.com/_Zfbv3mHcYrc/Si8u2evehpI/AAAAAAAABQY/pFMUJ5-vmOU/s400/amazoncertvalidity.png" /&gt;&lt;/a&gt; &lt;p&gt;One of the cipher suites that was offered was "TLS_DHE_RSA_WITH_AES_256_CBC_SHA" which uses the &lt;a id="y2_." title="Diffie-Hellman key exchange" href="http://en.wikipedia.org/wiki/Diffie-Hellman_key_exchange"&gt;Diffie-Hellman key exchange&lt;/a&gt; that has a nice property of "&lt;a id="jtxx" title="forward secrecy" href="http://en.wikipedia.org/wiki/Perfect_forward_secrecy"&gt;forward secrecy&lt;/a&gt;." This means that if someone cracked the mathematics of the key exchange, they'd be no better off to decrypt another session. One downside to this algorithm is that it requires more math with big numbers, and thus is a little more computationally taxing on a busy server. The "Advanced Encryption Standard" (&lt;a href="http://en.wikipedia.org/wiki/Advanced_Encryption_Standard"&gt;AES&lt;/a&gt;) algorithm was present in many of the suites that we offered. It's different than RC4 in that it works on 16 byte "blocks" at a time rather than a single byte. Since its key can be up to 256 bits, many consider this to be more secure than RC4.&lt;/p&gt;&lt;p&gt;In just 220 milliseconds, two endpoints on the Internet came together, provided enough credentials to trust each other, set up encryption algorithms, and started to send encrypted traffic.&lt;/p&gt;&lt;p&gt;And to think, all of this just so Bob can buy milk.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6800934446457898793-680453867994856278?l=www.moserware.com'/&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=qjQlyUN8Tnk:aE0vXZ6n9s0:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=qjQlyUN8Tnk:aE0vXZ6n9s0:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=qjQlyUN8Tnk:aE0vXZ6n9s0:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=qjQlyUN8Tnk:aE0vXZ6n9s0:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=qjQlyUN8Tnk:aE0vXZ6n9s0:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=qjQlyUN8Tnk:aE0vXZ6n9s0:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=qjQlyUN8Tnk:aE0vXZ6n9s0:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=qjQlyUN8Tnk:aE0vXZ6n9s0:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=qjQlyUN8Tnk:aE0vXZ6n9s0:4cEx4HpKnUU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=qjQlyUN8Tnk:aE0vXZ6n9s0:4cEx4HpKnUU" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Moserware/~4/qjQlyUN8Tnk" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.moserware.com/feeds/680453867994856278/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=6800934446457898793&amp;postID=680453867994856278" title="102 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6800934446457898793/posts/default/680453867994856278?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6800934446457898793/posts/default/680453867994856278?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Moserware/~3/qjQlyUN8Tnk/first-few-milliseconds-of-https.html" title="The First Few Milliseconds of an HTTPS Connection" /><author><name>Jeff Moser</name><uri>http://www.blogger.com/profile/16074905903060665396</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="08376966494433799517" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://4.bp.blogspot.com/_Zfbv3mHcYrc/ShgnOU1MihI/AAAAAAAABNI/BAF-YQdhkJU/s72-c/securitysymbols.png" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">102</thr:total><feedburner:origLink>http://www.moserware.com/2009/06/first-few-milliseconds-of-https.html</feedburner:origLink></entry><entry gd:etag="W/&quot;CkMCR3k8cSp7ImA9WxJTFUU.&quot;"><id>tag:blogger.com,1999:blog-6800934446457898793.post-7080898641294242243</id><published>2009-04-24T08:37:00.004-04:00</published><updated>2009-04-24T09:41:06.779-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-04-24T09:41:06.779-04:00</app:edited><title>Using Obscure Windows COM APIs in .NET</title><content type="html">&lt;p&gt;Most native Windows APIs are simple to call from .NET. For example, if you need to do something special when showing a window, you can use the &lt;a href="http://msdn.microsoft.com/en-us/library/ms633548%28VS.85%29.aspx"&gt;ShowWindow&lt;/a&gt; API using &lt;a href="http://en.wikipedia.org/wiki/Platform_Invocation_Services"&gt;Platform Invocation Services&lt;/a&gt; (P/Invoke) like this:&lt;/p&gt;&lt;pre&gt;[&lt;span style="color: rgb(43, 145, 175);"&gt;DllImport&lt;/span&gt;(&lt;span style="color: rgb(163, 21, 21);"&gt;"user32.dll"&lt;/span&gt;)]&lt;br /&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;static&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;extern&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;bool&lt;/span&gt; ShowWindow(&lt;span style="color: rgb(43, 145, 175);"&gt;IntPtr&lt;/span&gt; hWnd, &lt;span style="color: rgb(0, 0, 255);"&gt;int&lt;/span&gt; nCmdShow);&lt;/pre&gt;&lt;p&gt;When you call this function, here's roughly what happens:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The CLR calls &lt;a href="http://msdn.microsoft.com/en-us/library/ms684175%28VS.85%29.aspx"&gt;LoadLibrary&lt;/a&gt; on the file (e.g. "user32.dll") &lt;/li&gt;&lt;li&gt;The CLR then calls &lt;a href="http://msdn.microsoft.com/en-us/library/ms683212%28VS.85%29.aspx"&gt;GetProcAddress&lt;/a&gt; on the function name (e.g. "ShowWindow") to get the address of where the function is located. &lt;/li&gt;&lt;/ol&gt;&lt;p&gt;For the most part, it just magically works. If we had used a function like "MessageBox", the CLR would notice that it doesn't exist and would then pick between the ANSI version (e.g. "MessageBoxA") or the Unicode version (e.g. "MessageBoxW"). &lt;/p&gt;&lt;p&gt;With the address in hand, it's easy to &lt;a href="http://en.wikipedia.org/wiki/Branch_%28computer_science%29"&gt;jump&lt;/a&gt; to it and you're all set. Simple and easy.&lt;/p&gt;&lt;p&gt;I was expecting a simple API like this when I was investigating how to register my program as the default handler for ".wav" files on Vista. In the pre-Vista days, most programs would write directly into a registry key for the file extension (e.g. "&lt;a href="http://msdn.microsoft.com/en-us/library/ms724475%28VS.85%29.aspx"&gt;HKEY_CLASSES_ROOT&lt;/a&gt;\.wav") and move on. Problems come when your program wants to register itself as a handler for a "popular" extension like .MP3 or .HTM. Some programs go into an all out arms race with other programs in a fight of wills to make sure they keep the extension.&lt;/p&gt;&lt;p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/SfGs68ifNcI/AAAAAAAABNA/tJInxh8ezP0/s1600-h/SetFileAssociations.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/SfGs68ifNcI/AAAAAAAABNA/tJInxh8ezP0/s320/SetFileAssociations.png" /&gt;&lt;/a&gt; &lt;/p&gt;&lt;p&gt;In Windows Vista and later, Microsoft wants us to use the new "&lt;a href="http://msdn.microsoft.com/en-us/library/bb756951.aspx"&gt;Default Programs&lt;/a&gt;" feature. The idea is that you register what file extensions your program supports in the registry and then a nice UI allows people to easily pick which of those extensions they want to associate with your program. Digging around the documentation led me to discover that the bulk of the functionality was exposed via the &lt;a href="http://msdn.microsoft.com/en-us/library/bb776332.aspx"&gt;IApplicationAssociationRegistration&lt;/a&gt; COM interface.&lt;/p&gt;&lt;p&gt;Ah, COM. &lt;/p&gt;&lt;p&gt;Over the years, I've tried to keep my distance from it. This irrational fear came from wizards that "next, next, finish"'d your way into thousands of lines of inscrutable code. It took me years of passing glances to &lt;a href="http://www.moserware.com/2008/01/finally-understanding-com-after.html"&gt;finally understand its basics&lt;/a&gt;. Even then, when I needed to use it from .NET, I'd right click on my project references and click "Add Reference":&lt;/p&gt;&lt;p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/SevLN4jNxKI/AAAAAAAABMw/14fMxyJI9Dg/s1600-h/AddComReference.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/SevLN4jNxKI/AAAAAAAABMw/14fMxyJI9Dg/s400/AddComReference.png" /&gt;&lt;/a&gt; &lt;/p&gt;&lt;p&gt;I'd pick the library I needed and then somehow I could use the types as if they were .NET objects. I didn't ask further questions and moved on.&lt;/p&gt;&lt;p&gt;Unfortunately, IApplicationAssociationRegistration was nowhere to be found on the "Add Reference" list since it doesn't seem to have a registered type library associated with it. Using my basic COM knowledge, I knew that if I wanted to use it I would need to know the interface identifier (IID) as well as a class identifier (CLSID) that pointed to a concrete implementation.&lt;/p&gt;&lt;p&gt;Following the MSDN documentation, I knew I'd probably find success in shobjidl.idl:&lt;/p&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/bb776332.aspx"&gt;&lt;img src="http://1.bp.blogspot.com/_Zfbv3mHcYrc/SevPpp-xVNI/AAAAAAAABM4/UD8xgiLPBCo/s400/InterfaceInformation.png" /&gt;&lt;/a&gt; &lt;p&gt;Sure enough, shobjidl.idl was sitting in my "C:\Program Files\Microsoft SDKs\Windows\&lt;a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=e6e1c3df-a74f-4207-8586-711ebe331cdc&amp;amp;displaylang=en"&gt;v6.1&lt;/a&gt;\Include" directory and had this interface definition:&lt;/p&gt;&lt;pre&gt;[&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;object&lt;/span&gt;,&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;uuid&lt;/span&gt;(4e530b0a-e611-4c77-a3ac-9031d022281b),&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;pointer_default&lt;/span&gt;(&lt;span style="color: rgb(0, 0, 255);"&gt;unique&lt;/span&gt;),&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;helpstring&lt;/span&gt;(&lt;span style="color: rgb(163, 21, 21);"&gt;"Protocol URL and Extension File Application"&lt;/span&gt;)&lt;br /&gt;]&lt;br /&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;interface&lt;/span&gt; IApplicationAssociationRegistration : IUnknown&lt;br /&gt;{&lt;br /&gt; HRESULT QueryCurrentDefault(&lt;br /&gt;     [&lt;span style="color: rgb(0, 0, 255);"&gt;in&lt;/span&gt;, &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt;] LPCWSTR pszQuery,&lt;br /&gt;     [&lt;span style="color: rgb(0, 0, 255);"&gt;in&lt;/span&gt;] ASSOCIATIONTYPE atQueryType,&lt;br /&gt;     [&lt;span style="color: rgb(0, 0, 255);"&gt;in&lt;/span&gt;] ASSOCIATIONLEVEL alQueryLevel,&lt;br /&gt;     [&lt;span style="color: rgb(0, 0, 255);"&gt;out&lt;/span&gt;, &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt;] LPWSTR* ppszAssociation);&lt;br /&gt;&lt;br /&gt;...&lt;br /&gt;}&lt;/pre&gt;&lt;p&gt;A little further down was the declaration for the concrete class (coclass) and its associated class id (CLSID):&lt;/p&gt;&lt;pre&gt;&lt;span style="color: rgb(0, 128, 0);"&gt;// CLSID_ApplicationAssociationRegistration&lt;/span&gt;&lt;br /&gt;[ &lt;span style="color: rgb(0, 0, 255);"&gt;uuid&lt;/span&gt;(591209c7-767b-42b2-9fba-44ee4615f2c7) ] &lt;span style="color: rgb(0, 0, 255);"&gt;coclass&lt;/span&gt; ApplicationAssociationRegistration&lt;br /&gt;{&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;interface&lt;/span&gt; IApplicationAssociationRegistration;&lt;br /&gt;}&lt;/pre&gt;&lt;p&gt;In the IDL, we also see the definitions for the enums that the functions use:&lt;/p&gt;&lt;pre&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;typedef&lt;/span&gt; [v1_enum] &lt;span style="color: rgb(0, 0, 255);"&gt;enum&lt;/span&gt; tagASSOCIATIONLEVEL&lt;br /&gt;{&lt;br /&gt; AL_MACHINE,&lt;br /&gt; AL_EFFECTIVE,&lt;br /&gt; AL_USER,&lt;br /&gt;} ASSOCIATIONLEVEL;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;typedef&lt;/span&gt; [v1_enum] &lt;span style="color: rgb(0, 0, 255);"&gt;enum&lt;/span&gt; tagASSOCIATIONTYPE&lt;br /&gt;{&lt;br /&gt; AT_FILEEXTENSION,&lt;br /&gt; AT_URLPROTOCOL,&lt;br /&gt; AT_STARTMENUCLIENT,&lt;br /&gt; AT_MIMETYPE,&lt;br /&gt;} ASSOCIATIONTYPE;&lt;/pre&gt;&lt;p&gt;Getting this to work in .NET was surprisingly easy. The basic idea is that the CLR has to have just enough information to find the types:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The "&lt;a href="http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.comimportattribute.aspx"&gt;ComImportAttribute&lt;/a&gt;" is almost as simple to use as &lt;a href="http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.dllimportattribute.aspx"&gt;DllImportAttribute&lt;/a&gt;. In addition, you need to use the &lt;a href="http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.guidattribute.aspx"&gt;GuidAttribute&lt;/a&gt; to specify the gigantic GUIDs. &lt;/li&gt;&lt;li&gt;You use the "&lt;a href="http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.interfacetypeattribute.aspx"&gt;InterfaceTypeAttribute&lt;/a&gt;" to specify the basic interface(s) that the interface you're importing uses. In COM, all interfaces derive from &lt;a href="http://msdn.microsoft.com/en-us/library/ms680509.aspx"&gt;IUnknown&lt;/a&gt;. If the interface supports scripting then it implements &lt;a href="http://msdn.microsoft.com/en-us/library/ms221608.aspx"&gt;IDispatch&lt;/a&gt;. If you provide a speedy C++ way of accessing your interface (e.g. &lt;a href="http://blogs.msdn.com/oldnewthing/archive/2004/02/05/68017.aspx"&gt;vtable definition&lt;/a&gt;) and the scripting IDispatch interface, you've got a "&lt;a href="http://msdn.microsoft.com/en-us/library/aa366807%28VS.85%29.aspx"&gt;dual&lt;/a&gt;" interface. &lt;/li&gt;&lt;li&gt;You need to translate the parameter types to their .NET equivalents. This is an incredibly mechanical process that's straightforward. If there is a chance that the underlying bits are different between COM and .NET (e.g. they're not &lt;a href="http://msdn.microsoft.com/en-us/library/75dwhxf7.aspx"&gt;blittable&lt;/a&gt;) then you need to use the &lt;a href="http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.marshalasattribute.aspx"&gt;MarshalAsAttribute&lt;/a&gt; to tell the CLR how to convert the types as necessary. &lt;/li&gt;&lt;li&gt;You need to remember that COM handles errors by returning HRESULTs instead of natively using exceptions like .NET uses. By default, the CLR will make the last parameter that is an OUT parameter in the IDL to be the return value (it helps if it's marked by "retval"). Therefore, you can act as if the function really returns its last parameter and the CLR will automatically check the HRESULT and throw a corresponding .NET exception as needed.&lt;/li&gt;&lt;li&gt;Optionally, and perhaps most controversially, you're free de-&lt;a href="http://en.wikipedia.org/wiki/Hungarian_notation"&gt;Hungarianize&lt;/a&gt; the parameter names and PascalCase the enum names to make them much more friendly looking to people in .NET. It's optional since it might confuse people that use MSDN documentation and expecting the original names. &lt;/li&gt;&lt;/ol&gt;&lt;p&gt;In a minute or so, I translated the definitions and gladly got rid of the Hungarian prefixes by converting parameter names of "pszQuery" to just "query." I also converted all the enums and removed their unnecessary prefixes. The end result was this:&lt;/p&gt;&lt;pre&gt;[&lt;span style="color: rgb(43, 145, 175);"&gt;ComImport&lt;/span&gt;]&lt;br /&gt;[&lt;span style="color: rgb(43, 145, 175);"&gt;Guid&lt;/span&gt;(&lt;span style="color: rgb(163, 21, 21);"&gt;"4e530b0a-e611-4c77-a3ac-9031d022281b"&lt;/span&gt;)]&lt;br /&gt;[&lt;span style="color: rgb(43, 145, 175);"&gt;InterfaceType&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;ComInterfaceType&lt;/span&gt;.InterfaceIsIUnknown)]&lt;br /&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;internal&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;interface&lt;/span&gt; &lt;span style="color: rgb(43, 145, 175);"&gt;IApplicationAssociationRegistration&lt;/span&gt;&lt;br /&gt;{   &lt;br /&gt; [&lt;span style="color: rgb(0, 0, 255);"&gt;return&lt;/span&gt;: &lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)]&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; QueryCurrentDefault( [&lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)] &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; query,&lt;br /&gt;                           &lt;span style="color: rgb(43, 145, 175);"&gt;AssociationType&lt;/span&gt; queryType,&lt;br /&gt;                           &lt;span style="color: rgb(43, 145, 175);"&gt;AssociationLevel&lt;/span&gt; queryLevel);&lt;br /&gt; [&lt;span style="color: rgb(0, 0, 255);"&gt;return&lt;/span&gt;: &lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.Bool)]&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;bool&lt;/span&gt; QueryAppIsDefault([&lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)] &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; query,&lt;br /&gt;                        &lt;span style="color: rgb(43, 145, 175);"&gt;AssociationType&lt;/span&gt; queryType,&lt;br /&gt;                        &lt;span style="color: rgb(43, 145, 175);"&gt;AssociationLevel&lt;/span&gt; queryLevel,&lt;br /&gt;                        [&lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)] &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; appRegistryName);&lt;br /&gt; [&lt;span style="color: rgb(0, 0, 255);"&gt;return&lt;/span&gt;: &lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.Bool)]&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;bool&lt;/span&gt; QueryAppIsDefaultAll(&lt;span style="color: rgb(43, 145, 175);"&gt;AssociationLevel&lt;/span&gt; queryLevel,&lt;br /&gt;                           [&lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)] &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; appRegistryName);&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;void&lt;/span&gt; SetAppAsDefault([&lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)] &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; appRegistryName,&lt;br /&gt;                      [&lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)] &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; set,&lt;br /&gt;                      &lt;span style="color: rgb(43, 145, 175);"&gt;AssociationType&lt;/span&gt; setType);&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;void&lt;/span&gt; SetAppAsDefaultAll([&lt;span style="color: rgb(43, 145, 175);"&gt;MarshalAs&lt;/span&gt;(&lt;span style="color: rgb(43, 145, 175);"&gt;UnmanagedType&lt;/span&gt;.LPWStr)] &lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; appRegistryName);&lt;br /&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;void&lt;/span&gt; ClearUserAssociations();&lt;br /&gt;}&lt;/pre&gt;&lt;p&gt;Importing the concrete class that implements the interface was just a matter of specifying its CLSID:&lt;/p&gt;&lt;pre&gt;[&lt;span style="color: rgb(43, 145, 175);"&gt;ComImport&lt;/span&gt;]&lt;br /&gt;[&lt;span style="color: rgb(43, 145, 175);"&gt;Guid&lt;/span&gt;(&lt;span style="color: rgb(163, 21, 21);"&gt;"591209c7-767b-42b2-9fba-44ee4615f2c7"&lt;/span&gt;)]&lt;br /&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;internal&lt;/span&gt; &lt;span style="color: rgb(0, 0, 255);"&gt;class&lt;/span&gt; &lt;span style="color: rgb(43, 145, 175);"&gt;ApplicationAssociationRegistration&lt;/span&gt;&lt;br /&gt;{&lt;br /&gt; &lt;span style="color: rgb(0, 128, 0);"&gt;// coclass is implemented by the runtime callable wrapper&lt;/span&gt;&lt;br /&gt;}&lt;/pre&gt;&lt;p&gt;With all of that goo out of the way, you can use the interface like a normal .NET type:&lt;/p&gt;&lt;pre&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;var&lt;/span&gt; aa = &lt;span style="color: rgb(0, 0, 255);"&gt;new&lt;/span&gt; &lt;span style="color: rgb(43, 145, 175);"&gt;ApplicationAssociationRegistration&lt;/span&gt;();&lt;br /&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;var&lt;/span&gt; iaar = (&lt;span style="color: rgb(43, 145, 175);"&gt;IApplicationAssociationRegistration&lt;/span&gt;)aa;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 255);"&gt;string&lt;/span&gt; myCurrentMp3Player = iaar.QueryCurrentDefault(".mp3", &lt;span style="color: rgb(43, 145, 175);"&gt;AssociationType&lt;/span&gt;.FileExtension, &lt;span style="color: rgb(43, 145, 175);"&gt;AssociationLevel&lt;/span&gt;.Effective);&lt;/pre&gt;&lt;p&gt;Behind the scenes, the &lt;a href="http://msdn.microsoft.com/en-us/library/8bwh56xe.aspx"&gt;runtime callable wrapper&lt;/a&gt; has to do something like this:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;Load in ole32.dll where COM functions reside.&lt;/li&gt;&lt;li&gt;Call &lt;a href="http://msdn.microsoft.com/en-us/library/ms678543.aspx"&gt;CoInitialize&lt;/a&gt; to initialize COM. &lt;/li&gt;&lt;li&gt;Look up your CLSID and IID in the registry under HKEY_CLASSES_ROOT and find their associated DLL (in our case, "shell32.dll") &lt;/li&gt;&lt;li&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/ms684007%28VS.85%29.aspx"&gt;Create a factory&lt;/a&gt; for your class. &lt;/li&gt;&lt;li&gt;Use the factory to &lt;a href="http://msdn.microsoft.com/en-us/library/ms682215%28VS.85%29.aspx"&gt;create an instance&lt;/a&gt;. &lt;/li&gt;&lt;li&gt;Call &lt;a href="http://msdn.microsoft.com/en-us/library/ms682521%28VS.85%29.aspx"&gt;QueryInterface&lt;/a&gt; to get the specific interface we want (e.g. IApplicationAssociationRegistration) &lt;/li&gt;&lt;li&gt;Get a pointer to the function we want using the &lt;a href="http://blogs.msdn.com/oldnewthing/archive/2004/02/05/68017.aspx"&gt;vtable&lt;/a&gt;. &lt;/li&gt;&lt;/ol&gt;&lt;p&gt;After all that, we &lt;em&gt;finally&lt;/em&gt; have a place to jump to like we did with P/Invoke.&lt;/p&gt;&lt;p&gt;Why bother with all of this? One reason is that Microsoft has a huge legacy investment in C and C++ in Windows. There's no compelling reason for them to rewrite things in .NET. A natural consequence is that the C++ code that implements their latest APIs will be exposed using COM for the foreseeable future. Recently, Microsoft has gone ahead and published .NET COM &lt;a href="http://code.msdn.microsoft.com/Windows7Taskbar/Release/ProjectReleases.aspx?ReleaseId=2246"&gt;wrappers&lt;/a&gt; for some of the popular new APIs like the &lt;a href="http://windowsteamblog.com/blogs/developers/archive/2009/04/23/consuming-the-contents-of-windows-7-libraries.aspx"&gt;Libraries feature&lt;/a&gt; in Windows 7. With just a little work, you don't have to wait on Microsoft to do this for you.&lt;/p&gt;&lt;p&gt;Given that .NET was designed as a successor to COM, it's no surprise that Microsoft has made interoperability with it very seamless. The &lt;a href="http://msdn.microsoft.com/en-us/library/8bwh56xe.aspx"&gt;runtime callable wrapper&lt;/a&gt; does a good job of hiding most of the messier details. The garbage collector handles much of the bookkeeping involved with memory management that used to be the bane of COM programming. The runtime is very aware of typical COM semantics of when to allocate and free memory. It's not always perfect. Sometimes you can be pre-emptive and force your COM object to be cleaned up via &lt;a href="http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.marshal.releasecomobject.aspx"&gt;Marshal.ReleaseComObject&lt;/a&gt; so you don't have to wait on the garbage collector, but you should &lt;a href="http://blogs.msdn.com/cbrumme/archive/2003/04/16/51355.aspx"&gt;be careful&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;I just presented the basics of what I learned to get my job done. There's a lot more out there for more advanced scenarios. I've found the &lt;a href="http://www.theserverside.net/tt/articles/showarticle.tss?id=ComAndDotNetInterop_Book"&gt;free book&lt;/a&gt; "COM and .NET Interop" by Andrew Troelsen to be helpful.&lt;/p&gt;&lt;p&gt;There's plenty of obscure Windows APIs out there for the taking. Enjoy!&lt;/p&gt;&lt;br /&gt;&lt;a href="http://www.dotnetkicks.com/kick/?url=http%3a%2f%2fwww.moserware.com%2f2009%2f04%2fusing-obscure-windows-com-apis-in-net.html"&gt;&lt;img src="http://www.dotnetkicks.com/Services/Images/KickItImageGenerator.ashx?url=http%3a%2f%2fwww.moserware.com%2f2009%2f04%2fusing-obscure-windows-com-apis-in-net.html" alt="kick it on DotNetKicks.com" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6800934446457898793-7080898641294242243?l=www.moserware.com'/&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=8Yd9XdKD4fg:4lcTGe9gLbE:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=8Yd9XdKD4fg:4lcTGe9gLbE:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=8Yd9XdKD4fg:4lcTGe9gLbE:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=8Yd9XdKD4fg:4lcTGe9gLbE:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=8Yd9XdKD4fg:4lcTGe9gLbE:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=8Yd9XdKD4fg:4lcTGe9gLbE:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=8Yd9XdKD4fg:4lcTGe9gLbE:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=8Yd9XdKD4fg:4lcTGe9gLbE:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=8Yd9XdKD4fg:4lcTGe9gLbE:4cEx4HpKnUU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=8Yd9XdKD4fg:4lcTGe9gLbE:4cEx4HpKnUU" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Moserware/~4/8Yd9XdKD4fg" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.moserware.com/feeds/7080898641294242243/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=6800934446457898793&amp;postID=7080898641294242243" title="4 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6800934446457898793/posts/default/7080898641294242243?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6800934446457898793/posts/default/7080898641294242243?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Moserware/~3/8Yd9XdKD4fg/using-obscure-windows-com-apis-in-net.html" title="Using Obscure Windows COM APIs in .NET" /><author><name>Jeff Moser</name><uri>http://www.blogger.com/profile/16074905903060665396</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="08376966494433799517" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/_Zfbv3mHcYrc/SfGs68ifNcI/AAAAAAAABNA/tJInxh8ezP0/s72-c/SetFileAssociations.png" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">4</thr:total><feedburner:origLink>http://www.moserware.com/2009/04/using-obscure-windows-com-apis-in-net.html</feedburner:origLink></entry><entry gd:etag="W/&quot;DUYMQXw6eip7ImA9WxVUEkQ.&quot;"><id>tag:blogger.com,1999:blog-6800934446457898793.post-8345470441664339492</id><published>2009-03-16T07:47:00.006-04:00</published><updated>2009-03-17T09:39:40.212-04:00</updated><app:edited xmlns:app="http://www.w3.org/2007/app">2009-03-17T09:39:40.212-04:00</app:edited><title>How .NET Regular Expressions Really Work</title><content type="html">&lt;p&gt;Remember when you first tried to parse text?&lt;/p&gt;&lt;p&gt;My early BASIC programs were littered with IF statements that dissected strings using LEFT$, RIGHT$, MID$, TRIM$, and UCASE$. It took me hours to write a program that parsed a simple text file. Just trying to support whitespace and mixed casing was enough to drive me crazy.&lt;/p&gt;&lt;p&gt;Years later when I started programming in Java, I discovered the &lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html"&gt;StringTokenizer&lt;/a&gt; class. I thought it was a huge leap forward. I no longer had to worry about whitespace. However, I still had to use functions like &amp;quot;substring&amp;quot; and &amp;quot;toUpperCase&amp;quot;, but I thought that was as good as it could get. &lt;/p&gt;&lt;p&gt;And then one day I found &lt;a href="http://www.regular-expressions.info/quickstart.html"&gt;regular&lt;/a&gt;&amp;#160;&lt;a href="http://www.dijksterhuis.org/regular-expressions-in-csharp-the-basics/"&gt;expressions&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;I almost cried when I realized that I could replace parsing code that took me hours to write with a simple regular expression. It still took me several years to become comfortable with &lt;a href="http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/"&gt;the syntax&lt;/a&gt;, but the learning curve was worth the power obtained. &lt;/p&gt;&lt;p&gt;And yet with all of this love, I still had this nagging suspicion that I was doing it wrong. After &lt;a href="http://www.moserware.com/2009/01/wetware-refactorings.html"&gt;reading Pragmatic Thinking and Learning&lt;/a&gt;, I was determined to try to imagine what life was like inside the code I wrote. But I just couldn't connect with a regular expression. &lt;/p&gt;&lt;p&gt;The last straw came recently when I was trying to help a &lt;a href="http://www.aaronlerch.com/blog/"&gt;coworker&lt;/a&gt; craft a regex to properly handle name/value string pairs with escaped strings. In the end, our regex worked, but I felt that it was duct-taped together. I knew there was a better way.&lt;/p&gt;&lt;p&gt;&lt;a href="http://www.amazon.com/gp/product/0596528124?ie=UTF8&amp;amp;tag=moserware-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0596528124"&gt;&lt;img style="margin: 0px 10px 5px 0px" src="http://3.bp.blogspot.com/_Zfbv3mHcYrc/SbxVzXPjfxI/AAAAAAAABLY/o4BBJWjNWUo/s200/masteringregex.jpg" align="left" /&gt;&lt;/a&gt;I picked up a copy of Jeffrey Friedl's book &amp;quot;&lt;a href="http://www.amazon.com/gp/product/0596528124?ie=UTF8&amp;amp;tag=moserware-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0596528124"&gt;Mastering Regular Expressions&lt;/a&gt;&amp;quot; and couldn't put it down. In less than a week, I had flown through 400+ pages and had finally started to feel like I understood how regular expressions worked. I finally had a sense for what backtracking really meant and I had a better idea for how a regex could go &lt;a href="http://www.regular-expressions.info/catastrophic.html"&gt;catastrophically&lt;/a&gt; out of control. &lt;/p&gt;&lt;p&gt;I had extremely high hopes for chapter 9 which covered the .NET regular expression &amp;quot;flavor.&amp;quot; Since I work with .NET every day, I thought this would be the best chapter. I did learn a few things like how to properly use &lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx"&gt;RegexOptions.ExplicitCapture&lt;/a&gt;, how to use the &lt;a href="http://msdn.microsoft.com/en-us/library/ewy2t5e0.aspx"&gt;special per-match replacement sequences&lt;/a&gt; that &lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace.aspx"&gt;Regex.Replace&lt;/a&gt; offers, how to &lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.compiletoassembly.aspx"&gt;save compiled regular expressions to a DLL&lt;/a&gt;, and how to &lt;a href="http://weblogs.asp.net/whaggard/archive/2005/02/20/377025.aspx"&gt;match balanced parentheses&lt;/a&gt; -a feat that's theoretically not possible with a regex. Despite learning all of this in the chapter, I still didn't feel that I could &amp;quot;connect&amp;quot; with the very .NET regular expression engine that I know and love.&lt;/p&gt;&lt;p&gt;To be fair, the vast benefit of the book comes from the first six chapters that deal with how regular expressions work &lt;em&gt;in general&lt;/em&gt; since regex implementations share many ideas. The book laid a solid foundation, but I wanted more. &lt;/p&gt;&lt;p&gt;I wanted to stop all my hand-waving at regular expressions and actually understand how they &lt;i&gt;really&lt;/i&gt; work.&lt;/p&gt;&lt;p&gt;I knew I wanted to drill into the code. Although tools like &lt;a href="http://www.red-gate.com/products/reflector/"&gt;Reflector&lt;/a&gt; are amazing, I knew I wanted to see the actual code. It's fairly easy now to &lt;a href="http://weblogs.asp.net/scottgu/archive/2008/01/16/net-framework-library-source-code-now-available.aspx"&gt;step into the framework source code&lt;/a&gt; in the debugger. Unlike &lt;a href="http://www.moserware.com/2008/09/how-do-locks-lock.html"&gt;understanding the details of locking&lt;/a&gt;, which had me dive into C++ and x86 assembly, it was refreshing to see that the .NET regular expression engine was written entirely in C#.&lt;/p&gt;&lt;p&gt;I decided to use a really simple regular expression and search string and then follow it from cradle to grave. If you'd like to follow along at home, I've &lt;a href="http://www.koders.com/csharp/fid7F5AE3CBB76E9E51E24DEA0DB54B86C173369E88.aspx"&gt;linked&lt;/a&gt; to relevant lines in the .NET regular expression source code.&lt;/p&gt;&lt;p&gt;My very simple regex consisted of looking for a basic URL:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="color: rgb(0,0,255)"&gt;string&lt;/span&gt; textToSearch = &lt;span style="color: rgb(0,96,128)"&gt;&amp;quot;Welcome to http://www.moserware.com/!&amp;quot;&lt;/span&gt;;&lt;br /&gt;&lt;span style="color: rgb(0,0,255)"&gt;string&lt;/span&gt; regexPattern = &lt;span style="color: rgb(0,96,128)"&gt;@&amp;quot;http://([^\s/]+)/?&amp;quot;&lt;/span&gt;;&lt;br /&gt;&lt;span style="color: rgb(43,145,175)"&gt;Match&lt;/span&gt; m = &lt;span style="color: rgb(43,145,175)"&gt;Regex&lt;/span&gt;.Match(textToSearch, regexPattern); &lt;br /&gt;&lt;span style="color: rgb(43,145,175)"&gt;Console&lt;/span&gt;.WriteLine(&lt;span style="color: rgb(0,96,128)"&gt;&amp;quot;Full uri = '{0}'&amp;quot;&lt;/span&gt;, m.Value);&lt;br /&gt;&lt;span style="color: rgb(43,145,175)"&gt;Console&lt;/span&gt;.WriteLine(&lt;span style="color: rgb(0,96,128)"&gt;&amp;quot;Host ='{0}'&amp;quot;&lt;/span&gt;, m.Groups[1].Value);&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;Our journey begins at &lt;a href="http://msdn.microsoft.com/en-us/library/0z2heewz.aspx"&gt;Regex.Match&lt;/a&gt; where we &lt;a href="http://www.koders.com/csharp/fid7F5AE3CBB76E9E51E24DEA0DB54B86C173369E88.aspx#L113"&gt;checking an internal cache&lt;/a&gt; of the &lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.cachesize.aspx"&gt;past 15&lt;/a&gt; regex values to see if there a match for:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&amp;quot;0:ENU:http://([^\\s/]+)/?&amp;quot;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is a compact representation of:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx"&gt;RegexOptions&lt;/a&gt; : &lt;a href="http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo.threeletterwindowslanguagename.aspx"&gt;Culture&lt;/a&gt; : Regex pattern&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The regex doesn't find this in the cache, so it starts &lt;a href="ScanRegex"&gt;scanning the pattern&lt;/a&gt;. Note that &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L21"&gt;out of respect for the authors&lt;/a&gt;, our regex pattern doesn't have any comments or whitespace in it:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="color: rgb(0,128,0)"&gt;// It would be nice to get rid of the &lt;a href="UseOptionX"&gt;comment modes&lt;/a&gt;, since the &lt;br /&gt;// &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L867"&gt;ScanBlank&lt;/a&gt;() calls are just kind of duct-taped in.&lt;/span&gt;&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;We &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L265"&gt;start&lt;/a&gt; creating an internal tree representation of the regex by &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L1869"&gt;adding&lt;/a&gt; a multi-character (aka &amp;quot;&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L83"&gt;Multi&lt;/a&gt;&amp;quot;) node to contain the &amp;quot;http://&amp;quot; part. Next, we see that the scanner made it to first real capture:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;http://&lt;b&gt;([^\s/]+)&lt;/b&gt;/?&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This capture contains a &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx"&gt;character class&lt;/a&gt; that says that we don't want to match spaces or a forward slash. It is converted into an obscure five character string:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;&amp;quot;\x1\x2\x1\x2F\x30\x64&amp;quot;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Later we'll see why it had to all fit in one string, but for now we can use a &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L16"&gt;helpful comment&lt;/a&gt; to decode each character:&lt;/p&gt;&lt;table cellspacing="0" cellpadding="2" width="400" border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td valign="top" width="57"&gt;Offset&lt;/td&gt;&lt;td valign="top" width="75"&gt;Hex Value&lt;/td&gt;&lt;td valign="top" width="257"&gt;Meaning&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="62"&gt;0&lt;/td&gt;&lt;td valign="top" width="75"&gt;0x01&lt;/td&gt;&lt;td valign="top" width="257"&gt;The set should be negated&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="66"&gt;1&lt;/td&gt;&lt;td valign="top" width="75"&gt;0x02&lt;/td&gt;&lt;td valign="top" width="257"&gt;There are two characters in the character part of the set&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="70"&gt;2&lt;/td&gt;&lt;td valign="top" width="75"&gt;0x01&lt;/td&gt;&lt;td valign="top" width="257"&gt;There is one Unicode category&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="73"&gt;3&lt;/td&gt;&lt;td valign="top" width="75"&gt;0x2F&lt;/td&gt;&lt;td valign="top" width="257"&gt;Inclusive lower-bound of the character set. It's a '/' in Unicode&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="76"&gt;4&lt;/td&gt;&lt;td valign="top" width="75"&gt;0x30&lt;/td&gt;&lt;td valign="top" width="257"&gt;Exclusive upper-bound of the character set. It's a '0' in Unicode&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="77"&gt;5&lt;/td&gt;&lt;td valign="top" width="75"&gt;0x64&lt;/td&gt;&lt;td valign="top" width="257"&gt;This is a magic number that means the &amp;quot;&lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L63"&gt;Space&lt;/a&gt;&amp;quot; category.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;Before I realized that this string had meaning, I was utterly confused.&lt;/p&gt;&lt;p&gt;As we continue scanning, we &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L373"&gt;find a '+' quantifier&lt;/a&gt;:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;http://([^\\s/]&lt;strong&gt;&lt;span style="font-size:180%;"&gt;+&lt;/span&gt;&lt;/strong&gt;)/?&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is noted as a &lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L71"&gt;Oneloop&lt;/a&gt; node since it's a &amp;quot;loop&amp;quot; of what came before (e.g. the character class set). It has arguments of 1 and &lt;a href="http://msdn.microsoft.com/en-us/library/system.int32.maxvalue.aspx"&gt;Int32.MaxValue&lt;/a&gt; to denote 1 or more matches. We see that the next character isn't a '?', so we can assert this is &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L406"&gt;not a lazy match&lt;/a&gt; which means it's a &lt;a href="http://en.wikipedia.org/wiki/Regular_expression#Lazy_quantification"&gt;greedy&lt;/a&gt; match.&amp;#160;&lt;/p&gt;&lt;p&gt;The first &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L301"&gt;group is recorded&lt;/a&gt; when we hit the ')' character. At the end of the pattern, we note a &lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L79"&gt;One&lt;/a&gt; (character) node for the '/' and we see it's followed by a '&lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L368"&gt;?&lt;/a&gt;'&amp;#160; which is just &lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L551"&gt;another quantifier&lt;/a&gt;, this time with a minimum of 0 and a maximum of 1.&lt;/p&gt;&lt;p&gt;All those nodes come together to give us this &amp;quot;&lt;a href="http://www.koders.com/csharp/fidECC3E02EC33C0F92A3E24574A0673340C8A22B2C.aspx"&gt;RegexTree&lt;/a&gt;:&amp;quot;&lt;/p&gt;&lt;p&gt;&lt;a href="http://1.bp.blogspot.com/_Zfbv3mHcYrc/SbcqJZeaP-I/AAAAAAAABKo/xD5zClg2S5I/s1600-h/RegexParseTree.png"&gt;&lt;img src="http://1.bp.blogspot.com/_Zfbv3mHcYrc/SbcqJZeaP-I/AAAAAAAABKo/xD5zClg2S5I/s400/RegexParseTree.png" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;We still need to &lt;a href="http://www.koders.com/csharp/fid7F5AE3CBB76E9E51E24DEA0DB54B86C173369E88.aspx#L133"&gt;convert the tree to code&lt;/a&gt; that the regular expression &amp;quot;machine&amp;quot; can execute later. The bulk of the work is done by an aptly named &lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L223"&gt;RegexCodeFromRegexTree&lt;/a&gt; function that has a decent &lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L212"&gt;comment&lt;/a&gt;:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="color: rgb(0,128,0)"&gt;/*&lt;br /&gt; * The top level RegexCode generator. It does a depth-first walk &lt;br /&gt; * through the tree and calls EmitFragment to emits code before &lt;br /&gt; * and after each child of an interior node, and at each leaf. &lt;br /&gt; * &lt;br /&gt; * It runs two passes, first to count the size of the generated &lt;br /&gt; * code, and second to generate the code. &lt;br /&gt; * &lt;br /&gt; * &amp;lt;CONSIDER&amp;gt;we need to time it against the alternative, which is &lt;br /&gt; * to just generate the code and grow the array as we go.&amp;lt;/CONSIDER&amp;gt; &lt;br /&gt; */&lt;/span&gt;&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;I love the anonymous &amp;quot;CONSIDER&amp;quot; comment and would have had a similar reaction. Instead of using an &lt;a href="http://msdn.microsoft.com/en-us/library/system.collections.arraylist.aspx"&gt;ArrayList&lt;/a&gt; or &lt;a href="http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx"&gt;List&lt;/a&gt;&amp;lt;int&amp;gt; to store the op codes, which can automatically resize as needed, the code diligently goes through the entire RegexTree &lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L248"&gt;&lt;i&gt;twice&lt;/i&gt;&lt;/a&gt;. The class is peppered with &amp;quot;&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L126"&gt;if(_counting)&lt;/a&gt;&amp;quot; expressions that just increase a counter by the size they will use in the next pass.&lt;/p&gt;&lt;p&gt;As predicted by the comment, the bulk of the work is done by the 250 line switch statement that makes up the &lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L305"&gt;EmitFragment function&lt;/a&gt;. This function breaks up RegexTree &amp;quot;fragments&amp;quot; and converts them to a simpler &lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx"&gt;RegexCode&lt;/a&gt;. The first fragment is:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;EmitFragment(nodetype=&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L450"&gt;RegexNode.Capture | BeforeChild&lt;/a&gt;, node=[&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L113"&gt;RegexNode.Capture&lt;/a&gt;, Group=0, Length=-1], childIndex=0)&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is shorthand for emitting the RegexCode that should come before the children of the top level &amp;quot;&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L113"&gt;RegexNode.Capture&lt;/a&gt;&amp;quot; node that represents group 0 and that goes until the end of the string (e.g. has length -1). The last 0 means that it's the 0th child of the parent node (this is sort of meaningless since it has no parent). The subsequent calls walk the rest of the tree:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L321"&gt;RegexNode.Concatenate | BeforeChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L108"&gt;RegexNode.Concatenate&lt;/a&gt;], childIndex=0)&lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L522"&gt;RegexNode.Multi&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L83"&gt;RegexNode.Multi&lt;/a&gt;, string=&amp;quot;http://&amp;quot;], childIndex=0)&lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L322"&gt;RegexNode.Concatenate | AfterChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L108"&gt;RegexNode.Concatenate&lt;/a&gt;], childIndex=0)&lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L321"&gt;RegexNode.Concatenate | BeforeChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L108"&gt;RegexNode.Concatenate&lt;/a&gt;], childIndex=1)&lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L450"&gt;RegexNode.Capture | BeforeChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L113"&gt;RegexNode.Capture&lt;/a&gt;, Group=1, -1], childIndex=0) &lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L513"&gt;RegexNode.SetLoop&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L73"&gt;RegexNode.SetLoop&lt;/a&gt;, min=1, max=Int32.MaxValue], childIndex=0) &lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L454"&gt;RegexNode.Capture | AfterChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L113"&gt;RegexNode.Capture&lt;/a&gt;, Group=1, Length=-1], childIndex=0) &lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L322"&gt;RegexNode.Concatenate | AfterChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L108"&gt;RegexNode.Concatenate&lt;/a&gt;], childIndex=1) &lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L321"&gt;RegexNode.Concatenate | BeforeChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L108"&gt;RegexNode.Concatenate&lt;/a&gt;], childIndex=2) &lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L503"&gt;RegexNode.Oneloop&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L71"&gt;RegexNode.Oneloop&lt;/a&gt;, min=0, max=1, character='/'], childIndex=0) &lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L322"&gt;RegexNode.Concatenate | AfterChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L108"&gt;RegexNode.Concatenate&lt;/a&gt;], childIndex=2) &lt;br /&gt;EmitFragment(&lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L454"&gt;RegexNode.Capture | AfterChild&lt;/a&gt;, [&lt;a href="http://www.koders.com/csharp/fid0C0231291E4A7914C135C0A730D5A85182F872EB.aspx#L113"&gt;RegexNode.Capture&lt;/a&gt;, Group=0, Length=-1], childIndex=0)&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The reward for all this work is an integer array that describes the RegexCode &amp;quot;op codes&amp;quot; and their arguments. You can see that some instructions like &amp;quot;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L43"&gt;Setrep&lt;/a&gt;&amp;quot; take a string argument. These arguments point to offsets in a string table. This is why it was critical to pack everything about a set into the obscure string we saw earlier. It was the only way to pass that information to the instruction.&lt;/p&gt;&lt;p&gt;Decoding the code array, we see:&lt;/p&gt;&lt;table cellspacing="0" cellpadding="2" width="712" border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td valign="top" width="92"&gt;Index&lt;/td&gt;&lt;td valign="top" width="138"&gt;Instruction&lt;/td&gt;&lt;td valign="top" width="148"&gt;Op Code/Argument&lt;/td&gt;&lt;td valign="top" width="158"&gt;String Table Reference&lt;/td&gt;&lt;td valign="top" width="174"&gt;Description&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="93"&gt;0&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L73"&gt;Lazybranch&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="148"&gt;23&lt;/td&gt;&lt;td valign="top" width="158"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174" rowspan="2"&gt;Lazily branch to the &lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L91"&gt;Stop&lt;/a&gt; instruction at offset 21.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="93"&gt;1&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="148"&gt;21&lt;/td&gt;&lt;td valign="top" width="157"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="93"&gt;2&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L81"&gt;Setmark&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;31&lt;/td&gt;&lt;td valign="top" width="156"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174"&gt;Push our current state onto a stack in case we need to backtrack later.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="93"&gt;3&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L57"&gt;Multi&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;12&lt;/td&gt;&lt;td valign="top" width="156"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174" rowspan="2"&gt;Perform a multi-character match of string table item 0 which is &amp;quot;http://&amp;quot;.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="93"&gt;4&lt;/td&gt;&lt;td valign="top" width="139"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;0&lt;/td&gt;&lt;td valign="top" width="156"&gt;&amp;quot;http://&amp;quot;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="93"&gt;5&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L81"&gt;Setmark&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;31&lt;/td&gt;&lt;td valign="top" width="156"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174"&gt;Push our current state onto a stack in case we need to backtrack later.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;6&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L43"&gt;Setrep&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;2&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174" rowspan="3"&gt;Perform a set repetition match of length 1 on the set stored at string table position 1, which represents [^\s/].&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;7&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;1&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;quot;\x1\x2\x1\x2F\x30\x64&amp;quot;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;8&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;1&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;9&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L47"&gt;Setloop&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;5&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174" rowspan="3"&gt;Match the set [^\s/] in a loop at most Int32.MaxValue times.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;10&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;1&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;quot;\x1\x2\x1\x2F\x30\x64&amp;quot;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;11&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;2147483647&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;12&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L82"&gt;Capturemark&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;32&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174" rowspan="3"&gt;Capture into group #1, the string between the mark set by the last Setmark and the current position.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;13&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;1&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;14&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;-1&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;15&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L45"&gt;Oneloop&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;3&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174" rowspan="3"&gt;Match Unicode character 47 (a '/') in a loop for a maximum of 1 time.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;16&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;47&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;17&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;1&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;18&lt;/td&gt;&lt;td valign="top" width="138"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L82"&gt;Capturemark&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="149"&gt;32&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174" rowspan="3"&gt;Capture into group #0, the contents between the first Setmark instruction and the current position.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;19&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;0&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;20&lt;/td&gt;&lt;td valign="top" width="138"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="149"&gt;-1&lt;/td&gt;&lt;td valign="top" width="155"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="94"&gt;21&lt;/td&gt;&lt;td valign="top" width="139"&gt;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx#L91"&gt;Stop&lt;/a&gt;&lt;/td&gt;&lt;td valign="top" width="151"&gt;40&lt;/td&gt;&lt;td valign="top" width="160"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="174"&gt;Stop the regex.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;We can now see that our regex has turned into a simple &amp;quot;program&amp;quot; that will be executed later.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Prefix Optimizations&lt;/b&gt;&lt;/p&gt;&lt;p&gt;We could stop here, but we'd miss the fun &amp;quot;optimizations.&amp;quot; With our pattern and search string, the optimizations will actually slow things down, but the code generator is oblivious to that. The basic idea behind prefix optimizations is to quickly jump to where the match &lt;i&gt;might &lt;/i&gt;start. It does this by &lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L289"&gt;using&lt;/a&gt; a &lt;a href="http://www.koders.com/csharp/fid4F4894401F2873F7C00CC3CF96F851ED3ED10D69.aspx"&gt;RegexFCD&lt;/a&gt; class that I'm guessing stands for &amp;quot;Regex First Character Descriptor.&amp;quot; &lt;/p&gt;&lt;p&gt;With our regex, the &lt;a href="http://www.koders.com/csharp/fid4F4894401F2873F7C00CC3CF96F851ED3ED10D69.aspx#L53"&gt;FirstChars&lt;/a&gt; functions &lt;a href="http://www.koders.com/csharp/fid4F4894401F2873F7C00CC3CF96F851ED3ED10D69.aspx#L462"&gt;notices our &amp;quot;http://&amp;quot; 'Multi' node&lt;/a&gt; and &lt;a href="http://www.koders.com/csharp/fid4F4894401F2873F7C00CC3CF96F851ED3ED10D69.aspx#L524"&gt;determines&lt;/a&gt; that any match must start with an 'h'. If we had alternations, the first character of each alternation would be added to make a limited set of potential first characters. With this optimization alone, we can skip all characters in the text that aren't in this approved &amp;quot;white list&amp;quot; of first characters without having to execute any of the above RegexCode.&lt;/p&gt;&lt;p&gt;But wait... there's an even trickier optimization! The optimizer &lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L295"&gt;discovers&lt;/a&gt; that the first thing the regex must match is a simple string literal: &lt;a href="http://www.koders.com/csharp/fid4F4894401F2873F7C00CC3CF96F851ED3ED10D69.aspx#L106"&gt;a 'Multi' node&lt;/a&gt;. This means that we can use the &lt;a href="http://www.koders.com/csharp/fidA16EF1E737BCF735FD1DE4D39E0E1AD9851FC2A7.aspx"&gt;RegexBoyerMoore&lt;/a&gt; class which applies the &lt;a href="http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm"&gt;Boyer-Moore&lt;/a&gt; search algorithm. &lt;/p&gt;&lt;p&gt;The key insight is that we don't have to check each character of the text. We only need to look at last character to see if it's even worth checking the rest.&lt;/p&gt;&lt;p&gt;For example, if our sample text is &amp;quot;Welcome to http://www.moserware.com/!&amp;quot; and we're searching for &amp;quot;http://&amp;quot; which is 7 characters, we first look at the 7th character of the text which is 'e'. Since 'e' is not the 7th character of what we're looking for (which is a '/'), we know that there couldn't possibly be a match and so we don't need to bother checking all previous 6 characters because there isn't even an 'e' in what we're looking for. The tricky part is what to do if the what we find &lt;i&gt;is&lt;/i&gt; in the string that we're trying to match, but it isn't the last '/' character. &lt;/p&gt;&lt;p&gt;The specifics are handled in &lt;a href="http://www.koders.com/csharp/fidA16EF1E737BCF735FD1DE4D39E0E1AD9851FC2A7.aspx#L91"&gt;straightforward&lt;/a&gt;&amp;#160;&lt;a href="http://www.koders.com/csharp/fidA16EF1E737BCF735FD1DE4D39E0E1AD9851FC2A7.aspx#L166"&gt;way&lt;/a&gt; with some minor optimizations to reduce memory needs given 65,000+ possible Unicode characters. For each character, the maximum possible skip is calculated.&lt;/p&gt;&lt;p&gt;For &amp;quot;http://&amp;quot;, we come up with this skip table:&lt;/p&gt;&lt;table cellspacing="0" cellpadding="2" width="195" border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td valign="top" width="74"&gt;Character&lt;/td&gt;&lt;td valign="top" width="119"&gt;Characters to skip ahead&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="79"&gt;/&lt;/td&gt;&lt;td valign="top" width="119"&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="83"&gt;:&lt;/td&gt;&lt;td valign="top" width="119"&gt;2&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="86"&gt;h&lt;/td&gt;&lt;td valign="top" width="119"&gt;6&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="88"&gt;p&lt;/td&gt;&lt;td valign="top" width="119"&gt;3&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="89"&gt;t&lt;/td&gt;&lt;td valign="top" width="119"&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="90"&gt;all others&lt;/td&gt;&lt;td valign="top" width="119"&gt;7&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;This table tells us that if we find an 'e' then we can skip ahead 7 characters without even checking the previous 6 characters. If we find a 'p', then we can skip ahead at least 3 characters before performing a full check, and if we find a '/' then we could be on the last character and need to check other characters (e.g. skip ahead 0).&lt;/p&gt;&lt;p&gt;There is one more optimization that &lt;a href="http://www.koders.com/csharp/fid4F4894401F2873F7C00CC3CF96F851ED3ED10D69.aspx#L133"&gt;looks for anchors&lt;/a&gt;, but none apply to our regex, so it's ignored.&lt;/p&gt;&lt;p&gt;We're done! We made it to the end of the &lt;a href="http://www.koders.com/csharp/fidC6F7EB8E11A6BF080CFA281BEEE003B5FAAB4AD9.aspx#L302"&gt;RegexWriter phase&lt;/a&gt;. The &amp;quot;&lt;a href="http://www.koders.com/csharp/fidF4B2B64D471D5B7401063DE2054CB33F28BDA026.aspx"&gt;RegexCode&lt;/a&gt;&amp;quot; internal representation consists of these critical parts:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The regex code we created. &lt;/li&gt;&lt;li&gt;The string table derived from the regex that the code uses (e.g. our &amp;quot;Multi&amp;quot; and &amp;quot;Setrep&amp;quot; instructions have string table references). &lt;/li&gt;&lt;li&gt;The maximum size of our backtracking stack. (Ours is 7, this will make more sense later.) &lt;/li&gt;&lt;li&gt;A mapping of named captures to their group numbers. (We don't have any in our regex, so this is empty.) &lt;/li&gt;&lt;li&gt;The total number of captures. (We have 2.) &lt;/li&gt;&lt;li&gt;The RegexBoyerMoore prefix that we calculated. (This applies to us since we have a string literal at the start.) &lt;/li&gt;&lt;li&gt;The possible first characters in our prefix. (In our case, we calculated this to be an 'h'.) &lt;/li&gt;&lt;li&gt;Our anchors. (We don't have any.) &lt;/li&gt;&lt;li&gt;An indicator whether this should be a RightToLeft match. (In our case, we use the default which is false.) &lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Every regex passes through this step. It applies to our measly regex with a code size of 21 as much as it does to a gnarly &lt;a href="http://tools.ietf.org/html/rfc2822#section-3.4.1"&gt;RFC2822&lt;/a&gt; &lt;a href="http://www.regular-expressions.info/email.html"&gt;compliant regex&lt;/a&gt; that has 175. These nine items completely describe &lt;em&gt;everything&lt;/em&gt; that we'll do with our regex and they never change. &lt;/p&gt;&lt;p&gt;&lt;b&gt;In need of an interpreter&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Now that we have the RegexCode, the &lt;a href="http://msdn.microsoft.com/en-us/library/twcw2f1c.aspx"&gt;match&lt;/a&gt; method will &lt;a href="http://www.koders.com/csharp/fid7F5AE3CBB76E9E51E24DEA0DB54B86C173369E88.aspx#L919"&gt;run&lt;/a&gt; and create a &lt;a href="http://www.koders.com/csharp/fidABFA3D15F7A596443DCE29D6AE984F1192048031.aspx"&gt;RegexRunner&lt;/a&gt; which is the &amp;quot;driver&amp;quot; for the regex matching process. Since we didn't specify the &amp;quot;Compiled&amp;quot; flag, we'll use the &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx"&gt;RegexInterpreter&lt;/a&gt; runner.&lt;/p&gt;&lt;p&gt;Before the interpreter starts &lt;a href="http://www.koders.com/csharp/fidABFA3D15F7A596443DCE29D6AE984F1192048031.aspx#L81"&gt;scanning&lt;/a&gt;, it &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L364"&gt;notices&lt;/a&gt; that we have a valid Boyer-Moore prefix optimization and it &lt;a href="http://www.koders.com/csharp/fidA16EF1E737BCF735FD1DE4D39E0E1AD9851FC2A7.aspx#L269"&gt;uses it&lt;/a&gt; to quickly locate the start of the regex:&lt;/p&gt;&lt;table cellspacing="0" cellpadding="2" width="779" border="1"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td valign="top" width="68"&gt;Index&lt;/td&gt;&lt;td valign="top" width="19"&gt;0&lt;/td&gt;&lt;td valign="top" width="15"&gt;1&lt;/td&gt;&lt;td valign="top" width="14"&gt;2&lt;/td&gt;&lt;td valign="top" width="14"&gt;3&lt;/td&gt;&lt;td valign="top" width="15"&gt;4&lt;/td&gt;&lt;td valign="top" width="18"&gt;5&lt;/td&gt;&lt;td valign="top" width="15"&gt;6&lt;/td&gt;&lt;td valign="top" width="14"&gt;7&lt;/td&gt;&lt;td valign="top" width="14"&gt;8&lt;/td&gt;&lt;td valign="top" width="15"&gt;9&lt;/td&gt;&lt;td valign="top" width="21"&gt;10&lt;/td&gt;&lt;td valign="top" width="21"&gt;11&lt;/td&gt;&lt;td valign="top" width="21"&gt;12&lt;/td&gt;&lt;td valign="top" width="21"&gt;13&lt;/td&gt;&lt;td valign="top" width="21"&gt;14&lt;/td&gt;&lt;td valign="top" width="21"&gt;15&lt;/td&gt;&lt;td valign="top" width="21"&gt;16&lt;/td&gt;&lt;td valign="top" width="21"&gt;17&lt;/td&gt;&lt;td valign="top" width="21"&gt;18&lt;/td&gt;&lt;td valign="top" width="21"&gt;19&lt;/td&gt;&lt;td valign="top" width="21"&gt;20&lt;/td&gt;&lt;td valign="top" width="21"&gt;21&lt;/td&gt;&lt;td valign="top" width="21"&gt;22&lt;/td&gt;&lt;td valign="top" width="21"&gt;23&lt;/td&gt;&lt;td valign="top" width="21"&gt;24&lt;/td&gt;&lt;td valign="top" width="21"&gt;25&lt;/td&gt;&lt;td valign="top" width="21"&gt;26&lt;/td&gt;&lt;td valign="top" width="21"&gt;27&lt;/td&gt;&lt;td valign="top" width="21"&gt;28&lt;/td&gt;&lt;td valign="top" width="21"&gt;29&lt;/td&gt;&lt;td valign="top" width="21"&gt;30&lt;/td&gt;&lt;td valign="top" width="21"&gt;31&lt;/td&gt;&lt;td valign="top" width="21"&gt;32&lt;/td&gt;&lt;td valign="top" width="21"&gt;33&lt;/td&gt;&lt;td valign="top" width="21"&gt;34&lt;/td&gt;&lt;td valign="top" width="21"&gt;35&lt;/td&gt;&lt;td valign="top" width="10"&gt;36&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="68"&gt;Character&lt;/td&gt;&lt;td valign="top" width="19"&gt;W&lt;/td&gt;&lt;td valign="top" width="15"&gt;e&lt;/td&gt;&lt;td valign="top" width="14"&gt;l&lt;/td&gt;&lt;td valign="top" width="14"&gt;c&lt;/td&gt;&lt;td valign="top" width="15"&gt;o&lt;/td&gt;&lt;td valign="top" width="18"&gt;m&lt;/td&gt;&lt;td valign="top" width="15"&gt;e&lt;/td&gt;&lt;td valign="top" width="14"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="14"&gt;t&lt;/td&gt;&lt;td valign="top" width="15"&gt;o&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;h&lt;/td&gt;&lt;td valign="top" width="21"&gt;t&lt;/td&gt;&lt;td valign="top" width="21"&gt;t&lt;/td&gt;&lt;td valign="top" width="21"&gt;p&lt;/td&gt;&lt;td valign="top" width="21"&gt;:&lt;/td&gt;&lt;td valign="top" width="21"&gt;/&lt;/td&gt;&lt;td valign="top" width="21"&gt;/&lt;/td&gt;&lt;td valign="top" width="21"&gt;w&lt;/td&gt;&lt;td valign="top" width="21"&gt;w&lt;/td&gt;&lt;td valign="top" width="21"&gt;w&lt;/td&gt;&lt;td valign="top" width="21"&gt;.&lt;/td&gt;&lt;td valign="top" width="21"&gt;m&lt;/td&gt;&lt;td valign="top" width="21"&gt;o&lt;/td&gt;&lt;td valign="top" width="21"&gt;s&lt;/td&gt;&lt;td valign="top" width="21"&gt;e&lt;/td&gt;&lt;td valign="top" width="21"&gt;r&lt;/td&gt;&lt;td valign="top" width="21"&gt;w&lt;/td&gt;&lt;td valign="top" width="21"&gt;a&lt;/td&gt;&lt;td valign="top" width="21"&gt;r&lt;/td&gt;&lt;td valign="top" width="21"&gt;e&lt;/td&gt;&lt;td valign="top" width="21"&gt;.&lt;/td&gt;&lt;td valign="top" width="21"&gt;c&lt;/td&gt;&lt;td valign="top" width="21"&gt;o&lt;/td&gt;&lt;td valign="top" width="21"&gt;m&lt;/td&gt;&lt;td valign="top" width="21"&gt;/&lt;/td&gt;&lt;td valign="top" width="10"&gt;!&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td valign="top" width="68"&gt;Scan Order&lt;/td&gt;&lt;td valign="top" width="19"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="15"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="14"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="14"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="15"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="18"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="15"&gt;1&lt;/td&gt;&lt;td valign="top" width="14"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="14"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="15"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;9&lt;/td&gt;&lt;td valign="top" width="21"&gt;8&lt;/td&gt;&lt;td valign="top" width="21"&gt;2 &amp; 7&lt;/td&gt;&lt;td valign="top" width="21"&gt;6&lt;/td&gt;&lt;td valign="top" width="21"&gt;5&lt;/td&gt;&lt;td valign="top" width="21"&gt;4&lt;/td&gt;&lt;td valign="top" width="21"&gt;3&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="21"&gt;&amp;#160;&lt;/td&gt;&lt;td valign="top" width="10"&gt;&amp;#160;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;p&gt;It first looks at the 7th character and finds an 'e' instead of the '/' that it wanted. The skip table tells it that 'e' isn't in any possible match, so it jumps ahead 7 more characters where it finds a 't'. The skip table &lt;a href="http://www.koders.com/csharp/fidA16EF1E737BCF735FD1DE4D39E0E1AD9851FC2A7.aspx#L318"&gt;tells it&lt;/a&gt; to jump ahead 4 more characters where it &lt;em&gt;finally&lt;/em&gt; finds the '/' it wanted. It then &lt;a href="http://www.koders.com/csharp/fidA16EF1E737BCF735FD1DE4D39E0E1AD9851FC2A7.aspx#L326"&gt;verifies&lt;/a&gt; that this is the last character of our &amp;quot;http://&amp;quot; prefix. With a valid prefix found, we &lt;a href="http://www.koders.com/csharp/fidABFA3D15F7A596443DCE29D6AE984F1192048031.aspx#L187"&gt;prepare for a match&lt;/a&gt; in case we're lucky and the rest of the regex matches. &lt;/p&gt;&lt;p&gt;The bulk of the interpreter is in its &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L403"&gt;Go&lt;/a&gt;&amp;quot; method which is a 700 line switch statement that interprets the RegexCode we created earlier. The only interesting part is that the interpreter keeps two stacks to keep its state in case it needs to backtrack and abandon a path it took. The &amp;quot;run &lt;b&gt;s&lt;/b&gt;tack&amp;quot; records where in the search string an operation begins while the &amp;quot;run &lt;b&gt;t&lt;/b&gt;rack&amp;quot; records the RegexCode instruction that could potentially backtrack. Any time there is a chance that the interpreter could go down a wrong path, it pushes its state onto these stacks so that it can potentially try something else later.&lt;/p&gt;&lt;p&gt;On our string, the following instructions execute:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L430"&gt;Lazybranch&lt;/a&gt; - This is a branch that is &amp;quot;lazy.&amp;quot; It will only occur if we fail and have to backtrack to this instruction. In case there are problems, we push 11 (the string offset to the start of &amp;quot;http://&amp;quot;) onto the &amp;quot;run &lt;b&gt;s&lt;/b&gt;tack&amp;quot; and 0 (the RegexCode offset for this instruction) onto the &amp;quot;run &lt;b&gt;t&lt;/b&gt;rack.&amp;quot; The branch is to code offset 21 which is the &amp;quot;Stop&amp;quot; instruction. &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L441"&gt;Setmark&lt;/a&gt; - We save our position in case we have to backtrack. &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L824"&gt;Multi&lt;/a&gt; - A multi-character match. The string to match is at offset 0 in the string table (which is &amp;quot;http://&amp;quot;). &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L441"&gt;Setmark&lt;/a&gt; - Another position save in case of a backtrack. Since the Multi code succeeded, we push our &amp;quot;run &lt;b&gt;s&lt;/b&gt;tack&amp;quot; offset of 18 (the start of &amp;quot;www.&amp;quot;) and our &amp;quot;run &lt;b&gt;t&lt;/b&gt;rack&lt;b&gt;&amp;quot; &lt;/b&gt;code position of 5 &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L883"&gt;Setrep&lt;/a&gt; - Loads the &amp;quot;\x1\x2\x1\x2F\x30\x64&amp;quot; set representation at offset 1 in the string table that we calculated earlier. It reads an operand from the execution stack that we should verify that the set &lt;b&gt;rep&lt;/b&gt;eats exactly once. It calls &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L815"&gt;CharInClassRecursive&lt;/a&gt; that does the following: &lt;ol&gt;&lt;li&gt;It sees that the first character, 'w', is not in the character range ['/', '0'). This check corresponds to the '/' in the &amp;quot;[^\s/]&amp;quot; part of the regex. &lt;/li&gt;&lt;li&gt;It next tries &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L874"&gt;CharInCategory&lt;/a&gt; which notes that 'w' is part of the &amp;quot;LowercaseLetter&amp;quot; &lt;a href="http://msdn.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx"&gt;UnicodeCategory&lt;/a&gt;. The magic number 0x64 in our set tells us to &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L890"&gt;do &lt;/a&gt;a &lt;a href="http://msdn.microsoft.com/en-us/library/t809ektx.aspx"&gt;Char.IsWhiteSpace&lt;/a&gt; check on it. This too fails. &lt;/li&gt;&lt;li&gt;Although both checks fail, the interpreter sees that it needs to &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L828"&gt;flip the result&lt;/a&gt; since it is a negated (^) set. This makes the character class match succeed. &lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L948"&gt;Setloop&lt;/a&gt; - A &amp;quot;loop&amp;quot; instruction is like a &amp;quot;rep&amp;quot; one except that it isn't forced to match anything. In our case, we see that we loop for a maximum of Int32.MaxValue times on the same set we saw in &amp;quot;Setrep.&amp;quot; Here you can see that the code generation phase turned the &amp;quot;+&amp;quot; in &amp;quot;[^\s/]+&amp;quot; of the regex into a Setrep of 1 followed by a Setloop. This is equivalent to &amp;quot;[^\s/][^\s/]*&amp;quot;. The loop keeps chomping characters until it finds the '/' which causes it to call &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L321"&gt;BackwardNext()&lt;/a&gt; which sets the current position to just before the final '/'. &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L470"&gt;CaptureMark&lt;/a&gt; - Here we start capturing group 1 by popping the &amp;quot;run &lt;b&gt;s&lt;/b&gt;tack&amp;quot; which gives us 18. Our current offset is 35. We &lt;a href="http://www.koders.com/csharp/fidABFA3D15F7A596443DCE29D6AE984F1192048031.aspx#L354"&gt;capture&lt;/a&gt; the string between these two positions, &amp;quot;www.moserware.com&amp;quot;, and &lt;a href="http://www.koders.com/csharp/fidABFA3D15F7A596443DCE29D6AE984F1192048031.aspx#L330"&gt;keep it&lt;/a&gt; for later use in case the entire regex succeeds. &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L900"&gt;Oneloop&lt;/a&gt; - Here we do a loop at most one time that will check for the '/' character. It succeeds. &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L470"&gt;CaptureMark&lt;/a&gt; - We capture into group 0 the value between the offset on the &amp;quot;run &lt;b&gt;s&lt;/b&gt;tack&amp;quot;, which is 11 (the start of &amp;quot;http://&amp;quot;), and the last character of the string at offset 36. The string between these offsets is &amp;quot;http://www.moserware.com/&amp;quot;. &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L414"&gt;Stop&lt;/a&gt; - We're done executing RegexCode and can stop the interpreter. &lt;/li&gt;&lt;/ol&gt;&lt;p&gt;Since we stopped with successful captures, the Match is declared &lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.success.aspx"&gt;a success&lt;/a&gt;. Sure enough, if we look at our console window, we see:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;Full uri = 'http://www.moserware.com/'&lt;br /&gt;Host ='www.moserware.com'&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;&lt;b&gt;Backtracking Down Unhappy Paths&lt;/b&gt;&lt;/p&gt;&lt;p&gt;I can hear the cursing shouts of ^#!@.*#!$ from the regex mob coming towards me. They're miffed that I used a toy regular expression with a pathetically easy search text that didn't do anything &amp;quot;interesting.&amp;quot;&lt;/p&gt;&lt;p&gt;The mob really shouldn't be that worried. We already have all the essential tools we need to understand how things work.&lt;/p&gt;&lt;p&gt;One common issue that you have to deal with in a &amp;quot;real&amp;quot; regular expression is backtracking. &lt;/p&gt;&lt;p&gt;Let's say you have a search text and pattern like this:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="color: rgb(0,0,255)"&gt;string&lt;/span&gt; text = &lt;span style="color: rgb(0,128,0)"&gt;&amp;quot;This text has 1 digit in it&amp;quot;&lt;/span&gt;;&lt;br /&gt;&lt;span style="color: rgb(0,0,255)"&gt;string&lt;/span&gt; pattern = &lt;span style="color: rgb(0,128,0)"&gt;@&amp;quot;.*\d&amp;quot;&lt;/span&gt;; &lt;span style="color: rgb(43,145,175)"&gt;Regex&lt;/span&gt;.Match(text, pattern);&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;You'd recognize the parse tree:&lt;/p&gt;&lt;p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/SbwnM1rcCXI/AAAAAAAABLA/P6DqBD--AWk/s1600-h/RegexParseTreeDotStarDigit.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/SbwnM1rcCXI/AAAAAAAABLA/P6DqBD--AWk/s400/RegexParseTreeDotStarDigit.png" /&gt;&lt;/a&gt; &lt;/p&gt;&lt;p&gt;The only thing new about it is that the '.' pattern was translated into a &amp;quot;Notone&amp;quot; node that matches anything except one particular character (in our case, a line feed). We see that the set follows the obscure, but compact representation. The only thing new to report is that '\x09' is the magic number to represent all Unicode digits (which the &lt;a href="http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html"&gt;Turkey Test&lt;/a&gt; showed is more than just [0-9]).&lt;/p&gt;&lt;p&gt;It's painful to watch the regex interpreter work so hard for this match. The &amp;quot;.*&amp;quot; puts it &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L924"&gt;in a Notoneloop&lt;/a&gt; that goes right to the end of the string since it doesn't find a line feed ('\n'). It then looks for &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L817"&gt;the Set&lt;/a&gt; that represents &amp;quot;\d&amp;quot; and it fails. It has no choice but to backtrack by executing the &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L973"&gt;RegexCode.Notoneloop | RegexCode.Back&lt;/a&gt;&amp;quot; composite instruction which backtracks one character by resetting the &amp;quot;run &lt;strong&gt;t&lt;/strong&gt;rack&amp;quot; to be the Set instruction again, but this time it will start one character earlier.&lt;/p&gt;&lt;p&gt;Even in our insanely simple search string, the interpreter has to backtrack by executing &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L973"&gt;RegexCode.Notoneloop | RegexCode.Back&lt;/a&gt;&amp;quot; and retesting &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L817"&gt;the Set&lt;/a&gt; a total of &lt;em&gt;thirteen times&lt;/em&gt;.&lt;/p&gt;&lt;p&gt;An almost identical process occurs if we had used a lazy match regular expression like &amp;quot;.*?\d&amp;quot;. The difference is that it does a &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L1004"&gt;Notonelazy&lt;/a&gt;&amp;quot; instruction and then gets caught up in a &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L1050"&gt;RegexCode.Notonelazy | RegexCode.Back&lt;/a&gt;&amp;quot; backtrack and &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L817"&gt;Set match&lt;/a&gt; attempt that happens &lt;em&gt;fourteen times&lt;/em&gt;. Each iteration of the loop causes the &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L1004"&gt;Notonelazy&lt;/a&gt;&amp;quot; instruction to add one more character instead of removing one like the &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L924"&gt;Notoneloop&lt;/a&gt;&amp;quot; instruction had to. This is typical:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;In situations where the decision is between &amp;quot;make an attempt&amp;quot; and &amp;quot;skip an attempt,&amp;quot; as with items governed by quantifiers, the engine always chooses to first &lt;em&gt;make&lt;/em&gt; the attempt for &lt;em&gt;greedy&lt;/em&gt; quantifiers, and to first &lt;em&gt;skip&lt;/em&gt; the attempt for &lt;em&gt;lazy&lt;/em&gt; (non-greedy) ones.&amp;#160; &lt;em&gt;&lt;a href="http://www.amazon.com/gp/product/0596528124?ie=UTF8&amp;amp;tag=moserware-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=390957&amp;amp;creativeASIN=0596528124"&gt;Mastering Regular Expressions&lt;/a&gt;&lt;/em&gt;, p.159&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;If we had a little more empathy for the regex interpreter, we would have written &amp;quot;[^\d]*\d&amp;quot; and avoided all the backtracking, but it wouldn't have shown this common error.&lt;/p&gt;&lt;p&gt;Alternations such as &amp;quot;hello|world&amp;quot; are handled with backtracking. Before each alternative is attempted, the current position is saved on the &amp;quot;run &lt;strong&gt;t&lt;/strong&gt;rack&amp;quot; and &amp;quot;run &lt;strong&gt;s&lt;/strong&gt;tack.&amp;quot; If the alternate fails, the regex engine resets the position to what it was before the alternate was tried and the next alternate is attempted.&lt;/p&gt;&lt;p&gt;Now, we can even understand how more advanced concepts like &lt;a href="http://www.regular-expressions.info/atomic.html"&gt;atomic grouping&lt;/a&gt; work. If we use a regex like:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;\w+:&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;to match the names of email headers as in:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;Subject: Hello World!&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Things will work well. The problem will come when we try to match against&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;Subject&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;We already know that there is going to be a backtracking since &amp;quot;\w+&amp;quot; will match the whole string and then backtracking will occur as the interpreter desperately tries to match a ':'. If we used atomic grouping, as in:&lt;/p&gt;&lt;blockquote&gt;&lt;p&gt;(?&amp;gt;\w+):&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;We would see that the generated RegexCode has two extra instructions of &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L701"&gt;Setjump&lt;/a&gt; and &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L723"&gt;Forejump&lt;/a&gt; in it. These instructions tell the interpreter to do unconditional jumps after matching the &amp;quot;\w+&amp;quot;. As &lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L723"&gt;the comment&lt;/a&gt; for &amp;quot;Forejump&amp;quot; indicates, these unconditional jumps will &amp;quot;zap backtracking state&amp;quot; and be much more efficient for a failed match since backtracking won't occur.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Loose Ends&lt;/b&gt;&lt;/p&gt;&lt;p&gt;There are some minor details left. The first time you use any regex, a &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L102"&gt;lot&lt;/a&gt; &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L260"&gt;of&lt;/a&gt; &lt;a href="http://www.koders.com/csharp/fid14AB8BA02EE8A6DBA830F1DCC147C2B17F0B3DE0.aspx#L358"&gt;work&lt;/a&gt; goes on initializing all the &lt;a href="http://msdn.microsoft.com/en-us/library/20bw873z.aspx"&gt;character classes&lt;/a&gt; that are stored as static variables. If you just timed a single Regex, your numbers would be highly skewed by this process.&lt;/p&gt;&lt;p&gt;Another common issue is whether you should use the RegexOptions.Compiled flag. Compiling is handled by the &lt;a href="http://www.koders.com/csharp/fid7CC2751EC539A3CCF3A96A3D82E38D6E6D7B79F3.aspx"&gt;RegexCompiler&lt;/a&gt; class. The interesting aspects of the IL code generation is handled exactly like the interpreter, as indicated by &lt;a href="http://www.koders.com/csharp/fid7CC2751EC539A3CCF3A96A3D82E38D6E6D7B79F3.aspx"&gt;this comment&lt;/a&gt;: &lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="color: rgb(0,128,0)"&gt;/* &lt;br /&gt; * The main translation function. It translates the logic for a single opcode at &lt;br /&gt; * the current position. The structure of this function exactly mirrors &lt;br /&gt; * the structure of the inner loop of RegexInterpreter.Go(). &lt;br /&gt; * &lt;br /&gt; * The C# code from RegexInterpreter.Go() that corresponds to each case is &lt;br /&gt; * included as a comment. &lt;br /&gt; * &lt;br /&gt; * Note that since we're generating code, we can collapse many cases that are &lt;br /&gt; * dealt with one-at-a-time in RegexIntepreter. We can also unroll loops that &lt;br /&gt; * iterate over constant strings or sets. &lt;br /&gt; */&lt;/span&gt;&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;We can see that there &lt;em&gt;is&lt;/em&gt; some optimization in the generated code. The down side is that we have to generate all the code regardless of if we use all of it or not. The interpreter only uses what it needs. Additionally, unless we use &lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.compiletoassembly.aspx"&gt;Regex.CompileToAssembly&lt;/a&gt; to save the compiled code to a DLL, we'll end up doing the entire process of creating the parse tree, RegexCode, and code generation at runtime.&lt;/p&gt;&lt;p&gt;Thus, for most cases, it seems that RegexOptions.Compiled isn't worth the effort. But it's good to keep in mind that there are exceptions when performance is critical and your regex can benefit from it (otherwise, why have the option at all?).&lt;/p&gt;&lt;p&gt;Another option is &lt;a href="http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx"&gt;RegexOptions&lt;/a&gt;.IgnoreCase that makes everything case insensitive. The vast majority of the process stays the same. The only difference is that all instructions that compare characters will convert each &lt;a href="http://msdn.microsoft.com/en-us/library/system.char.aspx"&gt;System.Char&lt;/a&gt; to lower case, mostly using the &lt;a href="http://msdn.microsoft.com/en-us/library/xt041c19.aspx"&gt;Char.ToLower&lt;/a&gt; method. This sounds reasonable, but it's not quite perfect. For example, in Koine Greek, the word for &amp;quot;&lt;a href="http://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=G4597&amp;amp;t=NKJV"&gt;moth&lt;/a&gt;&amp;quot; goes from uppercase to lowercase like this: &lt;/p&gt;&lt;p&gt;&lt;a href="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Sbwmu25wcUI/AAAAAAAABK4/Of_LjLHbejs/s1600-h/mothUpperAndLower.png"&gt;&lt;img src="http://2.bp.blogspot.com/_Zfbv3mHcYrc/Sbwmu25wcUI/AAAAAAAABK4/Of_LjLHbejs/s400/mothUpperAndLower.png" /&gt;&lt;/a&gt; &lt;/p&gt;&lt;p&gt;That is, in Greek, when a &amp;quot;sigma&amp;quot; (&amp;#931;) appears in lowercase at the end of a word, it uses a &lt;a href="http://blogs.msdn.com/michkap/archive/2005/05/26/421987.aspx"&gt;different letter&lt;/a&gt; (&amp;#962;) than if it appeared anywhere else (&amp;#963;). RegexOptions.IgnoreCase can't handle cases that need more context than a single &lt;a href="http://msdn.microsoft.com/en-us/library/system.char.aspx"&gt;System.Char&lt;/a&gt; even though the string comparison functions can handle this. Consider this example:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="color: rgb(0,0,255)"&gt;string&lt;/span&gt; mothLower = &lt;span style="color: rgb(0,96,128)"&gt;&amp;quot;&amp;#963;&amp;#8053;&amp;#962;&amp;quot;&lt;/span&gt;; &lt;br /&gt;&lt;span style="color: rgb(0,0,255)"&gt;string&lt;/span&gt; mothUpper = mothLower.ToUpper(); &lt;span style="color: rgb(0,128,0)"&gt;// &amp;quot;&amp;#931;&amp;#8139;&amp;#931;&amp;quot;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0,0,255)"&gt;bool&lt;/span&gt; stringsAreEqualIgnoreCase = mothUpper.Equals(mothLower, &lt;span style="color: rgb(43,145,175)"&gt;StringComparison&lt;/span&gt;.CurrentCultureIgnoreCase);&amp;#160; &lt;span style="color: rgb(0,128,0)"&gt;// true&lt;/span&gt; &lt;br /&gt;&lt;span style="color: rgb(0,0,255)"&gt;bool&lt;/span&gt; stringsAreEqualRegex = &lt;span style="color: rgb(43,145,175)"&gt;Regex&lt;/span&gt;.IsMatch(mothLower, mothUpper, &lt;span style="color: rgb(43,145,175)"&gt;RegexOptions&lt;/span&gt;.IgnoreCase); &lt;span style="color: rgb(0,128,0)"&gt;// false&lt;/span&gt;&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;This also means that .NET's Regex won't do well with characters outside the &lt;a href="http://en.wikipedia.org/wiki/Basic_Multilingual_Plane"&gt;Basic Multilingual Plane&lt;/a&gt; that need to be represented by more than one &lt;a href="http://msdn.microsoft.com/en-us/library/system.char.aspx"&gt;System.Char&lt;/a&gt; as a &amp;quot;&lt;a href="http://msdn.microsoft.com/en-us/library/8k5611at.aspx"&gt;surrogate pair&lt;/a&gt;.&amp;quot;&lt;/p&gt;&lt;p&gt;I bring all of these &amp;quot;cases&amp;quot; up because it obviously troubled one of the Regex programmers who wrote this &lt;a href="http://www.koders.com/csharp/fidC88A6970F260F6826C679E703634322F3C553827.aspx#L1859"&gt;comment&lt;/a&gt; &lt;em&gt;&lt;a href="http://www.koders.com/csharp/fidA16EF1E737BCF735FD1DE4D39E0E1AD9851FC2A7.aspx#L64"&gt;twice&lt;/a&gt;&lt;/em&gt;:&lt;/p&gt;&lt;blockquote&gt;&lt;pre&gt;&lt;span style="color: rgb(0,128,0)"&gt;// We do the ToLower character by character for consistency.&amp;#160; With surrogate chars, doing &lt;br /&gt;// a ToLower on the entire string could actually change the surrogate pair.&amp;#160; This is more correct &lt;br /&gt;// linguistically, but since Regex doesn't support surrogates, it's more important to be &lt;br /&gt;// consistent.&lt;/span&gt;&lt;/pre&gt;&lt;/blockquote&gt;&lt;p&gt;You can tell the author was fully anticipating the &lt;a href="http://code.logos.com/blog/2008/07/net_regular_expressions_and_unicode.html"&gt;bug reports&lt;/a&gt; that eventually came as a result of this decision. Unfortunately, due to the way the code is structured, changing this behavior would take a hefty overhaul of the engine and would require a massive amount of regression testing. I'm guessing this is the reason why it won't be coming in a service pack anytime soon.&lt;/p&gt;&lt;p&gt;The last interesting option that affects most of the code is RegexOptions.RightToLeft. For the most part, this affects where the searching starts and how a &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L229"&gt;bump&lt;/a&gt;&amp;quot; is applied. When the engine wants to move forward or get the characters to the &amp;quot;right&amp;quot;, it checks this option to see if it should move +1 or -1 character from the current position. It's a simple idea, but its implementation is with many &amp;quot;&lt;a href="http://www.koders.com/csharp/fidE76CE858561A50AF7A1D9030DC8F2F4D6DEF839D.aspx#L247"&gt;if(!runrtl)&lt;/a&gt;&amp;quot; statements spread throughout the code.&lt;/p&gt;&lt;p&gt;Finally, you might be interested in how &lt;a href="http://www.mono-project.com/Main_Page"&gt;Mono&lt;/a&gt;'s regular expression compares with Microsoft's. The good news is that the code &lt;a href="http://anonsvn.mono-project.com/viewvc/trunk/mcs/class/System/System.Text.RegularExpressions/"&gt;is also available&lt;/a&gt; online as well. In general, Mono's implementation is very similar. Here are some of the (minor) differences:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Mono's parse tree has a similar shape, but it uses more strongly typed classes. For example, sets such as [^\s/] are given their own class rather than encoded as a single string. &lt;/li&gt;&lt;li&gt;The Boyer-Moore prefix optimization is done in the &lt;a href="http://anonsvn.mono-project.com/viewvc/trunk/mcs/class/System/System.Text.RegularExpressions/quicksearch.cs?view=log"&gt;QuickSearch&lt;/a&gt; class. It is calculated at run-time and is only used if the search string is longer than 5 characters. &lt;/li&gt;&lt;li&gt;The regex machine doesn't have a separate string table for referencing strings like &amp;quot;http://&amp;quot;. Each character is passed in as an argument to the instruction. &lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Weighing in around 14,000 lines of code, .NET's regular expression engine takes awhile to digest. After getting over the shock of its size, it was relatively straightforward to understand. Seeing the real source code, with its occasional funny comments, provided insight that Reflector simply couldn't offer. In the end, we see that a .NET regular expression pattern is simply a compact representation for its internal RegexCode machine language. &lt;/p&gt;&lt;p&gt;This whole process has allowed me to finally connect with regular expressions and give them a splash of empathy. Seeing the horror of backtracking first hand in the debugger was enough for me to want to do everything in my power to get rid of it. Following the translation process down to the RegexCode level clued me into how my regex pattern will actually execute. Feeling the wind fly by a regex using the Boyer-Moore prefix optimization has encouraged me to do whatever I can to put string literals at the front of a pattern. &lt;/p&gt;&lt;p&gt;It's all these little things that add up to a blazingly fast regular expression.&lt;/p&gt;&lt;br /&gt;&lt;a href="http://www.dotnetkicks.com/kick/?url=http%3a%2f%2fwww.moserware.com%2f2009%2f03%2fhow-net-regular-expressions-really-work.html"&gt;&lt;img src="http://www.dotnetkicks.com/Services/Images/KickItImageGenerator.ashx?url=http%3a%2f%2fwww.moserware.com%2f2009%2f03%2fhow-net-regular-expressions-really-work.html" border="0" alt="kick it on DotNetKicks.com" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6800934446457898793-8345470441664339492?l=www.moserware.com'/&gt;&lt;/div&gt;&lt;div class="feedflare"&gt;
&lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=iJ4VR3E6csQ:2dfg0-iS5no:yIl2AUoC8zA"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?d=yIl2AUoC8zA" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=iJ4VR3E6csQ:2dfg0-iS5no:63t7Ie-LG7Y"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?d=63t7Ie-LG7Y" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=iJ4VR3E6csQ:2dfg0-iS5no:V_sGLiPBpWU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=iJ4VR3E6csQ:2dfg0-iS5no:V_sGLiPBpWU" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=iJ4VR3E6csQ:2dfg0-iS5no:gIN9vFwOqvQ"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=iJ4VR3E6csQ:2dfg0-iS5no:gIN9vFwOqvQ" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=iJ4VR3E6csQ:2dfg0-iS5no:F7zBnMyn0Lo"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=iJ4VR3E6csQ:2dfg0-iS5no:F7zBnMyn0Lo" border="0"&gt;&lt;/img&gt;&lt;/a&gt; &lt;a href="http://feeds.feedburner.com/~ff/Moserware?a=iJ4VR3E6csQ:2dfg0-iS5no:4cEx4HpKnUU"&gt;&lt;img src="http://feeds.feedburner.com/~ff/Moserware?i=iJ4VR3E6csQ:2dfg0-iS5no:4cEx4HpKnUU" border="0"&gt;&lt;/img&gt;&lt;/a&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/Moserware/~4/iJ4VR3E6csQ" height="1" width="1"/&gt;</content><link rel="replies" type="application/atom+xml" href="http://www.moserware.com/feeds/8345470441664339492/comments/default" title="Post Comments" /><link rel="replies" type="text/html" href="https://www.blogger.com/comment.g?blogID=6800934446457898793&amp;postID=8345470441664339492" title="19 Comments" /><link rel="edit" type="application/atom+xml" href="http://www.blogger.com/feeds/6800934446457898793/posts/default/8345470441664339492?v=2" /><link rel="self" type="application/atom+xml" href="http://www.blogger.com/feeds/6800934446457898793/posts/default/8345470441664339492?v=2" /><link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/Moserware/~3/iJ4VR3E6csQ/how-net-regular-expressions-really-work.html" title="How .NET Regular Expressions Really Work" /><author><name>Jeff Moser</name><uri>http://www.blogger.com/profile/16074905903060665396</uri><email>noreply@blogger.com</email><gd:extendedProperty name="OpenSocialUserId" value="08376966494433799517" /></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://3.bp.blogspot.com/_Zfbv3mHcYrc/SbxVzXPjfxI/AAAAAAAABLY/o4BBJWjNWUo/s72-c/masteringregex.jpg" height="72" width="72" /><thr:total xmlns:thr="http://purl.org/syndication/thread/1.0">19</thr:total><feedburner:origLink>http://www.moserware.com/2009/03/how-net-regular-expressions-really-work.html</feedburner:origLink></entry></feed>
