<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:thr="http://purl.org/syndication/thread/1.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><atom:id>tag:blogger.com,1999:blog-9089443006312604961</atom:id><lastBuildDate>Tue, 22 Jun 2010 12:25:32 +0000</lastBuildDate><title>Machine Learning for Computer Security.</title><description>&amp;gt;&amp;gt; Good and bad times with machine learning and security research.  </description><link>http://blog.mlsec.org/</link><managingEditor>noreply@blogger.com (Konrad Rieck)</managingEditor><generator>Blogger</generator><openSearch:totalResults>33</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/mlsec" /><feedburner:info uri="mlsec" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-3599348974489579426</guid><pubDate>Mon, 14 Jun 2010 06:17:00 +0000</pubDate><atom:updated>2010-06-14T08:31:30.122+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">computer security</category><title>Second Call for Papers: EC2ND 2010.</title><description>&lt;i&gt;Only three weeks to go&lt;/i&gt;: The sixth &lt;a href="http://2010.ec2nd.org"&gt;European Conference on Computer Network Defense&lt;/a&gt; (EC2ND) invites submissions presenting novel ideas in the areas of network defense, intrusion detection and systems security. We specifically encourage submissions presenting work at an early stage with the intention to act as a discussion forum for innovative security research. &lt;br /&gt;&lt;br /&gt;Paper submission deadline:           &lt;b&gt;July 2, 2010&lt;/b&gt;&lt;br /&gt;Paper acceptance or rejection:       &lt;b&gt;August 6, 2010&lt;/b&gt;&lt;br /&gt;Conference dates:                     &lt;b&gt;October 28-29, 2010&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;More information on EC2ND 2010 and paper submission are available at the &lt;a href="http://2010.ec2nd.org"&gt;conference website&lt;/a&gt;. Additionally, you can also follow on twitter: &lt;a href="http://www.twitter.com/ec2nd"&gt;ec2nd&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-3599348974489579426?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/IU7qgzHFZ20" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/IU7qgzHFZ20/second-call-for-papers-ec2nd-2010.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2010/06/second-call-for-papers-ec2nd-2010.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-3878328979131416588</guid><pubDate>Thu, 22 Apr 2010 17:20:00 +0000</pubDate><atom:updated>2010-04-22T19:15:17.427+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">computer security</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>TokDoc: The Token Doctor.</title><description>Detecting and preventing network intrusions is a basic task in computer security. Thus, it no surprise that there exists several security products capable to block network attacks, though with moderate success and restricted to a database of known attack patterns. For a security researcher two issues with these systems are not satisfactory. First, current intrusion prevention systems rely on a up-to-date database of attack signatures. Novel and unknown attacks will likely go undetected. Second, network traffic is either passed or completely blocked&amp;mdash;often with fatal consequences and cut-off of benign communication.  &lt;br /&gt;&lt;br /&gt;In recent years, security research has focused on addressing the first issue by extending intrusion detection systems with techniques of statistics and machine learning. The resulting systems proof effective in many scenarios. Still, they build on a binary pass-or-block decision on every analyzed event. What if we question this rigid decision making?  Together with Tammo Krueger, Christian Gehl and Pavel Laskov, we have dared to go one step beyond. In a recent paper, we propose a web-application firewall that is not only capable to detect network attacks, but provides means to "heal" abnormal content to some degree. We call it the &lt;i&gt;Token Doctor&lt;/i&gt;. &lt;br /&gt;&lt;br /&gt;The Token Doctor, or short TokDoc, acts as a reverse proxy and inspects incoming requests of HTTP traffic. Each requests is parsed into its individual tokens, such as the URL, parameters, headers and so on. Each tokens is then analyzed individually using anomaly detection techniques. If an anomaly is spotted in a token, a fine-grained process is launched. Some tokens, such as usernames and cookie values, can not be corrected, hence there are simply dropped from the request as with usual prevention systems. Other tokens, however, are "healed" by replacing them with benign content. This replacement is automatically selected by determining similar tokens in a pool of benign HTTP requests. &lt;br /&gt;&lt;br /&gt;The "healing" implemented in TokDok is not guaranteed to succeed. Frankly speaking, automatic amending of network data is a controversial idea and one can image a lot of things going wrong. Nevertheless, our experiments demonstrate the opposite. TokDoc provides an excellent detection accuracy while significantly reducing false positives in comparison to state-of-the-art methods. For example, several benign requests, which would been dropped by a regular systems due to minor irregularities, are slightly amended without effects on functionality. Overall, there is room for discussion: Either stick to a rigid decision with painful blocking or choose a fuzzy amending with lots of promises but no guarantees. I don't know.&lt;br /&gt;&lt;br /&gt;A corresponding paper has been published at the 24th ACM Symposium on Applied Computing (SAC) this year.  Its abstract is here: &lt;blockquote&gt; The growing amount of web-based attacks poses a severe threat to the security of web applications. Signature-based detection techniques increasingly fail to cope with the variety and complexity of novel attack instances.  As a remedy, we introduce a protocol-aware reverse HTTP proxy TokDoc (the token doctor), which intercepts requests and decides on a per-token basis whether a token requires automatic "healing".  In particular, we propose an intelligent mangling technique, which, based on the decision of previously trained anomaly detectors, replaces suspicious parts in requests by benign data the system has seen in the past.  Evaluation of our system in terms of accuracy is performed on two real-world data sets and a large variety of recent attacks.  In comparison to state-of-the-art anomaly detectors, TokDoc is not only capable of detecting most attacks, but also significantly outperforms the other methods in terms of false positives.  Runtime measurements show that our implementation can be deployed as an inline intrusion prevention system.&lt;/blockquote&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2010b-sac.pdf"&gt;TokDoc: A Self-Healing Web Application Firewall&lt;/a&gt;.  Tammo Krueger, Christian Gehl, Konrad Rieck and Pavel Laskov.  &lt;i&gt;Proc.  of 25th ACM Symposium on Applied Computing (SAC)&lt;/i&gt;, 1846-1853, March 2010.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-3878328979131416588?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/0mudF2o0XdM" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/0mudF2o0XdM/tokdoc-token-doctor.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2010/04/tokdoc-token-doctor.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-1140856178647804106</guid><pubDate>Sat, 20 Mar 2010 11:48:00 +0000</pubDate><atom:updated>2010-03-20T13:02:48.219+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">computer security</category><title>Call for Papers: SICHERHEIT 2010</title><description>I am happy to serve on the program committee of  the conference "&lt;a href="http://www.sicherheit2010.de"&gt;Sicherheit, Schutz und Zuverlässigkeit&lt;/a&gt;" (SICHERHEIT) which  will take place in October 2010 in Berlin. SICHERHEIT is the German forum for experts from academia and industry to discuss aspects of &lt;i&gt;safety&lt;/i&gt; (protection from catastrophic events of technical systems) and &lt;i&gt;security&lt;/i&gt; (protection of confidentiality and integrity of information in technical systems). &lt;br /&gt;&lt;br /&gt;The organizers seek submissions from the broad range of safety and security, for example on the following topics:&lt;ul&gt;&lt;li&gt; biometrics, privacy, data protection &lt;br /&gt;&lt;li&gt; e-commerce, e-government &lt;br /&gt;&lt;li&gt; certification of safe and secure systems &lt;br /&gt;&lt;li&gt; cryptography, digital signatures, steganography &lt;br /&gt;&lt;li&gt; development and maintenance of safe and secure systems &lt;br /&gt;&lt;li&gt; formal methods for safe and secure systems &lt;br /&gt;&lt;li&gt; management of safe and secure systems &lt;br /&gt;&lt;li&gt; management of information security &lt;br /&gt;&lt;li&gt; network security &lt;br /&gt;&lt;li&gt; reactive security &lt;br /&gt;&lt;li&gt; reliability and availability &lt;br /&gt;&lt;/ul&gt;The conference language is English and SICHERHEIT welcomes participants and submissions from non-German-speaking countries. More information can be found at &lt;a href="http://www.sicherheit2010.de"&gt;www.sicherheit2010.de&lt;/a&gt;. Deadline for paper submissions is &lt;b&gt;March, 29th&lt;/b&gt;. So hurry up!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-1140856178647804106?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/PFYYu4bi4os" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/PFYYu4bi4os/call-for-papers-sicherheit-2010.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2010/03/call-for-papers-sicherheit-2010.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-2022077086703975292</guid><pubDate>Wed, 03 Mar 2010 18:50:00 +0000</pubDate><atom:updated>2010-03-03T19:50:07.369+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">treeology</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><title>Efficient Machine Learning with Trees.</title><description>Trees are a natural and intuitive representation of data in many domains of computer science. Although ubiquitous, analysis of tree data is far from trivial. Most techniques of machine learning are confined to operate in vector spaces and lack the ability to process  trees.  A solution to this problem is provided by &lt;i&gt;tree kernels&lt;/i&gt; which allow for application of kernel-based learning methods to tree data. Unfortunately, the run-time complexity of tree kernels is inherently quadratic in the number of nodes. While small trees can be processed with minor overhead, learning with larger trees gets intractable. For example, the computation of a regular kernel for two parse trees of HTML documents comprising 10,000 nodes each, requires about 1 Gigabyte of memory and takes over 100 seconds on a recent computer system. Given that kernel computations are performed millions of times in large-scale learning, it is evident that regular tree kernels are an inappropriate choice in many learning tasks.&lt;br /&gt;&lt;br /&gt;As a result, we have often abstained from using tree data in our research. However, we recently found the time to address this problem and devised &lt;i&gt;approximate tree kernels&lt;/i&gt;. Instead of fiddling with algorithmic issues and implementations, we propose to approximate the computation of kernel functions for trees. To this end, we narrow the kernel computation to a sparse subset of subtrees rooted at relevant symbols and thereby avoid considering all possible subtrees as in regular tree kernels. This sparse subset is automatically selected with respect to a given learning task, such that the expressiveness of the approximate kernel is preserved while its run-time is reduced. Though simple in design, this approximation enables speed-ups up to &lt;i&gt;three orders of magnitude&lt;/i&gt; and, for the first time, allows to compare HTML documents for detection of Web spam efficiently.  &lt;br /&gt;&lt;br /&gt;The concept of approximate tree kernels as well as several applications are described in a recent article published in the &lt;a href="http://www.jmlr.org"&gt;Journal of Machine Learning Research&lt;/a&gt;. The abstract for the article is given below: &lt;blockquote&gt;Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space.&lt;br /&gt;&lt;/blockquote&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2010-jmlr.pdf"&gt;Approximate Tree Kernels&lt;/a&gt;. Konrad Rieck, Tammo Krueger, Ulf Brefeld and Klaus-Robert Müller. &lt;i&gt;Journal of Machine Learning Research (JMLR)&lt;/i&gt;, 11(Feb):555−580, Microtome, 2010.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-2022077086703975292?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/i_6JOKCUKcQ" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/i_6JOKCUKcQ/efficient-machine-learning-with-trees.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2010/03/efficient-machine-learning-with-trees.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-6105030982503682672</guid><pubDate>Fri, 19 Feb 2010 09:08:00 +0000</pubDate><atom:updated>2010-02-19T10:24:31.717+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">malware analysis</category><category domain="http://www.blogger.com/atom/ns#">computer security</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Call for Papers: EC2ND 2010.</title><description>As program chair, I am happy to announce the sixth &lt;a href="http://2010.ec2nd.org"&gt;European Conference on Computer Network Defense&lt;/a&gt; (EC2ND 2010) in Berlin, Germany. The conference brings together researchers from academia and industry within Europe and beyond to present and discuss current topics in applied network and systems security. &lt;br /&gt;&lt;br /&gt;EC2ND 2010 invites submissions presenting novel ideas in the areas of network defense, intrusion detection and systems security. Topics for submission include but are not limited to:&lt;ul&gt;&lt;li&gt;Intrusion Detection&lt;br /&gt;&lt;li&gt;Malicious Software&lt;br /&gt;&lt;li&gt;Web Security&lt;br /&gt;&lt;li&gt;Security Policy&lt;br /&gt;&lt;li&gt;Peer-to-Peer and Grid Security&lt;br /&gt;&lt;li&gt;Wireless and Mobile Security&lt;br /&gt;&lt;li&gt;Network Forensics&lt;br /&gt;&lt;li&gt;Network Discovery and Mapping&lt;br /&gt;&lt;li&gt;Incident Response and Management&lt;br /&gt;&lt;li&gt;Privacy Protection&lt;br /&gt;&lt;li&gt;Cryptography&lt;br /&gt;&lt;li&gt;Legal and Ethical Issues&lt;br /&gt;&lt;/ul&gt;A detailed call for papers is available &lt;a href="http://2010.ec2nd.org/cfp/"&gt;here&lt;/a&gt;. Note the following important dates: &lt;ul&gt;&lt;li&gt;Paper submission deadline: July 2, 2010&lt;br /&gt;&lt;li&gt;Paper acceptance or rejection: August 6, 2010&lt;br /&gt;&lt;li&gt;Final paper camera ready copy:  August 13, 2010&lt;br /&gt;&lt;li&gt;Conference dates: October 28–29, 2010&lt;br /&gt;&lt;/ul&gt;I am looking forward to interesting contributions and an awesome conference.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-6105030982503682672?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/HaKRaONoMk4" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/HaKRaONoMk4/call-for-papers-ec2nd-2010.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2010/02/call-for-papers-ec2nd-2010.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-4100074927038398083</guid><pubDate>Wed, 30 Dec 2009 11:39:00 +0000</pubDate><atom:updated>2009-12-30T13:56:08.787+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">malware analysis</category><category domain="http://www.blogger.com/atom/ns#">clustering</category><category domain="http://www.blogger.com/atom/ns#">computer security</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><title>Malheur is out!</title><description>After almost a year of work, I am proud to announce the first public release of &lt;a href="http://www.mlsec.org/malheur"&gt;Malheur&lt;/a&gt;&amp;mdash;a tool for automatic analysis of program behavior recorded from malicious software (&lt;a href="http://www.mlsec.org/malheur"&gt;www.mlsec.org/malheur&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Malicious software (malware) is one of the major threats in the Internet today and millions of hosts are currently infected with malware programs, such as computer worms, backdoors and trojan horses. The sheer amount of malware renders manual analysis of malicious files impossible and, even worse, automatic inspection of file content is strongly obstructed by obfuscation techniques. An alternative for efficiently crafting new defenses against malware is &lt;em&gt;behavior-based analysis&lt;/em&gt;: Malware programs are collected in the wild and executed in a sandbox environment, where their behavior is monitored. The execution of each binary results in a report of monitored behavior which can be used to characterize and ultimately defend against malicious software.&lt;br /&gt;&lt;br /&gt;As the first publicly available tool, Malheur analyzes program behavior of malicious software and enables automatic discovery and discrimination of novel variants. This ability is rooted in well-known concepts of machine learning. Discovery of novel malware classes resembles a clustering problem and discrimination between classes matches a classification task. Both techniques are implemented in Malheur with effectivity and efficiency in mind. Instead of fiddling with esoteric learning concepts, Malheur resorts to basic methods, such as linkage clustering and nearest-prototype classification, which are efficiently implemented by means of parallel programming. Expressive access to recorded behavior is realized by embedding reports in a vector space spanned by short behavioral patterns, similar in spirit to bag-of-words models.&lt;br /&gt;&lt;br /&gt;Malheur is a joint effort of &lt;a href="http://www.ml.tu-berlin.de/"&gt;Berlin Institute of Technology&lt;/a&gt; and &lt;a href="http://pi1.informatik.uni-mannheim.de/"&gt;University of Mannheim&lt;/a&gt;. The merits of Malheur along with an empirical evaluation using real malware are detailed in a technical report:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.mlsec.org/malheur/docs/malheur-tr.pdf"&gt;Automatic Analysis of Malware Behavior using Machine Learning&lt;/a&gt;. Konrad Rieck, Philipp Trinius, Carsten Willems and Thorsten Holz. &lt;i&gt;Technical Report 18-209&lt;/i&gt;, Berlin Institute of Technology, 2009.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-4100074927038398083?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/ZYtofYjvG4o" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/ZYtofYjvG4o/malheur-is-out.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>2</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/12/malheur-is-out.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-3435387268158176160</guid><pubDate>Thu, 26 Nov 2009 08:17:00 +0000</pubDate><atom:updated>2009-11-26T09:51:49.337+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">generic research</category><title>My Thesis as a Paperback.</title><description>It may sound a little odd, but when I submitted my thesis to the library of my university, I was missing something. Back then, all I had to do was to send in some PDF file and fill out some form. Had the outcome of several months of hard work been yet another PDF document? &lt;br /&gt;&lt;a href="http://www.lulu.com/commerce/index.php?fBuyContent=7817661"&gt;&lt;img width="150" src="http://user.cs.tu-berlin.de/~rieck/img/diss-cover.png" border="0" align="right"&gt;&lt;/a&gt;&lt;br /&gt;Recently, I started to alleviate this strange pain by publishing my thesis as a print-on-demand book at &lt;a href="http://www.lulu.com/commerce/index.php?fBuyContent=7817661"&gt;lulu.com&lt;/a&gt;. The process is pretty straightforward and, given that most of us work with latex, took only a couple of hours to adapt the sources. Needless to say, that I felt much better, once I held the first paperback version of the thesis in my hands. And, if you are truly interested in reading this work, you can also grab a paperback copy for only &lt;a href="http://www.lulu.com/commerce/index.php?fBuyContent=7817661"&gt;9.99€&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Of course, if you aim for true scientific glory, it might be more appropriate to hunt for a renowned publisher, convince the editors, struggle with presentation and content requirements and finally receive a real book at about 99€.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-3435387268158176160?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/nDp4aXriUT8" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/nDp4aXriUT8/my-thesis-as-paperback.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/11/my-thesis-as-paperback.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-7043052223107284556</guid><pubDate>Sun, 15 Nov 2009 12:47:00 +0000</pubDate><atom:updated>2009-11-15T14:41:02.989+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">computer security</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Visualization of Payload-based Anomaly Detection.</title><description>Two paradigms for detection of attacks against computer systems have been widely studied in security research: First, misuse detection which aims at identifying known patterns of misuse and, second, anomaly detection which tries to detect deviations from normal usage. While both concepts have their individual pros and cons, only one of the two paradigms, namely misuse detection, has made its way into regular security products. For example, almost all anti-virus scanners and intrusion detection systems employ a database of known misuse patterns for spotting security problems. Although successful on the market, misuse detection inherently fails to protect from novel threats, such as zero-day exploits. In turn, anomaly detection methods provide means to identify unknown threats with high precision and low run-time overhead (see for instance work by &lt;a href="http://cs.gmu.edu/~astavrou/research/drift_raid_09.pdf"&gt;Cretu&lt;/a&gt;, &lt;a href="http://www.i-pi.com/~ingham/pubs/raid2007-revised.pdf"&gt;Ingham&lt;/a&gt;, &lt;a href="http://roberto.perdisci.com/projects/mcpad"&gt;Perdisci&lt;/a&gt; or &lt;a href="http://user.cs.tu-berlin.de/~rieck/pubs.html"&gt;myself&lt;/a&gt;). So, why is there still a lack of acceptance for anomaly detection in practice?&lt;br /&gt;&lt;br /&gt;One key problems with most anomaly detection approaches is their inability to explain decisions. The detection process resembles a black-box and security operators are required to thoroughly analyze context information to assess the actual cause of a reported anomaly. We have addressed this problem in a recent paper published at the &lt;i&gt;European Conference on Computer and Network Defense&lt;/i&gt; (&lt;a href="http://2009.ec2nd.org/"&gt;EC2ND&lt;/a&gt;). In particular, we present visualization techniques suitable for explaining the decisions of payload-based anomaly detection systems. Instead of digging into the technical details, I herein present an example for explaining the detection of a real network attack. An in-depth discussion is provided in &lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2009-ec2nd.pdf"&gt;our paper&lt;/a&gt;.&lt;center&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/misc/expt6_fdiff_awstats_configdir_0.png"&gt;&lt;br /&gt;&lt;img src="http://user.cs.tu-berlin.de/~rieck/misc/expt6_fdiff_awstats_configdir_0.png" width="450" align="center"&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/center&gt;The above figures shows so called &lt;i&gt;feature differences&lt;/i&gt; of a command injection attack (awstats configdir exploit), where peaks indicate string features that strongly deviate from a model of normal network traffic. The attack exploits an insecure handling of input parameters to pass shell commands to an HTTP server. The transferred commands are mapped to the standard URI scheme, which replaces reserved characters by the symbol “%” and an hexadecimal value. For example, “%20” denotes a space symbol, “%3b” a semi-colon, “%26” an ampersand and “%27” an apostrophe. In particular, the semi-colon and ampersand are characteristic for shell commands as they reflect specific semantics of shell syntax .In the presented visualization exactly these patterns are represented as high peaks. While similar string patterns can be also observed in legitimate traffic, the high differences in the figure clearly indicate an anomalous activity involving escaped shell commands and explain the reason for reporting of an anomaly.&lt;center&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/misc/expt6_color_awstats_configdir_0.png"&gt;&lt;br /&gt;&lt;img src="http://user.cs.tu-berlin.de/~rieck/misc/expt6_color_awstats_configdir_0.png" width="450" align="center"&gt;&lt;br /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/center&gt;Another visualization technique, &lt;i&gt;feature shading&lt;/i&gt;, is presented in the &lt;br /&gt;second figure, where the payload of a reported anomaly is superimposed with color reflecting the individual deviation of each byte from a model of normal network traffic. The URI of the command injection attack is flagged as anomalous by dark shading, thus indicating the presence of abnormal strings. The part ensuing the URI, however, is indicated as normal region, as it mainly contains frequent HTTP patterns, such as “Mozilla” and “Googlebot”. This example demonstrates the ability of a shading to emphasize anomalous contents in network payloads, while also indicating benign regions and patterns.&lt;br /&gt;&lt;br /&gt;By visualizing a “colorful” network payload a security operator is able to quickly identify relevant and malicious content in data, eventually enabling effective countermeasures. Consequently, the decisions made by a payload-based detection system &amp;ndash; so far opaque to a security operator &amp;ndash; can now be visually explained, such that one can benefit from early detection of novel attacks as well as an explainable detection process.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2009-ec2nd.pdf"&gt;Visualization and Explanation of Payload-Based Anomaly Detection&lt;/a&gt;. Konrad Rieck and Pavel Laskov. &lt;i&gt;Proceedings of 5th European Conference on Computer and Network Defense (EC2ND)&lt;/i&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-7043052223107284556?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/BCm1J9s5Cnk" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/BCm1J9s5Cnk/visualization-of-payload-based-anomaly.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/11/visualization-of-payload-based-anomaly.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-9217042298341106943</guid><pubDate>Thu, 22 Oct 2009 09:48:00 +0000</pubDate><atom:updated>2009-10-22T12:40:21.714+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">malware analysis</category><category domain="http://www.blogger.com/atom/ns#">computer security</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><title>Detecting the "Phoning Home" of Malicious Software.</title><description>Malicious software poses a severe threat to security of computer systems. Whether you download a file, plug in the USB stick of a colleague or simply surf the Web, your computer is always at risk to be compromised by malicious software, such as computer worms, backdoors or Trojan horses. Once the evil has infiltrated your system, it usually initiates a process referred to as "phoning home": The malicious software contacts its author and hands over control of your computer to him, for example for sending spam messages or conducting a distributed flooding attack. Unfortunately, regular security tools, such as anti-virus scanners, increasingly fail to protect the many infection vectors of malicious software and thus users are often left alone with systems "phoning home" to bad people.&lt;br /&gt;&lt;br /&gt;In our latest research (to be published at the &lt;a href="http://www.acm.org/conferences/sac/sac2010/"&gt;25th ACM Symposium on Applied Computing&lt;/a&gt;) we address this problem and introduce &lt;i&gt;Botzilla&lt;/i&gt;, a method for automatically detecting the "phoning home" of malicious software. Botzilla operates by first collecting malicious software in the wild using honeypots. The malicious software is then repetitively executed in a controlled environment and its communication is recorded&amp;mdash; similar to a rat in a lab. Invariants communication patterns, such as byte strings used for handshaking and remote control, are extracted and assembled to detection signatures using a naive-Bayes classification scheme. As a result, Botzilla is able to automatically generate signatures for malicious software within minutes and allows to counteract the propagation of evil in the first round. An abstract for this work is here: &lt;blockquote&gt; Hosts infected with malicious software, so called malware, are ubiquitous in today's computer networks. The means whereby malware can infiltrate a network are manifold and range from exploiting of software vulnerabilities to tricking a user into executing malicious code. Monitoring and detection of all possible infection vectors is intractable in practice. Hence, we approach the problem of detecting malicious software at a later point when it initiates contact with its maintainer; a process referred to as "phoning home".  In particular, we introduce Botzilla, a method for detection of malware communication, which proceeds by repetitively recording network traffic of malware in a controlled environment and generating network signatures from invariant content patterns. Experiments conducted at a large university network demonstrate the ability of Botzilla to accurately identify malware communication in network traffic with very low false-positive rates.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2010a-sac.pdf"&gt;Botzilla: Detecting the "Phoning Home" of Malicious Software.&lt;/a&gt; Konrad Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz and Pavel Laskov. &lt;i&gt;Proceedings of 25th ACM Symposium on Applied Computing (SAC)&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;The work on Botzilla is a small yet successful effort of &lt;a href="http://www.tu-berlin.de"&gt;Berlin Institute of Technology&lt;/a&gt;, &lt;a href="http://www.first.fraunhofer.de"&gt;Fraunhofer FIRST&lt;/a&gt;, &lt;a href="http://www.uni-erlangen.de"&gt;University of Erlangen&lt;/a&gt;, &lt;a href="http://www.tu-wien.at"&gt;Technical University Vienna&lt;/a&gt; and &lt;a href="http://www.uni-tuebingen.de"&gt;University of Tübingen&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-9217042298341106943?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/uurfhXkWz4c" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/uurfhXkWz4c/detecting-phoning-home-of-malicious.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/10/detecting-phoning-home-of-malicious.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-8189484488074339260</guid><pubDate>Wed, 26 Aug 2009 13:00:00 +0000</pubDate><atom:updated>2009-08-26T15:43:47.477+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">active learning</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Active Learning for Network Intrusion Detection.</title><description>Two paradigms for application of machine learning to security are prevalent: First, &lt;a href="http://en.wikipedia.org/wiki/Supervised_learning"&gt;supervised learning&lt;/a&gt; as  employed in spam filtering and, second, &lt;a href="http://en.wikipedia.org/wiki/Unsupervised_learning"&gt;unsupervised learning&lt;/a&gt; as applied for network anomaly detection. Supervised learning methods construct models for data using label information attached to each data object. For example, email messages tagged as spam and non-spam are used to learn a discriminative model for spam filtering. In contrast, unsupervised learning methods operate on given data only and do not make use label information. For example, models for detection anomalous network payloads are usually learned from unlabeled network traffic without the need to label million of network payloads.&lt;br /&gt;&lt;br /&gt;Besides these two paradigms, however, learning theory also provides the hybrid concept of &lt;a href="http://en.wikipedia.org/wiki/Semi-supervised_learning"&gt;semi-supervised learning&lt;/a&gt;. Although technically more involved, semi-supervised methods combine the best of the classic learning paradigms: The majority of training data can be unlabeled, whereas only few instances need to be equipped with label information for learning. Semi-supervised methods are more accurate than unsupervised methods while not suffering from the problem of labeling large amounts of data. Unfortunately, the security community has largely ignored semi-supervised learning and, consequently, often argues for either one of the two classic learning paradigms.&lt;br /&gt;&lt;br /&gt;In a first study, we have applied the concept of semi-supervised learning to network security and devised an learning method for network intrusion detection. Our method initially operates on unlabeled data as most of previous learning approaches to intrusion detection. However, the devised method then specifically requests label information for certain network events to improve the learning model. For example, our method requests labels for points that lie close to the decision boundary and thus may sharpen the detection accuracy. With only a minimal labeling effort a security operator can tune our method to particular network data and eliminate false-positive alarms that come with the majority of regular detection approaches. A paper of this work has been recently&lt;br /&gt;accepted at &lt;a href="http://sites.google.com/a/aisec.info/aisec-2009/"&gt;AISEC 2009&lt;/a&gt;. Here is its abstract:&lt;blockquote&gt;Anomaly detection for network intrusion detection is usually considered an unsupervised task. Prominent techniques, such as one-class support vector machines, learn a hypersphere enclosing network data, mapped to a vector space, such that points outside of the ball are considered anomalous. However, this setup ignores relevant information such as expert and background knowledge. In this paper, we rephrase anomaly detection as an active learning task. We propose an effective active learning strategy to query low-conﬁdence observations and to expand the data basis with minimal labeling effort. Our empirical evaluation on network intrusion detection shows that our approach consistently outperforms existing methods in relevant scenarios.&lt;/blockquote&gt; &lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2009-aisec.pdf"&gt;Active Learning for Network Intrusion Detection.&lt;/a&gt; Nico Görnitz, Marius Kloft, Konrad Rieck and Ulf Brefeld. &lt;i&gt;Proceedings of CCS Workshop on Security and Artificial Intelligence (AISEC)&lt;/i&gt;, October 2009.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-8189484488074339260?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/GU5NA5apWW8" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/GU5NA5apWW8/active-learning-for-network-intrusion.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/08/active-learning-for-network-intrusion.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-7194796241965864323</guid><pubDate>Tue, 18 Aug 2009 07:15:00 +0000</pubDate><atom:updated>2009-08-18T09:27:57.356+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">machine learning</category><title>Twimpact: An Impact Factor for Twitter.</title><description>There is one hype that I have not made friend with in the last year: &lt;a href="http://www.twitter.com"&gt;Twitter&lt;/a&gt;. Recently, my colleagues &lt;a href="http://twimpact.com/user/mikiobraun"&gt;Mikio Braun&lt;/a&gt; and &lt;a href="http://twimpact.com/user/thinkberg"&gt;Matthias Jugel&lt;/a&gt; have launched a new project named &lt;a href="http://www.twimpact.com"&gt;Twimpact&lt;/a&gt;, which aims at providing an impact factor for twitter posts (tweets) &amp;ndash; demonstrating that Twitter might be more than just a bunch of pointless messages. I am not sure, though. The official project website is located at &lt;a href="http://www.twimpact.com"&gt;www.twimpact.com&lt;/a&gt;. An introduction to the underlying &lt;i&gt;twimpact factor&lt;/i&gt; is provided &lt;a href="http://twimpact.com/about"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-7194796241965864323?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/7fSMmSDDfEg" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/7fSMmSDDfEg/twimpact-impact-factor-for-twitter.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>1</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/08/twimpact-impact-factor-for-twitter.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-6041600753457204956</guid><pubDate>Tue, 14 Jul 2009 08:16:00 +0000</pubDate><atom:updated>2009-08-16T22:43:46.675+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">computer security</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Machine Learning for Application-Layer Intrusion Detection.</title><description>Banzai! Yesterday, I handed over the last print outs of my Ph.D. thesis to the library of the &lt;a href="http://www.tu-berlin.de"&gt;Berlin Institute of Technology&lt;/a&gt;. My doctoral defense was a thrilling yet enjoyable event, especially due to challenging questions raised by &lt;a href="http://users.cs.dal.ca/~mchugh/"&gt;John McHugh&lt;/a&gt;, whom I have to cordially thank for coming to Berlin. After all, it was a great &lt;a href="http://en.wiktionary.org/wiki/summa_cum_laude"&gt;success&lt;/a&gt;. I am now heading for a long holiday to recover from and prepare for more exciting research.&lt;br /&gt; &lt;br /&gt;Following is the official summary of the thesis. A PDF version is available &lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2009-diss.pdf"&gt;here&lt;/a&gt;. &lt;blockquote&gt;Misuse detection as employed in current network security products relies on the timely generation and distribution of so called attack signatures. While appropriate signatures are available for the majority of known attacks, misuse detection fails to protect from novel and unknown threats, such as zero-day exploits and worm outbreaks. The increasing diversity and polymorphism of network attacks further obstruct modeling signatures, such that there is a high demand for alternative detection techniques.&lt;br /&gt;&lt;br /&gt;We address this problem by presenting a machine learning framework for automatic detection of unknown attacks in the application layer of network communication. The framework rests on three contributions to learning-based intrusion detection: First, we propose a generic technique for embedding of network payloads in vector spaces such that numerical, sequential and syntactical features extracted from the payloads are accessible to statistical and geometric analysis. Second, we apply the concept of kernel functions to network payload data, which enables efficient learning in high-dimensional vector spaces of structured features, such as tokens, q-grams and parse trees. Third, we devise learning methods for geometric anomaly detection using kernel functions where normality of data is modeled using geometric concepts such as hyperspheres and neighborhoods. As a realization of the framework, we implement a standalone prototype called &lt;i&gt;Sandy&lt;/i&gt; applicable to live network traffic.&lt;br /&gt;&lt;br /&gt;The framework is empirically evaluated using real HTTP and FTP network traffic and over 100 attacks unknown to the applied learning methods. Our prototype &lt;i&gt;Sandy&lt;/i&gt; significantly outperforms the misuse detection system &lt;i&gt;Snort&lt;/i&gt; and several state-of-the-art anomaly detection methods by identifying 80-97% unknown attacks with less&lt;br /&gt;than 0.002% false positives&amp;mdash;a quality that, to the best of our knowledge, has not been attained in previous work on network intrusion detection. Experiments with evasion attacks and unclean training data demonstrate the robustness of our approach. Moreover, run-time experiments show the advantages of kernel functions. Although operating in a vector space with millions of dimensions, our prototype provides throughput rates between 26-60 Mbit/s on real network traffic. This performance renders our approach readily applicable for protection of medium-scale network services, such as enterprise Web services and applications.&lt;br /&gt;&lt;br /&gt;While the proposed framework does not generally eliminate the threat of network attacks, it considerably raises the bar for adversaries to get their attacks through network defenses. In combination with existing techniques such as signature-based systems, it strongly hardens today's network protection against future threats.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2009-diss.pdf"&gt;Machine Learning for Application-Layer Intrusion Detection.&lt;/a&gt;&lt;br /&gt;Konrad Rieck. Ph.D. thesis, Berlin Institute of Technology (TU Berlin), 2009.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-6041600753457204956?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/IkquaKSo2qA" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/IkquaKSo2qA/machine-learning-for-application-layer.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/07/machine-learning-for-application-layer.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-5441873276930256684</guid><pubDate>Thu, 09 Jul 2009 16:35:00 +0000</pubDate><atom:updated>2009-07-09T18:59:11.584+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">computer security</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Securing IMS against Novel Threats.</title><description>Recently, we have published an interesting article at Bell Labs Technical Journal on  detecting unknown threats in IMS and VoIP networks. This article has been the outcome of a fruitful cooperation of &lt;a href="http://www.first.fraunhofer.de"&gt;Fraunhofer FIRST&lt;/a&gt; and &lt;a href="http://www.alcatel-lucent.com"&gt;Alcatel-Lucent&lt;/a&gt; in Germany. The article's abstract is here: &lt;blockquote&gt;Fixed mobile convergence (FMC) based on the 3GPP IP Multimedia&lt;br /&gt;Subsystem (IMS) is considered one of the most important communication technologies of this decade. Yet this all-IP-based network technology brings about the growing danger of security vulnerabilities in communication and data services. Protecting IMS infrastructure servers against malicious exploits poses a major challenge due to the huge number of systems that may be affected. We approach this problem by proposing an architecture for an autonomous and self-sufficient monitoring and protection system for devices and infrastructure inspired by network intrusion detection techniques. The crucial feature of our system is a signature-less detection of abnormal events and zero-day attacks. These attacks may be hidden in a single message or spread across a sequence of messages. Anomalies identified at any of the network domain's ingresses can be further analyzed for discriminative patterns that can be immediately distributed to all edge nodes in the network domain.&lt;/blockquote&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2009-bltj.pdf"&gt;Securing IMS against Novel Threats.&lt;/a&gt; Stefan Wahl, Konrad Rieck, Pavel Laskov, Peter Domschitz and Klaus-Robert Müller. Bell Labs Technical Journal (BLTJ), 14(1), 243-257, John Wiley &amp; Sons, May 2009.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-5441873276930256684?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/1791_PO1F2g" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/1791_PO1F2g/securing-ims-against-novel-threats.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/07/securing-ims-against-novel-threats.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-8433763727022367960</guid><pubDate>Fri, 29 May 2009 16:05:00 +0000</pubDate><atom:updated>2009-05-29T19:08:02.549+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">generic research</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><title>Some Tuning for Feature Extraction.</title><description>One key to efficient application of learning methods in practice is fast extraction of features from raw data. As an example, I am currently working on methods for automatic analysis of malware, where the behavior of a malware binary is represented in a textual report and mapped to a vector space using frequencies of contained substrings (see &lt;a href="http://user.cs.tu-berlin.de/~rieck/docs/2008-dimva.pdf"&gt;here&lt;/a&gt;). However, thousands of  binaries need to be processed and often the generated report files are huge. Consequently, the task of designing analysis methods gets tedious, as one has to wait several minutes just to load data and extract appropriate feature strings.&lt;br /&gt;&lt;br /&gt;In quest for a remedy, I have experimented with &lt;a href="http://www.openmp.org"&gt;OpenMP&lt;/a&gt; (Open Multi-Processing) and &lt;a href="http://people.freebsd.org/~kientzle/libarchive/"&gt;libarchive&lt;/a&gt;, where the first is a simple API for multi-processing programming in C and the latter a library for reading and writing of file archives, such as zip, tar and on. On the one hand OpenMP enables loading of data and extraction of features in parallel, whereas on the other hand libarchive allows for storing the data efficiently in compressed archives in favor of directories. &lt;br /&gt;&lt;br /&gt;&lt;center&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~rieck/misc/mal_expt04.png"&gt;&lt;img src="http://user.cs.tu-berlin.de/~rieck/misc/mal_expt04.png" width="380"&gt;&lt;/a&gt;&lt;br /&gt;&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;The figure shows some preliminary run-time measurements for extraction of feature vectors from malware reports. The application of multi-processing clearly accelerates the feature extraction, independent of the applied data format. For example, when reading from a directory the performance is doubled if two threads are used and enables processing up to 100 files per second (note this experiments was run on a dual-core machine). Surprisingly, the extraction performance is also high when using compressed archives. There is almost no difference between feature extraction from a zip/gz archive and a plain directory. Moreover, the zip/gz archive consumes only 5% of the original space and considerably reduces the amount of required storage. That's impressive. If you are dealing with loading and processing thousands of files, these tuning hacks might be an interesting option.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-8433763727022367960?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/yLneILNNihk" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/yLneILNNihk/some-tuning-for-feature-extraction.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/05/some-tuning-for-feature-extraction.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-667498829049255918</guid><pubDate>Mon, 04 May 2009 17:57:00 +0000</pubDate><atom:updated>2009-05-04T20:39:05.845+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>It's finally done.</title><description>After a long period of hard and boring work, I finally submitted my Ph.D. thesis to the computer science faculty at &lt;a href="http://www.tu-berlin.de"&gt;Berlin Institute of Technology&lt;/a&gt; (TU Berlin). The thesis is entitled "&lt;i&gt;Machine Learning for Application-Layer Intrusion Detection&lt;/i&gt;" and refereed by &lt;a href="http://ml.cs.tu-berlin.de/en/klaus/index.html"&gt;Klaus-Robert Müller&lt;/a&gt;, &lt;a href="http://users.cs.dal.ca/~mchugh/"&gt;John McHugh&lt;/a&gt; and &lt;a href="http://ida.first.fraunhofer.de/~laskov/"&gt;Pavel Laskov&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;In my thesis, I tackle the problem of detecting unknown and novel attacks in the application layer of network communication and present a machine learning framework for intrusion detection. In particular, I propose a generic technique for embedding of network payloads in vector spaces such that features extracted from the payloads are accessible to statistical and geometric analysis. Efficient learning in these high-dimensional vector spaces is realized using the concept of &lt;a href="https://ml01.zrz.tu-berlin.de/twiki/pub/Main/MaschinellesLernenS08/structured2.pdf"&gt;kernel functions&lt;/a&gt; defined over network payload data. Based on these functions, I derive methods for anomaly detection suitable for identification of unknown attacks, where normality of network data is modeled using geometric concepts such as hyperspheres and neighborhoods. &lt;br /&gt;&lt;br /&gt;The framework is empirically evaluated using 10 days of HTTP and FTP network traffic and over 100 real attacks unknown to the applied learning methods. A prototype of the framework outperforms related methods from &lt;a href="http://cs.fit.edu/~pkc/id/related/krugel-sac02.ps"&gt;Kruegel et al. (2002)&lt;/a&gt;, &lt;a href="http://packetstorm.ussrback.com/papers/IDS/nids/Anagram.pdf"&gt;Wang et al. (2006)&lt;/a&gt; and &lt;a href="http://www.i-pi.com/~ingham/pubs/raid2007-revised.pdf"&gt;Ingham et al. (2007)&lt;/a&gt;, where it identifies 80&amp;ndash;97% unknown attacks with less than 0.002% false positives. Moreover, reasonable throughput rates between 20&amp;ndash;60 Mbit/s are attained, though no special hardware acceleration is yet utilized.&lt;br /&gt;&lt;br /&gt;As the thesis is under review, I will not provide an online version now. However, I am going to present some interesting results on visualization of detected attack payloads in later posts. Stay tuned.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-667498829049255918?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/F9CQcAHB5hM" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/F9CQcAHB5hM/its-finally-done.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/05/its-finally-done.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-7772500805876658566</guid><pubDate>Tue, 07 Apr 2009 18:06:00 +0000</pubDate><atom:updated>2009-04-07T20:55:24.092+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">generic research</category><title>Fun and Pain with ZFS.</title><description>Recently, I found the time to experiment with &lt;a href="http://opensolaris.org/os/community/zfs/"&gt;ZFS&lt;/a&gt; &amp;ndash; Sun's next-generation file system &amp;ndash; using an external USB drive. In particular, I have been playing with ZFS to test whether it alleviates typical tasks of machine learning research, such as loading directories containing ten thousand of files or repeating experimental runs with large data chunks. As I am not running a Solaris system, I had to install the userland utilities and kernel modules for OSX (Leopard) available at &lt;a href="http://zfs.macosforge.org/trac/"&gt;Macforge&lt;/a&gt;. Note that Leopard natively supports read-only access to ZFS pools but lacks write functionality. &lt;br /&gt;&lt;br /&gt;Apart from great read access time, the first interesting issue I noticed is ZFS's ability to create hierarchical file systems on-the-fly. Instead of dumping all contents into a single volume, the hierarchical layout enables fine-grained control of different experimental data sets and allows for assigning quota and options individually per data. For example, the ability to easily split data comes handy, if one enables the transparent compression in ZFS. This snippet of commands creates two file systems named &lt;code&gt;tank/dataset1&lt;/code&gt; and &lt;code&gt;tank/dataset2&lt;/code&gt; where uses automatic compression.&lt;pre&gt;% zfs create tank/dataset1&lt;br /&gt;% zfs create tank/dataset2&lt;br /&gt;% zfs set compress=on  tank/dataset1&lt;br /&gt;% zfs set compress=off tank/dataset2&lt;br /&gt;&lt;/pre&gt;Compression might not be desired when running certain experiments. Thus, it can be disabled and enabled per file system, such that archived data is stored effectively while the current workbench is accessible with full processing power. A nice feature for working with experimental data. The compression ratio of each file system can be queried using the following command.&lt;pre&gt;% zfs get compressratio tank&lt;br /&gt;NAME            PROPERTY         VALUE         SOURCE&lt;br /&gt;tank/dataset1   compressratio    3.14x         -&lt;br /&gt;tank/dataset2   compressratio    1.00x         -&lt;br /&gt;&lt;/pre&gt;Another interesting issue is ZFS's ability to store snapshots of file systems. Initially, a snapshot does not consume any memory as only the differences to the original version are stored. If one is working with multiple copies of the same data, say one version described in a publication and a refined variant, snapshots are a great tool, as they allow one to quickly jump back and forth between different versions of data. One can access the content of each snapshot using the directory &lt;code&gt;.zfs&lt;/code&gt; in the root of the considered file system. Here's an example how a snapshot is created for a given data set.&lt;pre&gt;% zfs snapshot tank/dataset1@paper&lt;br /&gt;&lt;/pre&gt;As it can see from a call to &lt;code&gt;zfs list&lt;/code&gt; there is no memory associated with the snapshot directly after creation and thus no storage is wasted.&lt;pre&gt;% zfs list&lt;br /&gt;tank/dataset1          2.38G   222G  2.38G  /Volumes/tank/dataset1&lt;br /&gt;tank/dataset1@paper        0      -  2.38G  -&lt;br /&gt;&lt;/pre&gt;This feature is really nifty if one needs to "freeze" the state of experimental data if a paper is accepted for publication while still continuing to work with the data set.&lt;br /&gt;&lt;br /&gt;Unfortunately, all my enthusiasm and great plans to work with ZFS have been eliminated by the instability of the current Leopard driver. The driver does not really handle USB devices. If the device is accidentally removed or the computer falls asleep, the ZFS module simply crashes the kernel. I even managed to corrupt the ZFS partition such that any call to &lt;code&gt;zpool status&lt;/code&gt; issues a kernel panic. Game over.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-7772500805876658566?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/eGw1MrYKX-A" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/eGw1MrYKX-A/fun-and-pain-with-zfs.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/04/fun-and-pain-with-zfs.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-1352214559380938554</guid><pubDate>Sun, 25 Jan 2009 16:02:00 +0000</pubDate><atom:updated>2009-01-25T17:27:57.888+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">malware analysis</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Call for Papers: DIMVA 2009.</title><description>I am a member of the program committee of this year's conference on &lt;a href="http://www.dimva.org"&gt;Detection of Intrusions and Malware &amp; Vulnerability Assessment&lt;/a&gt; (DIMVA) in Milan, Italy. The conference invites paper submissions from the domain of computer security with focus on intrusion detection and malware research. &lt;ul&gt;&lt;li&gt; Deadline for paper submission:           February 6, 2009&lt;br /&gt;&lt;li&gt; Notification of acceptance or rejection: March   30, 2009&lt;br /&gt;&lt;li&gt; Final paper camera ready copy:           April   10, 2009&lt;br /&gt;&lt;li&gt; Conference dates:                        June/July, 2009&lt;br /&gt; &lt;/ul&gt;Contributions can be submitted as full papers (limited to 20 pages) or short papers (limited to 10 pages). A detailed "call for papers" is available &lt;a href="http://www1.gi-ev.de/fachbereiche/sicherheit/fg/sidar/dimva/dimva2009/cfp2009.pdf"&gt;here&lt;/a&gt;. I am looking forward to interesting contributions and an awesome DIMVA conference&amp;mdash;as in the last five years.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-1352214559380938554?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/BPhWY0CO2bE" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/BPhWY0CO2bE/call-for-papers-dimva-2009.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2009/01/call-for-papers-dimva-2009.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-7922906943023350163</guid><pubDate>Thu, 18 Dec 2008 12:55:00 +0000</pubDate><atom:updated>2008-12-30T15:25:42.410+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">clustering</category><category domain="http://www.blogger.com/atom/ns#">machine learning</category><title>Unpleasant Facts about Data Clustering.</title><description>&lt;a href="http://en.wikipedia.org/wiki/Data_clustering"&gt;Data clustering&lt;/a&gt; is a popular technique of data mining and machine learning. The objective is to automatically partition a set of given objects into "meaningful" groups. Clustering is intuitive and fits a variety of problem settings, for example &lt;a href="http://www.cs.columbia.edu/ids/publications/cluster-thesis00.pdf"&gt;intrusion detection&lt;/a&gt; and &lt;a href="http://www.eecs.umich.edu/~mibailey/publications/raid07_final.pdf"&gt;malware categorization&lt;/a&gt; in computer security. However, application of clustering methods is laborious and error-prone in practice. Here are my unpleasant facts of clustering:&lt;ul&gt;&lt;li&gt; &lt;i&gt;Generic clustering is NP-hard&lt;/i&gt;. Formally, clustering aims at partitioning data such that a predefined quality criterion is maximized. Unfortunately, there is no generic way for determining an optimal solution in this setting except for testing all possible combinations. All practical clustering methods, such as k-means and linkage clustering, limit the search space by discarding partitionings and thus provide only an approximation to the best clustering (a local optimum). If you cluster data, you will likely obtain an approximate solution.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; &lt;i&gt;Hierarchy is no clustering&lt;/i&gt;. Linkage clustering is a form of hierarchical clustering, where the data is first mapped to a hierarchical representation and then partitioned into clusters. In practice, people often consider the hierarchical representation of data as the final clustering result. However, a hierarchy encodes an exponential number of possible partitionings and the more involved step is to flatten the hierarchy into a clustering. As a negative example consider the Wikipedia entry on &lt;a href="http://en.wikipedia.org/wiki/Data_clustering"&gt;hierarchical clustering&lt;/a&gt; which almost solely focuses on constructing hierarchies&amp;mdash;omitting details on the subsequent clustering procedure. &lt;br /&gt;&lt;br /&gt;&lt;li&gt; &lt;i&gt;Statistical consistency unclear&lt;/i&gt;. An important term from statistical learning theory is &lt;i&gt;consistency&lt;/i&gt;. Shortly put a learning algorithm is consistent if it converges to the true solution the more training data is provided. Surprisingly, little is known about the consistency of clustering. For instance, most clustering methods aim at partitioning provided data but not the underlying data distributions. That is, if you provide more data to a clustering, there is no guarantee that the solution improves (see the work of &lt;a href="http://www.kyb.mpg.de/~ule"&gt;Luxburg&lt;/a&gt; [&lt;a href="http://www.cse.ohio-state.edu/~mbelkin/papers/SC_AOS_07.pdf"&gt;1&lt;/a&gt;, &lt;a href="http://books.nips.cc/papers/files/nips20/NIPS2007_0423.pdf"&gt;2&lt;/a&gt;]). &lt;br /&gt;&lt;br /&gt;&lt;li&gt; &lt;i&gt;Model selection vs. Clustering&lt;/i&gt;. Clustering is often controlled using a model parameter that determines the granularity of the partitioning. This parameter can be the number of clusters as for k-means but may also take the form of a numeric quantity as in linkage clustering. Clearly, this parameter needs to be adapted prior to application. However, model selection contradicts with the purpose of clustering. On the one hand, to validate the parameter on given data a reference partitioning needs to be available, thus no clustering is required in this case. On the other hand on unlabeled data the parameter can never be validated directly. In practice, one resorts to model selection on reference data and application of the best model to unknown data. Consequently, this setting requires a representative reference partitioning obtained &lt;i&gt;without clustering&lt;/i&gt;. &lt;br /&gt;&lt;/ul&gt;Besides these facts, I like clustering methods and have been successfully working with several of them. Nevertheless, special care needs to be taken when running experiments to achieve sound results&amp;mdash;the application of clustering is more involved as it seems at a first glance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-7922906943023350163?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/IRP5oIQyRPQ" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/IRP5oIQyRPQ/four-facts-about-data-clustering.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>1</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/12/four-facts-about-data-clustering.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-336590101159984107</guid><pubDate>Wed, 17 Dec 2008 16:33:00 +0000</pubDate><atom:updated>2008-12-18T13:46:36.472+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Incorporation of Application Layer Protocol Syntax into Anomaly Detection.</title><description>The majority of current intrusion detection, anti-virus and anti-malware products builds on the concept of &lt;i&gt;misuse detection&lt;/i&gt;, that is malicious activity is identified using rules and signatures of known misuse patterns. Consequently, much effort has been devoted to defining expressive and discriminative features for construction of misuse rules. In the domain of network security such features are often derived from the syntax of application layer protocols, for example to answer the questions "who did what to which data when?" &lt;br /&gt;&lt;br /&gt;In contrast to misuse detection, a second strain of intrusion detection research investigates methods for anomaly detection, that is malicious activity is identified in terms of unusual and abnormal events. Surprisingly, the use of features from protocol syntax in this setting has been considered in only few research, for example &lt;a href="http://www.cs.ucsb.edu/~vigna/publications/2003_kruegel_vigna_ccs03.pdf"&gt;Kruegel et al.&lt;/a&gt; and &lt;a href="http://www.cs.unm.edu/~forrest/publications/learning-DFA-Representations.pdf"&gt;Ingham et al.&lt;/a&gt; Access to syntactical features is clearly beneficial for anomaly detection, and hence some of my colleagues (Patrick and Christian) have been working on extending previous approaches to incorporate protocol syntax into a generic anomaly detection framework. &lt;br /&gt;&lt;br /&gt;Recent results of this work are presented at the &lt;a href="http://www.seclab.cs.sunysb.edu/iciss08/"&gt;International Conference on Information Systems Security&lt;/a&gt;. In particular, we introduce new features for network intrusion detection, which combine sequential features (such as &lt;a href="http://worminator.cs.columbia.edu/papers/2004/RAID4.PDF"&gt;byte frequencies&lt;/a&gt; and &lt;a href="http://ida.first.fraunhofer.de/~rieck/docs/2006-dimva.pdf"&gt;n-grams&lt;/a&gt;) with syntactical tokens. Here is the abstract of our contribution:&lt;blockquote&gt;The syntax of application layer protocols carries valuable information for network intrusion detection. Hence, the majority of modern IDS perform some form of protocol analysis to refine their signatures with application layer context. Protocol analysis, however, has been mainly used for misuse detection, which limits its application for the detection of unknown and novel attacks. In this contribution we address the issue of incorporating application layer context into anomaly-based intrusion detection. We extend a payload-based anomaly detection method by incorporating structural information obtained from a protocol analyzer. The basis for our extension is computation of similarity between attributed tokens derived from a protocol grammar. The enhanced anomaly detection method is evaluated in experiments on detection of web attacks, yielding an improvement of detection accuracy of 49%. While byte-level anomaly detection is sufficient for detection of buffer overflow attacks, identification of recent attacks such as SQL and PHP code injection strongly depends on the availability of application layer context. &lt;/blockquote&gt;&lt;br /&gt;&lt;a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-iciss.pdf"&gt;Incorporation of Application Layer Protocol Syntax into Anomaly Detection&lt;/a&gt;. Patrick Düssel, Christian Gehl, Pavel Laskov and Konrad Rieck. &lt;i&gt;Proc. of International Conference on Information Systems Security (ICISS)&lt;/i&gt;,  December 2008.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-336590101159984107?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/JlYihQsbWeQ" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/JlYihQsbWeQ/incorporation-of-application-layer.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/12/incorporation-of-application-layer.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-8416945338358920067</guid><pubDate>Wed, 03 Dec 2008 20:16:00 +0000</pubDate><atom:updated>2008-12-03T22:09:02.826+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>An Architecture for Inline Anomaly Detection.</title><description>I am currently finishing my doctoral thesis, thus there is almost no time for interesting activities and fun. Fortunately, I am not the only one, see for instance &lt;a href="http://www.honeyblog.org/archives/3-Old-Entries-Honeypot-Presentation.html"&gt;here&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Besides all the work, the good news is that we will present an interesting paper on combining anomaly detection and intrusion prevention at the &lt;a href="http://2008.ec2nd.org"&gt;European Conference on Computer Network Defense&lt;/a&gt; (EC2ND). Here is the abstract from our contribution:&lt;blockquote&gt;In this paper we propose an intrusion prevention system (IPS) which operates inline and is capable to detect unknown attacks using anomaly detection methods. Incorporated in the framework of a packet filter each incoming packet is analyzed and&amp;mdash;according to an internal connection state and a computed anomaly score&amp;mdash;either delivered to the production system, redirected to a special hardened system or logged to a network sink for later analysis. Run-time measurements of an actual implementation prove that the performance overhead of the system is sufficient for inline processing. Accuracy measurements on real network data yield improvements especially in the number of false positives, which are reduced by a factor of five compared to a plain anomaly detector. &lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-ec2nd.pdf"&gt;An Architecture for Inline Anomaly Detection.&lt;/a&gt; Tammo Krueger, Christian Gehl, Konrad Rieck and Pavel Laskov. &lt;i&gt;Proc. of European Conference on Computer Network Defense (EC2ND)&lt;/i&gt;, December 2008.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-8416945338358920067?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/wYMyoGjQ7eY" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/wYMyoGjQ7eY/architecture-for-inline-anomaly.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/12/architecture-for-inline-anomaly.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-6634487660793462211</guid><pubDate>Fri, 14 Nov 2008 07:53:00 +0000</pubDate><atom:updated>2008-12-03T22:27:12.381+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">stringology</category><title>Generalized Suffix Trees and Distinct Substrings.</title><description>This is a technical post on strings, substrings and suffix trees. This could get boring. You have been warned.&lt;br /&gt;&lt;br /&gt;Everybody knows the concept of &lt;i&gt;longest common substring&lt;/i&gt;, where a set of strings is given and the task is to determine the longest substring shared by the set. While intuitive and straightforward to implement using a &lt;a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree"&gt;generalized suffix tree&lt;/a&gt;, this setting is not robust against "noise", such as corrupted or truncated strings. Moreover, finding the longest common substring is useless, if the considered strings derive from different data distributions, thus not necessary sharing any substring.&lt;br /&gt;&lt;br /&gt;These shortcomings can be addressed by determining a &lt;i&gt;set of shared substrings&lt;/i&gt;, which are shared by at least &lt;i&gt;m&lt;/i&gt; strings but not necessary all strings in the set. To restrict the amount of small shared substrings, one requires the substrings to have a minimum length. This setting has been applied in the &lt;a href="http://www.cs.berkeley.edu/~dawnsong/papers/polygraph.pdf"&gt;PolyGraph&lt;/a&gt; and &lt;a href="http://www.cs.northwestern.edu/~zli109/publication/Li-Hamsa-ssp06.pdf"&gt;Hamsa&lt;/a&gt; paper for automatic signature generation from network data. Again, an implementation is easily realized using a depth-first traversal of a &lt;a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree"&gt;generalized suffix tree&lt;/a&gt;. Unfortunately, many of the shared substrings will be prefixes and suffixes of each other. &lt;br /&gt;&lt;br /&gt;This issue is resolved by the &lt;i&gt;set of distinct shared substrings&lt;/i&gt;, where a distinct substring &lt;i&gt;x&lt;/i&gt; is not contained in any other distinct substring &lt;i&gt;y&lt;/i&gt;, unless &lt;i&gt;x&lt;/i&gt; appears in at least &lt;i&gt;m&lt;/i&gt; strings independently of &lt;i&gt;y&lt;/i&gt;. Implementing this setting using &lt;a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree"&gt;generalized suffix trees&lt;/a&gt; is a little bit more involved. Together with a colleague, I came up with this approach:  &lt;br /&gt;&lt;blockquote&gt;We start to determine interesting substrings by a depth-first traversal of a &lt;a href="http://en.wikipedia.org/wiki/Generalised_suffix_tree"&gt;generalized suffix tree&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;This time, however, we are looking for distinct substrings and hence can not afford to output a substring that later turns out to be non-distinct. We thus first visit nodes which give rise to longer substrings, that is for each node we order the child nodes according to the length of possible substrings. Consequently, we find longer substrings first. &lt;br /&gt;&lt;br /&gt;If we have determined a first substring &lt;i&gt;x&lt;/i&gt; shared by &lt;i&gt;m&lt;/i&gt; strings, we can be sure that it is distinct as no longer substrings exist. Next, we need to mark all non-distinct substrings of &lt;i&gt;x&lt;/i&gt;. We do this by traversing the suffix links and parent nodes of &lt;i&gt;x&lt;/i&gt;, thereby processing all suffixes and prefixes of &lt;i&gt;x&lt;/i&gt;. For each node we maintain a counter indicating the number of strings the corresponding substring is shared by. When processing the prefixes and suffixes of &lt;i&gt;x&lt;/i&gt;, we decrement this counter by the number of occurences of &lt;i&gt;x&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;Finally, we continue the depth-first traversal. We do not descend to nodes with a counter smaller than &lt;i&gt;m&lt;/i&gt;, as no distinct shared substring can be determined on the corresponding path.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;To be honest, I have not thoroughly thought about the complexity of this algorithm. Yet, I expect it to be "somewhat linear" in the length of the considered strings. The output of the algorithm resembles the concept of distinct substrings proposed in &lt;a href="http://www.cs.berkeley.edu/~dawnsong/papers/polygraph.pdf"&gt;PolyGraph&lt;/a&gt;, where the authors apply a costly post-processing of shared substrings. Note that this approach can also be implemented using suffix and LCP arrays with the traversal techniques studied by &lt;a href="ftp://ftp.i.kyushu-u.ac.jp/pub/tr/trcs185.ps.gz "&gt;Kasai et al.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Did I mention that I like string algorithms?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-6634487660793462211?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/IZIGEtF8drs" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/IZIGEtF8drs/generalized-suffix-trees-and-distinct.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/11/generalized-suffix-trees-and-distinct.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-2836012684993065675</guid><pubDate>Thu, 16 Oct 2008 06:59:00 +0000</pubDate><atom:updated>2008-12-03T22:18:45.125+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">machine learning</category><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Automatic Feature Selection for Anomaly Detection.</title><description>Network intrusion detection requires discriminative features from network traffic, which for the case of misuse detection allow for the precise formulation of signatures and rules, and which, on the other end, enable application of learning methods for anomaly detection. However, devising a set of features is not trivial, as attacks are reflected in all sorts of different patterns and numerical measures. A good example is the work of Kruegel at al. (see &lt;a href="http://www.cs.ucsb.edu/~chris/research/doc/comnet05_anomaly.pdf"&gt;1&lt;/a&gt;, &lt;a href="http://www.cs.ucsb.edu/~chris/research/doc/ccs03_webanomaly.pdf"&gt;2&lt;/a&gt;, &lt;a href="http://www.cs.ucsb.edu/~chris/research/doc/2002_03.ps"&gt;3&lt;/a&gt;), in which various heterogeneous features are proposed and manually combined into an effective anomaly detection system. &lt;br /&gt;&lt;br /&gt;Recently, colleagues at our lab came up with a method for automatic selection and weighting of such features for intrusion detection. Instead of defining a weighting of different features manually, the method automatically determines the mixture which optimizes the performance of anomaly detection. This optimization is realized by incorporating the process of feature selection directly into an anomaly detection method&amp;mdash;a rather involved mathematical procedure. The paper will be presented at the &lt;a href="http://www.aisec.info/"&gt;AISec Workshop&lt;/a&gt; co-located with CCS. Following is its abstract:&lt;blockquote&gt;A frequent problem in anomaly detection is to decide among different feature sets to be used. For example, various features are known in network intrusion detection based on packet headers, content byte streams or application level protocol parsing. A method for automatic feature selection in anomaly detection is proposed which determines optimal mixture coefficients for various sets of features. The method  generalizes the &lt;a href="http://ict.ewi.tudelft.nl/~davidt/papers/ML_SVDD_04.pdf"&gt;support vector data description (SVDD)&lt;/a&gt; and can be expressed as a semi-infinite linear program that can be solved with standard techniques. The case of a single feature set can be handled as a particular case of the proposed method. The experimental evaluation of the new method on unsanitized HTTP data demonstrates that detectors using automatically selected features attain competitive performance, while sparing practitioners from a priori decisions on feature sets to be used. &lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;a href="http://user.cs.tu-berlin.de/~brefeld/publications/aisec19-kloft.pdf"&gt;Automatic Feature Selection for Anomaly Detection&lt;/a&gt;. M. Kloft, U. Brefeld, P. Düssel, C. Gehl, and P. Laskov. &lt;i&gt;Proceedings of the First ACM Workshop on AISec&lt;/i&gt;, October 2008.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-2836012684993065675?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/yxhPXZJrmAI" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/yxhPXZJrmAI/automatic-feature-selection-for-anomaly.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/10/automatic-feature-selection-for-anomaly.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-4376139487639527430</guid><pubDate>Sun, 28 Sep 2008 06:42:00 +0000</pubDate><atom:updated>2008-12-03T22:18:51.574+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">machine learning</category><title>Approximate Kernels for Trees.</title><description>The majority of machine learning and artificial intelligence methods are designed to work on vectorial data. In practice, however, data does not always come in the form of simple vectors. For instance, analysis of HTML documents requires processing and matching tree structures. The state-of-the-art techniques for learning with trees are convolutional &lt;a href="http://en.wikipedia.org/wiki/Kernel_methods"&gt;kernel functions&lt;/a&gt;. These functions are effective but painfully slow on large trees. To speed-up such kernels for security applications, we came up with &lt;a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-first.pdf"&gt;approximate tree kernels&lt;/a&gt;. Here is the abstract from the corresponding technical report:&lt;blockquote&gt;Convolution kernels for trees provide effective means for learning with tree-structured data, such as parse trees of natural language sentences. Unfortunately, the computation time of tree kernels is quadratic in the size of the trees as all pairs of nodes need to be compared: large trees render convolution kernels inapplicable. In this paper, we propose a simple but efficient approximation technique for tree kernels. The approximate tree kernel (ATK) accelerates computation by selecting a sparse and discriminative subset of subtrees using a linear program. The kernel allows for incorporating domain knowledge and controlling the overall  computation time through additional constraints. Experiments on applications of  natural language processing and web spam detection demonstrate the efficiency of the approximate kernels. We observe run-time improvements of two orders of magnitude while preserving the discriminative expressiveness and classification rates of regular convolution kernels.&lt;/blockquote&gt;&lt;br /&gt;&lt;a href="http://ida.first.fraunhofer.de/~rieck/docs/2008-first.pdf"&gt;Approximate Kernels for Trees.&lt;/a&gt; Konrad Rieck, Ulf Brefeld and Tammo Krueger. &lt;i&gt;Technical Report FIRST 5/2008, Fraunhofer Institute FIRST, September 2008.&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-4376139487639527430?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/G0q0ial6UuQ" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/G0q0ial6UuQ/approximate-kernels-for-trees.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/09/approximate-kernels-for-trees.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-1434215427339216087</guid><pubDate>Sat, 20 Sep 2008 17:19:00 +0000</pubDate><atom:updated>2008-11-14T11:14:29.306+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">intrusion detection</category><title>Low throughput? The imbalance of network traffic.</title><description>I have been recently running experiments with a "fast" learning method for anomaly detection in HTTP and FTP payloads. While the method performed well in terms of attack detection, the realized throughput was frustrating. Incoming traffic was processed at 15 mbit/s (HTTP) and 30 mbit/s (FTP) on a single CPU, including traffic normalization, TCP reassembly and anomaly detection. &lt;br /&gt;&lt;br /&gt;How useful is such a low throughput? &lt;br /&gt;&lt;br /&gt;Well, better than one might think. Depending on the considered application-layer protocol, there is often an imbalance of ingress and egress traffic. For example, the  HTTP traffic at our institute has an ingress-egress ratio of 1/50, while the FTP traffic recorded at LBNL (port 21 only) reaches a ratio of 1/6. Thus, when looking at the total throughput the considered learning method yields around 770 mbit/s (HTTP) and 200 mbit/s (FTP) throughput. That's not so frustrating, given that the performance could be easily increased using multi-core systems. &lt;br /&gt;&lt;br /&gt;For other protocols, however, the ingress-egress is not a small fraction and one can not indirectly benefit from the imbalance of traffic. In conclusion, when analyzing the run-time performance of an intrusion detection method it really matters which protocol and direction is monitored for evil things.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-1434215427339216087?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/qgh856vX2PU" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/qgh856vX2PU/low-throughput-imbalance-of-network.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/09/low-throughput-imbalance-of-network.html</feedburner:origLink></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-9089443006312604961.post-1704769672286661767</guid><pubDate>Sat, 30 Aug 2008 11:18:00 +0000</pubDate><atom:updated>2008-11-14T11:16:32.895+01:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">computer security</category><title>Portscanning the Internet.</title><description>I just returned from a long holiday trip. While browsing through the list of blog postings I missed during this period, I noticed an interesting article on large-scale port scanning:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://blog.thc.org/index.php?/archives/2-Port-Scanning-%20%20%20%20the-Internet.html"&gt;Portscanning the Internet&lt;/a&gt; &lt;i&gt;posted at the &lt;a href="http://blog.thc.org"&gt;THC blog&lt;/a&gt;&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;The author reports on his experience with port scanning methods, which are capable to scan the entire Internet in a reasonable time frame. Besides several remarks on how to conduct such scans and how to threshold a scanning system, there is also a funny side note on a &lt;i&gt;false&lt;/i&gt; worm outbreak caused by scanning activities.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9089443006312604961-1704769672286661767?l=blog.mlsec.org' alt='' /&gt;&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/mlsec/~4/XCHWVYdNzP8" height="1" width="1"/&gt;</description><link>http://feedproxy.google.com/~r/mlsec/~3/XCHWVYdNzP8/portscanning-internet.html</link><author>noreply@blogger.com (Konrad Rieck)</author><thr:total>0</thr:total><feedburner:origLink>http://blog.mlsec.org/2008/08/portscanning-internet.html</feedburner:origLink></item></channel></rss>
