Twitter and Information Security awareness

Twitter is giving traditional media a run for its money in many aspects, especially when it comes to getting the news out. Over the last few years a common pattern has emerged where news breaks first over Twitter or a comparable social media platform only to be picked up later by traditional media such as TV/Radio/Newspapers. In fact, most of the traditional media powerhouses have started incorporating social media in their portfolio both as means of reaching a younger tech savvy audience as well as receiving information about events as soon as they appear on social media. Twitter is by far the most popular choice of social network for breaking news as well as subsequent user/community interactions and for official corporate accounts to interact with the user base at large. With this in mind we analyze and assess the impact of Twitter when it comes to raising awareness of critical Information Security-related events.

The questions we’re tried to answer were…

  • How effective is the Twitter platform when it comes to raising awareness about InfoSec in general, and high profile events in specific?
  • Can InfoSec professionals and Organizations who face a constant uphill battle to keep up with what now seems like an endless barrage of computer/network attacks use Twitter to their advantage ?

On the whole, 2014 has already seen some very high profile vulnerability disclosures as well as data breaches. The intent of this exercise was to do a cross-sectional study of one such high profile vulnerability disclosure, the ’‘Shellshock vulnerability’, which made its appearance on Twitter and mainstream media in late Sept 2014. We use this particular vulnerability disclosure for our study because it was a very high profile disclosure with the potential to impact a lot of easy targets on the web and intranets (Don’t discount the internal threats !). Also this disclosure came on the heels of another high profile vulnerability disclosure, the ‘Heartbleed’.

Data at a glance

For this study we downloaded some 330,000 odd tweets using Twitter’s search API spanning the several weeks during which this vulnerability disclosure generated the most amount of activity both on social and mainstream media.
Due to the nature of Twitter conversations, where we have tweets, retweets, favorites, replies and combinations there of, it was interesting to look at the overall conversation map. This allowed us to gain a high level understanding of the scope of the conversations and the interactions that happened in those conversations.

Figure 1: Conversation Map

Figure 1: Conversation Map

Figure 1 shows a succinct view of the data we sampled. The numbers in the boxes show the number of tweets for that particular category (e.g. original tweets are tweets which are neither retweets nor replies). For comparison sake we also show what percentage each category takes up with respect to parent and grand-parent categories. What is clear is that there was not a whole lot of interaction here apart from retweeting. Not many back-and-forth conversations (replies and replies to replies < 3%), nor too many tweets got favorited (<10%). Of retweets, it was mostly the original tweets that tended to be retweeted rather than a subsequent reply to a tweet. All this pointed to the fact that the InfoSec twitter crowd is very close knit small crowd but not very interactive at least on the Twitter platform as far as discussing InfoSec events is concerned.

To elaborate on how minuscule this activity is, as far as high profile Twitter Activities go, we compared it with the July 9th 2014, Soccer World cup semi-final match between Germany and Brazil (which Germany won 7–1). That single match alone (lasting a little over 90 minutes) generated more that 35 Million tweets with a peak rate of approximately 580,000 tweets/minute (more than our entire sample size spanning several weeks). Although we don’t have a similar Conversation Map of that activity, the sheer number of tweets compared to our data gave us a good idea of how insignificant our ‘Shellshock’ activity was compared to a high profile sports related activity.

Note : The numbers were calculated from the data we downloaded using the Twitter search API. The search API does not provide all tweets but a sample of them (anywhere from 1% to 40% without indicating the true volume). The assumption here was that the sampling distribution is true enough to the distribution found in the total amount of tweets related to ‘shellshock’.

Timeline and Activity Trend

The one argument that Twitter has going for it over traditional media is that the speed with which news is received and propagated is very high compared to traditional media. There are various reasons behind such a claim, but let’s first see if there is any truth to this claim in the InfoSec world. Below we see two graphs. Figure 2 shows the per-day tweets and retweets related to ‘shellshock’ around the time when shellshock dominated the InfoSec news world, and Figure 3 is the google trend related to searches for the phrase ‘shellshock’ during that same time period.

Figure 2: Timeline of Tweets Related to shellshock

Figure 2: Timeline of Tweets Related to ‘shellshock’

Figure 3: Google Trend for shellsock

Figure 3: Google Trend for ‘shellsock’

Immediately what we saw was a remarkable similarity in the two trends. The peak of the activity was on Sept 25th & 26th and gradually declining over the next two weeks. We also saw a repetitive pattern of drop in activity over the weekends (27th Sat, 28th Sun and repeated on Oct 4th & 5th and later again on the 11th & 12th). This was a clear indicator that the InfoSec community and in general people who are interested in InfoSec was monitoring/searching for more information about ‘Shellshock’ and at the same time also interacting and exchanging information about it on Twitter, but this activity was mostly happening during work-days.

If we look at the timeline of creation dates of the accounts who participated in this conversation in Figure 4, we don’t see a huge spike in new user creation near the timeline of the event, which tells us that quite a lot of seasoned Twitter pros were engaged in the conversation.

Figure 4: User Account Creation Timeline

Figure 4: User Account Creation Timeline


To analyze the diversity in the twitter communication we looked at the Top 10 Languages that were used to tweet in Figure 5 and Top 10 user timezones in Figure 6.

Figure 5: Top 10 Languages Tweeted in

Figure 5: Top 10 Languages Tweeted in

Figure 6: Top 10 Timezones in User Profiles

Figure 6: Top 10 Timezones in User Profiles

Unsurprisingly, English was the dominating language with Japanese and Spanish coming in at distant 2nd and 3rd. The United States of America (USA) and Europe dominated the top user timezones. This was expected given their high concentration of IT industries and as a consequence more awareness about InfoSec in these locations. What was surprising was how little Asia, APAC, South-America, & Africa contributed in the conversation. Does this point to relatively lower InfoSec awareness in these regions ? Or does it simply point to lack of popularity of Twitter in these regions ? Given that Twitter has a huge world wide user base, we suspect the former more than the later.

Twitter does provide a way for its users to individually geo-tag each tweet with a location, but out of the 330,000 or so tweets we sampled, less than 1,000 had this information available, so we didn’t deep dive in to per tweet location analysis.


Individual Tweets can have hashtags, refer to external URLs, and mentions of individual users. Analyzing this information gives us valuable insights about the internal and external resources used in these tweets.

Figure 7: Top 10 Hashtags in All Tweets

Figure 7: Top 10 Hashtags in All Tweets

Figure 7 shows the Top 10 Hashtags found in all tweets. The ‘shellshock’ hashtag took a very comfortable 1st spot. What’s more, the previous high profile vulnerability ‘heartbleed’ also got a place in the Top 10. Because the Top 10 hashtags in just the unique tweets (sans retweets) show similar distribution, we didn’t replicate it plot here. Figure 8 shows the top 10 URLs in all the tweets. It was a bit of a surprise to see a CNET URL getting the top spot. This was due to a single tweet that got retweeted 11K times to push that CNET URL to the top. That same tweet is mentioned at the start of this article.When we removed the retweets from the equation then the Top URL spot went to a very detailed explanation about the vulnerability by Troy Hunt.

Figure 8: Top 10 URLs in all Tweets

Figure 8: Top 10 URLs in all Tweets

Figure 9 shows the top 10 Users who were mentioned in the all the tweets. The user account ‘whsaito’ took the top spot on account of a single tweet that got retweeted almost 11K times. (The same one which has the CNET URL and which appears at the top of this article.)

Figure 9: Top 10 Users mentioned in all Tweets

Figure 9: Top 10 Users mentioned in all Tweets

Figure_10 shows the Top 10 active user accounts in terms of number of tweets twitted from those accounts. These accounts can be thought of as being most active in raising awareness about this vulnerability by tweeting about it multiple times.

Figure 10: Top 10 Active Twitter Accounts

Figure 10: Top 10 Active Twitter Accounts

To round out this discussion we also present the Top 10 Retweeted, Replied, and Favorited user accounts in terms of number of tweets in Figure 11

Figure 11: Top 10 retweeted/replied/favorited user accounts

Figure 11: Top 10 retweeted/replied/favorited user accounts


We already looked at a high level interaction map in Figure 1. Now let’s deep dive in to it a bit. For starters let’s see how many followers our InfoSec twitter users tend to have. In Figure 12 below we show a kernel density plot of follower counts of each unique user involved in the data sample. What was very apparent is that InfoSec crowd is not very popular amongst Twitter user base. This is evident in that most of the accounts having less than 1,000 followers. But there are a few high profile accounts which have followers in the millions (the InfoSec rock-stars).

Figure 12: Follower Counts of Users

Figure 12: Follower Counts of Users

Figure 13: Retweeted and Favorited Counts of Tweets

Figure 13: Retweeted and Favorited Counts of Tweets

In Figure 13 above we show how many times unique tweets tend to be retweeted and favorited. This is a way to measure how popular InfoSec tweets tend to be. Most tweets don’t tend to be retweeted or favorited more than 100 times, a very small number especially when compared to some of the high profile trending activity on Twitter. This is a pity as it again points to lesser reach of InfoSec related Tweets.

Figure 14: Follower Counts v/s Retweet Counts and Favorite Counts

Figure 14: Follower Counts v/s Retweet Counts and Favorite Counts

For an interesting comparison we looked that the relationship between a user’s number of followers and whether it had any impact on a tweet being retweeted or favorited. Conventional wisdom would seem to suggest that there indeed should be some correlation there, i.e. the more the number of followers, the more the retweets or favorites, but we found evidence to the contrary. As Figure 14 seems to suggest that the less followers you have the higher the number of times your tweets get retweeted / favorited. One possible explanation for this contradiction is that we didn’t taken the PageRank effect into account, i.e. a tweet being retweeted by someone who has lots more followers than the original user account. This was left as an exercise for the future.


So what did we learn ? Twitter has the potential to reach vast amounts of users/organizations even outside the traditional InfoSec community to raise awareness about high profile security incidents or vulnerability disclosure. However as things stand now the InfoSec world is very close-knit and largely not popular outside its own sphere. InfoSec tweets about high profile events such as the ‘Shellshock’ vulnerability tend to reach a restricted and niche user base as opposed to say a high profile sports activity such as a Soccer world cup match or a super-bowl match. This was evident from the fact that most tweets related to shellshock were seen by, retweeted, replied, favorited by a very small fraction of the twitter user base.

If InfoSec organizations and professionals want to use twitter to their advantage, then they have their work cut out. They need to engage more and more people/Orgs from outside the core InfoSec community in the conversation. The more the conversation the more the awareness. InfoSec will never be as popular as Soccer or Super-Bowl. But if it manages to attract enough attention from non-InfoSec crowd it would be a step in the right direction.

Technical Notes

  • For the interested the data was downloaded using a python script and Twitter’s search API.
  • It was analyzed and plotted using R & ggplot2.
  • An interesting continuation of this analysis would be to put the data in a Graph Structure to explore the conversations in more detail, and see if we can discover any interesting clusters in the conversations.
  • Another possible research path is performing text analytics on the data for finding clusters based on words occurring in the tweets.

Weekly Intelligence Summary Lead Paragraph: 2014-11-14

The majority of intelligence collected by the VCIC this week could easily be organized into two categories: serious vulnerabilities and noteworthy attacks. Microsoft released its hefty November patch update on Tuesday, but the attention wasn’t on the cumulative Internet Explorer update or the patch for a second Windows OLE vulnerability that’s being exploited in a small number of attacks. The focus was on a remote code execution vulnerability in SChannel, which is Microsoft’s SSL/TLS implementation in Windows. Add it to the long list of crypto bugs we’ve seen this year and be sure to patch it too. Adobe also released a massive update to patch 18 vulnerabilities in Flash Player. Leading the noteworthy attack category is a report from Kaspersky on the Darkhotel APT, which gets its name by using hotel networks to target high-profile executives. Meanwhile two US government agencies, the United States Postal Service (USPS) and the National Oceanic and Atmospheric Administration (NOAA), announced they suffered attacks at the hands of Chinese threat actors. Australian news organizations also reported being breached by Chinese attackers in the buildup to the G20 Summit in Brisbane this weekend. This week’s must-read collections include Cyphort’s report on point of sale malware and volume 17 of Microsoft’s Security Intelligence Report.

Context Graph Based Analysis of Apple Pay Domains – Part 3 of 3

In our previous posts we identified Apple Pay domains created after the Apple Pay announcement here.  We then aggregated them in a context graph and analyzed the features of the graph here.  We then statistically analyzed the individual clusters here.  Companion posts explaining Verum, the context graph system, can be found here and here.  In this post we will manually validate the results of the previous analysis by looking at the individual clusters previously identified through statistical analysis.

Manual Cluster Validation

To this point in the analysis, everything can be automated.  Rather than manually analyzing the clusters that are potentially malicious, we could classify all of the infrastructure as malicious with a confidence derived from the various measures we looked at above.  The infrastructure atomic values (IPs, BGP prefixes, domains, etc) could be added to our IDS and other detection tools.  Once detected, our SIEM could use the confidence to prioritize incidents for investigation and response.

But let’s say we aren’t quite sure we trust our system yet.  Instead, we will now manually analyze the clusters we highlighted to validate that they represent malicious infrastructure.  Let’s first recap the clusters we hoped to look at:

  • Clusters with zero topic nodes. (We will take clusters with no nodes at distance zero, one, and two to make the set to analyze manageable.)
  • Clusters with greater than 1.5% IQR (or just over 16%) of nodes directly related to the malice node
  • Clusters above the 95% percentile of nodes two relationships from the malice node
  • Clusters with high aggregate malice scores, starting with the largest score and working our way down.
  • Clusters containing highly malicious nodes

To do this, let’s first subset our data to what we want.  We can use the following code to subset the dataset from our previous post.

Clusters Far from the Topic

Figure 12 Legend

Figure 12 Legend


Figure 12

We’ll start by looking at cluster 41. The cluster is the colorful cluster below and to the left of center of the three green nodes at the center of Figure 12. It is an IP address node surrounded by 4 record nodes that indicate that it is malicious. It is connected to an ASN, BGP prefix and a domain. The domain, (center green node with 2 close, green, neighbors), is not classified as malicious. From a malice standpoint the domain, (which connects to 21 topic nodes, colored black in Figure 12, through six IP nodes in the 4 other clusters) is just at the edge of the 1.5x IQR. This appears to be a malicious IP address that has a tangential connection to multiple other sets of infrastructure through a single domain that has moved between those infrastructures. Cluster 41 is most certainly malicious, but not necessarily related to our topic domains in the other clusters. Cluster 51 represents the same situation. A malicious IP address connected to the greater graph through a single domain. Same for clusters 69, 83, and 86. Rather than analyze all of the far-distance clusters, we will assume this pattern applies to the other clusters far from the topic nodes.  Next we will analyze clusters with significant percentages of nodes directly related to the malice node.

Clusters Connected to Malice

Figure 20

Figure 20 – Cluster 270

Figure 13

Figure 13 – Cluster 186

We will start with the cluster with the highest percentage of malice nodes, cluster 270. Upon inspection, it is two nodes, one of which is a malicious domain and the other is the record of the malicious classification. Same thing goes for 231, 134, 150, and the rest, (some having multiple records). A few are malicious IP addresses which then include ASN and BGP enrichments slightly diluting the concentration of maliciously classified nodes. These clusters are all malicious, but do not add any value over their original classification. We do have one cluster, 186 shown in Figure 13, which do not fit with the previous pattern. It contains two malicious nodes which share a BGP prefix and few other nodes. This should probably heighten our concern about that subnet, but not necessarily the entire cluster. The two purple nodes represent the malicious IP addresses with the associated BGP prefix represented by the yellow node in between. Hopefully the aggregated malice score will provide the necessary insight on the BGP prefix.

Clusters Two Relationships from Malice

Figure 14

Figure 14 – Cluster 102

These clusters follow a similar pattern to those directly related to the malice node.  Starting with cluster 102 shown in Figure 14, we see it has a single malicious IP, an associated BGP prefix, and 12 records indicating the IP is malicious. The yellow node represents the center IP node with the red nodes indicating the records in the cluster. The blue node represents the domain that, while not in the cluster, is directly related to the malicious IP. The IP is still connected to a domain, which should be concerning due to potentially being malicious, however, the topic node’s malice score should indicate that. There are alternate clusters similar to cluster 186 profiled above in this group as well, however most follow the pattern of clusters directly connected to malice, except with IP addresses, (rather than domains), which have additional enrichments that increase the number of nodes two relationships from malice in the clusters.

Aggregate Malice Score

Aggregate malice score is the average malice score of all nodes within the cluster. The node malice score is a propagation of malice from the malice node through the graph. As such, it captures not just the direct relationship between a node and malice, but also the malice propagated through relationships from neighbors to the node. The most malicious cluster in this context graph only has 2 nodes: a record and a domain. That domain, however, is a malicious name server. This plays out for multiple clusters where a single, highly malicious name server node carries the malice for a rather small cluster. There are some interesting clusters however. Cluster 41, which popped up previously, shows up again as well as clusters 107 and 71. These tend to cluster with a highly malicious IP, which then points to a presumably malicious domain.

Figure 15 Legend

Figure 15 Legend

Figure 15

Figure 15 – Cluster 65

Cluster 65 is our first interesting cluster in this set. It has five IPs classified malicious and a whole host of information as can be seen in Figure 15. The size of the node in Figure 15 corresponds to its malice score.  The Five blue IP nodes are directly connected to the malice node. In this cluster, the large green node in the center of various other nodes lower left is a topic Apple Pay domain. This domain is highly likely to be malicious even though it is not directly classified that way by any of our enrichments or classifications.

This type of pattern continues as the clusters get less malicious. More nodes are added which decreases the cluster malice score, but the additional nodes reveal additional infrastructure and indicators associated with the known malice. Some contain a topic; others are linked to topics not within the cluster. These larger clusters provide the best means of classifying previously unclassified malice. These clusters have higher internal malice scores, more nodes classified as malicious, and are highly interconnected highlighting a collection of malicious infrastructure. All of this can be algorithmically identified and the new indicators fed back into our detection tools.

Figure 16

Figure 16

Figure 16

Figure 16 – Cluster 16

We will take a look at the largest cluster, cluster 16. Figure 16 shows this cluster with nodes colored by type other than maliciously classified nodes that are colored red and topic nodes that are colored black. In the upper right we see multiple malicious domains in a single BGP prefix/ASN. Connected to that cluster of malicious IPs are various domains. Those IPs and domains connect into a less malicious portion of the cluster in the lower left. Rather than classify the nodes in the lower section as malicious, we may classify them malicious with a lower confidence that the domains directly connected to the malicious infrastructure in the upper right. This does, however, validate what we suspected about large clusters with significant aggregated malice scores.

Malice Score of Individual Topic Nodes

Figure 18 Legend

Figure 18 Legend

Figure 18

Figure 18 – Cluster 0

As outlined above, 81 clusters contain the top 1734 most malicious nodes identified as outliers. We will look at a few of those here, focusing on the clusters containing the most malicious nodes. Unsurprisingly, both the most malicious and second most malicious node in our graph are within cluster 16. They are the two largest nodes in Figure 16, one red and one brown in the upper right. The third most malicious node lies within cluster 2. This cluster contains 1390 nodes and represents a shared infrastructure based on analysis of the centrality of the cluster. In the cluster centrality, a single *aaS provider’s ASN, name servers, and IP space are very central. As such, when threat actors use the shared infrastructure, the malice ‘collects’ in the graph on nodes representing the shared service provider’s services. Cluster 0 represents a similar cluster and can be seen in Figure 18. It is interesting in that it represents two clusters of shared infrastructure (large clusters left and right of center) with domains associated with both of them (nodes in center). The two are actually two registrars, one of which used to be a reseller for the other.

When sorted by most malicious individual node using the above line of python, the first dozen clusters all have more than 100 nodes each validating what we suspected:  that the shared infrastructure dilutes the overall malice but concentrates it on individual nodes.  Cluster 76 is the first with less than 100 nodes. This cluster also shows up in our aggregate malice score analysis.


In this series of blogs we have collected an enriched graph centered on multiple topic domains related to Apple Pay. We have scored individual nodes and clusters, identifying small clusters of malicious nodes, medium-sized clusters of infrastructure, and large clusters of publicly available shared infrastructure. We’ve classified clusters as well as individual nodes as potentially malicious with an associated confidence. We’ve done this both through automated means as well as through manual analysis. We have also included contextual data in our analysis so that analysts trying to make an assessment later have the data they need to do so. All of this allows the classification and investigation of malicious infrastructure, even before it acts maliciously or when it only indirectly exhibits it’s malice, in such a way that it can either be consumed by human analysts or by automated protection, detection, and response systems.


While the graph database used for this analysis stores the temporal aspects of the nodes and relationships, functions to implement the algorithms to time phase the analysis through the graph has yet to be completed.  As such, the analysis within this blog post is done as if all classifications and enrichments occur at a single time. In reality, IP addresses, domains, and other nodes change classifications and relationships over time. In the future we will conduct analysis through the graph that respects the temporal aspect of the data. For now, understand that the lack of temporal phasing can skew the analysis.

Also, many large clusters exist in the graph representing publicly shared infrastructure (such as shared whois obfuscating services, domain registrars, IP space, and name servers). Cluster 2 is an example of this. The analysis correctly identifies these as non-malicious, however due to their size they contain subsets of malice. This can be corrected by rerunning modularity on the large cluster by itself.  (Modularity suffers from low resolution.)  Alternately, future research into subdividing these large infrastructure into smaller clusters which can then be individually classified may prove fruitful.

Weekly Intelligence Summary Lead Paragraph: 2014-11-07

Microsoft announced intentions to release sixteen security bulletins next week.  Sixteen is the most the company has released in one month since June 2011 and one under April 2011’s high water mark.  The VCIC dedicates extra effort to targeted attacks.  Not because they are currently prevalent among our clients, but because the methods that succeed today will almost certainly be used on Verizon Enterprise clients in the future.  This week those attacks include “TooHash” (GData), “Poisoned Handover” (FireEye), “BlackEnergy 2” (and 3 from Kaspersky) and “Rotten Tomato” (Sophos).  Crimeware continues to evolve as evidenced by a Cryptolocker campaign on Dutch users, several Rovnix campaigns, mostly in Western Europe, ROM, a new POS Trojan based on Backoff, and “WireLurker” impacting OS X and iOS systems, mostly in China.

Context Graph Based Analysis of Apple Pay Domains – Part 2 of 3

In our previous post, we looked at the initial creation and enrichment of a Context Graph centered around newly created Apple Pay domains.  We looked at the distribution of the Apple Pay topic throughout the graph.  In this post we will statistically compare and contrast individual clusters.  The companion post Introducing Verum: A Context Graph System – Part 2 of 2 provides additional insight into the Verum context graph system for those interested.

Cluster Analysis

To make the data easier to analyze with traditional means, I’ve provided a dataframe with the statistics for each cluster here. This dataframe was created by aggregating values stored on the individual nodes in the graph including the various topic and malice scores and distances.  I have normalized the values by dividing by the cluster order.  In the case of aggregate scores, the values are normalized to between zero and one.

Figure 8

Figure 8

Cluster Topic Analysis

Our first feature of analysis will be the percentage of nodes in each graph at varying distances from the topic nodes.  This data is plotted as a boxplot in figure 8. If you are unfamiliar with boxplots, I recommend reading up on them here. They are a very efficient way of visualizing non-normally distributed data. In our case, the whiskers represent 1.5x the inter-quartile range (IQR), which is the default for matplotlib.

Nodes at distance zero are the topic nodes.  We clearly see clusters tend to contain nodes at a distance of two and three relationships from the topics, which is to be expected as the number of nodes increases exponentially as we get further from the topic.  What is interesting though are the tails of the distributions. Clusters with a large number of topic nodes within them are likely small clusters as there is no way for a large cluster to be predominantly topic nodes without depriving other clusters of topic nodes. We do however see a very long tail in clusters with a significant number of nodes one relationship from a topic node. These can be categorized as clusters with a topic node or two and all other nodes in the cluster explaining the topic. This is most likely whois data and few IPs for the topic domains (see figures 14 and 20 below), as IPs would also include BGP prefixes and ASN numbers that would implicitly be 2 and 3 relationships away from the topic. In these clusters predominantly of nodes one relationship from the topic, we can conclude they are not part of a larger infrastructure but a small structure around one or two domains.

Figure 19

Figure 19


The tails of the two and three relationship topic distances are interesting as well. We can expect boxes to be large as it implies significant interrelated infrastructure. This presents itself in the graph in chains such as “name server X”<-[described by]-“applepay domain”-[described by]->”IP address”<-[described by]-“other domain”-[described by]->”name server X”. The IP address would also have a relationship to a BGP prefix, which would have a relationship to an ASN. In that case the “other domain” and “BGP prefix” would be two relationships away and the ASN number, three. Also, anything associated with the “other domain” will be at least three relationships away.  Due to the large boxes though, the 1.5x IQR goes all of the way to 100%.  We can remove the whiskers and analyze the outliers in clusters above the box in Figure 19.  We see multiple clisters with 60% and 80% composition of nodes two or three relationships from a topic. Effectively this means infrastructure that is more highly clustered with itself than the topic domains, as nodes have to have some relationship to topics to even be in the graph.  These are likely the clusters with no topics in them that we visually identified earlier.  As these clusters are of interest, we will manually analyze them in our next blog. 

Cluster Malice Analysis

As part of Verizon’s threat analysis, Verizon uses multiple intelligence feeds to enrich its data. As many people in the information security community have pointed out, intelligence feeds in the sense used in information security are little more than classification of IP addresses and domains. However, we can apply that classification to our data to help add context to the clusters of infrastructure we have identified. We will apply scoring algorithms, (which we discuss in the companion blog post), to propagate the malice score from the malice node to the rest of the subgraph to assess the malice of nodes within the graph. The outcome of this analysis is aggregated per cluster in the above data frame as column “aggMaliceScore’.  We also capture the percentage of nodes in each cluster at each distance from the malice node similar to our topic distance percentages above.


Figure 9

Figure 9

We will start by analyzing the boxplots in Figure 9.  Immediately it sticks out that third quartile for two relationships from the malice node is 40%. This gives us a good idea of what a normal cluster looks like. The whisker however hides clusters above 40% due to the large IQR. We can remove the whiskers as we did earlier and see the actual points in Figure 10.

Figure 10

Figure 10

By using

we see that the 95th percentile is at roughly 80%. Between Figures 8 and 9, we see multiple clusters of interest. Those that are greater than 80% comprised of nodes of distance two from the malice topic are a good focus for investigation.   Also worth investigation are those clusters with a significant number of nodes with a direct relationship to the malice node, (i.e. malice distance of one). We see five nodes above the whisker in this category that we will add to our list for manual analysis.

Figure 11

FIgure 11

What is not captured In Figures 9 and 10 , however, is the aggregate score. The aggregate malice score is a score of the malice accumulated through all of the nodes within the cluster. Figure 11 provides that view.

While the data is normalized in the boxplot, it is not a percentage. In this plot, we see a long tale with a single cluster with an aggregate score of nearly twice the next highest cluster. This is likely a small cluster with multiple nodes related to malice.  We will add all of the clusters above the 1.5x IQR for manual investigation as potentially malicious infrastructure.

Figure 17

Figure 17

It may also be helpful to identify nodes of significantly high malice and their associated clusters. This may identify clusters with a very malicious node in with a subset of the cluster which is not malicious thereby decreasing the aggregate malice score of the cluster.  Figure 17 shows the boxplot of normalized individual node malice scores. Using the same centrality and scale estimation algorithms as above, we choose a cutoff of about .23 which leaves 1734 nodes in 81 clusters.

Next Post

In the next post, we will take the clusters identified through statistical analysis and manually analyze them to determine if our statistical process has appropriately identified previously unknown malice, providing true threat intelligence.  Stay tuned!

Introducing Verum: A Context Graph System – Part 2 of 2

As a complement to our second Apple Pay analysis post, we continue our discussion of the Verum system.  If you haven’t read the previous post on Verum, please check it out!



Figure 1 – Netflow Record as Graph 1


Figure 2 – Netflow Record as Graph 2

At its most fundamental level, Verum is a graph. In the mathematical sense, a graph is nodes (also called vertices) represented by circles and edges (also called relationships) represented by arcs or lines. For comparison, if you had a relational database table of something like Netflows with columns: <flowID, srcIP, srcPort, destIP, dstPort, proto> and values, <1,,, 12345, 80, tcp>, you could create a graph as seen in Figure 1. In this image we have a node for each value in the database row. Additionally, we have something relational databases don’t. We have a separately describable relationship between each value. In a relational database the relationship is primarily in that the values are in the same row and in the column names. (Notice that the database columns, it’s “source IP” rather than “IP” as it is in the graph.) In the graph, the relationship can be on the edge allowing more atomic nodes. We may also choose to structure the graph even more basically as in Figure 2. In Figure 2, most of the relationship data is stored on the edge with the nodes being the atomic values connected by the relationship. This leads in to another point: in most graphs, both nodes and relationships have properties (essentially key:value pairs). This is called a property graph and effectively all graph systems support and use such features.  In the case of Figure 2, the relationship has the following key:value pairs: <flowID:1, srcPort:12345, dstPort:80, proto:tcp>.

There are multiple types of graphs. The most basic graph is an undirected graph, which has relationships that do not imply a direction. If I was a node and my wife was a node in the graph, a relationship between us would be “spouse” but would not have any particular direction. The next step up is a directed graph. In a directed graph, each edge has a source and a destination. The Netflow relationship where the connection goes from one IP to another is an example of a directed graph. A multigraph is a graph in which multiple edges can have the same source and destination. If there were multiple flows from to in our above example: <flowID:1, srcPort:, dstPort:, proto:tcp>, <flowID:2, srcPort:, dstPort:, proto:tcp>, <flowID:3, srcPort:, dstPort:, proto:tcp>, <flowID:4, srcPort:, dstPort:, proto:tcp>, it would be a multigraph. Multigraphs can be either directed or undirected. The primary difference with a multigraph is that the relationship source and destination is not enough to uniquely identify the relationship and as such, the infrastructure must accommodate some other unique identifier of relationships. There are many other types of graphs such as and simple graphs, acyclic graphs, and hypergraphs, but for the purpose of this blog they are unimportant. Our schema will be applied to a directed multigraph.

Scoring Algorithms

Part 1 described retrieving a sub-graph from the graph database.  Once we have the sub-graph, we can score it to produce our desired output: additional classification of the topic with confidence based on it’s context within the graph.  The first step in scoring is getting the distance each node is from the topic. We did this during sub-graph export, but we may want to do it again for other topics within the sub-graph so we implement a simple function, get_topic_distance(). This function simply runs the Networkx shortest path length algorithm on each topic node and keeps the lowest score. Because we are scoring how relevant certain nodes are to the topic nodes, the farther a node is from the topic nodes, the less weight we want to give it. As such, we implement multiple decay functions that return a weight for a node given a few constant factors and the node’s distance from the topic: exponential decay, linear decay, logarithmic decay. (The appropriate decay to use, it’s constants, and whether to even use a decay is subjective and varies by scoring algorithm.) We also have a function that converts directed multigraphs to undirected graphs. This is necessary for some of the scoring algorithms. The approach is strait forward:

Probably the most important algorithms are the scoring algorithms. We have tested multiple algorithms: bayesian_network_probability(), pagerank(), path_count_probability(), and PageRank with initialization values – pagerank2().


Our first scoring algorithm, bayesian_network_probability(), treats the sub-graph as a Bayesian network. It creates a synthetic conditional probability table where each row is true if any predecessors are true. The confidence of the edges is used to weight probability of the node being true given the connected node being true. The probability of each node given that the topic is true is then calculated. The strength of using a Bayesian network is that we could capture complex relationships between predecessor nodes through the conditional probability tables. For example, if an IP was malicious if either domain A or domain B were malicious, but not both, a conditional probability table could capture that. Because the algorithm is very iterative, it should be possible to partition it into jobs that can be spread across a distributed compute infrastructure.

On the other hand, Bayesian networks have many complexities. First, when calculating the probabilities in the Bayesian network, we do not get the correct probabilities due to not having the entire Markov blanket. This causes nodes to be more likely than they would normally because of the missing relationships. Second, calculating the Bayesian network probability is computationally intensive and will run very slowly over large sub-graphs. Finally, Bayesian networks do not complete on graphs with cycles (nodeA->nodeB->nodeC->nodeA) without introducing additional complexity into the algorithm. This means that calculating the Bayesian probability may never complete. The included algorithm does not properly handle cycles in the sub-graph and instead, simply errors if the calculation loop runs for too long without progress.


The initial pagerank() is a strait-forward calculation of the PageRank score in the graph with nodes receiving a personalized weight relative to their distance to the topic. It also uses the confidence of each relationship for the weight. This approach is very efficient computationally and one of the first algorithms executed in a distributed compute environment. It does a good job of capturing the effect of relationships on the score of each node. However, in this form, the nodes scores are effectively their relative importance in the sub-graph, not their importance relative to the topic.


The path_count_probability() function collects all of the unique simple paths between the topic nodes and all other nodes in the sub-graph. Each path is scored using the confidence’s on the relationships in the path and the topic distances of the nodes in the path. This produces a score per node of its relevance to the topic. This algorithm produces a good score, which is relative to the topic. It should also be executable in a parallelized compute environment as finding paths is a node centric computation and scoring a path is independent from the scoring of all other paths. Unfortunately, finding all paths is extremely computationally expensive.

pagerank2() – PageRank with Specific Initialization Values

Running PageRank with specific initialization values, currently captured in the pagerank2() function, has proved to be the most effective means for scoring the graph. Pagerank2() runs the same as the first PageRank algorithm, including the use of a personalized weight and the use of confidence for the relationship weight. However, it includes specific starting weights and dangling weights. By default, the initial score is evenly distributed across all nodes. In this instance, we spread the initial weight only across the topic nodes and assign all other nodes zero initial score. The dangling weight is the probability that a node will be jumped to when a jump occurs (either randomly or due to a node having no predecessor nodes). By default, all nodes have an even chance of being jumped to. In this case, topic nodes will all receive an even dangling weight and all other nodes will receive none.

The initialization values chosen in pagerank2, have an interesting effect. PageRank works by ‘visiting’ a node, giving it a ‘point’, and then either visiting one of the nodes predecessors or, with a small probability, jumping to a random node. The score of a node is it’s points divided by the sum of all points of all nodes in the graph. Effectively this spreads the score around, causing it to ‘collect’ in places that are repeatedly visited (nodes of high degree or nodes pointed to by nodes of high degree). By starting all of the score at the topic and always jumping to the topic, nodes receive all of their score from the topic, which is analogous to their relative importance to the topic. The use of the topic distance-based decay for each node causes the probability to jump back to the topic the farther from the topic the node is reinforces the score being a relative importance to the topic. This application of PageRank inherits the other PageRank benefits including distributed compute compatibility and efficient calculation while correcting the primary deficiency. As such, this is the scoring algorithm that we will primarily use.


The previous four scores produce effectively the same measurement: relative importance of a node to the topic, (similar to machine learning classification with an associated confidence). There is a completely orthogonal score, which we are likely also interested in: cluster; (well, technically, modularity class). Modularity is a score that measures a graph’s connectivity. By finding groupings of nodes that maximize modularity, a graph can be clustered into individual communities. This is effectively the same to machine learning clustering. When run against a topic-specific sub-graph, modularity may help us understand what, if any, sub-groupings exist within the data. We simply call the best_partition() function of the communities module for Networkx. We do, however convert the graph from a multidigraph to an undirected graph first as the communities algorithm does not handle either multigraphs or directed graphs well.

We can also apply the scoring algorithms above to nodes with <key:classification> such as the malice node. In this case, the topic is the malice node rather than our original topic. Since our scoring is based on distance to the topic, we must re-run the topic distance calculation for the malice node. We can then use that along with a decay algorithm to execute the pagerank2 scoring algorithm. This is what we do in the case of our Apple Pay analysis. Because the Apple Pay sub-graph has multiple different clusters with multiple different sets of infrastructure, this helps us identify the malicious clusters and the non-malicious clusters. if our sub-graph contains both the benign and malice nodes, we could run the scoring on both and compare the result, finding which nodes were significantly more malicious than benign and visa versa. We can also use it on other classification such as “C2 node”, “bot”, “malware host”. By comparing the scores, we can determine what kind of malice a given node is likely to be.

Presentation Algorithms

Presentation algorithms provide final presentation of the results.  The firs step is to normalize the data.  Each of the scores above produces scores in a different range.  As such, to make it practically useful, it needs to be normalized. This can be done easily:

This will ensure the scores are between zero and one. In our Apple Pay analysis, we did most of our scoring on a per-cluster basis. As such we need to sum the scores of all of nodes within a cluster and divide by the cluster size to get a score comparable between clusters.

Once the scores have been normalized, we can take two approaches to finding interesting results. The first approach is what we apply to the Apple Pay analysis. We look for outliers in scores for which we have three options. In the first option, we calculate a cutoff based on a measure of scale and centroid of our score distribution. Because our distributions are likely to be long-tailed and skewed, we do not want to rely on the mean and standard deviation to predict outliers. Instead, we can use a more robust measure of scale such as tau, (calculated using this algorithm) and the median to predict centrality. We can then find outlier scores above and below the cutoff and classify the associated nodes. Alternately, we may arbitrarily pick a cutoff by percentile such as the 95th, 97th, or 99th percentiles. We then find the score at that percentile and classify the nodes with larger scores. Finally, we may simply want to start with the highest-scored node and work our way down. In that case we will simply sort the nodes by score and process them.

Our second approach is differential classification. While it is not used in the Apple Pay analysis, if we wish to determine if something is either A or B, (for example: malicious or benign), we can compare it’s relative score between the two, (or the relative score of the classification nodes in the topic-centric sub-graph). This would apply to more than two classifications being compared as well. Additionally, because are scores are not Boolean; we can translate them into a confidence for use by tools down-stream.

This output can then be consumed in whatever way we need. For those doing manual event analysis, observables can be enriched, classified, and weighted, with context, in a graph GUI when they arrive in the morning, prioritized by potential malice. For more academic analysis, the data can be fed into statistical tools such as R, Python Pandas,  or Gephi for manual analysis. For machine-to-machine, the output can be sent in the format of choice of the receiving system whether it be a graph, a record, JSON, XML or another format for observables and their context and classifications.  Because the graph contains all data, it can easily be formatted into higher level schemas such as STIX, IOCs,  YARA, or IDMEF.


Because the concept behind Verum benefits the more data is in it, the infrastructure is designed to support the largest graph humanly possible. To accomplish that, it must be built on infrastructure where performance scales linearly. To this end the TitanDB, Cassandra, Elastic Search, Tinkerpop stack was chosen for storing the graph.  TitanDB is an distributed graph database designed to work with scalable architectures. It has good interoperability with other distributed computing and storage projects. It is well documented and maintained. TitanDB supports multiple storage backends. We choose to use Cassandra for storage. Cassandra is a top-level apache project offering many benefits consistent with Verum’s needs. A more thorough review of Cassandra can be found here. While TitanDB provides it’s own hash indexing of nodes and edges, to do full search a search backend must be used. Elastic Search was chosen due to the team’s internal experience though TitanDB also supports Lucene.  Elastic Search provides full text search and order comparison across the entire TitanDB property.  A word of caution, TitanDB expects schemas to be defined prior to creation.  Not creating the schema and indexes prior to adding nodes to the graph causes significant headaches down the road.

TitanDB, Cassandra and Elastic Search together provide a complete storage solution. The primary interface to TitanDB is through the Gremlin query language. The same way SQL is designed for querying relational databases, Gremlin is designed for querying graphs. It provides a method for transversing nodes and relationships. Gremlin is part of the Tinkerpop stack. To this point, all the software is Java based. To allow us more options in interfacing with the graph database, we will use the Tinkerpop Rexster Restful API. This allows us to use any language to interface with the graph as well as providing a basic GUI for the graph. Rexster allows either raw Gremlin queries or abstractions of Gremlin queries, (such as retrieving a specific node).

The compute portion of Verum is currently not implemented on a scalable architecture. The algorithms, however, are designed to support a scalable compute architecture once the compute platform is ready. Currently, computation is done within Python using the excellent Networkx package. The bulbflow package is used to communicate with the TitanDB Rexster interface. The algorithms us blubflow to retrieve the desired sub-graph from the graph database and store it as a Networkx graph. (An alternate package, Mogwai also exists to interact with TitanDB from python.) In some instances, this type of structure may be desirable. If multiple users are analyzing sub-graphs of the graph database, having the compute occur on the clients may better distribute the compute load. However, if users do not receive acceptable compute performance, or if only a few users use the system, it may make sense to use a distributed compute solution.

TitanDB provides TitanDB-Hadoop for distributed graph computations. Titan-Hadoop can translate Gremlin queries including side effects into MapReduce jobs. This allows linear scaling of complex Gremlin computations. An alternative is Spark Graphx. Spark is a successor to Hadoop that provides more memory-based computation as well as other performance improvements. Graphx is a graph-processing framework that sits on top of Spark. While Spark and Graphx are probably the way of the future, we found a few reasons to cause us to delay using it as a platform. 1. Spark seems to be primarily focused on the compute piece of the distributed architecture. As our first goal was storage, we wanted to go with a platform that prioritized storage (TitanDB). 2. Graphx bindings are currently all Scala, a language our team is not well-versed in. 3. We found the TitanDB/Cassandra/Elastic Search/Tinkerpop documentation to be more robust. Still, we expect that Spark/Graphx will be the go-to platform in the near future.


In addition to the scalability and classification/context benefits already highlighted, other benefits exist. First, this system is distributable among organizations. Due to the URI approach to identifying nodes, a client could query multiple graphs such as their own context graph, freely available context graphs, licensed context graphs, private information sharing exchange context graphs, and government context graphs for the same topic. The various returned sub-graphs can be combined and the analysis executed across the combination of sub-graphs. Also, the system is best effort in nature. This allows it to mature as more data is added. Rather than try and provide the ‘right’ answer, Verum attempts to provide the ‘best’ answer it can with the available data. This means that it can provide answers when other systems cannot and can provide more knowledge as it gains more data.

Verum represents a massive amount of domain knowledge, deep academic knowledge and very modern technical knowledge in ways never before attempted. I mentioned this blog post, in the introduction where explained why data was the solution. Verum is the first step in actively using data to win the information security battle.  To that end, experimental code to implement Verum can be found at the VZ-risk Github.   It is published with the hope that it will benefit all in the fight for information security.  While there, also check out the other great VZ-risk products such as the VERIS framework and VCDB data set; open source breach data in the same format as that used for the Verizon DBIR.

Introducing Verum: A Context Graph System – Part 1 of 2


In this blog post, we will discuss the Infrastructure, schema, and algorithms used in the post here to analyze domains created after the Apple Pay announcement. While we took the Apple Pay announcement as an opportunity for specific application of this system, it is generic enough to be applied to any analysis, including analysis outside of information security. We refer to our system as Verum after the transcendental: truth.

The concept of Verum is to ensure that any information gained can be used to find the underlying truth behind the data. In this blog post I explain my thoughts on the importance of data to information security. However, in this post I explain our current problem: we do not effectively exploit the signal threat actors send to us. Verum is designed to remedy that problem.


Verum’s schema is a subset of the schema documented here. The goal of the schema is to ensure that all nodes are atomic key:value pairs and that relationships stores the signal we receive in our data. Relationships in CAGS have three types: “described_by”, “leads_to”, and “influence”. “described_by” is used for connecting the ontological structure of the atomic values. “leads_to” describes temporal relationships of actions.  “Influence” is used to describe the temporal relationships from context. Verum deals with the ontological structure and so all edges are “described_by” relationships. Almost all nodes have a class of “attribute” as they are an attribute of something else. The exception is “actors”. The reason for defining actors separately is that they have free will, or in more agnostic economic terms, they can make choices in general in accordance with the rules of rational actors.


Nodes have only a few properties. This is because most of the data is not in the nodes, but in their relationships with each other. We do record a few things: class, key, value, optional startTime, optional finishTime, optional uri, and an optional comment. The system is not required to expect any other properties, though any that do exist must not interfere with graph manipulation. The “class” property is always “attribute” or “actor” in our graph. The “key” value is the key of the key:value pair with “value” being the value. While it may seem odd to separate these, it helps significantly when indexing nodes as well as in operating with nodes in tools which treat each property as a column, (such as Gephi). startTime and finishTime are helpful in understanding temporal relationships, particularly for things like domains which have a clear creation date and destruction date. In some tools, it is also helpful to store a Uniform Reference Indicator (URI) on nodes. While all nodes should have a URI, storing it on node (rather than as a synthetic construction of the <attribute, key, value> tuple) is beneficial in some infrastructures. Additionally, nodes are allowed to have a database assigned id as most storage systems required to have this. This value is not used algorithmically anywhere.


The relationship is the most important part of the structure. It is what provides for the truth. Edge properties are similar to those of nodes: relationship, source, destination, optional startTime, optional finishTime, confidence (implicitly 1 if not specified), origin, optional uri, and optional comment. Edges may also have sub-relationships. For example, an edge between two domains may have a relationship of “described_by” with a sub_relationship of “name_server”. The properties look something like <“relationship”:”described_by”, “described_by”:”name_server”>. This chain of relationships may continue indefinitely, but in practicality is probably zero to two relationships deep at the most. The confidence is extremely important for edges. In a world of imperfect information and implied relationships, the better we capture the confidence of our data, the better we find truth within it. The confidence is a decimal from 0 to 1 representing our confidence in the data with zero being none and 1 being factual truth. This value is used in processing but is assumed to be 1 if not present on a relationship. The origin is helpful in that it indicates how the relationship was learned. This can both be helpful in making sure models do not use relationships they created to make additional decisions as well as later assessing what contribution each data source made to a given classification.

We have three node keys that are worthy of note: enrichment, classification, and record. Whenever we add data about a node to the graph, we want to capture that by adding a relationship from the node to an enrichment node. The enrichment node may have properties such as: <class:attribute, key:enrichment, value:dns>. To follow the example, if we have a node: <class:attribute, key:domain,>, we might use DNS to find the ip address. We’d then add a relationship from the node to the enrichment:dns node as well as from the node to the ip: node. A “classification” node is one of the most useful in the graph. The Apple Pay blog deals extensively with the node: <class:attribute, key:classification, value:malice> which indicates that any nodes that share a relationship with it are malicious. We can also use classification for other things such as types of malice, for example: <class:attribute, key:classification, value:bot>, or for data classifications such as: <class:attribute, key:classification, value:company_proprietary>. The importance of the classification nodes will be highlighted in the algorithm section of this post. Finally, records keep a unique identifier for relationships that are gained together. In Netflow, “tcp” is neither related to the source IP or the destination IP. Instead, it would share a relationship with the Netflow record.  While it may be the atomic values within a record that are truly of interest to us, the record helps us connect values to ensure our algorithms can find the relationships between them.

This schema meets the needs of nodes that represent atomic truths and relationships that represent potentially uncertain signals of relationships between nodes. This ensures that as we learn more about the relationships between our nodes, the nodes that are related will become more densely interconnected. This has a very fundamental impact: When we are looking for the context for a topic, say an IP address, it is a search of the local area of the graph. (i.e. the nodes which are within just a few relationships of the topic node.) This is very important. Our graph will be extremely large, theoretically designed to encompass all knowledge. As such, we need any computations we do to happen on a subset of the graph. While real-world graph almost always exhibit the small world property, (think 6 degrees of Kevin Bacon), by localizing our search and ignoring high-degree nodes, we minimize the computational power needed and time expended answering questions with our graph.  Also, imperfection in our data gathering can be modeled through confidence and to the temporal nature of relationships in the time stamps.  All of this gives the schema the ability to answer otherwise unanswerable problems.

Ingestion Algorithms

Ingestion algorithms are focused on getting data into the graph.  Ingestion happens in two phases, enrichment and graph merging. A simple example of enrichment is resolving a domain to an IP address. Generally, enrichment begins with an observable. In our example, the observable is a domain name. We use the observable to query some service, in this case DNS, which then provides us a record with one or more IP addresses:

that returns:


Figure 3 – DNS Enrichment

The enrichment algorithm must standardize the data format. It does this by converting it into a graph. While database records, JSON files, and XML files can all be converted to graphs, the import works best when it is tailored for the data source. This is due to most data sources not carrying the full context of the data. In our case, our script will use the current time as the start time. Our domain will be described by each of the IP addresses as well as the enrichment “DNS” shown in Figure 3. We will leave the confidence blank as it is 100%, the default value. This example can be expanded out to any type of enrichment: whois data, routing information, passive DNS, malware characteristics, intelligence classifications, STIX observables, log records, Netflow, threat actor information, a sub-graph from another context graph (for true intelligence sharing), etc.; anything that can be imagined.

Once the enrichment has been converted to a graph, it must be merged into the overall context graph. At it’s core, this is a very simple task performed by the function merge_titandb(). Because nodes are atomic, two steps must be completed: 1. All nodes in the enrichment graph but not in the context graph must be added to the context graph. 2. All edges in the sub-graph must be added to the context graph. Because of our schema, there is no need to translate anything from the sub-graph node to the context graph node, (such as ID numbers). We simply search the context graph for the key:value pair associated with the nodes of a relationship and add it. The only complicating factor is the start and finish time. We have the option of simply duplicating relationships and allowing the query function to handle the duplicates. We may however, choose to attempt to merge relationships. For example: if a relationship that starts at T2 and finishes at T3 between a pair of nodes exists in the context graph and we are adding a relationship between the same two nodes between times T1 and T4 with the same origin, we simply update the existing relationship’s times to T1 and T4 rather than duplicating the edge.

There are a few factors in our choice. The first is a trade between storage and speed. It is much faster to simply add the duplicate edge, however it increases storage and required additional logic in our query algorithms. Second, adding duplicate edges removes the ability to refer to relationships by a deterministic URI such as <origin, relationship, source, destination>. Instead, the ID number assigned by the database is necessarily to individually identify any edge. Third, the combining of edges may hide changes that occurred within the larger time span. For example, if a domain resolved multiple ways over a time period, it may be hard to determine at any exact time how it reserved as there may be relationships to two IPs at a given time based on the relationship start and finish times.

Subgraph Query Algorithm

Once the data we desire is in the context graph, we want to query it to find out about a specific observable. In keeping with our approach of using graphs, we will convert the topic observable into a graph. The topic can include multiple atomics, but functions best when using atomics that are expected to be closely related. If we need to know about multiple atomics, it is better to query them separately so that the returned graph represents only a single context. While we could hand-create our topics, it is simpler to define the topic as JSON dictionaries such as:


Figure 4 – Topic Graph

We use a simple function (create_topic()) to convert the JSON dictionary into key:value nodes without edges in a sub-graph as seen in Figure 4. This sub-graph forms our topic. (Note, nodes in graphs do not all need to be connected to each other and, in fact, do not need to be connected to anything at all.) We then send the topic to the get_titan_subgraph() function. If merge_titandb() is the function that gets data into the graph database, get_titan_sub-graph() is it’s compliment to get data out. The function takes a graph database handle, a topic, a maximum depth, and instructions on when to not retrieve neighbors to a node being added to the sub-graph. The following pseudo-code helps clarify how it works:


A key note is that at some nodes we either only process edges that come in, come out, or none at all. A good example of where we would want to do this is the <attribute:class, key:enrichment, value:DNS> node from the earlier example. In that case, a large number of nodes are likely to be connected to the DNS node and most of those nodes will have nothing to do with the topic, only sharing the relationship that they were also enriched. To avoid retrieving all of the unnecessary nodes, we will not follow any relationships from destination to source whose destination has a key of “enrichment”. We will do the same for the key “classification” for the same reason.

We also stop processing at a certain depth. Due to the small-world principles referenced earlier, a max depth of four or five is the most we want to retrieve. Otherwise we risk receiving the majority of the context graph in response, (though stopping processing at nodes with many relationships such as enrichment and classification nodes should buffer us somewhat from this.) At the end of this process, we have retrieved a sub-graph from the overall context graph that describes our topic.

To Be Continued…

In our next post we will start with an explanation of the theory behind our context graph.  We will also discuss the scoring and presentation algorithms.  The post will conclude with an explanation of the infrastructure that our context graph resides on.  Stay Tuned!

Context Graph Based Analysis of Apple Pay Domains – Part 1 of 3

Executive Summary

Intelligence is a hot topic in information security. However, it is widely known that most of what is referred to as ‘intelligence’ is really ‘data’ in the form of IPs, domains, and executables categorized as malicious.  In the following posts I apply the concept of a context graph to analyze the data and produce true intelligence which can provide new indicators with associated confidences.  These new indicators may have not yet done anything malicious, but instead are found ‘guilty by association’.  They can be consumed by automated security equipment, by SIEM systems, or by analysts to assist in the prioritization of investigations and collection of the context of incidents.

Property graphs are not a new thing.  Graphs in the mathematical sense date back to the traveling salesman problem of 1930.  However, Verum, a new context graph system consisting of a unique schema, novel use of graph algorithms, and cutting-edge infrastructure allows analysis previously not available.  This and subsequent posts document both the application of Verum by example through the analysis of Apple Pay domains as well as the Verum system itself.  And rather than hide behind the anonymity of ‘trade secrets’, we publish the Verum algorithms and schema for critique, and welcome the use by and feedback of the infosec community.


As previously summarized in this post, the announcement of Apple Pay has caused a spike in Apple Pay related domains being created. In this post we will follow the analysis of the previous post with analysis of their relationships using multifaceted enrichment and graphs to structure the data.  This post will focus on the analysis of the Apple Pay domains using a context graph. For a description of Verum, the schema, algorithms, and infrastructure which powers this analysis, see the companion post: Introducing Verum: A Context Graph System – Part 1 of 2.

Visually, graphs provide a middle ground between charts and raw data. When looking at raw data, you can only look at a small amount at a time. Graphs allow you to visually analyze the relationships, both hierarchical and circular, of data in larger scale than can be visualized in tables of raw data, but do not scale as well as charts. Charts make one dimensional data features more obvious at the expense of higher dimensional perspectives.  For example, we can analyze the relationships of ten’s of thousands of domains, IPs, name servers, etc simply by looking at a graph.  In this post we will regularly look at graphs and sub-graphs, however the true strength of graphs lie in the mathematical algorithms behind them.  We will use some of those algorithms plus standard statistical calculations to analyze our data.

Figure 1

Figure 1

Figure 2

Figure 2

We’ll start by looking at the overarching graph structure. Figure 1 shows our full graph. It represents roughly 300 clusters, 13,500 nodes, and 45,000 edges. Initially we will use a subset of the data shown in Figure 2 for our visual examples, however the algorithmic analysis will all be run against the full data set. As we move to individual cluster analysis, we will analyze the clusters within the full graph.  All the figures in this blog post are built using Gephi or Python. Gephi is a very feature complete tool for working with graphs.  While we are primarily using its visualization features (mainly OpenOrd and ForceAtlas 2 layout and various partitioning methods for coloring the graph), it also has robust filtering, statistical, and other analytical tools.

The majority of our analysis will be done using Python. Python has an excellent graph package: Networkx, which we will use to store our graph. Networkx provides a host of functionally similar to Gephi, but better supports scripting.

A Little Graph Theory

Our graph is based on a schema documented through this blog post and the associated comments.  Each node in the graph represents an atomic tuple such as “ip”:”″ or “classification”:”malice”. The first value is stored as a property named “key” on the nodes with the second stored as “value”. Nodes may have other properties such as the cluster they belong to.  In our case, we will primarily add additional properties to the graph after we’ve retrieved it.  Nodes are connected with relationships (also known as edges).  Our relationships can primarily be read as “described by”.  Therefore a domain “”-[described by]->”ip:″.  Because our relationships have a source node and a destination node, our graph is considered directed.  A node may be the source and destination of any number of relationships.  Because we allow multiple edges to start at the same source node and end at the same destination node, our graph is a multigraph, (technically a multidigraph as it’s directed).

Our graph is formed by finding nodes related to a topic.  In our case, the topic is the set of domains related to Apple Pay that were identified in the previous post.  We start with that set of domains, ensure they have been enriched with whois, ASN, BGP, passive DNS (pDNS) and intelligence feed data in the context graph database.  We then retrieve the nodes which represent the topic domains and all nodes that are within only a few relationships of the topic nodes.  We discuss the algorithm for retrieving the graph from the context graph database in the companion post.

Figure 2 represents the retrieved graph with a subset of the topics and enrichments.  It also shows the graph colored by clusters. A cluster is a set of nodes that are interrelated.  Clusters are generated by calculating modularity using the communities package associated with Networkx. To calculate the clusters we run:

You will notice we removed three types of nodes: enrichments, classifications, and records. These three types of nodes are necessary for other calculations, but tend to connect clusters that are not truly related. Our algorithm requires an undirected network, however our graph begins as a multidigraph.  To deal with this, we collapse all edges with the same source and destination to a single edge and store the number of duplciate edges as increased confidence.  We then use the built-in Networkx function to convert to an undirected graph.

In our visualization, we have roughly 40 clusters. In general, these clusters represent separate sets of infrastructure. Our next step is to analyze those clusters. In addition to the key and value, each node has a couple pre-computed properties including its relevance to the topic (the Apple Pay domains), it’s distance from the topic, it’s relevance to the malice node, and it’s distance to the malice node. (The malice node is a classification added by enrichment from intelligence sources. An edge from a node to the malice node effectively indicates the source node has been identified as being malicious.) A follow-on blog to the companion post will address how these values were computed.

Size Analysis

First let’s get an idea about the size of each cluster. For our 40 cluster graph, we can plot a histogram of cluster size:

Figure 4

Figure 4

As we can see in Figure 4, we have an outlier at 350 nodes within the cluster. We need to remove the outliers, but because the data is not a normal distribution, we need a more robust method for finding outliers (estimating scale) than standard deviation. We will use Maronna and Zamar’s robust Tau. I have transcoded an implementation from R, which can be found here. This is the equivalent of using standard deviation for a normal distribution. We will use the geometric median, the equivalent of the mean for a normal distribution, for estimating the center of the distribution. As our data is one-dimensional, the geometric median is simply the median.

Figure 5

Figure 5

In figure 5, we see a bit more resolution. Clearly there are groupings of smaller clusters at the low end. There are then multiple clusters with a few dozen nodes. These are in addition to the outlier clusters we filtered out.

Figure 6

Figure 6

Figure 7

Figure 7

If we analyze the larger graph of 769 applepay nodes, we see similar distribution at scale. In Figure 6, we notice the same pattern as before, just with more clusters and a longer tail. If we create the outlier threshold and graph the clusters below it, we notice in Figure 7 that the cutoff is much lower due to a better sample size. Still, the cutoff around 15 nodes continues, similar to our smaller graph.  In the manual analysis section of part 2 of this blog, we will use the data we collect in our following analysis to make conclusions about clusters of various sizes.

Topic Distribution

Figure 3 Legend

Figure 3 Legend

Figure 3

Figure 3

Next let’s look at the topics in each cluster. Figure 3 has nodes colored by their distance from the topic.  The distance stops at four because that is the max distance we requested when we extracted the graph from the database. Due to small-world properties of real-world graphs, retrieving nodes at a distance of more than four or five is unrealistic. We see that most clusters have one or more topics in them. However, there are a few clusters to lower left of the central cluster that have zero topics in them. We will include those in our per-cluster analysis later.

Next Post

In our next post, we will analyze the distribution of the Apple Pay topic nodes within the graph as well as the distribution of malice in the graph.  We will then manually verify the statistical analysis by individually analyzing specific clusters.  Stay Tuned!

Why every board needs an independent InfoSec advisor

In the last year, we have witnessed some very large breaches, some of which were sophisticated. Analysts have learned many lessons and written extensively about these data breaches. One important lesson is that boards of large organizations may not be fully equipped with the information related to the threats facing their business. As a result, corporate boards may not be empowered with actionable information which allows them to prioritize and create effective cyber-security strategies, and we continue to see larger and more numerous data breaches.

Some of the major reasons for continued data breaches are as follows:

  1. There are very few board members who themselves have a practicing background and a good grasp on issues relating to information security.
  2. Although many attacks use the same basic techniques to infiltrate corporate networks, some have become sophisticated using custom malware and require in-depth knowledge and skills to understand and rectify.
  3. Boards get advice from their CISOs or CIOs about the current state of information security. Since these internal resources are responsible for running the internal security programs, their advice may not be as objective as that of an independent InfoSec advisor.
  4. Many times the IT leadership teams may set their own agendas and preferences, e.g. insourcing or outsourcing initiatives, instead of making rational decisions with a combinations of options based upon strengths and weaknesses of an organization.
  5. In some cases, security is still considered  an IT problem instead of a business imperative. This is evident from the fact that in most of the companies that had data breaches, the CISO role still reported to CIOs or the CISO role did not exist in any meaningful way at all

The impact of mega data breaches is too big to be taken lightly. Questions in front of boards are not easy to answer: What is the current threat landscape? How is it going to impact our business? What is the appropriate level of investment needed to protect the business from cyber threats? How effective is the current security program? How can we minimize impact of a data breach by early detection? How we are going to respond to a breach? Who are our adversaries and why they want to attack us? The list is long and many times the answers may not be simple.

Based upon these and other considerations, corporate boards have several choices about how to protect their organizations. Among the available options, boards may choose to consider the following two:

  1. Induct new board members who are recognized in the information security industry and have in-depth knowledge of current and emerging threats.
  2. Hire an independent security advisor (with specific industry vertical experience) for the board that can help the board understand the threat landscape and provide an objective analysis about the effectiveness of the current information security program against these threats.

While the first option may seem to be a permanent solution, it is not an easy task to find qualified board members who are well versed in the matters of information security. A shorter path could be the second option. Finding a trusted advisor from a partner company with broader industry knowledge and expertise could be a short-term fix. This person can advise the board on a quarterly basis by reviewing the current information security program/strategy, analysis of governance model, and of current threat landscape. This will greatly help boards make timely, objective, and effective decisions to lower business risk.

Clearly, information security is a board level issue and not an IT problem anymore. As we have seen in recent data breaches, boards can’t solely rely on internal resources to make the best possible cyber security decisions. An independent InfoSec advisor can help board members understand the business threats, form responses to those threats, and enable risk-based decision making where the priorities of the business can be weighed against the costs associated with its protection.

Weekly Intelligence Summary Lead Paragraph: 2014-10-31

Halloween is here and the security space received its fair share of frights this week. Drupal made the ominous announcement that if you didn’t install its latest update to its content management software within seven hours of its release, you should operate under the assumption your website is compromised. There’s a basis for that statement; not long after Drupal patched a critical SQL injection vulnerability automated attacks began appearing in the wild. Victims with RIG EK in their Drupal include Popular Science and Typepad. Drupal’s announcement includes tips for remediation if your site was compromised. FireEye release its report on APT28, which also goes by the name Sednit or Sofacy depending on which vendor’s report you read. It provides more intelligence on the Russia-based group’s attack on Eastern European countries and military organizations. And speaking of Russian threat actors, some are being blamed for breaching unclassified White House networks. Emphasis on unclassified. Novetta and its partners (FireEye, F-Secure, Microsoft, et al) released their anticipated report on Operation SMN and the Axiom threat group as promised. It contains indicators and IDS signatures galore. In addition to FireEye’s analysis of APT28 and Novetta’s report on Axiom, Damballa’s State of Infections Q3 2014 report and Akamai’s State of the Internet Q3 2014 report are both recommended reading.