<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>BrightPlanet</title>
	<atom:link href="https://brightplanet.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://brightplanet.com</link>
	<description>Deep Web Intelligence by BrightPlanet</description>
	<lastBuildDate>Sun, 02 Jun 2019 18:32:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>
	<item>
		<title>AMPLYFI- Data and Beyond</title>
		<link>https://brightplanet.com/2018/05/24/amplyfi-data-and-beyond/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Thu, 24 May 2018 18:47:41 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[deep web]]></category>
		<category><![CDATA[Global News Data Feed]]></category>
		<category><![CDATA[open source intelligence tools]]></category>
		<category><![CDATA[OSINT]]></category>
		<category><![CDATA[risk management]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8678</guid>

					<description><![CDATA[Amplyfi is one of BrightPlanet’s Data-as-a-Service partners leveraging large-scale, open-source data from the Surface Web and Deep Web to build business intelligence for its clients. Our business and technology relationship has spanned years, and we are excited to see its market growth as a leader in artificial intelligence. NatWest Business Hub Article On May 10, [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Amplyfi is one of BrightPlanet’s Data-as-a-Service partners leveraging large-scale, open-source data from the Surface Web and Deep Web to build business intelligence for its clients. Our business and technology relationship has spanned years, and we are excited to see its market growth as a leader in artificial intelligence.<span id="more-8678"></span></p>
<p><a href="http://www.natwestbusinesshub.com/content/4e12978a-f1b6-b77d-b8bd-3c46f9bafa62"><img fetchpriority="high" decoding="async" class="alignnone wp-image-8682 size-full" src="http://10.0.0.183:8085/wp-content/uploads/2018/05/2018-05-24_1343.png" alt="Amplyfi headquarters" width="990" height="824" srcset="https://brightplanet.com/wp-content/uploads/2018/05/2018-05-24_1343.png 990w, https://brightplanet.com/wp-content/uploads/2018/05/2018-05-24_1343-300x250.png 300w, https://brightplanet.com/wp-content/uploads/2018/05/2018-05-24_1343-768x639.png 768w" sizes="(max-width: 990px) 100vw, 990px" /></a></p>
<h1>NatWest Business Hub Article</h1>
<p>On May 10, the NatWest Business Hub published a great story about Amplyfi’s growth, market success, dedication to entrepreneurship, and long-term vision. In an interview with Chris Ganje, Amplyfi co-founder and CEO, the piece deep dives into why Amplyfi is growing its business in Cardiff, the challenges of hiring the best talent while maintaining culture, and what Ganje’s vision is for the future.</p>
<p>Amplyfi continues its trajectory as a fast-moving startup, and BrightPlanet continues to expand our Data-as-a-Service offerings partnership. If you have not read the full article, check it out at: <a href="http://www.natwestbusinesshub.com/content/4e12978a-f1b6-b77d-b8bd-3c46f9bafa62">http://www.natwestbusinesshub.com/content/4e12978a-f1b6-b77d-b8bd-3c46f9bafa62</a></p>
<h1>Conclusion</h1>
<p>BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best &#8212; creating intelligence.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Keeping up with the constantly changing Deep Web, BrightPlanet has developed the solutions that work</title>
		<link>https://brightplanet.com/2018/05/10/keeping-up-with-the-constantly-changing-deep-web-brightplanet-has-developed-the-solutions-that-work/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Thu, 10 May 2018 21:16:50 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[big data from the Deep Web]]></category>
		<category><![CDATA[dark web]]></category>
		<category><![CDATA[Dark Web search]]></category>
		<category><![CDATA[data as a service]]></category>
		<category><![CDATA[deep web]]></category>
		<category><![CDATA[deep web harvest]]></category>
		<category><![CDATA[open source intelligence tools]]></category>
		<category><![CDATA[unstructured data]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8668</guid>

					<description><![CDATA[Website structures are constantly changing. You might be surprised how often websites swap formatting, themes, or its entire layout. These changes will typically break custom harvest scripts from inexpensive or roll-your-own harvesting solutions. BrightPlanet’s harvest engine and quality assurance solution is designed to be robust and fault tolerant to these types of website changes. Our [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Website structures are constantly changing. You might be surprised how often websites swap formatting, themes, or its entire layout. These changes will typically break custom harvest scripts from inexpensive or roll-your-own harvesting solutions.</p>
<p>BrightPlanet’s harvest engine and quality assurance solution is designed to be robust and fault tolerant to these types of website changes. Our harvest engine will continue to harvest and provide value without need for external human intervention; this is because we typically leverage unstructured keyword rules instead of hardwired rules.<span id="more-8668"></span></p>
<p>In our previous post, named “<a href="http://10.0.0.183:8085/2018/04/all-websites-are-not-created-equal-brightplanet-knows-how-to-harvest-the-exact-data-clients-need-whether-it-is-deep-web-dark-web-or-surface-web-content/">All websites are not created equal</a>”, we talked specifically about our techniques for harvesting the best quality content from websites. Today, we are going to discuss how we ensure that clients’ harvests continue to operate even after the websites change.</p>
<p><img decoding="async" class="alignnone wp-image-8673 size-full" src="http://10.0.0.183:8085/wp-content/uploads/2018/05/blog.png" alt="" width="1024" height="486" srcset="https://brightplanet.com/wp-content/uploads/2018/05/blog.png 1024w, https://brightplanet.com/wp-content/uploads/2018/05/blog-300x142.png 300w, https://brightplanet.com/wp-content/uploads/2018/05/blog-768x365.png 768w" sizes="(max-width: 1024px) 100vw, 1024px" /></p>
<h1>Keeping Your Harvests Simple</h1>
<p>If there is one thing we have learned over the <em>last 18 years of web scraping</em>, it is to keep your havests simple. I don’t mean harvesting simple websites. Instead of worrying which text nodes to process and which to ignore while harvesting data, let your analytics and unstructured text rules do the heavy lifting.</p>
<p>Sites requiring a tremendous amount of custom steps, extractions, and hardwired rules are going to be the ones which break easily. BrightPlanet takes an approach of loosening up harvest rules to ensure we’re pulling content, but then tightening restrictions on the analytics, qualifications, and post-harvest filtering.</p>
<p>For example, we often restrict a harvest based on URL path, using either substring matching or regular expression, instead of defining a series of rules to determine which links should be followed. It may sound simple, but it is very effective. If you are still picking up false-positive links, use unstructured keyword filters to polish the data set.</p>
<h1>Leaving Rules As Unstructured</h1>
<p>There is only so much you can do with harvest filters; sometimes you need to jump into the unstructured content to finalize content quality.</p>
<p>In its simplest form, we leverage large keyword lists and require webpages to have one or more of the keywords to keep the content. Our list processing system allows us to easily process up to thousands of keywords.</p>
<p>A great example of this filtering is used to determine if a webpage is selling a product, such as a pharmaceutical drug. We can provide a list of tens of thousands of drug names and then ensure that at least one of them is on the page, otherwise the page is rejected.</p>
<h1>Quality Assurance Checks</h1>
<p>After the harvest events are defined for a project, we integrate ongoing monitoring to ensure harvests are producing an expected number of documents over time. This will vary from project to project, source to source, and day to day. Using a standard deviation calculation and historic trending data, we can easily tell if a harvest needs to be reviewed or is operating within its expected boundaries.</p>
<h1>Conclusion</h1>
<p>BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best &#8211; creating intelligence.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>All websites are not created equal. BrightPlanet knows how to harvest the exact data clients need, whether it is Deep Web, Dark Web or Surface Web content.</title>
		<link>https://brightplanet.com/2018/04/20/all-websites-are-not-created-equal-brightplanet-knows-how-to-harvest-the-exact-data-clients-need-whether-it-is-deep-web-dark-web-or-surface-web-content/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Fri, 20 Apr 2018 13:56:22 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[Financial Industry]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[big data from the Deep Web]]></category>
		<category><![CDATA[dark web]]></category>
		<category><![CDATA[Dark Web search]]></category>
		<category><![CDATA[data as a service]]></category>
		<category><![CDATA[data harvesting]]></category>
		<category><![CDATA[deep web]]></category>
		<category><![CDATA[deep web harvest]]></category>
		<category><![CDATA[deep web harvesting]]></category>
		<category><![CDATA[deep web search]]></category>
		<category><![CDATA[open source intelligence tools]]></category>
		<category><![CDATA[OSINT]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8662</guid>

					<description><![CDATA[BrightPlanet provides terabytes of data for various analytic projects across many industries. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines, data visualizations, or reports. In this blog series, we are going to be diving [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>BrightPlanet provides terabytes of data for various analytic projects across many industries. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines, data visualizations, or reports. In this blog series, we are going to be diving into our techniques and process that we bring to each project.</p>
<p>In our previous post named “<a href="http://10.0.0.183:8085/2018/03/harvest-lot-websites-clients-know-sites-harvest-first-place/">We harvest a lot of websites for our clients, but how do we know which sites to harvest in the first place?</a>”, we talked specifically about finding valuable websites. Today, we are going to discuss how we decide the best ways to harvest content for our clients.<span id="more-8662"></span></p>
<div id="attachment_8630" style="width: 1040px" class="wp-caption aligncenter"><img decoding="async" aria-describedby="caption-attachment-8630" class="wp-image-8630 size-large" src="http://10.0.0.183:8085/wp-content/uploads/2018/02/2018-02-08_1524-1024x578.png" alt="Data-as-a-Service" width="1030" height="582" srcset="https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524-1024x578.png 1024w, https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524-300x169.png 300w, https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524-768x434.png 768w, https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524.png 1098w" sizes="(max-width: 1030px) 100vw, 1030px" /><p id="caption-attachment-8630" class="wp-caption-text">Data-as-a-Service</p></div>
<h1>Determining What Content to Harvest</h1>
<p>Before we setup any harvests, we always sit down with our clients and perform a client onboarding process, or harvest audit, to determine exactly which data makes the most sense to harvest. It is not sufficient to know only the website domain; we need to know which information to extract from those websites.</p>
<p>Think of a large news website, such as CNN. It is not practical, nor is it useful, to harvest the entire website. For example, if our client is looking for North Korean threats we will target only the sections within CNN which focus on that topics, such as world news, while excluding sports, weather, lifestyle, etc.</p>
<p>Once we have defined our target content, each site may need to be reviewed and processed individually, depending on how targeted the data must be. Each project is different; sometimes a broad harvest with sufficient filtering is enough.</p>
<p>Each harvest event can be defined with term filters, URL filters, domain filters, depth filters, and more. Filters are single items, multiple items, regular expressions, or even a massive list of keywords. There are no practical limits to our filtering system and many filters are applied as the harvest runs, further optimizing harvest efficiency.</p>
<h2>Choosing the Right Harvest Techniques</h2>
<p>BrightPlanet’s harvest solution contains many different techniques and harvesters allowing us to pick and choose the most efficient way to harvest content for each website. We are not limited to a simple web crawl. This allows us to spend less time harvesting content and more time processing the data, <a href="http://10.0.0.183:8085/2017/04/rosoka/">a topic we previously covered here</a>.</p>
<p>Since we utilize multiple harvest engines, we can easily choose the correct harvesting techniques for the website and client’s content needs.</p>
<p>For examples, going back to our earlier CNN example, say we are only interested in North Korean missile testing. Instead of harvesting thousands of irrelevant world news documents looking for the needle in a haystack, we can leverage our Deep Web harvest engine to customize a search of the CNN website looking for specific keywords, ordered by publication date, and filtered by sub-sections and keywords. Now we’re only harvesting only extremely relevant documents</p>
<p>Our Data Acquisition Engineers may even leverage different harvest techniques for the same website, if necessary. Perhaps we need to perform an initial harvest of archived content and then perform daily updates of new content. It is not necessary to constant harvest old data; we would create one harvest to grab all content on the website. A second harvest (typically pointed to an RSS feed or home page) would monitor for new documents.</p>
<h2>Advanced Deep Web and Dark Web Tips</h2>
<p>Targeting makes a huge difference when it comes to Dark Web content. Dark Web websites often cover many topics, many of them being irrelevant to our client’s needs. Having the ability to quickly filter only relevant content channels allows us to be more efficient and also prevents our harvesters from being blocked &#8211; critical when harvesting Dark Web sites.</p>
<p>Another technique we often leverage involves a multi-pass of the same website to pre-process content. Once an initial pass is performed, we will process the data that was harvested to build intelligence into how we should monitor the site over a longer period of time. This per-processed data is typically thrown away since it is not relevant enough to provide value.</p>
<p>Deep Web and Dark Web content may also be scattered with irrelevant content, or content meant to obfuscated valuable data. Leveraging a broad harvest without filtering is often used to target relevant content which may then be re-harvested using additional filters in a new harvest event. This allows us to curate higher quality data with little additional work.</p>
<h1>Conclusion</h1>
<p>BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best &#8212; creating intelligence.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></span></span></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION"></a> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); <!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Interview with Mikhail Shengeliya of Eagle Alpha: OSINT Data Collection Challenges &#038; Solutions</title>
		<link>https://brightplanet.com/2018/04/04/interview-with-mikhail-shengeliya-of-eagle-alpha-osint-data-collection-challenges-solutions/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Wed, 04 Apr 2018 21:03:09 +0000</pubDate>
				<category><![CDATA[White Papers and Publications]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data as a service]]></category>
		<category><![CDATA[data harvesting]]></category>
		<category><![CDATA[Eagle Alpha]]></category>
		<category><![CDATA[open source intelligence tools]]></category>
		<category><![CDATA[OSINT]]></category>
		<category><![CDATA[partnership]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8651</guid>

					<description><![CDATA[Will Bushee, BrightPlanet&#8217;s Vice President of Technology, recently sat down with Mikhail Shengeliya from Eagle Alpha to discuss various topics, including: Challenges which come with harvesting open-source content Solutions for those challenges Specific project use-cases, such as harvesting job postings Eagle Alpha provides a full-service platform enabling asset managers to obtain solutions from alternative data [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Will Bushee, BrightPlanet&#8217;s Vice President of Technology, recently sat down with Mikhail Shengeliya from <a href="https://eaglealpha.com/">Eagle Alpha</a> to discuss various topics, including:</p>
<ol>
<li>Challenges which come with harvesting open-source content</li>
<li>Solutions for those challenges</li>
<li>Specific project use-cases, such as harvesting job postings</li>
</ol>
<p>Eagle Alpha provides a full-service platform enabling asset managers to obtain solutions from alternative data providers, such as BrightPlanet.</p>
<p><img loading="lazy" decoding="async" class=" size-full wp-image-1938 aligncenter" src="http://eaglealpha.com/wp-content/uploads/2017/09/image-for-linkedin-png.png" alt="IMAGE FOR LINKEDIN PNG" width="853" height="436" /></p>
<p>The interview is only available to subscribers of Eagle Alpha&#8217;s newsletter, or for partners within Eagle Alpha&#8217;s internal partnership portal. If you’d like to jump on their mailing list, you can <a href="https://eaglealpha.com/newsletter/">sign up here</a>.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>We harvest a lot of websites for our clients, but how do we know which sites to harvest in the first place?</title>
		<link>https://brightplanet.com/2018/03/19/harvest-lot-websites-clients-know-sites-harvest-first-place/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Mon, 19 Mar 2018 16:29:57 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Dark Web search]]></category>
		<category><![CDATA[data as a service]]></category>
		<category><![CDATA[data harvesting]]></category>
		<category><![CDATA[deep web]]></category>
		<category><![CDATA[deep web search]]></category>
		<category><![CDATA[unstructured data]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8645</guid>

					<description><![CDATA[BrightPlanet has provided terabytes of data for various analytic projects across many industries over the years. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines or final reports. The first phase of all projects &#8211; [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>BrightPlanet has provided terabytes of data for various analytic projects across many industries over the years. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines or final reports.</p>
<p>The first phase of all projects &#8211; <em><strong>we have three phases total</strong></em> &#8211; is harvesting. It is not uncommon for a single project to have 10,000-20,000 websites from which we harvest content. While a large volume of websites is impressive, the more important element is knowing which websites are the most appropriate for the current task. This is where our Data Acquisition Engineers become a critical part of the solution.</p>
<p><img loading="lazy" decoding="async" class="aligncenter wp-image-8647 size-full" src="http://10.0.0.183:8085/wp-content/uploads/2018/03/blog-source-identification.png" alt="" width="930" height="521" srcset="https://brightplanet.com/wp-content/uploads/2018/03/blog-source-identification.png 930w, https://brightplanet.com/wp-content/uploads/2018/03/blog-source-identification-300x168.png 300w, https://brightplanet.com/wp-content/uploads/2018/03/blog-source-identification-768x430.png 768w" sizes="auto, (max-width: 930px) 100vw, 930px" /></p>
<h2>Techniques for Identifying Websites</h2>
<p>BrightPlanet has developed several techniques for locating websites; some are trade secrets whiles others are just common sense. We are going to use the term “website” to refer to a single site that is all within the same domain name. Like “<em>brightplanet.com</em>” or “<em>cnn.com</em>”. As you can imagine, a single website may have a few pages or a several billion pages, it all depends on the site. A useful trick for estimating the number of pages in a website is to do a Google search using only the website parameter. (ex: <i>site:brightplanet.com</i>)</p>
<p>Occasionally <strong>clients possess a full list of websites they want to harvest.</strong> Clients will typically have a partial list of websites. However, even these lists will usually require some type of curation or quality check.</p>
<p>Most often, our Data Acquisition Engineers will work with a <strong>client to define what a ‘good’ website looks like</strong>. This will help identify and filter out websites which may be relevant, but not on target with a client’s needs.</p>
<p>As you might have guessed, one way to identify new sites is to <strong>“search” for them using surface </strong>web sites<strong> like Google or DuckDuckGo.</strong> Unlike an analyst who might need to search, iterate, search again, and then review each website by hand, we can do that quickly using our Deep Web harvesting and some filter rules. This is extremely effective because we can iterate so fast with our harvest engine.</p>
<p><strong>Locating online lists or directories of similar sources.</strong> If you have ever tried to find the “best video editing software” online, you know people love to make lists. Using known entities, we can quickly locate other sites referencing related sources. From there, we can easily harvest and qualify the sites listed.</p>
<h2>Advanced Techniques Used For Some Deep Web Harvesting</h2>
<p>Those are pretty straightforward, and probably obvious, ways to locate websites. Some projects need a much greater depth to locate valid websites. Here are a few creative techniques we have used on projects.</p>
<p><strong>Monitoring </strong>new<strong> purchased domains.</strong> Each day, a list of the newly purchased (and newly expired) domains is generated. Since we can harvest and validate many sites at a time, we can use these lists in combination with good filtering and validation rules.</p>
<p><strong>Diving deep into topical blogs or messages boards is a great way to find hidden gems.</strong> Often, people will post links which get buried deep within these sites. Being able to go deep, and validate them, allows us to locate sites that might otherwise be missed.</p>
<p><strong>Social media contains a wealth of links,</strong> but they typically need to be “exploded” since they will be shortened for tracking purposes. Again, our harvest engine and social connectors make it possible to search, locate, harvest, extract, and validate these links.</p>
<h2>Conclusion</h2>
<p>BrightPlanet is the leader is providing deep Data-as-a-Service to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best &#8212; creating intelligence.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Visualizing a Terror Group using Named Entity Tagging</title>
		<link>https://brightplanet.com/2018/03/05/visualizing-terror-groups-named-entity-tagging/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Mon, 05 Mar 2018 09:31:33 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[Intelligence Community]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data harvesting]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[Global News Data Feed]]></category>
		<category><![CDATA[OSINT]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8635</guid>

					<description><![CDATA[We recently received a request to analyze news focusing on Lashkar-e-Taiba, an active militant terrorism organization located in South Asia. This is a simple task by using BrightPlanet’s REST API to export data from the Global News Data Feed.  The data load contains everything you need to perform a thorough analysis, including: URL web page [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>We recently received a request to analyze news focusing on <a href="https://en.wikipedia.org/wiki/Lashkar-e-Taiba">Lashkar-e-Taiba</a>, an active militant terrorism organization located in South Asia. This is a simple task by using BrightPlanet’s REST API to export data from the <a href="http://10.0.0.183:8085/global-news-datafeed/">Global News Data Feed</a>.  The data load contains everything you need to perform a thorough analysis, including:</p>
<ul>
<li>URL</li>
<li>web page title</li>
<li>document harvest date</li>
<li>full text of news article</li>
<li>extracted named entities: including crime, other threats, events, weapons, people, diseases, companies, countries, places, etc.</li>
<li>document and individual entity-level sentiment</li>
</ul>
<p>The data below was visualized and shared using <a href="http://tableau.com/">Tableau</a>.</p>
<h2>Named Entities Reveal Granular Patterns in Data</h2>
<p>The named entities we&#8217;re focusing on in this dashboard are threats, events, weapons, and people. By default, we see the total count of entities over time. Between June 2015 and January 2018, there were about 3,800 news articles mentioning Lashkar-e-Taiba. Looking at the line chart, March 2016 and November 2017 jump out as months with increased activity. Clicking on any single month will filter all the data down to that month. Overall, we can see entities which frequently appear in this dataset include:</p>
<ul>
<li>&#8220;attack&#8221;</li>
<li>&#8220;kill&#8221;</li>
<li>&#8220;terrorism&#8221;</li>
<li>&#8220;arrest</li>
<li>&#8220;India&#8221;</li>
<li>&#8220;Pakistan&#8221;</li>
<li>&#8220;Hafiz Saeed&#8221; (co-founder of Lashkar-e-Taiba)</li>
</ul>
<div id="attachment_8636" style="width: 1010px" class="wp-caption alignnone"><a href="https://public.tableau.com/views/Lashkar-e-TaibaNews/Lashkar-e-Taiba?:embed=y&amp;:display_count=yes"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-8636" class="wp-image-8636 size-full" src="http://10.0.0.183:8085/wp-content/uploads/2018/03/Lashkar-e-Taiba.png" alt="Lashkar-e-Taiba news visualization" width="1000" height="1500" srcset="https://brightplanet.com/wp-content/uploads/2018/03/Lashkar-e-Taiba.png 1000w, https://brightplanet.com/wp-content/uploads/2018/03/Lashkar-e-Taiba-200x300.png 200w, https://brightplanet.com/wp-content/uploads/2018/03/Lashkar-e-Taiba-768x1152.png 768w, https://brightplanet.com/wp-content/uploads/2018/03/Lashkar-e-Taiba-683x1024.png 683w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></a><p id="caption-attachment-8636" class="wp-caption-text">Click the image to go to interactive visualization</p></div>
<h2>Matching Crimes to Locations with Entity Relationships</h2>
<p>Knowing individual entities which appear in a document is helpful, but seeing how those entities relate to other nearby entities of interest can provide real insight. BrightPlanet uses Rosoka Series 6 as its entity tagging engine. One convenient feature of the software is automatically tagging relationships between nearby entities. The yellow dots on the map correspond to news articles containing a Crime-to-Location relationship. Hovering over the southern-most yellow dot on the map, we can see that the crime entity &#8220;theft&#8221; was found nearby the place entity of &#8220;North Paravoor&#8221;. If we jumped to that article from <em>The New Indian Express</em>, we would find the targeted sentence,</p>
<blockquote><p><em>&#8220;Anoop, a native of <strong>North Paravoor</strong>, is the seventh accused in the case related to <strong>theft</strong> of ammonium nitrate, nitrate mixer and electric detonator from Thuruthiyil Traders, a shop functioning in Perumbavoor.&#8221;</em></p></blockquote>
<p>These are just a few examples of the insights that can be found by adding structure to unstructured web content by tagging named entities. Please <a href="https://public.tableau.com/views/Lashkar-e-TaibaNews/Lashkar-e-Taiba?:embed=y&amp;:display_count=yes">click into the interactive Tableau visualization</a> to explore the nuggets of insight which can be found from a targeted search of news data.</p>
<h2>Develop Business Insight through Unstructured Web Content with BrightPlanet</h2>
<p>If you think your organization would benefit from trends and insights derived from access to a repository of news data with over 15 million articles from thousands of unique sources, check out <a href="http://10.0.0.183:8085/global-news-datafeed/">BrightPlanet’s Global News Data Feed for a 30-day free trial</a>.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>We talk a lot about Data-as-a-Service, but what exactly does that mean?</title>
		<link>https://brightplanet.com/2018/02/13/talk-lot-data-service-exactly-mean/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Tue, 13 Feb 2018 10:07:50 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Big Data case study]]></category>
		<category><![CDATA[dark web]]></category>
		<category><![CDATA[Dark Web search]]></category>
		<category><![CDATA[data as a service]]></category>
		<category><![CDATA[deep web]]></category>
		<category><![CDATA[OSINT]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8629</guid>

					<description><![CDATA[BrightPlanet has provided terabytes of data for various analytic projects across many industries over the years. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines or final reports. These collections often contain data from dozens, [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>BrightPlanet has provided terabytes of data for various analytic projects across many industries over the years. Our role is to locate open-source web data, harvest the relevant information, curate the data into semi-structured content, and provide a stream of data feeding directly into analytic engines or final reports.</p>
<p>These collections often contain data from dozens, hundreds, even thousands of websites using various techniques to optimize the best content from each website. All websites are not created equally; there is a very low chance the same harvesting technique will work for every website.</p>
<h2><span id="more-8629"></span></h2>
<div id="attachment_8630" style="width: 1108px" class="wp-caption aligncenter"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-8630" class="wp-image-8630 size-full" src="http://10.0.0.183:8085/wp-content/uploads/2018/02/2018-02-08_1524.png" alt="Data-as-a-Service" width="1098" height="620" srcset="https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524.png 1098w, https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524-300x169.png 300w, https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524-768x434.png 768w, https://brightplanet.com/wp-content/uploads/2018/02/2018-02-08_1524-1024x578.png 1024w" sizes="auto, (max-width: 1098px) 100vw, 1098px" /><p id="caption-attachment-8630" class="wp-caption-text">Data-as-a-Service</p></div>
<h2>Techniques for Harvesting Content</h2>
<p>BrightPlanet has developed several techniques for harvesting data. One of the easiest for people to understand would be a <a href="http://10.0.0.183:8085/2016/02/is-this-the-surface-web-or-deep-web/">Surface Web </a>harvest, or a hyperlink crawl. Start with a single URL, then follow the outgoing links from that URL, and repeat. This is a typical process for collecting web data and has is utilized in nearly every project we do.</p>
<p>Other techniques increase the harvest complexity, such as a <a href="http://10.0.0.183:8085/2016/04/video-what-is-the-deep-web/">Deep Web harvest</a>. Instead of starting with a URL and following links, our Deep Web harvester can interact with a website’s internal search database, asking it different questions, each time collecting and harvesting the results as they are returned.</p>
<p>Our extensive harvest engine also supports many other techniques, such as a <a href="http://10.0.0.183:8085/2017/11/video-accessing-the-dark-web-with-tor-search-on-an-ubuntu-virtual-machine/">Dark Web harvester</a>, REST-API harvester, and even a custom scripted engine. Each of these techniques allow many variations for diving deep into websites and extracting only relevant content, avoiding noise in the final dataset.</p>
<h2>Curating Deep Web Content</h2>
<p>As each document is harvested, it is normalized into a consistent stream of data and prepared for curation. It is important the content is properly prepared to optimize entity relationships. If data gets mangled, it can be mismatched and produce poor results. Garbage in, garbage out.</p>
<p>Each project will have its own curation steps defined because what works for a fraud project will be useless for a reputation management project. Our <a href="http://10.0.0.183:8085/2017/09/open-source-intelligence-new-rosoka-update/">Rosoka integration</a> allows us to be very flexible and extensive when it comes to creating custom named entity recognition solutions.</p>
<p><a href="http://10.0.0.183:8085/2017/04/rosoka/">We covered entity extraction features in a previous blog post</a>.</p>
<h2>Data Delivery through REST APIs</h2>
<p>Lastly, data needs to be delivered to the analytics platform for further processing. At this point, the data is properly prepared for a third-party solution to ingest the data. The easiest, and most common way to transfer data is via our <a href="http://10.0.0.183:8085/data-feed-api-guide/">REST API</a> service.</p>
<p>Typically, our Data-as-a-Service ends with a REST API that our clients or partners can easily integrate into a dashboard, service, analytics engine, <a href="http://10.0.0.183:8085/2018/02/mapping-bitcoin-metrics-across-deep-web-news-sentiment-tableau/">data visualization</a>, reports, or even a simple data alert.</p>
<h2>Conclusion</h2>
<p>BrightPlanet is the leader is providing deep <a href="http://10.0.0.183:8085/data-as-a-service/">Data-as-a-Service</a> to our customers with open-source, web content through a simple-to-use service. Our customers do not need to worry about the complexities and details about harvesting, curating, and preparing data for analytics. Instead they can focus on what they do best &#8212; creating intelligence.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Mapping Bitcoin Metrics Across Deep Web News and Sentiment with Tableau</title>
		<link>https://brightplanet.com/2018/02/02/mapping-bitcoin-metrics-across-deep-web-news-sentiment-tableau/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Fri, 02 Feb 2018 16:53:40 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[Financial Industry]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data harvesting]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[Global News Data Feed]]></category>
		<category><![CDATA[OSINT]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8607</guid>

					<description><![CDATA[Bitcoin, and other cryptocurrencies, have been a hot topic of conversation recently. The volatile digital currency rocketed from $1,000 to nearly $20,000 in 2017 before crashing back down to around $10,000 in January 2018. We thought it would be intriguing to query BrightPlanet’s Global News Data Feed to compare Bitcoin price trends with news mentions [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://en.wikipedia.org/wiki/Bitcoin">Bitcoin</a>, and other cryptocurrencies, have been a hot topic of conversation recently. The volatile digital currency rocketed from $1,000 to nearly $20,000 in 2017 before crashing back down to around $10,000 in January 2018. We thought it would be intriguing to query BrightPlanet’s <a href="http://10.0.0.183:8085/global-news-datafeed/">Global News Data Feed</a> to compare Bitcoin price trends with news mentions of cryptocurrency, and the sentiment associated with those news articles. After joining the news/sentiment data with the Bitcoin pricing data from the same time period, we visualized the data using <a href="http://tableau.com/">Tableau</a>.</p>
<h2>Visualizing Bitcoin News Data</h2>
<p>The first insight to jump out was the high correlation between the Bitcoin price and the number of news mentions, as you can see when those two measures are visualized with a scatterplot. The r-squared value of the linear regression line is 0.95, indicating a high level of predictability. The greater the price of Bitcoin, the more news articles regarding cryptocurrency are produced.</p>
<div id="attachment_8608" style="width: 1009px" class="wp-caption alignnone"><a href="https://public.tableau.com/views/BitcoinTrendsinGlobalNews/BitcoinNews?:embed=y&amp;:display_count=yes&amp;publish=yes&amp;:showVizHome=no"><img loading="lazy" decoding="async" aria-describedby="caption-attachment-8608" class="wp-image-8608 size-full" src="http://10.0.0.183:8085/wp-content/uploads/2018/02/bitcoin-tableau.png" alt="Bitcoin Tableau Data Visualization" width="999" height="847" srcset="https://brightplanet.com/wp-content/uploads/2018/02/bitcoin-tableau.png 999w, https://brightplanet.com/wp-content/uploads/2018/02/bitcoin-tableau-300x254.png 300w, https://brightplanet.com/wp-content/uploads/2018/02/bitcoin-tableau-768x651.png 768w" sizes="auto, (max-width: 999px) 100vw, 999px" /></a><p id="caption-attachment-8608" class="wp-caption-text">Click the image to view the interactive visualization</p></div>
<p>Using the coloring of the marks on both the scatterplot and treemap charts, which indicate the “Year” attribute of the monthly data, you can see the slow rise in popularity of Bitcoin in 2015 and 2016 before the explosion in 2017. Every month before May 2017 is clumped together in the bottom-left of the scatterplot, indicating a price range around <em><strong>$1,000</strong></em> and a news article count between <strong><em>100-350</em> </strong>articles per month.</p>
<h2>Diving into Deep Web content</h2>
<p>The dual-axis chart at the bottom of the dashboard shows the relationship trends between Bitcoin price and news article sentiment. BrightPlanet uses Rosoka Series 6 for entity tagging and extraction. The Rosoka data point we’re using in this dashboard is total sentiment at the document-level. To smooth out the trendlines and avoid large outliers, both sentiment and price were visualized as a 7-day moving average. The insights aren’t as striking in this chart, but you can generally see increases and decreases in sentiment shortly following corresponding increases and decreases in Bitcoin price.</p>
<p>If you hover over any single day on the line charts, you will see a <strong><em>“Top 10 Articles by Sentiment Intensity”</em></strong> chart. This ranking shows the top 10 news articles for that day based on difference from a &#8220;neutral&#8221; sentiment rating.</p>
<p>In addition to the two main measures, <strong>Price</strong> and <a href="http://10.0.0.183:8085/2015/12/how-can-you-analyze-the-relevance-and-sentiment-of-online-data/"><strong>Sentiment</strong></a>, shown by the line charts, we also added two additional measures to the view by using the “size” attribute of each line to show the volume of both news articles and Bitcoin trading. On the left-side (older) of the line chart, the lines are noticeably skinny. The skinniness indicates fewer news articles and a lower volume of Bitcoin trading. Towards the right-side (more recent) of the line chart, the lines grow thicker as both news mentions and Bitcoin trading volume increase.</p>
<h2>Develop Business Insight through Unstructured Web Content with BrightPlanet</h2>
<p>If you think your organization would benefit from trends and insights derived from access to a repository of news data with over 15 million articles from thousands of unique sources, check out <a href="http://10.0.0.183:8085/global-news-datafeed/">BrightPlanet’s Global News Data Feed for a 30-day free trial</a>.</p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
<p>&nbsp;</p>
<p><em><a href="http://maxpixel.freegreatpicture.com/Future-Bitcoin-Cryptocurrency-Currency-Money-Btc-2868703">[Photo credit from Max Pixel]</a></em>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Harvest Web Data in Multiple Languages with Unstructured Data Mining and Deep Web Search</title>
		<link>https://brightplanet.com/2018/01/17/harvest-web-data-multiple-languages-unstructured-data-mining-deep-web-search/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Wed, 17 Jan 2018 20:02:40 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[Intelligence Community]]></category>
		<category><![CDATA[Dark Web search]]></category>
		<category><![CDATA[deep web search]]></category>
		<category><![CDATA[foreign language]]></category>
		<category><![CDATA[language entity extraction]]></category>
		<category><![CDATA[rosoka]]></category>
		<category><![CDATA[unstructured data mining]]></category>
		<category><![CDATA[web harvest]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8597</guid>

					<description><![CDATA[The Internet knows nearly no limit when it comes to languages. The social media platform Facebook, for example, is available in over 100 languages. And that’s just the beginning. Hundreds of languages also make up areas of the Deep Web and Dark Web. If businesses want to gain complete data insight into their organization, it [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">The Internet knows nearly no limit when it comes to languages. The social media platform Facebook, for example, is available in over 100 languages. And that’s just the beginning. </span></p>
<p><span style="font-weight: 400">Hundreds of languages also make up areas of the </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/your-guide-to-the-deep-web"><span style="font-weight: 400">Deep Web</span></a></span><span style="font-weight: 400"> and </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/discover-the-dark-web"><span style="font-weight: 400">Dark Web</span></a></span><span style="font-weight: 400">. If businesses want to gain complete data insight into their organization, it becomes necessary to effectively harvest and interpret data from languages other than English. </span><br />
<!-- end HubSpot Call-to-Action Code --></p>
<p><span style="font-weight: 400">Learn how BrightPlanet partners with </span><span style="text-decoration: underline"><a href="http://www.rosoka.com/" target="_blank" rel="noopener noreferrer"><span style="font-weight: 400">Rosoka</span></a></span> <span style="font-weight: 400">to utilize unstructured data mining and Deep Web search to </span><b>harvest data in nearly any language</b><span style="font-weight: 400">, interpreting its meaning to provide deeper insight into the opportunities and threats that exist for businesses today.</span></p>
<h2><b>Harvesting Foreign Deep Web Data</b></h2>
<p><span style="font-weight: 400">BrightPlanet’s process for harvesting web data in a foreign language is the same <span style="text-decoration: underline"><a href="http://10.0.0.183:8085/data-as-a-service/">process</a></span> it uses to harvest a web page in English. </span></p>
<p><span style="font-weight: 400">For example, an online article about leukemia written in English is harvested the same as an article about leukemia written in Dutch, Arabic, or Portuguese. BrightPlanet navigates each page and </span><b>stores and archives all text from that page</b><span style="font-weight: 400">. This process works with any language that can be written using characters of some sort online.</span></p>
<h2><b>Curating Foreign Data to Identify Patterns </b></h2>
<p><span style="font-weight: 400">While harvesting foreign data isn’t too difficult, the challenge comes when analyzing documents of different languages and comparing them for similar and contrasting content.</span></p>
<p><span style="font-weight: 400">BrightPlanet works with text analytics solutions partner, </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/2017/04/rosoka/"><span style="font-weight: 400">Rosoka</span></a></span><span style="font-weight: 400">, to </span><b>enrich unstructured data</b><span style="font-weight: 400"> that has been harvested through Deep Web search or Dark Web search through the process of </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/2017/04/rosoka/"><span style="font-weight: 400">entity extraction</span></a></span><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">Rosoka has the ability to detect key entities in multiple languages, meaning it is able to </span><b>extract main keywords and themes from content in over 200 languages</b><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">If Rosoka harvested three different articles about leukemia written in English, Dutch, and Portuguese, it would be able to recognize the main theme of the disease without needing to perform a full-size machine extraction, saving valuable time. </span></p>
<p><span style="font-weight: 400">Another advantage to using Rosoka and BrightPlanet to harvest data in other languages is our ability to </span><b>normalize extracted tags into one instance</b><span style="font-weight: 400">, regardless of language. For example, even though the three example articles mentioned above are all written in different languages, we are able to create a common link between the three, simply referring to “leukemia” instead of each individual tag. </span></p>
<p><span style="font-weight: 400">Not only can Rosoka identify common keywords and themes among different languages, it can also </span><b>identify sentiments such as mood and intensity of voice among entities and entire documents</b><span style="font-weight: 400">. This feature allows organizations to dig deeper into their data, </span><b>discovering the passions and decisions that are the driving force behind data points</b><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">Once you have this harvested data, BrightPlanet works Rosoka </span><span style="font-weight: 400">to give you the content in the language you primarily use, from English to Russian.  </span></p>
<p><span id="hs-cta-wrapper-73981f45-acaa-4b80-9f0b-2e2f59b50016" class="hs-cta-wrapper"><span id="hs-cta-73981f45-acaa-4b80-9f0b-2e2f59b50016" class="hs-cta-node hs-cta-73981f45-acaa-4b80-9f0b-2e2f59b50016"><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/73981f45-acaa-4b80-9f0b-2e2f59b50016"><img decoding="async" id="hs-cta-img-73981f45-acaa-4b80-9f0b-2e2f59b50016" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/73981f45-acaa-4b80-9f0b-2e2f59b50016.png" alt="LEARN ABOUT ROSOKA'S ENTITY EXTRACTION PROCESS" /></a></span> hbspt.cta.load(179268, &#8216;73981f45-acaa-4b80-9f0b-2e2f59b50016&#8217;, {}); </span></p>
<h2><b>Develop Business Insight through Unstructured Data Mining of Foreign Languages</b></h2>
<p><span style="font-weight: 400">Harvesting foreign language data can be an overwhelming topic to address for businesses. Many people don’t always realize the </span><b>potential increased business insight that comes with foreign language entity extraction</b><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">BrightPlanet works to provide the best possible foreign language data extraction </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/services/"><span style="font-weight: 400">services</span></a></span><span style="font-weight: 400"> that allow you to protect your business against possible domestic and foreign dangers such as fraud, while also giving you insight into potential business opportunities.</span></p>
<p><span style="font-weight: 400">No matter what your data harvest needs are, BrightPlanet can help. </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/schedule-a-consultation"><span style="font-weight: 400">Schedule a consultation</span></a></span><span style="font-weight: 400"> with one of our Data Acquisition Engineers and learn how you can increase your business intelligence.</span></p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Artificial Intelligence and Unstructured Data Mining: Data Trends to Watch in 2018</title>
		<link>https://brightplanet.com/2018/01/10/artificial-intelligence-unstructured-data-mining-data-trends-watch-2018/</link>
		
		<dc:creator><![CDATA[bluemonkeydev]]></dc:creator>
		<pubDate>Wed, 10 Jan 2018 14:51:25 +0000</pubDate>
				<category><![CDATA[Deep Web and Big Data]]></category>
		<category><![CDATA[Financial Industry]]></category>
		<category><![CDATA[Intelligence Community]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[open source content]]></category>
		<category><![CDATA[tableau]]></category>
		<category><![CDATA[unstructured data mining]]></category>
		<guid isPermaLink="false">http://brightplanet.com/?p=8587</guid>

					<description><![CDATA[The world loves to make predictions about the coming year. BrightPlanet is no exception. We recognize the power of data, and work to find new ways to harvest content from the Deep Web and Dark Web that can help businesses with their data goals. Here are four data trends we’ll be watching in 2018. 1. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><span style="font-weight: 400">The world loves to make predictions about the coming year. BrightPlanet is no exception.</span></p>
<p><span style="font-weight: 400">We recognize the power of data, and work to find new ways to harvest content from the <span style="text-decoration: underline"><a href="http://10.0.0.183:8085/your-guide-to-the-deep-web">Deep Web</a> </span>and <span style="text-decoration: underline"><a href="http://10.0.0.183:8085/discover-the-dark-web">Dark Web</a></span> that can help businesses with their data goals. Here are four data trends we’ll be watching in 2018.</span></p>
<h2><b>1. Growth in Artificial Intelligence</b></h2>
<p><span style="font-weight: 400">Artificial intelligence systems (AI) have seen a resurgence in growth the past few years. One use for AI comes in the form of chatbots. These chatbots are those often annoying pop-ups that ask you vague questions. When you reply, their response (if you receive one at all) is usually unhelpful.  </span></p>
<p><span style="font-weight: 400">While chatbots don’t have the best reputation, that may change for the better in 2018. No longer do chatbots just scan for keywords and pump unintelligible responses to complex questions, they really do </span><b>understand what you are asking, and can quickly route you to where you need to go</b><span style="font-weight: 400">.</span></p>
<p><span style="font-weight: 400">Chatbots are interesting, but they aren’t the only thing that’s affected by AI. </span></p>
<p><span style="font-weight: 400">AI is going to make a huge impact in the intelligence community through the building of </span><b>content models that can better detect valuable content versus background or secondary content</b><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">BrightPlanet has seen these models work very well in a few of our projects, as they </span><b>classify content based on how similar or dissimilar it is to a curated model</b><span style="font-weight: 400">. We believe that AI is ready to analyze all open source data in 2018, giving BrightPlanet clients the ability to see patterns and quality indicators within their data. </span></p>
<h2><b>2. Expansion of Unstructured Data Mining</b></h2>
<p><a href="http://10.0.0.183:8085/2017/11/uncover-financial-trends-using-consumer-price-index-and-access-economy-data-analysis/"><img loading="lazy" decoding="async" class="aligncenter wp-image-8589 size-full" src="http://10.0.0.183:8085/wp-content/uploads/2018/01/Screen-Shot-2018-01-09-at-3.29.30-PM.png" alt="Predict Trends in the Financial Services Industry" width="738" height="450" srcset="https://brightplanet.com/wp-content/uploads/2018/01/Screen-Shot-2018-01-09-at-3.29.30-PM.png 738w, https://brightplanet.com/wp-content/uploads/2018/01/Screen-Shot-2018-01-09-at-3.29.30-PM-300x183.png 300w" sizes="auto, (max-width: 738px) 100vw, 738px" /></a></p>
<p><span style="font-weight: 400">BrightPlanet has seen major advancements in unstructured data mining in recent years, with more advancements to come within the industry in 2018. </span></p>
<p><span style="font-weight: 400">BrightPlanet works closely with text analytics solutions company, </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/2017/09/open-source-intelligence-new-rosoka-update/"><span style="font-weight: 400">Rosoka</span></a></span><span style="font-weight: 400">, which will see many updates in the coming year, including updates to their entity extraction system. </span></p>
<p><span style="font-weight: 400">The key to processing quality web content relies entirely on the ability to </span><b>extract and create meaningful metadata through the unstructured data mining process</b><span style="font-weight: 400">. As we see services like Amazon Web Services (AWS) and Google invest more resources into artificial intelligence, the ability to process unstructured web content is also being improved. </span></p>
<p><span style="font-weight: 400">One specific industry that we see adopting more unstructured content in 2018 is the banking and financial services industries, or </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/2017/11/uncover-financial-trends-using-consumer-price-index-and-access-economy-data-analysis/"><span style="font-weight: 400">Fintech</span></a></span><span style="font-weight: 400"> as it is being referred to today. </span></p>
<p><span style="font-weight: 400">Being able to </span><b>predict trends in the economy using unstructured text analytics solutions</b><span style="font-weight: 400"> can help businesses identify potential opportunities and risks within the financial services industry. To learn more about uncovering trends within the financial services industry (as well as the auto industry), watch our </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/watch-our-free-trends-and-influencers-webinar?hsCtaTracking=0624fb0e-b31a-4513-9607-173a41b4e4f0%7C83bed5dc-52da-4c1d-ad01-45b99346f931"><span style="font-weight: 400">trends and influencers webinar</span></a></span><span style="font-weight: 400">, held in November, 2017.</span></p>
<p><span id="hs-cta-wrapper-8a74934b-aef7-46f6-9f21-94c305bbd5ad" class="hs-cta-wrapper"><span id="hs-cta-8a74934b-aef7-46f6-9f21-94c305bbd5ad" class="hs-cta-node hs-cta-8a74934b-aef7-46f6-9f21-94c305bbd5ad"><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/8a74934b-aef7-46f6-9f21-94c305bbd5ad"><img decoding="async" id="hs-cta-img-8a74934b-aef7-46f6-9f21-94c305bbd5ad" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/8a74934b-aef7-46f6-9f21-94c305bbd5ad.png" alt="WATCH THE WEBINAR" /></a></span> hbspt.cta.load(179268, &#8216;8a74934b-aef7-46f6-9f21-94c305bbd5ad&#8217;, {}); </span></p>
<h2><b>3. Improved Data Visualizations</b></h2>
<p><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/2015/07/more-than-search-a-guide-to-visualizing-data-in-tableau-with-brightplanets-search-dashboard/"><span style="font-weight: 400">Tableau</span></a></span><span style="font-weight: 400">, BrightPlanet’s default data visualization solution for the past three years, is making </span><b>improvements to their data modeling, storage, and flexibility in data viewing</b><span style="font-weight: 400"> in 2018. As these updates are implemented, it will be easier than ever to turn metadata into beautiful data visualizations with </span><b>real meaning and intelligence</b><span style="font-weight: 400">.</span></p>
<p><span style="font-weight: 400">Since unstructured data sets can have billions or tens-of-billions of data points, </span><b>visualizing the data at scale has never been more important</b><span style="font-weight: 400">. Tableau recognizes the growing importance of having accurate data visualizations, and is adding an optimized cache to be built around these visuals. </span></p>
<p><span style="font-weight: 400">Tableau, along with other data visualization options, are leveraging back-end </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/data-feed-api-guide/"><span style="font-weight: 400">REST APIs</span></a></span><span style="font-weight: 400"> to properly prepare and organize data for optimal visualization. These tools will continue to improve and be a huge asset to all open source projects in 2018.</span></p>
<h2><b>4. The Rise of the Chief Data Officer</b></h2>
<p><span style="font-weight: 400">We are optimistic that 2018 will be a strong year for organizations in all industries to seriously </span><b>invest time and resources into their data analytics and data harvest processes</b><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">With an increased emphasis on data, businesses will need to have people within their organization, such as a Chief Data Officer, whose primary job duty is to monitor and </span><b>improve their data through analytics, collection, visualization, and open source content</b><span style="font-weight: 400">. </span></p>
<p><span style="font-weight: 400">We welcome the well deserved recognition of future Chief Data Officers everywhere, and can’t wait to talk to them about how to best leverage open source web content for their business intelligence needs.</span></p>
<h2><b>BrightPlanet: Your Answer to Data Harvest Questions in 2018 and Beyond</b></h2>
<p><span style="font-weight: 400">Data trends come and go, but one thing that won’t change in 2018 is </span><b>BrightPlanet’s commitment to helping clients achieve their data harvest goals</b><span style="font-weight: 400">, no matter how simple or complex. </span></p>
<p><span style="font-weight: 400">Looking to improve your data harvest process in 2018? Let BrightPlanet help. </span><span style="text-decoration: underline"><a href="http://10.0.0.183:8085/schedule-a-consultation?hsCtaTracking=8d4a0273-e697-4ea7-9306-c663c8483225%7Cc92a38d6-754a-4b6c-8390-226d2cf6b5f4"><span style="font-weight: 400">Schedule a free consultation</span></a></span><span style="font-weight: 400"> with one of our Data Acquisition Engineers today!</span></p>
<p><!--HubSpot Call-to-Action Code --><span id="hs-cta-wrapper-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-wrapper"><span id="hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-node hs-cta-811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><!-- [if lte IE 8]&gt;--></p>
<div id="hs-cta-ie-element"></div>
<p><a href="https://cta-redirect.hubspot.com/cta/redirect/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97"><img decoding="async" id="hs-cta-img-811862a0-8baf-4d9e-b1ca-3fa809ee8f97" class="hs-cta-img" style="border-width: 0px" src="https://no-cache.hubspot.com/cta/default/179268/811862a0-8baf-4d9e-b1ca-3fa809ee8f97.png" alt="SCHEDULE YOUR CONSULTATION" /></a></span> hbspt.cta.load(179268, &#8216;811862a0-8baf-4d9e-b1ca-3fa809ee8f97&#8217;, {}); </span><!-- end HubSpot Call-to-Action Code --></p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
