<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	
	>
<channel>
	<title>
	Comments for Andrei Zmievski	</title>
	<atom:link href="https://zmievski.org/comments/feed/" rel="self" type="application/rss+xml" />
	<link>https://zmievski.org</link>
	<description>Life, technology, and other good things</description>
	<lastBuildDate>Mon, 25 Nov 2019 05:47:54 -0800</lastBuildDate>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=5.3</generator>
			<item>
				<title>
				Comment on Bloom Filters Quickie by Mạnh				</title>
				<link>https://zmievski.org/2009/04/bloom-filters-quickie/#comment-906</link>
		<dc:creator><![CDATA[Mạnh]]></dc:creator>
		<pubDate>Sat, 10 May 2014 19:02:27 +0000</pubDate>
		<guid isPermaLink="false">http://gravitonic.com/?p=788#comment-906</guid>
					<description><![CDATA[I’m learning about bloom filters “Multi-dimensional Range Query for DataManagement using Bloom Filters”
you have documentation about this part?
I speak English is not good! i’m Sorry]]></description>
		<content:encoded><![CDATA[<p>I’m learning about bloom filters “Multi-dimensional Range Query for DataManagement using Bloom Filters”<br />
you have documentation about this part?<br />
I speak English is not good! i’m Sorry</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Duplicates Detection with ElasticSearch by Andrei				</title>
				<link>https://zmievski.org/2011/03/duplicates-detection-with-elasticsearch/#comment-1311</link>
		<dc:creator><![CDATA[Andrei]]></dc:creator>
		<pubDate>Thu, 11 Jul 2013 18:30:37 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1122#comment-1311</guid>
					<description><![CDATA[@Paweł, in our product we never remove the duplicates from the corpus, because they are places entered by people and it wouldn&#039;t be nice if we simply removed their data. We used the process described in the article to generate &quot;clusters&quot; of places that could potentially be duplicates of one another and to represented them as such on the map.]]></description>
		<content:encoded><![CDATA[<p>@Paweł, in our product we never remove the duplicates from the corpus, because they are places entered by people and it wouldn&#8217;t be nice if we simply removed their data. We used the process described in the article to generate &#8220;clusters&#8221; of places that could potentially be duplicates of one another and to represented them as such on the map.</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Duplicates Detection with ElasticSearch by Paweł Rychlik				</title>
				<link>https://zmievski.org/2011/03/duplicates-detection-with-elasticsearch/#comment-1310</link>
		<dc:creator><![CDATA[Paweł Rychlik]]></dc:creator>
		<pubDate>Wed, 10 Jul 2013 11:25:30 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1122#comment-1310</guid>
					<description><![CDATA[Andrei,
I currently work on a quite similar problem with basically the same toolset. I&#039;ve came across a problem of scoring relativity in elasticsearch - i.e. the score yielded by elasticsearch varies depending on e.g. the number of already processed &#038; deduplicated documents (of course the relation here is much more complex).
Situation A:
You are at the very beginning of your deduping process. You have very few items in your deduplicated dataset. You take a new one (say X), search for its probable duplicates against your dataset. You get some results from ES, with e.g. the highest score being 0.5.
Situation B:
You are far into the deduping process of your data. You have a very large set of deduplicated items already. Again - you take a new one (the same X), search for its probable duplicates against your dataset. You get some results from ES, but the scoring is now on a completely different level - say 2.0.
In your article you mentioned that you have a static score cut-off threshold = 0.5 (which you have refined to get better results).
How can you determine what a score of 0.5 means without knowing the whole context of your data? Shouldn&#039;t the cut-off point be evaluated at runtime based on some statistics or the like? Maybe you would get even better results when you  set the score threshold to 0.5 at the very beginning of dedupe process, but push it towards e.g. 1.5 when you have broader amount of data?]]></description>
		<content:encoded><![CDATA[<p>Andrei,<br />
I currently work on a quite similar problem with basically the same toolset. I&#8217;ve came across a problem of scoring relativity in elasticsearch &#8211; i.e. the score yielded by elasticsearch varies depending on e.g. the number of already processed &amp; deduplicated documents (of course the relation here is much more complex).<br />
Situation A:<br />
You are at the very beginning of your deduping process. You have very few items in your deduplicated dataset. You take a new one (say X), search for its probable duplicates against your dataset. You get some results from ES, with e.g. the highest score being 0.5.<br />
Situation B:<br />
You are far into the deduping process of your data. You have a very large set of deduplicated items already. Again &#8211; you take a new one (the same X), search for its probable duplicates against your dataset. You get some results from ES, but the scoring is now on a completely different level &#8211; say 2.0.<br />
In your article you mentioned that you have a static score cut-off threshold = 0.5 (which you have refined to get better results).<br />
How can you determine what a score of 0.5 means without knowing the whole context of your data? Shouldn&#8217;t the cut-off point be evaluated at runtime based on some statistics or the like? Maybe you would get even better results when you  set the score threshold to 0.5 at the very beginning of dedupe process, but push it towards e.g. 1.5 when you have broader amount of data?</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Duplicates Detection with ElasticSearch by pawelrychlik				</title>
				<link>https://zmievski.org/2011/03/duplicates-detection-with-elasticsearch/#comment-1309</link>
		<dc:creator><![CDATA[pawelrychlik]]></dc:creator>
		<pubDate>Wed, 10 Jul 2013 11:24:47 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1122#comment-1309</guid>
					<description><![CDATA[Andrei,
I currently work on a quite similar problem with basically the same toolset. I&#039;ve came across a problem of scoring relativity in elasticsearch - i.e. the score yielded by elasticsearch varies depending on e.g. the number of already processed &#038; deduplicated documents (of course the relation here is much more complex).
Situation A:
You are at the very beginning of your deduping process. You have very few items in your deduplicated dataset. You take a new one (say X), search for its probable duplicates against your dataset. You get some results from ES, with e.g. the highest score being 0.5.
Situation B:
You are far into the deduping process of your data. You have a very large set of deduplicated items already. Again - you take a new one (the same X), search for its probable duplicates against your dataset. You get some results from ES, but the scoring is now on a completely different level - say 2.0.
In your article you mentioned that you have a static score cut-off threshold = 0.5 (which you have refined to get better results).
How can you determine what a score of 0.5 means without knowing the whole context of your data? Shouldn&#039;t the cut-off point be evaluated at runtime based on some statistics or the like? Maybe you would get even better results when you  set the score threshold to 0.5 at the very beginning of dedupe process, but push it towards e.g. 1.5 when you have broader amount of data?]]></description>
		<content:encoded><![CDATA[<p>Andrei,<br />
I currently work on a quite similar problem with basically the same toolset. I&#8217;ve came across a problem of scoring relativity in elasticsearch &#8211; i.e. the score yielded by elasticsearch varies depending on e.g. the number of already processed &amp; deduplicated documents (of course the relation here is much more complex).<br />
Situation A:<br />
You are at the very beginning of your deduping process. You have very few items in your deduplicated dataset. You take a new one (say X), search for its probable duplicates against your dataset. You get some results from ES, with e.g. the highest score being 0.5.<br />
Situation B:<br />
You are far into the deduping process of your data. You have a very large set of deduplicated items already. Again &#8211; you take a new one (the same X), search for its probable duplicates against your dataset. You get some results from ES, but the scoring is now on a completely different level &#8211; say 2.0.<br />
In your article you mentioned that you have a static score cut-off threshold = 0.5 (which you have refined to get better results).<br />
How can you determine what a score of 0.5 means without knowing the whole context of your data? Shouldn&#8217;t the cut-off point be evaluated at runtime based on some statistics or the like? Maybe you would get even better results when you  set the score threshold to 0.5 at the very beginning of dedupe process, but push it towards e.g. 1.5 when you have broader amount of data?</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Duplicates Detection with ElasticSearch by Andrei				</title>
				<link>https://zmievski.org/2011/03/duplicates-detection-with-elasticsearch/#comment-1308</link>
		<dc:creator><![CDATA[Andrei]]></dc:creator>
		<pubDate>Wed, 06 Mar 2013 03:52:40 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1122#comment-1308</guid>
					<description><![CDATA[Your project looks quite interesting, and I&#039;m sure produces better quality results than my approach. I wish I had run across it at the time. But, I wanted to keep the number of pieces of technology as small as possible and using ElasticSearch was a quick &amp; dirty approach that seemed to yield decent results.]]></description>
		<content:encoded><![CDATA[<p>Your project looks quite interesting, and I&#8217;m sure produces better quality results than my approach. I wish I had run across it at the time. But, I wanted to keep the number of pieces of technology as small as possible and using ElasticSearch was a quick &#038; dirty approach that seemed to yield decent results.</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Duplicates Detection with ElasticSearch by Lars Marius Garshol				</title>
				<link>https://zmievski.org/2011/03/duplicates-detection-with-elasticsearch/#comment-1307</link>
		<dc:creator><![CDATA[Lars Marius Garshol]]></dc:creator>
		<pubDate>Sat, 02 Mar 2013 09:19:08 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1122#comment-1307</guid>
					<description><![CDATA[It&#039;s interesting to see what you chose to approach this problem by simply querying the search engine directly. I&#039;m surprised your results seem so good, because generally this is a tricky problem, where information from many fields (name, address, phone number, geoposition) etc all need to be considered and weighed against one another.
I had the same problem and chose to build &lt;a href=&quot;http://code.google.com/p/duke/&quot; rel=&quot;nofollow ugc&quot;&gt;a full record linkage engine&lt;/a&gt; on top of Lucene. That basically uses Lucene to find candidate matches (much like you do), but then does configurable detailed comparison with weighted Levenshtein, q-grams etc etc and combines results for different properties using Bayes&#039;s Theorem. It also cleans and normalizes data before comparison.
Even that requires a lot of tuning and work to produce good results, so, like I said, I&#039;m surprised your results look so good. But maybe I&#039;m missing something.]]></description>
		<content:encoded><![CDATA[<p>It&#8217;s interesting to see what you chose to approach this problem by simply querying the search engine directly. I&#8217;m surprised your results seem so good, because generally this is a tricky problem, where information from many fields (name, address, phone number, geoposition) etc all need to be considered and weighed against one another.<br />
I had the same problem and chose to build <a href="http://code.google.com/p/duke/" rel="nofollow ugc">a full record linkage engine</a> on top of Lucene. That basically uses Lucene to find candidate matches (much like you do), but then does configurable detailed comparison with weighted Levenshtein, q-grams etc etc and combines results for different properties using Bayes&#8217;s Theorem. It also cleans and normalizes data before comparison.<br />
Even that requires a lot of tuning and work to produce good results, so, like I said, I&#8217;m surprised your results look so good. But maybe I&#8217;m missing something.</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Bloom Filters Quickie by Tech talk: Bloom Filters &#124; onoffswitch.net				</title>
				<link>https://zmievski.org/2009/04/bloom-filters-quickie/#comment-905</link>
		<dc:creator><![CDATA[Tech talk: Bloom Filters &#124; onoffswitch.net]]></dc:creator>
		<pubDate>Thu, 28 Feb 2013 16:27:14 +0000</pubDate>
		<guid isPermaLink="false">http://gravitonic.com/?p=788#comment-905</guid>
					<description><![CDATA[[...] http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ http://zmievski.org/2009/04/bloom-filters-quickie http://www.perl.com/pub/2004/04/08/bloom_filters.html [...]]]></description>
		<content:encoded><![CDATA[<p>[&#8230;] <a href="http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/" rel="nofollow ugc">http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/</a> <a href="http://zmievski.org/2009/04/bloom-filters-quickie" rel="nofollow ugc">http://zmievski.org/2009/04/bloom-filters-quickie</a> <a href="http://www.perl.com/pub/2004/04/08/bloom_filters.html" rel="nofollow ugc">http://www.perl.com/pub/2004/04/08/bloom_filters.html</a> [&#8230;]</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Ideas of March by Devon Sutton				</title>
				<link>https://zmievski.org/2011/03/ideas-of-march/#comment-1296</link>
		<dc:creator><![CDATA[Devon Sutton]]></dc:creator>
		<pubDate>Fri, 08 Feb 2013 03:03:30 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1110#comment-1296</guid>
					<description><![CDATA[They loved jumping on the ruby hate bandwagon when twitter was going through it&#039;s difficulties. Little bo beep has been quite silent since.]]></description>
		<content:encoded><![CDATA[<p>They loved jumping on the ruby hate bandwagon when twitter was going through it&#8217;s difficulties. Little bo beep has been quite silent since.</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Duplicates Detection with ElasticSearch by Andrei				</title>
				<link>https://zmievski.org/2011/03/duplicates-detection-with-elasticsearch/#comment-1306</link>
		<dc:creator><![CDATA[Andrei]]></dc:creator>
		<pubDate>Sun, 30 Sep 2012 16:53:01 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1122#comment-1306</guid>
					<description><![CDATA[The groups are stored in MongoDB in a separate collection. Each entry just lists the IDs of the places in the group. Querying is an easy $in operator, to find which group a given place belongs to.]]></description>
		<content:encoded><![CDATA[<p>The groups are stored in MongoDB in a separate collection. Each entry just lists the IDs of the places in the group. Querying is an easy $in operator, to find which group a given place belongs to.</p>
]]></content:encoded>
						</item>
						<item>
				<title>
				Comment on Duplicates Detection with ElasticSearch by Dennis				</title>
				<link>https://zmievski.org/2011/03/duplicates-detection-with-elasticsearch/#comment-1305</link>
		<dc:creator><![CDATA[Dennis]]></dc:creator>
		<pubDate>Sun, 30 Sep 2012 06:36:46 +0000</pubDate>
		<guid isPermaLink="false">http://zmievski.org/?p=1122#comment-1305</guid>
					<description><![CDATA[Hi, once you have the items in groups, how do you go about querying for items and collapsing on the groups you have created?]]></description>
		<content:encoded><![CDATA[<p>Hi, once you have the items in groups, how do you go about querying for items and collapsing on the groups you have created?</p>
]]></content:encoded>
						</item>
			</channel>
</rss>
