Comments for Andrei Zmievski

Comment on Bloom Filters Quickie by Mạnh

Mạnh — Sat, 10 May 2014 19:02:27 +0000

I’m learning about bloom filters “Multi-dimensional Range Query for DataManagement using Bloom Filters”
you have documentation about this part?
I speak English is not good! i’m Sorry

Comment on Duplicates Detection with ElasticSearch by Andrei

Andrei — Thu, 11 Jul 2013 18:30:37 +0000

@Paweł, in our product we never remove the duplicates from the corpus, because they are places entered by people and it wouldn’t be nice if we simply removed their data. We used the process described in the article to generate “clusters” of places that could potentially be duplicates of one another and to represented them as such on the map.

Comment on Duplicates Detection with ElasticSearch by Paweł Rychlik

Paweł Rychlik — Wed, 10 Jul 2013 11:25:30 +0000

Andrei,
I currently work on a quite similar problem with basically the same toolset. I’ve came across a problem of scoring relativity in elasticsearch – i.e. the score yielded by elasticsearch varies depending on e.g. the number of already processed & deduplicated documents (of course the relation here is much more complex).
Situation A:
You are at the very beginning of your deduping process. You have very few items in your deduplicated dataset. You take a new one (say X), search for its probable duplicates against your dataset. You get some results from ES, with e.g. the highest score being 0.5.
Situation B:
You are far into the deduping process of your data. You have a very large set of deduplicated items already. Again – you take a new one (the same X), search for its probable duplicates against your dataset. You get some results from ES, but the scoring is now on a completely different level – say 2.0.
In your article you mentioned that you have a static score cut-off threshold = 0.5 (which you have refined to get better results).
How can you determine what a score of 0.5 means without knowing the whole context of your data? Shouldn’t the cut-off point be evaluated at runtime based on some statistics or the like? Maybe you would get even better results when you set the score threshold to 0.5 at the very beginning of dedupe process, but push it towards e.g. 1.5 when you have broader amount of data?

Comment on Duplicates Detection with ElasticSearch by pawelrychlik

pawelrychlik — Wed, 10 Jul 2013 11:24:47 +0000

Comment on Duplicates Detection with ElasticSearch by Andrei

Andrei — Wed, 06 Mar 2013 03:52:40 +0000

Your project looks quite interesting, and I’m sure produces better quality results than my approach. I wish I had run across it at the time. But, I wanted to keep the number of pieces of technology as small as possible and using ElasticSearch was a quick & dirty approach that seemed to yield decent results.

Comment on Duplicates Detection with ElasticSearch by Lars Marius Garshol

Lars Marius Garshol — Sat, 02 Mar 2013 09:19:08 +0000

It's interesting to see what you chose to approach this problem by simply querying the search engine directly. I'm surprised your results seem so good, because generally this is a tricky problem, where information from many fields (name, address, phone number, geoposition) etc all need to be considered and weighed against one another. I had the same problem and chose to build a full record linkage engine on top of Lucene. That basically uses Lucene to find candidate matches (much like you do), but then does configurable detailed comparison with weighted Levenshtein, q-grams etc etc and combines results for different properties using Bayes's Theorem. It also cleans and normalizes data before comparison. Even that requires a lot of tuning and work to produce good results, so, like I said, I'm surprised your results look so good. But maybe I'm missing something.

Comment on Bloom Filters Quickie by Tech talk: Bloom Filters | onoffswitch.net

Tech talk: Bloom Filters | onoffswitch.net — Thu, 28 Feb 2013 16:27:14 +0000

[…] http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ http://zmievski.org/2009/04/bloom-filters-quickie http://www.perl.com/pub/2004/04/08/bloom_filters.html […]

Comment on Ideas of March by Devon Sutton

Devon Sutton — Fri, 08 Feb 2013 03:03:30 +0000

They loved jumping on the ruby hate bandwagon when twitter was going through it’s difficulties. Little bo beep has been quite silent since.

Comment on Duplicates Detection with ElasticSearch by Andrei

Andrei — Sun, 30 Sep 2012 16:53:01 +0000

The groups are stored in MongoDB in a separate collection. Each entry just lists the IDs of the places in the group. Querying is an easy $in operator, to find which group a given place belongs to.

Comment on Duplicates Detection with ElasticSearch by Dennis

Dennis — Sun, 30 Sep 2012 06:36:46 +0000

Hi, once you have the items in groups, how do you go about querying for items and collapsing on the groups you have created?