semanticvoid

Method and System for Advertising in Real World using a T-shirt with a Printed Advertisement

anand kishore — Thu, 23 Feb 2012 21:32:03 +0000

A method and system for advertising in real world using a t-shirt with a printed advertisement is disclosed. The method and system provides more eyeballs to an advertisement in a physical advertising world.

(http://ip.com/IPCOM/000214395)

WYCIWYS

anand kishore — Wed, 21 Sep 2011 21:12:39 +0000

Many a times I’ve stared at Explored Flickr Photos and tried grokking its artistic nuances. Due to lack of artistic sensibility, at times I fail to understand the techniques or properties that the photographer used or intended to capture. But the Flickr community is brimming with experts who often chime in about what they like/see in comments. My #nlproc hack (for the upcoming Yahoo! Winter Hackday) aims to solve this by summarizing this expert knowledge (wisdom of crowd) for a photograph.

What You Comment Is What You See (WYCIWYS) is a Flickr hack that harnesses the comments of photos to determine the attributes/properties of the photo that people are talking about. It also gives a sentiment score (+ve) for each attribute to help a user gauge what other users find most interesting about a photo. Following are some outputs for WSCIWYS (click to zoom):

what the bleep!

Anand Kishore — Fri, 04 Mar 2011 19:13:15 +0000

Profanity is often prevalent in user generated content (like comments). Websites that do not want to display such profane comments/content currently employ masking as a solution to get rid of profanity. Masking replaces the profanity in the content with characters like ####. The masked content still though conveys the existence of profanity to the user. Humans have built up a great language model to infer missing words. Try it yourself – it should be easy for you to guess a bunch of profanity words for the following sentence:

What the ####!

My hack (Bleep) for the Yahoo! Spring ’11 Hackday is yet another natural language hack that tries to remove the profanity from a comment without altering the semantics of the content. In brief, removing the profanity word from the content makes the parse tree less probable. The algorithm tries to alter this improbable parse tree to find the best local parse tree.

Following are some corrections suggested by Bleep:

a speed gun for spam

Anand Kishore — Fri, 25 Feb 2011 00:56:09 +0000

Apart from the content there are various features from metadata (like IP etc) which can help tell a spammer and regular user apart. Following are results of some data analysis (done on roughly 8000+ comments) which speak of another feature which proves to be a good discriminator. Hopefully this will aid others fighting spam/abuse (if not already using a similar feature).

The discriminator referred above is typing speed. The graph above plots the content length of a comment posted by a user against the (approximate) time he took to write it. If a user posts more than one comment in window of 5-10 minutes, we can consider those comments as consecutive posts. Any comment falling outside this window is ignored. In the above plot, the time is inferred by [time_of_current_post - time_of_previous_post] for consecutive posts. Thus typing speed is estimated as (content_lenght)/(time_delta_between_two_posts). This is a non-intrusive way of measuring typing speed from the logs. Though for a more accurate number one could always instrument the page with javascript (see example here). The dataset was manually labeled as spam (depicted in red) and ham (depicted in blue).

Wikipedia lists the average words per minute (wpm) for a regular internet user at around 30 wpm. With a conversion factor of 5 to characters per minute (cpm), this amounts to ~2.5 characters per second. The green line in the plot depicts a projection of the content length a user could have typed in the given time with an average typing speed (of ~2.5 chars per sec). We observe that this line clearly separates out most spam from ham. The ham posts that fall above this line are usually trolls (as observed).

This turns out to be a nice feature to tell spammers (bots and non-bots), trolls, and regular users apart. Bots often fail the turing test and don’t try hard enough to be more human like. Non-bot spammers on the other hand have to take the pains to type their spam comment repeatedly and usually end up pasting it.Try out the example here .

So spammers fix yourselves cause we have the speed gun to pull you over.

the evolving spammer

Anand Kishore — Thu, 09 Sep 2010 03:10:31 +0000

Though I’ve only recently started tackling it (spam), what I hear from veterans is that spam is hard problem. It is so not because its difficult to model (unlike some sub-domains in NLP) but because essentially it is a battle of human-vs-human. The opponent is now a constantly evolving machine. They learn and they learn fast. This keeps those fighting spam on their toes and you need to react to new techniques that they learn to get past the filters. Most of the work thus involved is on a reactive basis. Basically you keep iterating the following cycle: deploy -> observe -> learn -> model -> deploy

Now lets consider a sample spam text: “Find sexy girls and guys at xyz.com”. The simplest classifier (lets assume Bayesian text classifier) will start to crumble once the spammer changes the text to “fin d sex y girl s an d guy s a t xyz.co m”. So you will label and retrain your classifier to catch this new trick.

To get out of this vicious reactive cycle, you need to test your model proactively against the possible techniques a spammer could come up with to get away. This is where comes in YODA (acronym for Overly Determined Abuser), a genetic programming based model of a spammer I built (yes we have 20% time as well) to break our spam detection models. As any other genetic algorithm framework, it needs implementations of fitness functions and genome functions. The idea is to model characteristics of a spammer (variables that a spammer can manipulate) as genome functions. The genome functions represent the minimalist change that can be made to the text. For instance, changing the case of characters, modifying sentence delimiters, modifying word delimiters etc. The genome functions need not be just text modification functions but could also represent other attributes of a spammer (like IP etc). The fitness functions represent the criteria the spammer is trying to optimize i.e. to get past the filters with minimal distortions to the spam text. This could be the edit distance combined with the score returned by the model/filter.

Once the fitness function and many such genome functions have been defined, you can set these spam bots free to undergo selection, crossover and mutation. In the end (when you decide to stop the evolution), you will end up with bots that are far more complex than just the basic genome functions defined. The transformations to the original text might be beyond what you could have thought of testing against.

Following are some results of this model on the same spam text using the above mentioned basic genome functions:

- F.ind.s.exy g.irls.a.nd.g.uys.a.t.x.yz.com
- f iñd s exy gi rls ã ñd g úys a t xy z.çom
- FI ND sE Xy Gir Ls anD gu yS AT XyZ. COM
- Find_sexy girls and_guys at_xyz.com

stop words

Anand Kishore — Wed, 25 Aug 2010 04:00:16 +0000

In a recent implementation for a near duplicate detection task I relied on stop words as key features in extracting signatures from text. The results turned out to be good but that’s not what I’m focusing on here. This was quite contrary to the mindset in the IR/NLP domain we have been accustomed to, where these words are considered meaningless and need to be got rid of before building any model/index. These word on the other hand encode a plethora of information like tense, plurality, (un)certainty, subjectivity and more. They bind the semantics of a sentence together and give them context. Yet (atleast in the IR sense) we give them a negative connotation (STOP/NN -0.140192 sentiment). I would go a step ahead by saying that we should stop calling them *stop* words and instead accept the inability of some IR systems of making correct use of them. How about *glue* words for a change? Or maybe not.

PS: Incase you are looking for a list of stop words for different languages here is a good list – http://members.unine.ch/jacques.savoy/clef/

Twitter Plots: the case of the 2000 ‘following’

Anand Kishore — Fri, 28 May 2010 03:27:25 +0000

I came across this interesting pattern while trying to visualize some of the Twitter streaming data. The following charts plot the ‘following’ counts vs the ‘followers’ counts (for ~200K user accounts). The data represents one hours worth of data obtained via the streaming API. User accounts falling around the line y ~= 0 tend to generally be celebrities (musicians, sportsmen etc), companies, news and info bots (like the WSJ, CNN etc). The general population usually falls around the line y = x (the ‘I follow you, You follow me’ kind). But thats not whats interesting here (we all knew that). Looking at the zoomed in plots (figure 2 and figure 5), we see a distinct square formed by at (0,0) (2000,2000). This is also observed in another days data (figure 5) so its not just an anomaly. The plateau formed at y=2000 is a bit perplexing. I can’t seem to get my head around that. Figure (3) tries to look at the user accounts with ~2000 ‘following’ – a large number of these users turn out to be spam bots. I suspect most spam account (bots) are concentrated around this region. Its as if the spam bots tend to follow around 2000 users at max so as to not alert the spam controls by mass following users.

Any hypothesis that comes to your mind?

Figure 1: plot for day 1

Figure 2: plot for day 1 (zoomed)

Figure 3: plot for day 1 with y ~ 2000

Figure 4: plot for day 2

Figure 5: plot for day 2

Reading Less Is Reading More

Anand Kishore — Wed, 07 Oct 2009 08:19:27 +0000

If information is what drives you to the internet, like me, you might be spending roughly 60-70% of your time online reading blogs, news and feeds (not to forget twitter). For me at least, reading online has superseded email (and updating social networks) as the most time consuming activity. And yet everyone is busy generating more content rather than finding a solution to consume all this information. We are trying to tackle this problem precisely with Dygest. At its core Dygest is a summarization engine that tries to sift through all the noise and present only the *real* content/news contained in any (news) article/text. Recently, we released an experimental version of a feed summarizer that uses the Dygest engine to summarize blogposts/news for any RSS/ATOM feed. This summarized feed can be subscribed in any feed reader like Bloglines, Google Reader etc.

NOTE: A feed that has not been encountered by our system ever before should be summarized in a couple of minutes.

On the whole with Dygest, reading blogs has now become much faster, much more concise and consuming information has become a great deal easier. Imagine the time saved reading the summarized version as compared to the original post (also you are not overwhelmed with useless information). See for yourself below:

Original Post

Summarized Post

While you might have the urge to head over to Dygest and summarize your entire subscription list on Google Reader, I would recommend reading this post a bit further for some real cool stuff we have in store. If you must though – click here to Dygest.

Summarizing Your Twitter Links

Readtwit is a really cool service launched recently, which extracts links from your twitter feed and packages them in a clean RSS format. The awesome combination of Readtwit along with Dygest yields a summarized twitter feed delivered to your favorite feed reader.

Steps to get a summarized twitter feed:

(1) Sign into Readtwit.
(2) Copy the link on the ‘Get me the feed’ button:

(3) Paste this link into the Dygest interface and subscribe to the summarized feed returned in your favorite feed reader.

More To Come

This is just an experimental release of Dygest and so do send in your feedback on the summaries and help us improve. In the coming months we are working on improving the algorithms and churning out other great applications of Dygest (there is something really cool in the works). So while we are busy teaching computers to read, Dygest your feeds – because reading less is reading more.

Web Content Extraction Dataset

Anand Kishore — Sun, 23 Aug 2009 05:42:10 +0000

For a recent project, we (sudheer_624 and I) have had to deal with developing algorithms to extract the true content from any given web page. By true content I mean the text excluding the ads, navigational links/text, etc even excluding comments (if any). Thus, given a blog post we are interested in extracting just the content of the post and not the comments and other surrounding text. We did not come across any dataset for the given task that would let us evaluate our algorithms. We recently generated our own dataset for this purpose and would like to share it with anyone tackling a similar problem.

The dataset contains the html source and text content (true content) for around ~4000 webpages. One metric to measure your algorithm against this dataset could be the edit distance. If you do use this dataset, it would be great if you could share the results of your algorithms for benchmarks to compare against. I’ll be updating this post with the accuracy of our algorithm soon enough.

Download the dataset here (gzipped)

`Fact`orize Your Search

Anand Kishore — Fri, 14 Aug 2009 07:37:08 +0000

Dygest and a hackday later, @sudheer_624 and I (@semanticvoid) are back with ‘dfacto’, codename for our latest search hack for Yahoo! Hackday Summer 2009.

I think that search is undergoing a paradigm shift – its no longer about who presents the best ten blue links but now more about presenting the answers upfront. Dfacto (pronounced as ‘de facto‘, Latin for ‘by [the] fact‘) is aimed at addressing this issue. A large percentage (nearly 68%) of queries are informational queries – one where the searcher knows what she’d like to do or find but does not know how this can be achieved. Dfacto is aimed primarily at addressing this class of queries by presenting a set of facts associated with the query/topic to the searcher. It uses natural language algorithms to get facts that are most “semantically” related to the query. In lay terms, it literally tries to understand your query and the results. I’ll save the algorithmic details for another post. The few examples below show how it works:

Disclaimer: This is a work in progress, so you might notice a few ‘facts’ that are irrelevant to the query.

Lets say the searcher is (losing hair and) looking for causes of hair loss. Normally he/she would need to click through a bunch of links to get an overview on the causes. This hack on the other hand makes life a bit easier by presenting the causes upfront (click to enlarge):

click to enlarge

Along with the facts, we also list the source from where it was extracted. Alternatively, the searcher can also select a bunch of facts he/she thinks are relevant and refine the search. This in turn would yield a new set of ‘web results’ along with new refined and related ‘facts’.

Another example (one which I particularly like) is a query about ‘table manners’. This precisely lists a set of etiquette’s to follow at the table (click to enlarge).

click to enlarge

Alternatively, Dfacto also serves well as a product research tool. A query for ‘iphone 3gs’ yeilds (click to enlarge):

click to enlarge

On another note, if you have a date in the coming weeks you might be interested in reading the list below (:

click to enlarge

Happy hacking!