ezkl

Twitter Training Part Deux

admin — Tue, 27 Apr 2010 17:02:31 +0000

Due to circumstances that are, for the most part, entirely within my control (namely laziness), I haven’t updated this blog in a while. My last post focused on the idea of training text classification systems using Twitter as a corpus and some of the issues associated with that process. After spending considerably more time thinking about my initial approach, it recently dawned on me that I may have bitten off considerably more than I could chew.

In essence, I underestimated the complexity of emotional classification. While this probably seems quite apparent to most, I think that this complexity arises from the fact that emotionality, especially as it is conveyed using language, isn’t black-and-white enough for conventional classification systems to parse effectively. It is also abundantly clear that Twitter, while a vast source of linguistic information, probably does not allow enough characters for any individual to effectively elucidate or convey any real sense of emotion.

Rather than giving up on the idea completely, I’ve begun to reformulate my research method to focus on something that is a bit more cut-and-dry; political opinion. While I personally do not believe that political discourse is simple, the current public political dialogue occurring in the United States is so marginalized and segmented, that I believe classification will be far simpler.

So, again, I pose a question; what questions do you have with regard to the political leanings of different demographics in the US?

Training Classification Systems With Twitter

admin — Sat, 20 Feb 2010 19:25:38 +0000

I’ve been using Twitter as a data source for content generation and language classification because of the ubiquity of hashtags. For those of you not in the know, hashtags are the weird #something bits you see all over Twitter. In some ways, hashtags are used for the exact purpose of classifying language, providing an extra level of context to short form writing. Twitter makes it very easy to search and mine various hashtags with their occasionally moody, but generally effective search API.

There are, as with any mining method, a few major caveats. In my opinion, the first and foremost is one that impacts all language processing and mining projects; colloquialisms and regional variations in the meaning of words. This issue can be seen when doing a search for something like #sad. A tweet like “My girl left me #sad” would fit into most people’s classification of the word sad when it is intended to mean unhappy or sorrowful. However, “That cab driver just picked his nose and wiped it on his dashboard! #sad” wouldn’t fall into the same classification as the previous tweet, but still extends a commonly accepted meaning of the word.

One way to get around this issue is to first train your classification system with a little hand fed data and then use this slight training to help you bootstrap your system. While you can paint yourself into a bit of a self-referential corner with this sort of method, building it in from the beginning can play an important role in making your system scalable and “grow more intelligent” as you add more training data.

The second issue is ensuring that you are only mining the written languages your classification system is being trained to classify. Because Twitter is a global community, it is not uncommon to have many languages appearing in search results. This issue is easier to overcome due to API-based language detection systems like the one provided by Google. You will occasionally get errors due to the non-traditional grammar frequently used on Twitter. It is also wise to parse out any @ and # tags before sending it through the language detection system.

After overcoming some of these initial issues, the benefit of having access to a constantly updated, semi-classified data set becomes evident. I currently am training a fork of my system using the method I’ve described with five separate hashtags to help assess the efficacy of the method. I run a cron job once every 15 minutes, parse the results, making sure I’m not processing the same tweet twice, and, if it passes the initial tests, add it to the system. While some queries produce considerably more tweets than others, my hope is that at least some of my training data will be complete enough to expand into other, less pre-qualified territory.

In a few weeks I plan on doing a bit of a subjective, qualitative test to discern the accuracy of this approach. I’ll post the results here when I’m done.

Help Me Classify Language!

admin — Wed, 17 Feb 2010 01:08:24 +0000

This is an open call to everyone out there in internet land. I’m working on a new project whose purpose it is to classify the various emotions that pieces of text evoke in people. To do that, I need lots of chunks of text, regardless of its origin, length, or style that makes you feel a certain way. It would be helpful if you believe that the same piece of text might make others feel the same way as you, but beggars can’t be choosers.

I’m going to start off with 4 broad emotions and expand from there:

Anger
Frustration
Happiness
Desire

If you feel so inclined, I would be forever grateful if you would post quotes, links, or references of any kind to written material that makes you feel any of those ways in the comment section below. If I know you in the real world, I will repay you with high-fives, beer, a long cyclical conversation about absolutely any subject you like, or something of lesser-or-equal-value.

I would also be grateful if you would share this post with friends, family, colleagues, etc as the more information I have, the better the chance the project will be a success.

Note: This information will not be sold, marketed, or profited from in anyway. If it leads to anything of merit, I will publish my findings publicly.

Bloggin’

admin — Tue, 16 Feb 2010 02:01:29 +0000

A few days ago I started toying with the idea of starting a blog to collect my thoughts and various web presences into one, consolidated space. I’ve also been meaning to make use of the ezkl.org domain for ages. I’ll be toying around with the setup, plugins, theme, etc quite a bit, so definitely consider this a work in progress.

Most likely, no one but me will ever read this. If you are reading this, then I probably know you and it’d be totally cool if you left a comment below.