deixto.com/blog

Celery task/ job queue

noreply@blogger.com (kntonas) — Tue, 21 Jan 2014 07:10:00 +0000

Queues are very common in computer science and in real-world programs. A queue is a FIFO data structure where new elements are typically added to the rear position whereas the first items inserted will be the first ones to be removed/ served. A nice, thorough collection of queueing systems can be found on queues.io. Task/Job queues in particular are used by numerous systems/ services and they apply to a wide range of applications. They can alleviate the complexity of system design and implementation, boost scalability and generally have many advantages. Some of the most popular ones are Celery, RQ (Redis Queue) and Gearman.

The one we recently stumbled upon and immediately took advantage of was Celery which is written in Python and is based on distributed message passing. It focuses on real-time operation but supports scheduling as well. The units of work to be performed, called tasks, are executed simultaneously on a single or more worker servers using multiprocessing. Thus, concurrent workers run in the background waiting for new job arrivals and when a task arrives (and its turn comes) a worker processes it. Some of Celery's uses are handling long running jobs, asynchronous task processing, offloading heavy tasks, job routing, task trees, etc.

Now let's see a use case where Celery untied our hands. As you might already know, for quite some time we have been developing Python, Selenium-based scripts for web scraping and browser automation. So, occasionally we came across data records/ cases, while executing a script, that had to be dealt separately by another process/ script. For instance, in the context of the recent e-procurement project, when scraping through DEiXToBot the detail page of a payment (published on the Greek e-procurement platform) you could find a reference towards a relevant contract which you would also like to download and scrape. Additionally, this contract could also link with a tender notice and the latter may be a corrected version of an existing tender or connect in turn with another decision/ document.
Thus, we thought it would be handy and more convenient if we could add the unique codes/ identifiers of these extra documents to a queue system and let a background worker get the job done asynchronously. It should be noted that the lack of persistent links on the eprocurement website made it harder to download a detail page programmatically at a later stage since you could access it only after performing a new search with its ID and automating a series of steps with Selenium depending on the type of the target document.

So, it was not long before we installed Celery on our Linux server and started experimenting with it. We were amazed with its simplicity and efficiency. We quickly wrote a script that fitted the bill for the e-procurement scenario we described in the previous paragraph. The code we wrote provided an elegant and practical solution to the problem at hand and was something like that (note the recursion!):

from celery import Celery

app = Celery('tasks', backend='amqp', broker='amqp://')

@app.task

def download(id):

... selenium_stuff ...

if (reference_found)

download.delay(new_id) # delay sends a task message

In conclusion, we are happy to have found Celery, it's really promising and we thought it would be nice to share this news with you. We are looking forward to using Celery further for our heavy scraping needs and we are glad that we added it to our arsenal.

About web proxies

noreply@blogger.com (kntonas) — Fri, 17 Jan 2014 08:17:00 +0000

Rightly or wrongly there are times when one would like to conceal his IP address, especially while scraping a target website. Perhaps the most popular way to do that is by using web proxy servers. A proxy server is a computer system or an application that acts as an intermediary for requests from clients seeking resources from other servers. Thus, web proxies allow users to mask their true IP and enable them to surf anonymously online. But personally we are mostly interested in their use for web data extraction and automated systems in general. So, we did some Google search to locate notable proxy service providers but surprisingly the majority of the results were dubious websites of low trustworthiness and low Google PageRank scores. However, there were a few that stood out from the crowd. We will name two: a) HideMyAss (or HMA for short) and b) Proxify.

HMA provides (among others) a large real-time database of free working public proxies. These proxies are open to everyone and vary in speed and anonymity level. Nevertheless, free shared proxies have certain disadvantages mostly in terms of security and privacy. They are third-party proxies and HMA cannot vouch for their reliability. On the other hand, HMA offers a powerful Pro VPN service which encrypts your entire internet activity and unlike a web proxy it automatically works with all applications on your computer (whereas web proxies typically work with web browsers like Firefox or Chrome and utilities like cURL or GNU Wget). However, the company's policy and Pro VPN's terms of use are not robots-friendly, so using Pro VPN for web scraping might cause an abuse warning and result in the suspension of the account.

The second high-quality proxy service that we found was Proxify. They offer 3 packages: Basic, Pro and SwitchProxy. The latter is very fast and it's intended for web crawling and automated systems of any scale. Since we are mostly interested in web scraping, SwitchProxy is the tool that suits us the most. It provides a rich set of features and gives access to 1296 "satellites" in 279 cities in 74 countries worldwide. They also offer an auto IP change mechanism that runs either after each request (assigning each time a random IP address) or once every 10 minutes (scheduled rotation). Therefore, it seems a great option for scraping purposes, maybe the best out there. However, it's quite expensive with plans starting at a minimum cost of 100$ per month. Additionally, Proxify provides some nice code examples about how one could integrate SwitchProxy with his program/ web robot. As far as WWW::Mechanize and Selenium are concerned (these two are our favorite web browsing tools), it is easy and straightforward to combine them with SwitchProxy.

Finally, we would like to bring forward once again the access restrictions and terms of use that many websites impose. Before launching a scraper make sure you check their robots.txt file as well as their copyright notice. For further information about this topic we wrote a relevant post some time ago, perhaps you would like to check it out too.

Visualizing e-procurement tenders with a bubble chart

noreply@blogger.com (kntonas) — Thu, 02 Jan 2014 13:15:00 +0000

A few weeks ago we started gathering data from the Greek e-procurement platform through DEiXTo aiming to build an RSS feed with the latest tender notices and in order to provide a method to automatically retrieve fresh data from the Central Electronic Registry of Public Contracts (CERPC or “Κεντρικό Ηλεκτρονικό Μητρώο Δημοσίων Συμβάσεων” in Greek). For further information you can read this post. Only a few days later, we were happy to find out that the first power user consuming the feed popped up: yperdiavgeia.gr, a popular search engine indexing all public documents uploaded to the Clarity website.

So now that we have a good deal of data at hand and we systematically ingest public procurement info every single day, we are trying to think of innovative ways to utilise it creatively. They say a picture is worth a thousand words. Therefore, one of the first ideas that occurred to us (and inspired by greekspending.com), we thought it would be nice to visualize the feed with some beautiful graphics. After a little experimentation with the great D3.js library and puttering around with the JSON Perl module, we managed to come up with a handy bubble chart which you may check out here: http://deixto.gr/eprocurement/visualize

Let's note a couple of things in order to better comprehend the chart.

the bigger the budget, the bigger the bubble
if you click on a bubble then you will be redirected to the full text PDF document
on mouseover a tooltip appears with some basic data fields

The good news is that this chart will be produced automatically on a daily basis along with the RSS feed. So, one could easily browse through the tenders published on CERPC over the last few days and locate the high-budget ones. Finally, as open data supporters we are very glad to see transparency initiatives like Clarity or CERPC and we warmly encourage people and organisations to take advantage of open public data and use it for a good purpose. Any suggestions or comments about further use of the e-procurement data would be very welcome!

Web archiving and Heritrix

noreply@blogger.com (kntonas) — Tue, 10 Dec 2013 07:47:00 +0000

A topic that has gained increasing attention lately is web archiving. In an older post we started talking about it and we cited a remarkable online tool named ArchiveReady that checks whether a web page is easily archivable. Perhaps the most well-known web archiving project is currently the Internet Archive which is a non-profit organization aiming to build a permanently and freely accessible Internet library. Their Wayback Machine, a digital archive of the World Wide Web, is really interesting. It enables users to "travel" across time and visit archived versions of web pages.

As web scraping aficionados we are mostly interested in their crawling toolset. So, the web crawler used by the Internet Archive is Heritrix, a free, powerful Java crawler released under the Apache License. The latest version is 3.1.1 and it was made available back in May 2012. Heritrix creates copies of websites and generates WARC (Web ARChive) files. The WARC format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.
Heritrix offers a basic web based user interface (admin console) to manage your crawls as well as a command line tool that can optionally be used to initiate archiving jobs. We played with it a bit and found it handy for quite a few cases but overall it left us with a sense of obsolescence.

In our humble opinion (and someone please correct us if we are wrong) the two main drawbacks Heritrix has are: a) lack of distributed crawling support and b) lack of JavaScript/AJAX support. The first one means that if you would like to scan a really big source of data, for example the great Digital Public Library of America (DPLA) with more than 5 million items/ pages, then Heritrix would take a lot of time since it runs locally on a single machine. Even if multiple Heritrix crawlers were combined and a subset of the target URL space was assigned to each of them, then again it wouldn't be an optimal solution. From our point of view it would be much better and faster if several cooperating agents on multiple different servers could actually collaborate to complete the task. Therefore, scaling and time issues arise when the number of pages goes very large.

The second disadvantage on the other hand is related to the trend of modern websites towards heavy use of JavaScript and AJAX calls. Heritrix provides just basic browser functionality and it does not include a fully-fledged web browser. Therefore, it's not able to archive efficiently pages that use JavaScript/ AJAX to populate parts of the page. Thus, it cannot capture properly social media content.

We think that both of these issues could be surpassed using a cloud, Selenium-based architecture like Sauce Labs (although the cost for an Enterprise plan is a matter that should be considered). This choice would allow you a) to run your crawls in the cloud in parallel and b) use a real web browser with full JavaScript support, like Firefox, Chrome or Safari. We have already covered Selenium in previous posts and it is absolutely a great browser automation tool. In conclusion, we recommend Selenium and a different, cloud-based approach for implementing large-scale, web archiving projects. Heritrix is quite good and has proved a valuable ally but we think that other, state-of-the-art technologies are nowadays more suitable for the job especially with the latest Web 2.0 developments. What's your opinion?

DEiXTo at BCI 2013

noreply@blogger.com (kntonas) — Mon, 16 Sep 2013 06:57:00 +0000

We are pleased to inform you that our short paper titled “DEiXTo: A web data extraction suite” has been accepted for presentation at the 6th Balkan Conference in Informatics ( BCI 2013 ) to be held in Thessaloniki on September 19-21 2013. The main goal of the BCI series of conferences is to provide a forum for discussions and dissemination of research accomplishments and to promote interaction and collaboration among scientists from the Balkan countries.

So, if you would like to cite DEiXTo in your thesis, project or scientific work, please use the following reference:

F. Kokkoras, K. Ntonas, N. Bassiliades. “DEiXTo: A web data extraction suite”, In proc. of the 6th Balkan Conference in Informatics (BCI-2013), September 19-21, 2013, Thessaloniki, Greece

Using XPath for web scraping

noreply@blogger.com (kntonas) — Mon, 09 Sep 2013 07:20:00 +0000

The last few years we have worked quite a bit on aggregators gathering periodically information from multiple online sources. We usually write own custom code and mostly use DOM-based extraction patterns (built with our home made DEiXTo GUI tool) but we also use other technologies and useful tools, when possible, in order to get the job done and make our scraping tasks easier. One of them is XPath which is a query language, defined by W3C, for selecting nodes from an XML document. Note that an HTML page (even malformed) can be represented as a DOM tree, thus an XML document. XPath is quite effective, especially for relatively simple scraping cases.

Suppose for instance that we would like to retrieve the content of an article/ post/ story on a specific website or blog. Of course this scenario could be extended to several posts from many different sources and go large in scale. Typically the body of a post resides in a DIV (or a certain type of) HTML element with a particular attribute value (the same stands for the post title as well). Therefore, the text content of a post is usually included in something like the following html segment (especially if you consider that numerous blogs and websites live on platforms like blogger or WordPress.com and share a similar layout):

<div class="post-body entry-content" ...>the content we want</div>

DEiXo tree rules are more suitable and efficient when there are multiple structure-rich record occurrences on a page. So, if you need just a specific div element it's better to stick with an XPath expression. It's pretty simple and it works. Then you could do some post processing on the data scraped and further utilise it by passing it to other techniques e.g. using regular expressions on the inner text to identify dates, places or other pieces of interest or parsing the outer HTML code of the selected element with a specialised tool looking for interesting stuff. So, instead of creating a rule with DEiXTo for the case described above we could just use an XPath selector: /div[@class="post-body entry-content"] to select the proper element and access its contents.

We actually used this simple but effective technique repetitively for myVisitPlanner, a project funded by the Greek Ministry of Education aiming at creating a personalised system for cultural itineraries planning. The main content of event pages (related to music, theatre, festivals, exhibitions, etc) is systematically extracted from a wide variety of local websites (most lacking RSS and APIs) in order to automatically monitor and aggregate events information. We could show you some code to demonstrate how to scrape a site with XPath but instead we would like to cite an amazing blog dedicated to web scraping which gives a nice code example of using XPath in screen scraping: extract-web-data.com. It provides a lot of information about web data extraction techniques and covers plenty of relevant tools. It's a nice, thorough and well-written read.

Anyway, if you need some web data just for personal use or your boss asked you so, why don't you consider using DEiXTo or one of the remarkable software tools out there? The use case scenarios are limitless and we are sure you could come up with a useful and interesting one.

Creating a complete list of FLOSS Weekly podcast episodes

noreply@blogger.com (kntonas) — Sat, 04 May 2013 10:31:00 +0000

It was not until recently that I discovered and started subscribing to podcasts. I wish I did earlier but the lack of available time (mostly) kept me away from them although we should always try to find time to learn and explore new things and technologies. So, I was very excited when I ran across FLOSS Weekly, a popular Free Libre Open Source (FLOSS) themed podcast from the TWiT Network. Currently, the lead host is Randal Schwartz, a renowned Perl hacker and programming consultant. As a Perl developer myself, it's needless to say that I greatly admire and respect him. FLOSS Weekly debuted back in April 2006 and as of 4th of May 2013 it features 250 episodes! That's a lot of episodes and lots of great stuff to explore.

Inevitably, if you don't have the time to listen to them all and had to choose only some of them, you would need to browse through all the listing pages (each containing 7 episodes) in order to find those that would interest you most. As I am writing this post one would have to visit 36 pages (by repeatedly clicking on the NEXT page link) to get a complete picture of all subjects discussed. Consequently, it's not that easy to quickly locate the ones that you find more interesting and compile a To-Listen (or To-Watch if you prefer video) list. I am not 100% sure that there is no such thing available on the twit.tv website but I was not able to find a full episodes list on a single place/ page. Therefore, I thought that a spreadsheet (or even better a JSON document) containing the basic info for each episode (title, date, link and description) would come in handy.

Hence, I utilised my beloved home-made scraping tool, DEiXTo, in order to extract the episodes metadata so that one can have a convenient, compact view of all available topics and decide easier which ones to choose. It was really simple to build a wrapper for this task and in a few minutes I had the data at hand (in a tab delimited text file). Then it was straightforward to import it in an Excel spreadsheet (you can download it here). Moreover, with a few lines of Perl code the data scraped was transformed into a JSON file (with all the advantages this brings) suitable for further use.
Check FLOSS Weekly out! You might find several great episodes that could illuminate you and bring into your attention amazing tools and technologies. As a free software supporter, I highly recommend it (despite the fact that I discovered it with a few years delay, hopefully it's never too late).

Scraping the members of the Greek Parliament

noreply@blogger.com (kntonas) — Thu, 02 May 2013 22:37:00 +0000

The Hellenic Parliament is the supreme democratic institution that represents Greek citizens through an elected body of Members of Parliament (MPs). It is a legislature of 300 members, elected for a four-year term, that submits bills and amendments. Its website, www.hellenicparliament.gr, has a lot of interesting data on it that could potentially be useful for mere citizens, certain types of professionals like journalists and lawyers, the media as well as businesses.

Inspired by existing scrapers for many Parliaments of the world like these on ScraperWiki, an amazing web-based scraping platform, we decided to write a simple, though efficient, DEiXToBot-based script that gathers information (such as the full name, constituency and contact details) from the CVs pages of Greek MPs and exports it (after some post-processing, e.g. deducing the party name to which the MP belongs from the logo in the party column) to a tab delimited text file that can then be easily imported in an ODF spreadsheet or into a database. The script uses a tree pattern previously built with the GUI DEiXTo tool to identify the data under interest and visits all 30 target pages (each containing ten records) by utilizing the pageNo URL parameter. It should also be noted that we used Selenium for our purposes, our favorite browser automation tool. Eventually, the results of the execution of the script can be found in this .ods file. In case you would like to take a look at the Perl code that got the job done you can download it here.

Open data — data that is free for use, reuse, and redistribution — is a goldmine that can stimulate innovative ways to discover knowledge and analyze rich data sets available on the World Wide Web. Scraping is an invaluable tool that can help towards this direction and serve transparency and openness. Currently there is a wide variety of remarkable web data extraction tools (among which quite a few free). Perhaps you would like to give DEiXTo a try and start building your own web robots to get the data you need and transform it into a suitable format for further use.
In conclusion, scraping has numerous uses and applications and there is a high chance you could come up with an interesting and creative use case scenario tailored to your requirements. So, if you need any help with DEiXTo or have any inquiries, please do not hesitate to contact us!

Visualizing Clarity document categories in a pie chart

noreply@blogger.com (kntonas) — Tue, 16 Apr 2013 07:58:00 +0000

The "Cl@rity" program of the Hellenic Republic offers a wealth of data about the decisions and expenditure of all Greek ministries and their organizations. It operates for more than two years now and it is a great source of public data waiting for all of us to explore. However, it has been facing a lot of technical problems over the last year because of the large number of documents uploaded daily and the heavy data management cost. Unfortunately, their frontend and its search functionality is not working most of the time. Thankfully, a private initiative, UltraCl@rity, has come up in the meantime to offer a great alternative for searching the digitally signed public pdf documents and their metadata, filling in the gap left by the Greek government.

As you probably already know we focus on web scraping and the utilization of the information extracted. One of the best ways to exploit the data you might have gathered with DEiXTo (or another web data extraction tool) is presenting it with a comprehensive chart. Hence, we thought it might be interesting to collect the subject categories of the documents published on Cl@rity by a big educational institution like the Athens University of Economics and Business (AUEB) and create a handy pie chart.

This page http://et.diavgeia.gov.gr/f/aueb/list.php?l=themes provides a convenient categorization of AUEB's decisions. Therefore, with a simple pattern (extraction rule), created with GUI DEiXTo, we captured the categories and their number of documents. Then, it was quite easy and straightforward to programmatically transform the output data (as of 16th of April 2013) into an interactive Google pie chart with the most popular categories using the amazing Google Chart Tools. So, here it is:

By the way, publicspending.gr and greekspending.com are truly remarkable research efforts aiming at visualizing public expenditure data from the Cl@rity project in user-friendly diagrams and charts. Of course the deixto-based scenario described above is just a simple scraping example. What we would like to point out is that this kind of data transformations could have some, innovative practical applications and facilitate useful web-based services. In conclusion, Cl@rity (or "Διαύγεια" as it is known in Greek) can be a goldmine, spark new innovations and allow citizens and developers in particular to dig into open data in a creative fashion and in favor of the transparency of public life.

Fuel price monitoring & data visualization

noreply@blogger.com (kntonas) — Sat, 13 Apr 2013 20:05:00 +0000

Recently, we stumbled upon a very useful public, web-based service, the Greek Fuel Prices Observatory ("Παρατηρητήριο Τιμών Υγρών Καυσίμων" in Greek). Its main objective is to allow consumers find out the prices of liquid fuels per type and geographic region. Having a wealth of fuel-related information at his disposal, one could build some innovative services (e.g. taking advantage of the geo-location of gas stations), find interesting stats or create meaningful charts.

One of the site's most interesting pages is that which contains the min, max and mean prices over the last 3 months: http://www.fuelprices.gr/price_stats_ng.view?time=1&prodclass=1

However, the hyperlink at the bottom right corner of the page (with the text "Γραφήματος") which is supposed to display a comprehensive graph returns an HTTP Status 500 exception message instead (at least as of 13th of April 2013). So, we could not resist scraping the data from the table with DEiXTo and presenting it nicely with a Google line chart after some post-processing. We used a regular expression to isolate the date, we reversed the order of the records found (so that the list is sorted chronologically, the oldest one first), we replaced commas in prices with dots (as a decimal mark) and we wrote a short script to produce the necessary lines for the arrayToDataTable method call of the Google Visualization API. Therefore, it was pretty straightforward to create the following:

Generally, there are various remarkable data visualization tools out there (one of the best is Google Charts of course) but we would not like to elaborate further on this now. Nevertheless, we would like to give emphasis on the fact that once you have rich and useful web data in hand you can exploit them in a wide variety of ways and come up with smart methods to analyze, use and present them. Your imagination is the only limit (along with the copyright restrictions).

Data migration through browser automation

noreply@blogger.com (kntonas) — Fri, 12 Apr 2013 06:34:00 +0000

As we have already mentioned quite a few times, browser automation can have a lot of practical applications ranging from testing of web applications to web-based administration tasks and web scraping. The latter (scraping) is our field of expertise and Selenium is our tool of choice when it comes to automated browser interaction and dealing with complex, JavaScript-rich pages.

A very interesting scenario (among others) of combining our beloved web scraping tool, DEiXTo, with Selenium could be data migration. Imagine for example that you have an osCommerce online store and you would like to migrate it to a Joomla VirtueMart e-commerce system. Wouldn't it be great if you could scrape the product details from the old, online catalogue through DEiXTo and then automate the data entry labor via Selenium? Once we have the data at hand in a suitable format, e.g. comma/ tab delimited or XML, we could then write a script that would repeatedly visit the data entry online form (in the administration environment of the new e-shop), fill in the necessary fields and submit it (once for each single product) so as to insert automatically all the products into the new website.

This way you can save a lot of time and effort and avoid messing with complex data migration tools (which are very useful in many cases). Important: of course we don't claim that migrating databases through scraping and automated data entry is the best solution. However, it's a nice and quick alternative approach for several, especially relatively simple, cases. The big advantage is that you don't even need to know the underlying schemas of the two systems under consideration. The only condition is to have access to the administrator interface of the new system.
By the way, below you can see a screenshot from Altova MapForce, maybe the best (but not free) data mapping, conversion and integration software tool out there.

Generally speaking, the uses and applications of web data extraction are numerous. You can check out some of them here. Perhaps you are about to think the next one and we would be glad to help you with the technicalities!

How to pass Selenium pages to DEiXToBot

noreply@blogger.com (kntonas) — Thu, 28 Mar 2013 08:40:00 +0000

Recently we talked about Selenium and its potential combination with DEiXTo. It is a truly remarkable browser automation tool with numerous uses and applications. For those of you wondering how to programmatically pass pages fetched with Selenium to DEiXToBot on the fly, then here is a way (provided you are familiar with Perl programming):

# suppose that you have already fetched the target page with the WWW::Selenium object ($sel variable)
my $content = $sel->get_html_source(); # get the page source code

my ($fh,$name); # create a temporary file containing the page's source code
do { $name = tmpnam() } until $fh = IO::File->new($name, O_RDWR|O_CREAT|O_EXCL);
print $fh $content;
close $fh;

$agent->get("file://$name"); # load the temporary file/page with the DEiXToBot agent using the file:// scheme

unlink $name; # delete the temporary file, it is not needed any more

if (! $agent->success) { die "Could not fetch the temp file!"; }

$agent->load_pattern('pattern.xml'); # load the pattern built with the GUI tool

$agent->build_dom(); # build the DOM tree of the page

$agent->extract_content(); # apply the pattern

my @records = @{$agent->records};
for my $record (@records) { # loop through the data/ records scraped
....

Therefore, you can create temporary HTML files, in real time, containing the source code of the target pages (after the WWW::Selenium object gets these pages) and pass them to the DEiXToBot agent to do the scraping job. Another interesting scenario is to download the pages locally with Selenium and then read/ scrape them directly from the disk at a later stage. We hope the above snippet helps. Please do not hesitate to contact us for any questions or feedback!

Digital Preservation and ArchiveReady

noreply@blogger.com (kntonas) — Sat, 02 Feb 2013 10:19:00 +0000

Although our blog's main focus is scraping data from web information sources (especially via DEiXTo), we are also very interested in services and applications that can be built on top of agents and crawlers. Our favorite tools for programmatic web browsing are WWW::Mechanize and Selenium. The first one is a handy Perl object (that lacks Javascript support though) whereas the latter is a great browser automation tool that we have been using more and more lately in a variety of cases. Through them we can simulate whatever a user can do in a browser window and automate the interaction with pages of interest.

Traversing a website is one of the most basic and common tasks for a developer of web robots. However, the methodologies used and the mechanisms deployed can vary a lot. So, we tried to think of a meaningful crawler-based scenario that would blend various "tasty" ingredients and come up with a nice story. Hopefully we did and our post has four major pillars that we would like to highlight and discuss further below:

Crawling (through Selenium)
Sitemaps
Archivability
Scraping (in the demo that follows we download reports from a target website)

An interesting topic we recently stumbled upon is digital preservation which can be viewed as a series of policies and strategies necessary to ensure continued access to digital content over time and regardless of the challenges of media failure and technological change. In this context, we discovered ArchiveReady, a remarkable web application that checks whether a website is easily archivable. This means that it scans a page and checks whether it's suitable for web archiving projects (such as the Internet Archive and BlogForever) to access and preserve it. However, you can only pass one web page at a time to its checker (not an entire website) and it might take some time to complete depending on the complexity and size of the page. Therefore, we thought it could be useful for those interested to test multiple pages if we wrote a small script that parses the XML sitemap of a target site and checks each of the URLs contained in it against the ArchiveReady service and at the same time downloads the results.

Sitemaps, as you probably already know, are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a sitemap is an XML file that lists URLs for a site along with some additional metadata about each URL so that search engines can more intelligently crawl the site. Typically sitemaps are auto-generated by plugins.

OK, enough with the talking. Let's get to work and write some code! The Perl modules we utilised for our purposes were WWW::Selenium and XML::LibXML. The activities we had to do were the following:

launch a Firefox instance
read the sitemap document of a sample target website (we chose openarchives.gr)
pass each of its URLs to the ArchiveReady validation engine and finally
download locally the results returned in Evaluation and Report Language (EARL) format since the ArchiveReady offers this option

So, here is the code (just note that we wait till the validation page contains 8 "Checking complete" messages, one for each section, to determine whether the processing has finished):

use WWW::Selenium;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $dom = $parser->parse_file('http://openarchives.gr/sitemap.xml');
my @loc_elms = $dom->getElementsByTagName('loc');
my @urls;
for my $loc (@loc_elms) {
push @urls,$loc->textContent;
}
# launch a Firefox instance
my $sel = WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*firefox",
browser_url => "http://archiveready.com/"
);
$sel->start;
# parse through the pages contained in the sitemap
for my $u (@urls) {
$sel->open("http://archiveready.com/check?url=$u");
my $content = $sel->get_html_source();
while ( (() = $content =~ /Checking complete/g) != 8) { # check if complete
sleep(1);
$content = $sel->get_html_source();
}
$content=~m#href="/download-results\?test_id=(\d+)&format=earl"#;
my $id = $1; # capture the identifier of the current validation test
$sel->click('xpath=//a[contains(@href,"earl")]'); # click on the EARL link
$sel->wait_for_page_to_load(5000);
eval { $content = $sel->get_html_source(); };
open my $fh,">:utf8","download_results_$id.xml"; # write the EARL report to a file
print $fh $content;
close $fh;
}
$sel->stop;

We hope you found the above helpful and the scenario described interesting. We tried to take advantage of software agents/ crawlers and use them creatively in combination with ArchiveReady, an innovative service that helps you strengthen your website's archivability. Finally, scraping and automated browsing can have an extremely extensive set of uses and applications. Please check out DEiXTo, our feature-rich web data extraction tool, and don't hesitate to contact us! Maybe we can help you with your tedious and time-consuming web tasks and data needs!

Cloudify your browser testing (and scraping) with Sauce!

noreply@blogger.com (kntonas) — Thu, 24 Jan 2013 06:17:00 +0000

For quite some time now, along with our DEiXTo scraping software, we have been using Selenium which is perhaps the best web browser automation tool currently available. It's really great and has helped us a lot in a variety of web data extraction cases (we published another post about it recently). We tried it locally as well as on remote GNU/Linux servers and we wrote code for a couple of automated tests and scraping tasks. However, it was not that easy to set everything up and get things running; we came across various difficulties (ranging from installation to stability issues e.g. sporadic timeout errors) although we were finally able to surpass most of them.

Wouldn't it be great though if there was a robust framework that would provide you with the necessary infrastructure and all possible browser/OS combinations and allow you to run your Selenium tests in the cloud? You would not have to worry about setting a bunch of things up, installing updates, machines management, maintenance, etc. Well, there is! And it offers a whole lot more.. Its name is Sauce Labs and it provides an amazing set of tools and features. Admittedly they have done awesome work and they bring great products to software developers. Moreover, their team seems to share some great values: pursuit of excellence, innovation and open source culture (among others).
They offer a variety of pricing plans (a bit expensive in my opinion though) while the free account includes 100 automated code minutes for Win, Linux and Android, 40 automated code minutes for Mac and iOS and 30 Minutes of manual testing. And for those contributing to an open source project that needs testing support, Open Sauce Plan is just for you (unlimited minutes without any cost!). Please note that the Selenium project is sponsored by Sauce Labs.

Being a Perl programmer, I could not resist signing up and writing some Perl code to run a test on the ondemand.saucelabs.com host! I was already familiar with the WWW::Selenium CPAN module, so it was quite easy and straightforward. It should be noted that they provide useful guidelines and various examples online for multiple languages e.g. Python, Java, PHP and others. Overall my test script worked pretty well but it was a bit slow (compared to running the same code locally). However, one could improve speed by deploying lots of processes in parallel (if the use case scenario is suitable) and by disabling video (the script's execution and browser activity is recorded for easier debugging). Furthermore, Sauce's big advantage is that it can go large scale, which would be especially suited for complex cases with heavy requirements.

The bottom line is that the "Selenium - Sauce Labs" pair is remarkable and can be very useful in a wide range of cases and purposes. Sauce in particular offers developers an exciting way to cloudify and manage their automated browser testing (although we personally focus more on the scraping capabilities that these tools provide). Their combination with DEiXTo extraction patterns could definitely be very fertile and open new, interesting potentials. In conclusion, the uses and applications of web scraping are limitless and Selenium turns out to be a powerful tool in our quiver!

Scraping PDF files

noreply@blogger.com (kntonas) — Sun, 13 Jan 2013 05:07:00 +0000

While puttering around on the Internet, I recently stumbled upon the website of the National Printing House ("Εθνικό Τυπογραφείο" in Greek) which is the public service responsible for the dissemination of Greek law. It publishes and distributes the government gazette and its website provides free access to all series of the Official Journal of the Hellenic Republic (ΦΕΚ).

So, at its search page I noticed a section with the most popular issues. The most-viewed one, as of 13 Jan 2013, with 351.595 views was this: ΦΕΚ A 226 - 27.10.2011. Out of curiosity I decided to download it in order to take a quick look and see what it is all about. It was available in a PDF format and it turned out to be an issue about the Economic Adjustment Programme for Greece aiming to reduce its macroeconomic and fiscal imbalances. However, I was quite surprised to find that the text was contained in images and you could not perform any keyword search in it nor could you copy-paste its textual content! I guess because the document's pages were scanned and converted to digital images.

This instantly brought to my mind once again the difficulties that PDF scraping involves. From our web scraping experience there are many cases where the data is "locked" in PDF files e.g. in a .pdf brochure. Getting the data of interest out is not an easy task but quite a few tools (pdftotext is one of them) have popped up over the years to ease the pain. One of the best tools I have encountered so far is Tesseract, a pretty accurate open source OCR engine currently maintained by Google.

So, I thought it would be nice to put Tesseract into action and check its efficiency against the PDF document (that so dramatically affects the lives of all Greeks..). It worked quite well, although not perfect (probably because of the Greek language), and a few minutes later (and after converting the pdf to a tiff image through Ghostscript) I had the full text, or at least most of it, in my hands. The output text file generated can be found here. The truth is that I could not do much with it (and the austerity measures were harsh..) but at least I was happy that I was able to extract the largest part of the text.
Of course this is just an example, there are numerous PDF files out there containing rich, inaccessible data that could potentially be processed and further utilised e.g. in order to create a full text search index. DEiXTo, our beloved web data extraction tool, can scrape only HTML pages. It cannot deal with PDF files residing on the Web. However, we do have the tools and the knowledge to parse those as well, find bits of interest and unleash their value!

Selenium: a web browser automation companion for DEiXTo

noreply@blogger.com (kntonas) — Fri, 11 Jan 2013 06:43:00 +0000

Selenium is probably the best web browser automation tool we have come across so far. Primarily it is intended for automated testing of web applications but it's certainly not limited to that; it provides a suite of free software tools to automate web browsers across many platforms. The range of its use case scenarios is really wide and its usefulness is just great.

However, as scraping experts, we inevitably focus on using Selenium for web data extraction purposes. Its functionality-rich client API can be used to launch browser instances (e.g. Firefox processes) and simulate, through the proper commands, almost everything a user could do on a web site/ page. Thus, it allows you to deploy a fully-fledged web browser and surpass the difficulties that pop up from heavy JavaScript/ AJAX use. Moreover, via the virtual framebuffer X server (Xvfb), one could automate browsers without the need for an actual display and create scripts/ services running periodically or at will on a headless server e.g. on a remote GNU/Linux machine. Therefore, Selenium could successfully be used in combination with DEiXToBot, our beloved Mechanize scraping module.

For example, the Selenium-automated browser could fetch a target page after a couple of steps (like clicking a button/ hyperlink, selecting an item from a drop-down list, submitting a form, etc.) and then pass it to DEiXToBot (which lacks JavaScript support) to do the scraping job through DOM-based tree patterns previously generated with the GUI DEiXTo tool. This is particularly useful for complex scraping cases and opens new potential for DEiXTo wrappers.

The Selenium Server component (formerly the Selenium RC Server) as well as the client drivers that allow you to write scripts that interact with the Selenium Server can be found here. We have used it quite a few times for various cases and the results were great. In conclusion, Selenium is an amazing "weapon" added to our arsenal and we strongly believe that along with DEiXTo it boosts our scraping capabilities. If you have an idea/ project that involves web browser automation or/ and web data extraction, we would be more than glad to hear from you!

PhantomJS & finding pizza using Yelp and DEiXTo!

noreply@blogger.com (kntonas) — Sat, 18 Aug 2012 09:17:00 +0000

Recently I stumbled upon PhantomJS, a headless WebKit browser which can serve a wide variety of purposes such as web browser automation, site scraping, website testing, SVG rendering and network monitoring. It's a very interesting tool and I am sure that it could successfully be used in combination with DEiXToBot which is our beloved powerful Mechanize scraper. For example, it could fetch a not-easy-to-reach (probably JavaScript-rich) target page (that WWW::Mechanize could not get due to its lack of JavaScript support) after completing some steps like clicking, selecting, checking, etc and then pass it to DEiXToBot to do the scraping job. This is particularly useful for complex scraping cases where in my humble opinion PhantomJS DOM manipulation support would just not be enough and DEiXTo extraction capabilities could come into play.

So, I was taking a look at the PhantomJS examples and I liked (among others) the one about finding pizza in Mountain View using Yelp (I really like pizza!). So, I thought it would be nice to port the example to DEiXToBot in order to demonstrate the latter's use and efficiency. Hence, I visually created a pretty simple and easy to build XML pattern with GUI DEiXTo for extracting the address field of each pizzeria returned (essentially equivalent to what PhantomJS does by getting the inner text of span.address items) and wrote a few lines of Perl code to execute the pattern on the target page and print the addresses extracted on the screen (either on a GNU/Linux terminal or a command prompt window on Windows).

The resulting script was simple like that:

use DEiXToBot;
my $agent = DEiXToBot->new();
$agent->get('http://www.yelp.com/search?find_desc=pizza&find_loc=94040&find_submit=Search');
die 'Unable to access network' unless $agent->success;
$agent->load_pattern('yelp_pizza.xml');
$agent->build_dom();
$agent->extract_content();
my @addresses;
for my $record (@{$agent->records}) {
push @addresses, $$record[0];
}
print join("\n",@addresses);

Just note that it scrapes only the first results page (just like in the PhantomJS example). We could easily parse through all the pages by following the "Next" page link but this is out of scope.

I would like to further look into PhantomJS and check the potential of using it (along with DEiXTo) as a pre-scraping step for hard JavaScript-enabled pages. In any case, PhantomJS is a handy tool that can be quite useful for a wide range of use cases. Generally speaking, web scraping can have countless applications and uses and there are many remarkable tools out there. One of the best we believe is DEiXTo, so check it out! DEiXTo has helped quite a few people get their web data extraction tasks done easily and free!

DEiXTo powers ΟPEN-SME

noreply@blogger.com (kntonas) — Fri, 09 Mar 2012 06:52:00 +0000

We are happy to announce that DEiXTo is going to power ΟPEN-SME, an exciting EU-funded project that promotes software reuse among small and medium-sized software enterprises (SMEs). ΟPEN-SME is coordinated by the Greek Association of Computer Engineers and it is aiming to develop a set of methodologies, tools and business models centered on SME Associations, which will enable software SMEs to effectively introduce open source software reuse practices in their production processes.

DEiXTo-based wrappers have been successfully deployed in order to enable the project's federated search engine, called OCEAN (developed by the Department of Informatics of the Aristotle University of Thessaloniki), to simultaneously search in real time existing open source software search engines that do NOT offer API access (i.e. Koders and Krugle). To achieve this, custom Perl code was written so as to submit the user-specified queries to the native websites and scrape the (N first) results returned into a suitable form.

We are really glad that we are participating in this challenging and innovative project and we hope that DEiXTo will help ΟPEN-SME towards implementing its goals. So, if you are looking for a web scraping framework to power your aggregator or search engine, please do not hesitate to contact us!

Uses and applications of web scraping

noreply@blogger.com (kntonas) — Mon, 05 Mar 2012 09:07:00 +0000

Some people wonder what the uses of web scraping might be. Well, your imagination is the only limit (along with the copyright notices perhaps). There is a huge wealth of data out there and many believe that the open Web is a real goldmine. So, web data extraction tools and DEiXTo in particular could help you unlock this treasure and give birth to innovations, applications and new ideas.

Public institutions, companies and organizations, entrepreneurs, professionals as well as mere citizens and users generate an enormous amount of information every single day. The question is: how effectively is it being used? Towards this direction, web content extraction can prove a valuable ally. Along with data mining, they have much to offer in every field you can imagine. The following are only some of the uses of web scraping:

collect properties from real estate listings
scrape retailer sites on a daily basis
extract offers and discounts from deal-of-the-day websites
gather data for hotels and vacation rentals
scrape jobs postings and internships
crawl forums and social sites so as to enable analysis and post-processing of their rich data
power aggregators and product search engines
monitor your online reputation and check what is being said for you or your brand
quickly populate product catalogues with full specifications
monitor prices of the competition
scrape the content of digital libraries in order to transform it into suitable, structured forms
collect and aggregate government and public data
search (in real time) bibliographic databases and online sources that don't offer an API, thus powering federated search engines
look for educational material and information from across traditional formal higher education subjects and real-life context environments in order to help the contemporary learner
power mobile applications
help building geolocation apps (e.g. extracting addresses available on web pages and using their coordinates to build meaningful maps with points of interest)
prepare large, focused datasets for scientific tasks (i.e. data mining)
extract and summarize large volumes of text (e.g. summarizing product reviews)
<your scraping task goes here!>

This list can grow very long. There are countless use cases and potential scenarios, either business-oriented or non-profit. As far as the access and copyright restrictions are concerned, it is a really significant issue that has raised a lot of discussion and controversy. However, the opinion that seems to be gaining ground is that (well-intentioned) web scraping is legal since the data is publicly and freely available on the Web. So, let your creativity and imagination loose; DEiXTo can probably help you to achieve your scraping-based project goals. We would be more than happy to hear from you.

Linked Data & DEiXTo

noreply@blogger.com (kntonas) — Sun, 19 Feb 2012 06:08:00 +0000

As explained in a previous post, DEiXTo can scrape the content of digital libraries, archives and multimedia collections lacking an API and enable their metadata transformation (through post-processing and custom Perl code) to Dublin Core and subsequently in OAI-PMH or another suitable form, e.g. Europeana Semantic Elements (ESE).
Meanwhile, the Web has become a dynamic collaboration platform that allows everyone to meet, read and more importantly write. Thus, it steadily approaches the vision of Tim Berners-Lee (the inventor of the World Wide Web): the Linked Data Web, a place where related data are linked and information is represented in a more structured and easily machine-processable way.

Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. Its key technologies are URIs (a generic method to identify resources on the Internet), the Hypertext Transfer Protocol (HTTP) and RDF (a data model and a general method for conceptual description of things in the real world). It is an exciting topic of interest and it's expected to make great progress in the next few years. A video that does a nice job of explaining what Linked Open Data is all about can be found here: http://vimeo.com/36752317

Over the last decade, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) has become the de facto standard for metadata exchange in digital libraries and it's playing an increasingly important role. However, it has two major drawbacks: it does not make its resources accessible via dereferencable URIs and it provides only restricted means of selective access to metadata. Therefore, there is a strong need for efficient tools that would allow metadata repositories to expose their content according to the Linked Data guidelines. This would make digitized items and media objects accessible via HTTP URIs and query able via the SPARQL protocol.
Dr Haslhofer has performed significant research and work towards this direction. He has developed (among others) the OAI2LOD Server based on the D2R Server implementation and wrote the ESE2EDM converter, a collection of ruby scripts that can convert given XML-based ESE source files into the RDF-based Europeana Data Model (EDM). These remarkable tools could turn out very useful for making large volumes of information Linked-Data ready, with all the advantages this brings.
Linked Open Data can change the computer world as we know it. So, there is a lot of potential in combining DEiXTo with Linked Data technologies. Their blend could eventually produce an innovative and useful outcome. Many already believe that Linked Data is the next big thing. Time will tell. Meanwhile, DEiXTo could definitely help you generate structured data in a variety of formats from unstructured HTML pages, either your ultimate goal is Linked Data or not.

DEiXTo components clarified

noreply@blogger.com (kntonas) — Sat, 11 Feb 2012 07:42:00 +0000

From the emails and feedback received, it seems that many people get a bit confused about the utility and functionality of the DEiXTo GUI tool compared to the Perl command line executor (CLE). DEiXToBot is even more confusing for quite a few users. So, let's clarify things.

The GUI tool is freeware (available at no cost but without any source code, at least yet) and it allows you to visually build and execute extraction rules for web pages of interest with point and click convenience. It offers you an embedded web browser and a friendly graphical interface so that you can highlight an element/ record instance as the mouse moves over it. The GUI tool is a Windows-only application that harnesses Internet Explorer's HTML parser and render engine. It is worth noting that it can support simple cooperative extraction scenarios as well as periodic, scheduled execution through batch files and the Windows Task Scheduler. Perhaps its main drawback is that it can execute just one pattern on a page although for several cases (maybe for the majority) one and only extraction rule is enough to get the job done.

On the other hand, the command line executor, or CLE for short, is implemented in Perl and it is freely distributed under the GNU General Public License v3, thus its source code is included. Its purpose is to execute wrapper project files (.wpf) that have previously been created with the GUI tool. It runs on a DOS prompt window or on a Linux/ Mac terminal. Besides the code though, we have built two standalone executables so that you can run CLE either on a Windows or a GNU/Linux machine without having Perl or any prerequisite modules installed. CLE is faster, offers more output formats and has some additional features such as an efficient post-processing mechanism and database support. However, it shares the same shortcoming as the GUI tool: it supports just one pattern on a page. Finally, it relies on DEiXToBot, a "homemade" package that facilitates the execution of GUI DEiXTo generated wrappers.

DEiXToBot is the third and probably the most powerful and well-crafted software component of the DEiXTo scraping suite and it is available under the GPL v3 license. It is a Perl module based on WWW::Mechanize::Sleepy, a handy web browser Perl object, and several other CPAN modules. It allows extensive customization and tailor-made solutions since it facilitates the combination of multiple extraction rules/ patterns as well as the post-processing of their results through custom code. Therefore, it can deal with complex cases and cover more advanced web scraping needs. But it requires programming skills in order to use it.

The bottom line is that DEiXToBot is the essence of our long experience. The GUI tool might be more suitable for most every-day users (due to its visual convenience) but when things get difficult or the situation requires a more advanced solution (e.g. scheduled or on-demand execution and coordination of multiple wrappers on a GNU/Linux server), a customized DEiXToBot-based script is your choice. You can use the GUI tool first to create the necessary patterns and then deploy a Perl script that uses them to extract structured data from the pages of the target website. So, if you are familiar with Perl, you should not find it very hard to write your first deixto-based spider/ crawler!

Federated searching & dbWiz

noreply@blogger.com (kntonas) — Sat, 28 Jan 2012 00:03:00 +0000

Nowadays, most university and college students, professors as well as researchers are increasingly seeking information and finding answers on the open Web. Google has become the dominant search tool for almost everyone. Its popularity is enormous, no need to wonder or analyze why. It has a simple and effective interface and it returns fast, accurate results.

However, libraries, in their effort to win some patrons back, have tried to offer a decent searching alternative by developing a new model: federated search engines. Federated searching (also known as metasearch or cross searching) allows users to search simultaneously multiple web resources and subscription-based bibliographic databases from a single interface. To achieve that, parallel processes are executed in real time and retrieve results from each separate source. Τhen, the results returned get grouped together and presented to the user in a unified way.

The mechanisms used for pulling the data from the target sources are broadly two: either through an Application Programming Interface (API) or via scraping the native web interface/ site of each database. The first method is undoubtedly better but very often a search API is not available. In such cases, web robots (or agents) come into play and capture information of interest, typically by simulating a human browsing through the target webpages. Especially in the academia, there are numerous online bibliographic databases. Some of them offer Z39.50 or API access. However, a large number still does not provide protocol-based search functionality. Thus, scraping techniques should be deployed for those (unless the vendor disallows bots).

When starting my programming adventure with Perl back in 2006, in the context of my former full-time job at the Library of University of Macedonia (Thessaloniki, Greece), I had the chance (and luck) to run across dbWiz, a remarkable open source, federated search tool developed by the Simon Fraser University (SFU) Library in Canada. I was fascinated with Perl as well as dbWiz's internal design and implementation. So, this is how I met and fell in love with Perl.

dbWiz offered a friendly and usable admin interface that allowed you to create search categories and select from a global list of resources which databases would be active and searchable. If you had to add a new resource though, you would have to write your own plugin (Perl knowledge and programming skills were required). Some of the dbWiz search plugins were based upon Z39.50 whereas others (the majority) relied on regular expressions and WWW::Mechanize (a handy web browser Perl object).
The federated search engine developed while working at the University of Macedonia (2006-2008) was named "Pantou" and became a valuable everyday tool for students and professors of the University. The results of this work were presented at the 16th Panhellenic Academic Libraries Conference (Piraeus, 1-3 October 2007). Unfortunately, its maintenance stopped at the end of 2010 due to the economic crisis and severe cuts in funding. Consequently, a few months later some of its plugins started falling apart.

Generally, delving into dbWiz taught me a lot of lessons such as web development, Perl programming and GNU/Linux administration. I loved it! Meanwhile, in my effort to improve the relatively hard and tedious procedure of creating new dbWiz plugins, I put into practice an early version of GUI DEiXTo (which was my MSc thesis being fulfilled in the same period at the Aristotle University of Thessaloniki). The result was a new Perl module that allowed the execution of W3C DOM-based, XML patterns (built with the GUI DEiXTo) inside dbWiz and eliminated, at least to a large extent, the need for heavy use of regular expressions. That module, which was the first predecessor of today's DEiXToBot package, got included in the official dbWiz distribution after contacting the dbWiz development team in 2007. Unfortunately, SFU Library ended the support and development of dbWiz in 2010.

Looking back, I can now say with quite a bit of certainty, that DEiXTo (more than ever before) can power federated search tools and help them extend their reach to previously inaccessible resources. As far as the search engines war is concerned, Google seems to triumph but nobody can say for sure what is going to happen in the next few years to come. Time will tell..

Open Archives & Digital Libraries

noreply@blogger.com (kntonas) — Thu, 19 Jan 2012 22:02:00 +0000

The Open Archives Initiative (OAI) develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements and its cornerstone is the Protocol for Metadata Harvesting (OAI-PMH) which allows data providers/ repositories to expose their content in a structured format. A client then can make OAI-PMH service requests to harvest that metadata through HTTP.

openarchives.gr is a great federated search engine harvesting 57 Greek digital libraries and institutional repositories (as of January 2012). It currently provides access to almost half a million(!) documents (mainly undergraduate theses and Master/ PhD dissertations) and its index gets updated on a daily basis. It began its operation back in 2006 after being designed and implemented by Vangelis Banos but since May 2011 it is being hosted, managed and co-developed by the National Documentation Centre (EKT). What makes this amazing searching tool even more remarkable is the fact that it is entirely built on open source/ free software.

A tricky point that needs some clarification is that when a user searches openarchives.gr, the search is not submitted in real time to the target sources. Instead, it is performed locally on the openarchives.gr server where full copies of the repositories/ libraries are stored (and updated at regular time intervals).
The majority of the sources searched by openarchives.gr are OAI-PMH compliant repositories (such as DSpace or EPrints). Therefore, their data are periodically retrieved via their OAI-PMH endpoint. However, it is worth mentioning that non OAI-PMH digital libraries have also been included in its database. This was made possible through scraping their websites with DEiXTo and transforming their metadata into Dublin Core. So, more than 16.000 records from 6 significant online digital libraries (such as the Lyceum Club of Greek Women and the Music Library of Greece “Lilian Voudouri”) were inserted in openarchives.gr with the use of DEiXTo wrappers and custom Perl code.

Finally, it is known that digital collections have flourished over the last few years and enjoy growing popularity. However, most of them do NOT provide their contents in OAI-PMH or another appropriate metadata format. Actually, many of them (especially legacy systems) do NOT even offer an API or an SRW/U interface. Consequently, we believe that there is much room for DEiXTo to help cultural and educational organizations (e.g., museums, archives, libraries and multimedia collections) to export, present and distribute their digitized items and rich content to the outside world, in an efficient and structured way, through scraping and repurposing their data.

Netnography & Scraping

noreply@blogger.com (kntonas) — Tue, 17 Jan 2012 13:05:00 +0000

Netnography or digital ethnography, is (or should be) the correct translation of ethnographic methods to online environments such as bulletin boards and social sites. It is more or less doing the same that ethnographers do in actual places like squares, pubs, clubs, etc: observe what people say and do, and try to participate as much as possible in order to better understand what's involved in action and discourses. Using ethnography may answer a lot of what, when, who and how questions defining several everyday problems. However, netnography differs in many ways compared to ethnography; especially in the fashion it is conducted.

Forums, Wikis as well as the blogosphere are good online equivalents of public squares and pubs. There are not physical identities, but online ones; there are not faces, but avatars; there is no gender, age or any reliable info about physical identities, but there are voices discussing and arguing about common topics of interests.

The more popular a forum is, the more difficult it gets to follow it nethnographically. A nethnographer has to use a Computer Assisted Qualitative Data Analysis (CAQDA) tool (such as RDQA) on certain parts of the texts collected during his research. In a forum use case, these texts would be posts and threads. If the researcher has to browse the forum and manually copy and paste its content, a huge amount of effort would be required. However, this obstacle could be surpassed through scraping the forum with a web data extraction tool such as DEiXTo.

A scraped forum is a jewel: perfectly ordered textual data corresponding to each thread, ready for further analysis. So, this is where DEiXTo comes into play and may boost the research process significantly. To our knowledge, Dr Juan Luis Chulilla Cano, CEO of Online and Offline Ltd., has been successfully utilizing scraping techniques so as to capture the threads of popular Spanish forums (and their metadata) and transform them into a structured format, suitable for post-processing. Typically, such sites have a common presentation style for their threads and offer rich metadata. Thus, they are potential goldmines upon which various methodologies can be tested and applied so as to discover knowledge and trends and draw useful conclusions.

Finally, netnography and anthropology seem to be gaining momentum over the last few years. They are really interesting as well as challenging fields and scraping could evolve to an important ally. It is worth mentioning that quite a few IT vendors and firms employ ethnographers for R&D and testing of new products. Therefore, there is a lot of potential in using computer aided techniques in the context of netnography. So, if you are coming from social sciences and creating wrappers/ extraction rules is not your second nature, why don't you drop us an email? Perhaps we could help you gather quite a few tons of usable data with DEiXTo! Unless terms of use or copyright restrictions forbid it..

Geo-location data, Yahoo! PlaceFinder & Google Maps API

noreply@blogger.com (kntonas) — Thu, 12 Jan 2012 22:03:00 +0000

Location-aware applications have known huge success over the last few years and geographic data have been used extensively in a wide variety of ways. Meanwhile, there are numerous places of interest out there, such as shopping malls, airports, restaurants, museums, transit stations and for most of them their addresses are publicly available on the Web. Therefore, you could use DEiXTo (or a web data extraction tool of your choice) in order to scrape the desired location information for any points of interest and then postprocess it so as to produce geographic data for further use.

Yahoo! PlaceFinder is a great web service that supports world-wide geocoding of street addresses and place names. It allows developers to convert addresses and places into geographic coordinates (and vice versa). Thus, you can send an HTTP request with a street address to it and get the latitude and longitude back! It's amazing how well it works. Of course, the more complete and detailed the address, the more precise the coordinates returned.

In the context of this post, we thought it would be nice, mostly for demonstration purposes, to build a map of Thessaloniki museums using the Google Maps API and geo-location data generated with Yahoo! PlaceFinder. The source of data for our demo was Odysseus, the WWW server of the Hellenic Ministry of Culture that provides a full list of Greek museums, monuments and archaeological sites.

So, we searched for museums located in the city of Thessaloniki (the second-largest city in Greece and the capital of the region of Central Macedonia) and extracted through DEiXTo the street addresses of the ten results returned. At the picture below you can see a sample screenshot from the "INFORMATION" section of the Folk Art and Ethnological Museum of Macedonia and Thrace Odysseus detailed webpage (from which the address of this specific museum was scraped):

After capturing the name and location of each museum and exporting them to a simple tab delimited text file, we wrote a Perl script harnessing the Geo::Coder::PlaceFinder CPAN module in order to automatically find their geo-location coordinates and create an XML output file containing all the necessary information (through XML::Writer). Part of this XML document is displayed right below:

After having all the metadata we needed in this XML file, we utilized the Google Maps JavaScript API v3 and created a map (centered on Thessaloniki) displaying all city museums! To accomplish that goal, we followed the helpful guidelines given in this very informative post about Google Maps markers and wrote a short script that parsed the XML contents (via XML::LibXML) and produced a web page with the desired Google Map object embedded (including markers for each museum). Finally, the end result was pretty satisfying (after some extra manual effort to be absolutely honest):

This is kind of cool, isn't it? Of course, the same procedure could be applied in a larger scale (e.g. for creating a map of Greece with ALL museums or/and monuments available) or expanded to other points of interest (whatever you can imagine, from schools and educational institutions to cinemas, supermarkets, shops or bank ATMs). In conclusion, we think that the combination of DEiXTo with other powerful tools and technologies can sometimes yield an innovative and hopefully useful outcome. Since you have the raw web data at your disposal (captured with DEiXTo), your imagination (and perhaps copyright restrictions) is the only limit!