<?xml version='1.0' encoding='UTF-8'?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:blogger="http://schemas.google.com/blogger/2008" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" version="2.0"><channel><atom:id>tag:blogger.com,1999:blog-3639231664593965268</atom:id><lastBuildDate>Sat, 05 Oct 2024 03:22:03 +0000</lastBuildDate><category>Scraping</category><category>Data transformations</category><category>Selenium</category><category>APIs</category><category>Agents</category><category>Browser automation</category><category>Data visualization</category><category>Dublin Core</category><category>Europeana</category><category>OAI-PMH</category><category>Διαύγεια</category><category>ΥπερΔιαύγεια</category><category>DSpace</category><category>Digital Libraries</category><category>Downloading</category><category>Federated search</category><category>Geo-location</category><category>Google Charts</category><category>Institutional repositories</category><category>JavaScript</category><category>Music Library Lilian Voudouri</category><category>Open Source</category><category>PDF Downloader</category><category>Search engines</category><category>Veria Central Public Library</category><category>Web archiving</category><category>Wrappers</category><category>AJAX</category><category>Archivability</category><category>Athos Memory</category><category>BCI</category><category>CAQDA</category><category>Celery</category><category>Concurrent workers</category><category>Conference</category><category>D3.js</category><category>Data migration</category><category>Digital preservation</category><category>E-Learning</category><category>Ethnography</category><category>FLOSS</category><category>FP7</category><category>Forums</category><category>Geographic data</category><category>Google Maps</category><category>Heritrix</category><category>HideMyAss</category><category>Informatics</category><category>Internet Archive</category><category>JSON</category><category>Job queues</category><category>Linked Data</category><category>Michelin Maps</category><category>Mobile apps</category><category>Mobile devices</category><category>Netnography</category><category>OCR</category><category>Open Archives</category><category>Open data</category><category>PDF</category><category>Parliament</category><category>PhantomJS</category><category>Podcasts</category><category>Price monitoring</category><category>Proxies</category><category>Proxify</category><category>Publications</category><category>Qualitative Analysis</category><category>Robots.txt</category><category>Sauce Labs</category><category>ScraperWiki</category><category>Social sites</category><category>SwitchProxy</category><category>TEL-MAP</category><category>Tech Box</category><category>Tesseract</category><category>Testing</category><category>VPN</category><category>Web services</category><category>Wikipedia</category><category>XML</category><category>XPath</category><category>XSLT</category><category>Yahoo PlaceFinder</category><category>Z39.50</category><category>dbWiz</category><category>e-commerce</category><category>e-procurement</category><category>iPhone simulator</category><category>myVisitPlanner</category><category>openarchives.gr</category><category>spynner</category><category>wget</category><title>deixto.com/blog</title><description></description><link>http://deixto.blogspot.com/</link><managingEditor>noreply@blogger.com (kntonas)</managingEditor><generator>Blogger</generator><openSearch:totalResults>39</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-7779916029708740510</guid><pubDate>Tue, 21 Jan 2014 07:10:00 +0000</pubDate><atom:updated>2014-01-23T11:53:18.644+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Celery</category><category domain="http://www.blogger.com/atom/ns#">Concurrent workers</category><category domain="http://www.blogger.com/atom/ns#">Job queues</category><title>Celery task/ job queue</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;div&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/Queue_(abstract_data_type)&quot; target=&quot;_blank&quot;&gt;Queues&lt;/a&gt; are very common in computer science and in real-world programs. A queue is a &lt;a href=&quot;http://en.wikipedia.org/wiki/FIFO&quot; target=&quot;_blank&quot;&gt;FIFO&lt;/a&gt; data structure where new elements are typically added to the rear position whereas the first&amp;nbsp;items&amp;nbsp;inserted will be the first&amp;nbsp;ones to be removed/ served. A nice, thorough collection of queueing systems can be found on &lt;a href=&quot;http://queues.io/&quot;&gt;queues.io&lt;/a&gt;. Task/Job queues in particular are used by numerous systems/ services and they apply to a wide range of&amp;nbsp;applications. They can alleviate the complexity of system design and&amp;nbsp;implementation, boost scalability and generally have many advantages. Some of the most popular ones are&amp;nbsp;&lt;a href=&quot;http://www.celeryproject.org/&quot; target=&quot;_blank&quot;&gt;Celery&lt;/a&gt;,&amp;nbsp;&lt;a href=&quot;http://python-rq.org/&quot; target=&quot;_blank&quot;&gt;RQ (Redis Queue)&lt;/a&gt;&amp;nbsp;and &lt;a href=&quot;http://gearman.org/&quot; target=&quot;_blank&quot;&gt;Gearman&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; The one we recently stumbled upon and immediately took advantage of was&amp;nbsp;&lt;a href=&quot;http://www.celeryproject.org/&quot; target=&quot;_blank&quot;&gt;Celery&lt;/a&gt; which is written in Python and is based on distributed message passing. It focuses on real-time operation but supports scheduling as well. The units of work to be performed, called tasks, are executed simultaneously on a single or more worker servers using multiprocessing. Thus, concurrent workers run in the background waiting for new job arrivals and when a task arrives (and its turn comes) a worker processes it. Some of Celery&#39;s uses are handling long running jobs, asynchronous task processing, offloading heavy tasks, job routing, &lt;a href=&quot;https://github.com/NetAngels/celery-tasktree&quot; target=&quot;_blank&quot;&gt;task trees&lt;/a&gt;, etc.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://www.celeryproject.org/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipxWL7nPGFuLknq1kVFoHpCcsr35JVi_Jh0tdFpZ2laHLZLgbwwMomkqfOYPxSTnTvZ3PkRI3zwGncAgHxyxLZh43WO3WBn9PntaE5vFhVdr09nSX9i0g8Is9olWaaYuN3GTqS6i5HETR1/s1600/celery.jpg&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
&amp;nbsp; &amp;nbsp; Now let&#39;s see a use case where Celery untied our hands. As you might already know, for quite some time we have been developing Python,&amp;nbsp;&lt;a href=&quot;http://deixto.blogspot.gr/2013/01/selenium-browser-automation-companion-for-deixto.html&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt;-based scripts for web scraping and browser automation. So, occasionally we came across data records/ cases, while executing a script, that had to be dealt separately by another process/ script. For instance, in the context of the recent&amp;nbsp;&lt;a href=&quot;http://deixto.com/deixto-blog/rss_feed_for_the_greek_e-procurement/&quot; target=&quot;_blank&quot;&gt;e-procurement project&lt;/a&gt;, when scraping through&amp;nbsp;&lt;a href=&quot;http://deixto.blogspot.gr/2012/02/deixto-components-clarified.html&quot; target=&quot;_blank&quot;&gt;DEiXToBot&lt;/a&gt;&amp;nbsp;the detail page of a payment (published on the &lt;a href=&quot;http://www.eprocurement.gov.gr/&quot; target=&quot;_blank&quot;&gt;Greek e-procurement platform&lt;/a&gt;)&amp;nbsp;you could find a reference towards a relevant contract which you would also like to download and scrape. Additionally, this contract&amp;nbsp;could also link with a tender notice and the latter may be a corrected version of an existing tender or connect in turn with another decision/ document.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; Thus, we thought it would be handy and more convenient if we could add the unique codes/ identifiers of these extra documents to a queue system and let a background worker get the job done asynchronously.&amp;nbsp;It should be noted that the lack of persistent links on the eprocurement website made it harder to download a detail page programmatically at a later stage since you could access it only after performing a new search with its ID and automating a series of steps with Selenium depending on the type of the target document.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; So, it was not long before we installed Celery on our Linux server and started experimenting with it. We were amazed with its simplicity and efficiency.&amp;nbsp;We quickly wrote a script that fitted the bill for the e-procurement scenario we described in the previous paragraph. The code we wrote provided an elegant and practical solution to the problem at hand and was something like that (note the recursion!):&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;from celery import Celery&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;app = Celery(&#39;tasks&#39;, backend=&#39;amqp&#39;, broker=&#39;amqp://&#39;)&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;@app.task&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;def download(id):&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp; &amp;nbsp; ... selenium_stuff ...&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp; &amp;nbsp; if (reference_found)&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; download.delay(new_&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace;&quot;&gt;id&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace;&quot;&gt;) # delay &lt;/span&gt;&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;sends a task message&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; In conclusion, we are happy to have found Celery, it&#39;s really promising and we thought it would be nice to share this news with you. We are looking forward to using Celery further for our heavy scraping needs and we are glad that we added it to our arsenal.&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2014/01/celery-task-job-queue.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipxWL7nPGFuLknq1kVFoHpCcsr35JVi_Jh0tdFpZ2laHLZLgbwwMomkqfOYPxSTnTvZ3PkRI3zwGncAgHxyxLZh43WO3WBn9PntaE5vFhVdr09nSX9i0g8Is9olWaaYuN3GTqS6i5HETR1/s72-c/celery.jpg" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-7336264484231739650</guid><pubDate>Fri, 17 Jan 2014 08:17:00 +0000</pubDate><atom:updated>2014-01-17T23:40:33.633+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">HideMyAss</category><category domain="http://www.blogger.com/atom/ns#">Proxies</category><category domain="http://www.blogger.com/atom/ns#">Proxify</category><category domain="http://www.blogger.com/atom/ns#">SwitchProxy</category><category domain="http://www.blogger.com/atom/ns#">VPN</category><title>About web proxies</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
Rightly or wrongly there are times when one would like to conceal his IP address, especially while scraping a target website. Perhaps the most popular way to do that is by using web proxy servers. A &lt;a href=&quot;http://en.wikipedia.org/wiki/Proxy_server&quot; target=&quot;_blank&quot;&gt;proxy server&lt;/a&gt; is a computer system or an application that acts as an intermediary for requests from clients seeking resources from other servers. Thus, web proxies allow users to mask their true IP and enable them to surf anonymously online. But&amp;nbsp;personally&amp;nbsp;we are mostly interested in their use for web data extraction and automated systems in general. So, we did some Google search to locate notable proxy service providers but surprisingly the majority of the results were dubious websites of low trustworthiness and low Google &lt;a href=&quot;http://en.wikipedia.org/wiki/PageRank&quot; target=&quot;_blank&quot;&gt;PageRank&lt;/a&gt; scores. However, there were a few that stood out from the crowd. We will name two: a) &lt;a href=&quot;http://www.hidemyass.com/&quot; target=&quot;_blank&quot;&gt;HideMyAss&lt;/a&gt; (or HMA for short) and b) &lt;a href=&quot;https://proxify.com/&quot; target=&quot;_blank&quot;&gt;Proxify&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/Proxy_server&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfyCaVY2cNZhMzBfa4Ac29BSubYNU8vVDX0tiOfJJvMIkckkdxFI6YHcbXKk0XFbPhKIheuU60xH2H-arfbla2kHraKPiBhWksjK-ihXuPkGfJEEEpolcjGkXvbZ_Wm5NyilJjVQBgRCi_/s1600/280px-Open_proxy_h2g2bob.svg.png&quot; height=&quot;120&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; HMA provides (among others) a large real-time database of &lt;a href=&quot;https://hidemyass.com/proxy-list/&quot; target=&quot;_blank&quot;&gt;free working public proxies&lt;/a&gt;. These proxies are open to everyone and vary in speed and anonymity level. Nevertheless, free shared proxies have certain disadvantages mostly in terms of security and privacy. They are third-party proxies and HMA cannot vouch for their reliability. On the other hand, HMA offers a powerful&amp;nbsp;&lt;a href=&quot;http://hidemyass.com/vpn/&quot; target=&quot;_blank&quot;&gt;Pro VPN &lt;/a&gt;service which encrypts your entire internet activity and unlike a web proxy it automatically works with &lt;u&gt;all&lt;/u&gt; applications on your computer (whereas web proxies typically work with web browsers like Firefox or Chrome and utilities like &lt;a href=&quot;http://curl.haxx.se/&quot; target=&quot;_blank&quot;&gt;cURL&lt;/a&gt; or &lt;a href=&quot;http://www.gnu.org/software/wget/&quot; target=&quot;_blank&quot;&gt;GNU Wget&lt;/a&gt;). However, the company&#39;s policy and Pro VPN&#39;s terms of use are not robots-friendly, so using Pro VPN for web scraping might cause an abuse warning and result in the suspension of the account.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://hidemyass.com/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwrqZg_7-FUH6haodlY5aomQYitydqsmk42Jr1gHlHA82cY4bn1Y9FnjIgT7vdiNpP1yiWR9RrO7g1mqWuxLXISYWCjlh2FQRzO2eWAFO7fAleUvLtsKB2vU2Jdfm7iNmFpwbJVJzHYkJE/s1600/HMA.png&quot; height=&quot;148&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; The second high-quality proxy service that we found was &lt;a href=&quot;https://proxify.com/&quot; target=&quot;_blank&quot;&gt;Proxify&lt;/a&gt;. They offer 3 packages: Basic, Pro and &lt;a href=&quot;https://proxify.com/switchproxy.shtml?&quot; target=&quot;_blank&quot;&gt;SwitchProxy&lt;/a&gt;. The latter is very fast and it&#39;s intended for web crawling and automated systems of any scale. Since we are mostly interested in web scraping, SwitchProxy is the tool that suits us the most. It provides a rich set of features and gives access to 1296 &quot;satellites&quot; in 279 cities in 74 countries worldwide. They also offer an auto IP change mechanism that runs either after each request (assigning each time a random IP address) or once every 10 minutes (scheduled rotation). Therefore, it seems a great option for scraping purposes, maybe the best out there. However, it&#39;s quite expensive with plans starting at a minimum cost of 100$ per month. Additionally, Proxify provides some nice &lt;a href=&quot;https://proxify.com/switchproxy_guide.shtml&quot; target=&quot;_blank&quot;&gt;code examples&lt;/a&gt; about how one could integrate SwitchProxy with his program/ web robot. As far as &lt;a href=&quot;http://search.cpan.org/~ether/WWW-Mechanize/lib/WWW/Mechanize.pm&quot; target=&quot;_blank&quot;&gt;WWW::Mechanize&lt;/a&gt; and &lt;a href=&quot;http://deixto.blogspot.gr/2013/01/selenium-browser-automation-companion-for-deixto.html&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt; are concerned (these two are our favorite web browsing tools), it is easy and straightforward to combine them with SwitchProxy.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;a href=&quot;https://proxify.com/switchproxy.shtml&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdBKE4cnGjoS0XsE69eUtF8pFmJaZcYjguzmBwW2U0W_Mj3LItSmjfX4nHKeiYBAYEZWJZvBAwwCZMHrtkeC1YaLy8zdEZxVK1V7Od4HxshzzS4J86sdilTmfvrrhLNB9LESxNdKOJOfs_/s1600/switchproxy.png&quot; height=&quot;65&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Finally, we would like to bring forward once again the access restrictions and terms of use that many websites impose. Before launching a scraper make sure you check their &lt;a href=&quot;http://www.robotstxt.org/&quot; target=&quot;_blank&quot;&gt;robots.txt&lt;/a&gt; file as well as their copyright notice. For further information about this topic we wrote a &lt;a href=&quot;http://deixto.blogspot.gr/search/label/Robots.txt&quot; target=&quot;_blank&quot;&gt;relevant post&lt;/a&gt; some time ago, perhaps you would like to check it out too.&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2014/01/about-web-proxies.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfyCaVY2cNZhMzBfa4Ac29BSubYNU8vVDX0tiOfJJvMIkckkdxFI6YHcbXKk0XFbPhKIheuU60xH2H-arfbla2kHraKPiBhWksjK-ihXuPkGfJEEEpolcjGkXvbZ_Wm5NyilJjVQBgRCi_/s72-c/280px-Open_proxy_h2g2bob.svg.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-928137876577213385</guid><pubDate>Thu, 02 Jan 2014 13:15:00 +0000</pubDate><atom:updated>2014-01-02T20:31:07.494+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">D3.js</category><category domain="http://www.blogger.com/atom/ns#">Data visualization</category><category domain="http://www.blogger.com/atom/ns#">e-procurement</category><category domain="http://www.blogger.com/atom/ns#">ΥπερΔιαύγεια</category><title>Visualizing e-procurement tenders with a bubble chart</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
A few weeks ago we started gathering data from the Greek &lt;a href=&quot;http://www.eprocurement.gov.gr/&quot; target=&quot;_blank&quot;&gt;e-procurement platform&lt;/a&gt; through &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; aiming to build an &lt;a href=&quot;http://deixto.gr/eprocurement/eprocurement.rss&quot; target=&quot;_blank&quot;&gt;RSS feed&lt;/a&gt; with the latest tender notices and in order to provide a method to automatically retrieve fresh data from the &lt;a href=&quot;http://www.eprocurement.gov.gr/agora/unprotected/searchNotice.htm&quot; target=&quot;_blank&quot;&gt;Central Electronic Registry&lt;/a&gt; of Public Contracts (CERPC or “Κεντρικό Ηλεκτρονικό Μητρώο Δημοσίων Συμβάσεων” in Greek). &amp;nbsp;For further information you can read &lt;a href=&quot;http://deixto.com/deixto-blog/rss_feed_for_the_greek_e-procurement/&quot; target=&quot;_blank&quot;&gt;this post&lt;/a&gt;. Only a few days later, we were happy to find out that the first power user consuming the feed popped up: &lt;a href=&quot;http://yperdiavgeia.gr/#eprocurementtab&quot; target=&quot;_blank&quot;&gt;yperdiavgeia.gr&lt;/a&gt;, a popular search engine indexing all public documents uploaded to the &lt;a href=&quot;http://diavgeia.gov.gr/en&quot; target=&quot;_blank&quot;&gt;Clarity&lt;/a&gt; website.&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://yperdiavgeia.gr/#eprocurementtab&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;283&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFzr5y39R9OfkH_FCBTVKvxHxLf6a9sgYn6rrpnzaFAoyF8-T5kO028oHgK6xxsgvF9yHNNYMxgFkdl2FDE2bV6b39I6VfuTBMqZ1KTGalnf4CfsljlviOL4ABz6RPU1Ec_TfhckHadvJa/s400/ultraclarity-e-procurement.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; So now that we have a good deal of data at hand and we systematically ingest public procurement info every single day, we are trying to think of innovative ways to utilise it creatively. They say a picture is worth a thousand words. Therefore, one of the first ideas that occurred to us (and inspired by &lt;a href=&quot;http://greekspending.com/&quot; target=&quot;_blank&quot;&gt;greekspending.com&lt;/a&gt;), we thought it would be nice to visualize the feed with some beautiful graphics. After a little experimentation with the great &lt;a href=&quot;https://github.com/mbostock/d3/wiki/Gallery&quot; target=&quot;_blank&quot;&gt;D3.js library&lt;/a&gt; and puttering around with the &lt;a href=&quot;http://search.cpan.org/~makamaka/JSON/lib/JSON.pm&quot; target=&quot;_blank&quot;&gt;JSON Perl module&lt;/a&gt;, we managed to come up with a handy &lt;span id=&quot;goog_1479612672&quot;&gt;&lt;/span&gt;bubble chart&lt;span id=&quot;goog_1479612673&quot;&gt;&lt;/span&gt; which you may check out here:&amp;nbsp;&lt;a href=&quot;http://deixto.gr/eprocurement/visualize&quot; target=&quot;_blank&quot;&gt;http://deixto.gr/eprocurement/visualize&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://deixto.gr/eprocurement/visualize/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;539&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhm-p8PtMXJkRC_X9iJo9AkYXONXOfBXlNEssI6woUyWKoy3wacT7YS9knCPAilfQ9m4y2yFEBZF4eDbbcUD-q_hHqA5vSCU2HaGLhfISA_3EDjE7WGWhgPC4cUErRjHRtoIvHl53axgp6i/s640/Screen+Shot+2014-01-02+at+1.50.42+PM.png&quot; width=&quot;640&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&amp;nbsp; &amp;nbsp; Let&#39;s note a couple of things in order to better comprehend the chart.&lt;br /&gt;
&lt;ul style=&quot;text-align: left;&quot;&gt;
&lt;li&gt;the bigger the budget, the bigger the bubble&lt;/li&gt;
&lt;li&gt;if you click on a bubble then you will be redirected to the full text PDF document&lt;/li&gt;
&lt;li&gt;on mouseover a tooltip appears with some basic data fields&lt;/li&gt;
&lt;/ul&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; The good news is that this chart will be produced automatically on a daily basis along with the RSS feed.&amp;nbsp;&amp;nbsp;So, one could easily browse through the tenders published on CERPC over the last few days and locate the high-budget ones. Finally, as &lt;a href=&quot;http://en.wikipedia.org/wiki/Open_data&quot; target=&quot;_blank&quot;&gt;open data&lt;/a&gt; supporters we are very glad to see transparency initiatives like Clarity or CERPC and we warmly encourage people and organisations to take advantage of open public data and use it for a good purpose. Any suggestions or comments about further use of the e-procurement data would be very welcome!&lt;/div&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2014/01/visualizing-e-procurement.gov.gr-tenders-with-a-bubble-chart.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFzr5y39R9OfkH_FCBTVKvxHxLf6a9sgYn6rrpnzaFAoyF8-T5kO028oHgK6xxsgvF9yHNNYMxgFkdl2FDE2bV6b39I6VfuTBMqZ1KTGalnf4CfsljlviOL4ABz6RPU1Ec_TfhckHadvJa/s72-c/ultraclarity-e-procurement.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-4769496151258196046</guid><pubDate>Tue, 10 Dec 2013 07:47:00 +0000</pubDate><atom:updated>2013-12-10T12:48:47.520+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Agents</category><category domain="http://www.blogger.com/atom/ns#">Heritrix</category><category domain="http://www.blogger.com/atom/ns#">Internet Archive</category><category domain="http://www.blogger.com/atom/ns#">Selenium</category><category domain="http://www.blogger.com/atom/ns#">Web archiving</category><title>Web archiving and Heritrix</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
A topic that has gained increasing attention lately is web archiving. In an &lt;a href=&quot;http://deixto.blogspot.gr/2013/02/digital-preservation-and-archiveready.html&quot; target=&quot;_blank&quot;&gt;older post&lt;/a&gt; we started talking about it and we cited a remarkable online tool named &lt;a href=&quot;http://archiveready.com/&quot; target=&quot;_blank&quot;&gt;ArchiveReady&lt;/a&gt; that checks whether a web page is easily archivable. Perhaps the most well-known web archiving project is currently the &lt;a href=&quot;https://archive.org/&quot; target=&quot;_blank&quot;&gt;Internet Archive&lt;/a&gt; which is a non-profit organization aiming to build a permanently and freely accessible Internet library. Their &lt;a href=&quot;https://archive.org/web/&quot; target=&quot;_blank&quot;&gt;Wayback Machine&lt;/a&gt;, a digital archive of the World Wide Web, is really interesting. It enables users to &quot;travel&quot; across time and visit archived versions of web pages.&lt;/div&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://archive.org/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinKZ4PJK3VGO6yFpQrXgpPuDaHDEuKsvvvPLQFQM1KpNkU-gs8ehGI7UVzIwH-z49oRXnEkbjAxje2ori0OICWE0atgl1Bd2NSxnKqfQProy1dLkCPvYh9LfVfmUWA4ZJ6lnZC8SlKCjsJ/s1600/Internet_Archive_logo.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; As web scraping aficionados we are mostly interested in their crawling toolset. So, the web crawler used by the Internet Archive is &lt;a href=&quot;https://webarchive.jira.com/wiki/display/Heritrix/Heritrix&quot; target=&quot;_blank&quot;&gt;Heritrix&lt;/a&gt;, a free, powerful Java crawler released under the Apache License. The latest version is 3.1.1 and it was made available back in May 2012. Heritrix creates copies of websites and generates &lt;a href=&quot;http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf&quot; target=&quot;_blank&quot;&gt;WARC&lt;/a&gt;&amp;nbsp;(Web ARChive) files. The WARC format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; Heritrix offers a basic web based user interface (admin console) to manage your crawls as well as a command line tool that can optionally be used to initiate archiving jobs. We played with it a bit and found it handy for quite a few cases but overall it left us with a sense of obsolescence.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;img border=&quot;0&quot; height=&quot;276&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1tqf23E9_R2P6OB3FrZ3Dm83ywt_cUsDAKj5OtQ-2rMzy7OvT8_Weac_DsWC76y9LLA5SiE2HBFZBphnwWSGEQ07WNDqedikCqt1wN_uO1wCY47EqB-sJfKLeFWMwntWeqqZ8CKnDIHkG/s320/Screen+Shot+2013-12-08+at+4.57.40+PM.png&quot; width=&quot;320&quot; /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; In our humble opinion (and someone please correct us if we are wrong) the two main drawbacks Heritrix has are: a) lack of distributed crawling support and b) lack of JavaScript/AJAX support. The first one means that if you would like to scan a really big source of data, for example the great &lt;a href=&quot;http://dp.la/&quot; target=&quot;_blank&quot;&gt;Digital Public Library of America&lt;/a&gt; (DPLA) with more than 5 million items/ pages, then Heritrix would take a lot of time since it runs locally on a single machine. Even if multiple Heritrix crawlers were combined and a subset of the target URL space was assigned to each of them, then again it wouldn&#39;t be an optimal solution. From our point of view it would be much better and faster if several cooperating agents on multiple different servers could actually collaborate to complete the task. Therefore, scaling and time issues arise when the number of pages goes very large.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://dp.la/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpU1gUES6YTPebqf-uAJt-ZaJGEL45EswZp4-Qgyko0RAgEJiJQrZRZDjdCGocfjKNhJCcgPxtWcSxVP-fFIzVq_gWotGdqOepGsIKqzNMUFnsmm-vVXgg1eYCgtcsuLYrf952ZDt-nT76/s1600/dpla-logo.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; The second disadvantage on the other hand is related to the trend of modern websites towards heavy use of JavaScript and AJAX calls. Heritrix provides just basic browser functionality and it does not include a fully-fledged web browser. Therefore, it&#39;s not able to archive efficiently pages that use JavaScript/ AJAX to populate parts of the page. Thus, it cannot capture properly&amp;nbsp;social media content.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; We think that both of these issues could be surpassed using a cloud, &lt;a href=&quot;http://www.seleniumhq.org/&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt;-based architecture like &lt;a href=&quot;https://saucelabs.com/&quot; target=&quot;_blank&quot;&gt;Sauce Labs&lt;/a&gt; (although the cost for an Enterprise plan is a matter that should be considered). This choice would allow you a) to run your crawls in the cloud in parallel and b) use a real web browser with full JavaScript support, like Firefox, Chrome or Safari. We have already covered Selenium in &lt;a href=&quot;http://deixto.blogspot.gr/2013/01/selenium-browser-automation-companion-for-deixto.html&quot; target=&quot;_blank&quot;&gt;previous posts&lt;/a&gt; and it is absolutely a great browser automation tool. In conclusion, we recommend Selenium and a different, cloud-based approach for implementing large-scale, web archiving projects. Heritrix is quite good and has proved a valuable ally but we think that other, state-of-the-art technologies are nowadays more suitable for the job especially with the latest &lt;a href=&quot;http://en.wikipedia.org/wiki/Web_2.0&quot; target=&quot;_blank&quot;&gt;Web 2.0&lt;/a&gt; developments. What&#39;s your opinion?&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/12/web-archiving-and-heritrix.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinKZ4PJK3VGO6yFpQrXgpPuDaHDEuKsvvvPLQFQM1KpNkU-gs8ehGI7UVzIwH-z49oRXnEkbjAxje2ori0OICWE0atgl1Bd2NSxnKqfQProy1dLkCPvYh9LfVfmUWA4ZJ6lnZC8SlKCjsJ/s72-c/Internet_Archive_logo.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6285472663660242478</guid><pubDate>Mon, 16 Sep 2013 06:57:00 +0000</pubDate><atom:updated>2013-09-19T11:47:05.710+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">BCI</category><category domain="http://www.blogger.com/atom/ns#">Conference</category><category domain="http://www.blogger.com/atom/ns#">Informatics</category><category domain="http://www.blogger.com/atom/ns#">Publications</category><title>DEiXTo at BCI 2013</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
We are pleased to inform you that our short paper titled “DEiXTo: A web data extraction suite”&amp;nbsp;has been accepted for presentation at the 6th Balkan Conference in Informatics (&amp;nbsp;&lt;a href=&quot;http://bci2013.bci-conferences.org/&quot; target=&quot;_blank&quot;&gt;BCI 2013&lt;/a&gt;&amp;nbsp;) to be held in&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Thessaloniki&quot; target=&quot;_blank&quot;&gt;Thessaloniki&lt;/a&gt;&amp;nbsp;on September &lt;a href=&quot;tel:19-21%202013&quot; x-apple-data-detectors=&quot;true&quot; x-apple-data-detectors-type=&quot;telephone&quot; x-apple-data-detectors-result=&quot;0&quot;&gt;19-21 2013&lt;/a&gt;.&amp;nbsp;The main goal of the BCI series of conferences is to provide a forum for discussions and dissemination of research accomplishments and to promote interaction and collaboration among scientists from the Balkan countries.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://bci2013.bci-conferences.org/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;42&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0CSSP_VsxvljEer3CnNtUw7fN-j1OI2JR2CvgTQxjUgqO_nhv5MSln9nr5IhXGVNOOHq8ql2LbgRLzh4rZfiWbesN96tNk5gAgb5iplu16tzjZFGvAk00oGfAIBU1qy_yFu5s7gVqVI0u/s400/bci2013.png&quot; width=&quot;400&quot;&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
So, if you would like to cite &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; in your thesis, project or scientific work, please use the following reference:&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
F. Kokkoras, K. Ntonas, N. Bassiliades. “DEiXTo: A web data extraction suite”, In proc. of the 6th Balkan Conference in Informatics (BCI-2013), September 19-21, 2013, Thessaloniki, Greece&lt;/div&gt;
&lt;div&gt;
&lt;br&gt;&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/09/deixto-at-bci-2013.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0CSSP_VsxvljEer3CnNtUw7fN-j1OI2JR2CvgTQxjUgqO_nhv5MSln9nr5IhXGVNOOHq8ql2LbgRLzh4rZfiWbesN96tNk5gAgb5iplu16tzjZFGvAk00oGfAIBU1qy_yFu5s7gVqVI0u/s72-c/bci2013.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-2139928746548570220</guid><pubDate>Mon, 09 Sep 2013 07:20:00 +0000</pubDate><atom:updated>2014-01-13T09:54:45.472+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Scraping</category><category domain="http://www.blogger.com/atom/ns#">XPath</category><title>Using XPath for web scraping</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
The last few years we have worked quite a bit on aggregators gathering periodically information from multiple online sources. We usually write own custom code and mostly use DOM-based extraction patterns (built with our home made DEiXTo GUI tool) but we also use other technologies and useful tools, when possible, in order to get the job done and make our scraping tasks easier. One of them is &lt;a href=&quot;http://www.w3schools.com/xpath/&quot; target=&quot;_blank&quot;&gt;XPath&lt;/a&gt; which is a query language, defined by W3C, for selecting nodes from an XML document. Note that an HTML page (even malformed) can be represented as a DOM tree, thus an XML document. XPath is quite effective, especially for relatively simple scraping cases.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; Suppose for instance that we would like to retrieve the content of an article/ post/ story on a specific website or blog. Of course this scenario could be extended to several posts from many different sources and go large in scale. Typically the body of a post resides in a DIV (or a certain type of) HTML element with a particular attribute value (the same stands for the post title as well). Therefore, the text content of a post is usually included in something like the following html segment (especially if you consider that numerous blogs and websites live on platforms like blogger or WordPress.com and share a similar layout):&lt;br /&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;div style=&quot;text-align: left;&quot;&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&amp;lt;div class=&quot;&lt;span style=&quot;color: red;&quot;&gt;post-body entry-content&lt;/span&gt;&quot; ...&amp;gt;&lt;/span&gt;&lt;span style=&quot;color: blue; font-family: Courier New, Courier, monospace;&quot;&gt;the content we want&lt;/span&gt;&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; DEiXo tree rules are more suitable and efficient when there are multiple structure-rich record occurrences on a page. So, if you need just a specific div element it&#39;s better to stick with an XPath expression. It&#39;s pretty simple and it works. Then you could do some post processing on the data scraped and further utilise it by passing it to other techniques e.g. using regular expressions on the inner text to identify dates, places or other pieces of interest or parsing the outer HTML code of the selected element with a specialised tool looking for interesting stuff. So, instead of creating a rule with DEiXTo for the case described above we could just use an XPath selector: /div[@class=&quot;post-body entry-content&quot;] to select the proper element and access its contents.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div&gt;
&lt;span style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; We actually used this simple but effective technique repetitively for &lt;a href=&quot;http://deixto.blogspot.gr/2011/11/myvisitplanner.html&quot; target=&quot;_blank&quot;&gt;myVisitPlanner&lt;/a&gt;, &lt;/span&gt;a project funded by the Greek Ministry of Education aiming at creating a personalised system for cultural itineraries planning. The main content of event pages (related to music, theatre, festivals, exhibitions, etc) is systematically extracted from a wide variety of local websites (most lacking RSS and APIs) in order to automatically monitor and aggregate events information.&amp;nbsp;&lt;span style=&quot;text-align: justify;&quot;&gt;We could show you some code to demonstrate how to scrape a site with XPath but instead we would like to cite an amazing blog dedicated to web scraping which gives a nice code example of using XPath in screen scraping: &lt;/span&gt;&lt;a href=&quot;http://scraping.pro/python-lxml-scrape-online-dictionary/&quot; style=&quot;text-align: justify;&quot; target=&quot;_blank&quot;&gt;extract-web-data.com&lt;/a&gt;&lt;span style=&quot;text-align: justify;&quot;&gt;. It provides a lot of information about web data extraction techniques and covers plenty of relevant tools. It&#39;s a nice, thorough and well-written read.&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://scraping.pro/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;32&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8uPGHsOmJ1vztAB8ghw7SHOWwtKDJRs2cb4gAGcrkwus2PB6H3DIcfoX3QSvcxwZ0hYPrf4r0EGydRYD5FLhkhnXlnF40Gkv5pLx05pX_xMS23Wr0MUi5l_Z1Jk7T2vNDETEmAeaQiTv9/s320/extract-web-data.png&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Anyway, if you need some web data just for personal use or your boss asked you so, why don&#39;t you consider using &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; or one of the remarkable software tools out there? The &lt;a href=&quot;http://deixto.blogspot.gr/2012/03/uses-and-applications-of-web-scraping.html&quot; target=&quot;_blank&quot;&gt;use case scenarios&lt;/a&gt; are limitless and we are sure you could come up with a useful and interesting one.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/09/using-xpath-for-web-scraping.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8uPGHsOmJ1vztAB8ghw7SHOWwtKDJRs2cb4gAGcrkwus2PB6H3DIcfoX3QSvcxwZ0hYPrf4r0EGydRYD5FLhkhnXlnF40Gkv5pLx05pX_xMS23Wr0MUi5l_Z1Jk7T2vNDETEmAeaQiTv9/s72-c/extract-web-data.png" height="72" width="72"/><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-4695284449217500922</guid><pubDate>Sat, 04 May 2013 10:31:00 +0000</pubDate><atom:updated>2013-05-06T13:28:40.724+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">FLOSS</category><category domain="http://www.blogger.com/atom/ns#">JSON</category><category domain="http://www.blogger.com/atom/ns#">Open Source</category><category domain="http://www.blogger.com/atom/ns#">Podcasts</category><category domain="http://www.blogger.com/atom/ns#">Scraping</category><title>Creating a complete list of FLOSS Weekly podcast episodes</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
It was not until recently that I discovered and started subscribing to&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Podcast&quot; target=&quot;_blank&quot;&gt;podcasts&lt;/a&gt;. I wish I did earlier but the lack of available time (mostly) kept me away from them although we should always try to find time to learn and explore new things and technologies. So, I was very excited when I ran across &lt;a href=&quot;http://twit.tv/show/floss-weekly&quot; target=&quot;_blank&quot;&gt;FLOSS Weekly&lt;/a&gt;, a popular Free Libre Open Source (&lt;a href=&quot;http://en.wikipedia.org/wiki/Free_and_open-source_software&quot; target=&quot;_blank&quot;&gt;FLOSS&lt;/a&gt;) themed podcast from the &lt;a href=&quot;http://twit.tv/&quot; target=&quot;_blank&quot;&gt;TWiT&lt;/a&gt; Network.&amp;nbsp;Currently, the lead host is &lt;a href=&quot;http://en.wikipedia.org/wiki/Randal_Schwartz&quot; target=&quot;_blank&quot;&gt;Randal Schwartz&lt;/a&gt;, a renowned Perl hacker and programming consultant. As a Perl developer myself, it&#39;s needless to say that I greatly admire and respect him. FLOSS Weekly debuted back in April 2006 and as of 4th of May 2013 it features 250 episodes! That&#39;s a lot of episodes and lots of great stuff to explore.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://twit.tv/show/floss-weekly&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;110&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEeJRN4Ro2JIU2TNUigqaKrNgNa7kRoP3ZMlF0RSsEZQYJs9ezNC0H2Pu2OhLmOv6jJXYetP8MocWDZj21dtXvJAnd2LWPNm7x8ZaLmAFtWCYFFOSm8k_CeGXudgvtXUrkWOAA-z-qBOjo/s1600/podcast_5_3.jpg&quot; width=&quot;110&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Inevitably, if you&amp;nbsp;don&#39;t have the time to listen to them all and had to choose only some of them, you would need to browse through all the listing pages (each containing 7 episodes) in order to find those that would interest you most. As I am writing this post one would have to visit 36 pages (by repeatedly clicking on the NEXT page link) to get a complete picture of all subjects discussed. Consequently, it&#39;s not that easy to quickly locate the ones that you find more interesting and compile a To-Listen (or To-Watch if you prefer video) list. I am not 100% sure that there is no such thing available on the twit.tv website but I was not able to find a full episodes list on a single place/ page. Therefore, I thought that a spreadsheet (or even better a &lt;a href=&quot;http://www.json.org/&quot; target=&quot;_blank&quot;&gt;JSON&lt;/a&gt; document) containing the basic info for each episode (title, date, link and description) would come in handy.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://twit.tv/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;125&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHxYmIY0BMTQrRNLlU_iyENtxfEL6kNqvSCS8bHPPHGqGy8h4anwmDhxTMaAvv7y0Bfs8BTJJzPl6w7itdhuyWW-u_umtmT1W5rftOTEqwp4coCy7jfKek2wznkniHqlJs7JpijetBC7cZ/s200/TWiT.png&quot; width=&quot;110&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Hence, I utilised my beloved home-made scraping tool, &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;, in order to extract the episodes metadata so that one can have a convenient, compact view of all available topics and decide easier which ones to choose. It was really simple to build a wrapper for this task and in a few minutes I had the data at hand (in a tab delimited text file). Then it was straightforward to import it in an Excel spreadsheet (you can &lt;a href=&quot;http://deixto.com/wp-content/uploads/floss_weekly_all_episodes_2013-05-04.xls&quot; target=&quot;_blank&quot;&gt;download it here&lt;/a&gt;). Moreover, with a few lines of Perl code the data scraped was transformed into a &lt;a href=&quot;http://deixto.com/wp-content/uploads/floss_weekly_all_episodes_2013-05-04.json&quot; target=&quot;_blank&quot;&gt;JSON file&lt;/a&gt; (with all the advantages this brings) suitable for further use.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; Check&amp;nbsp;&lt;a href=&quot;http://twit.tv/show/floss-weekly&quot; target=&quot;_blank&quot;&gt;FLOSS Weekly&lt;/a&gt;&amp;nbsp;out! You might find several great episodes that could illuminate you and bring into your attention amazing tools and technologies. As a free software&amp;nbsp;supporter, I highly recommend it (despite the fact that I discovered it with a few years delay, hopefully it&#39;s never too late).&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/05/list-of-floss-weekly-podcast-episodes.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEeJRN4Ro2JIU2TNUigqaKrNgNa7kRoP3ZMlF0RSsEZQYJs9ezNC0H2Pu2OhLmOv6jJXYetP8MocWDZj21dtXvJAnd2LWPNm7x8ZaLmAFtWCYFFOSm8k_CeGXudgvtXUrkWOAA-z-qBOjo/s72-c/podcast_5_3.jpg" height="72" width="72"/><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-8735077798606596858</guid><pubDate>Thu, 02 May 2013 22:37:00 +0000</pubDate><atom:updated>2013-05-08T09:22:20.605+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Open data</category><category domain="http://www.blogger.com/atom/ns#">Parliament</category><category domain="http://www.blogger.com/atom/ns#">ScraperWiki</category><category domain="http://www.blogger.com/atom/ns#">Scraping</category><title>Scraping the members of the Greek Parliament</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
The &lt;a href=&quot;http://en.wikipedia.org/wiki/Hellenic_Parliament&quot; target=&quot;_blank&quot;&gt;Hellenic Parliament&lt;/a&gt; is the supreme democratic institution that represents Greek citizens through an elected body of Members of Parliament (MPs). It is a legislature of 300 members, elected for a four-year term, that submits bills and amendments.&amp;nbsp;Its website, &lt;a href=&quot;http://www.hellenicparliament.gr/&quot; target=&quot;_blank&quot;&gt;www.hellenicparliament.gr&lt;/a&gt;, has a lot of interesting data on it that could potentially be useful for mere citizens, certain types of professionals like journalists and lawyers, the media as well as businesses.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://www.hellenicparliament.gr/en/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHc46uuWMFXqREdmrte4eSzD3UTapEjIsbmZiNK2i73o-R1ZQ64M_o3DDV4AYivljnT2Ch2lIQOaft4NcTbe-1PXznz8YsMFzmfSIuSj2guLniKqfsaYlN3kGQJWW5mrrrBD3-FHV88k6Q/s1600/logo_en.gif&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Inspired by existing scrapers for many Parliaments of the world like these on &lt;a href=&quot;https://scraperwiki.com/tags/parliament&quot; target=&quot;_blank&quot;&gt;ScraperWiki&lt;/a&gt;, an amazing web-based scraping platform, we decided to write a simple, though efficient, DEiXToBot-based script that gathers information (such as the full name, constituency and contact details) from the &lt;a href=&quot;http://www.hellenicparliament.gr/en/Vouleftes/Viografika-Stoicheia/&quot; target=&quot;_blank&quot;&gt;CVs pages&lt;/a&gt;&amp;nbsp;of Greek MPs and exports it (after some post-processing, e.g. deducing the party name to which the MP belongs from the logo in the party column) to a tab delimited text file that can then be easily imported in an &lt;a href=&quot;http://en.wikipedia.org/wiki/OpenDocument&quot; target=&quot;_blank&quot;&gt;ODF&lt;/a&gt; spreadsheet&amp;nbsp;or into a database. The script uses a tree &lt;a href=&quot;http://deixto.com/wp-content/uploads/parliament_CVs.xml&quot; target=&quot;_blank&quot;&gt;pattern&lt;/a&gt; previously built with the GUI &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; tool to identify the data under interest and visits all 30 target pages (each containing ten records) by&amp;nbsp;utilizing&amp;nbsp;the &lt;i&gt;pageNo&lt;/i&gt; URL parameter. It should also be noted that we used &lt;a href=&quot;http://deixto.blogspot.gr/2013/01/selenium-browser-automation-companion-for-deixto.html&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt; for our purposes, our favorite browser automation tool. Eventually, the results of the execution of the script can be found in this &lt;a href=&quot;http://deixto.com/wp-content/uploads/Parliament_Members.ods&quot; target=&quot;_blank&quot;&gt;.ods file&lt;/a&gt;.&amp;nbsp;&lt;span style=&quot;background-color: white; color: #222222;&quot;&gt;In case you would like to take a look at the Perl code that got the job done you can &lt;a href=&quot;http://deixto.com/wp-content/uploads/parliament.pl&quot; target=&quot;_blank&quot;&gt;download it here&lt;/a&gt;.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://en.wikipedia.org/wiki/Open_data&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;100&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEie-uL6d72y3Rp2xx7_bIW0epnpqENTTrPwBPZ5rdJ0OFyZXMZUhnYQpXkvBivR0LBYQfQh7_q89XKf4XPVzoDzsZmTkeh6wBT-xW-RZAiEmcrdFP-X6TJRPbryu6o-G4DBaTBS-lc0Wjvd/s200/rdf_open_data.png&quot; width=&quot;91&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; &lt;a href=&quot;http://en.wikipedia.org/wiki/Open_data&quot; target=&quot;_blank&quot;&gt;Open data&lt;/a&gt; — data that is free for use, reuse, and redistribution —&amp;nbsp;is a goldmine that can stimulate innovative ways to discover knowledge and analyze rich data sets available on the World Wide Web. Scraping is an invaluable tool that can help towards this direction and serve&amp;nbsp;transparency&amp;nbsp;and&amp;nbsp;openness. Currently there is a wide variety of remarkable web data extraction tools (among which quite a few free). Perhaps you would like to give &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; a try and start building your own web robots to get the data you need and transform it into a suitable format for further use.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; In conclusion, scraping has numerous &lt;a href=&quot;http://deixto.blogspot.gr/2012/03/uses-and-applications-of-web-scraping.html&quot; target=&quot;_blank&quot;&gt;uses and applications&lt;/a&gt;&amp;nbsp;and&amp;nbsp;there is a high chance you could come up with an interesting and creative use case scenario tailored to your requirements. So, if you need any help with DEiXTo or have any inquiries, please do not hesitate to &lt;a href=&quot;http://deixto.com/contact/&quot; target=&quot;_blank&quot;&gt;contact us&lt;/a&gt;!&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/05/scraping-members-of-greek-parliament.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHc46uuWMFXqREdmrte4eSzD3UTapEjIsbmZiNK2i73o-R1ZQ64M_o3DDV4AYivljnT2Ch2lIQOaft4NcTbe-1PXznz8YsMFzmfSIuSj2guLniKqfsaYlN3kGQJWW5mrrrBD3-FHV88k6Q/s72-c/logo_en.gif" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-5626603475610369990</guid><pubDate>Tue, 16 Apr 2013 07:58:00 +0000</pubDate><atom:updated>2013-05-06T17:10:08.222+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Data transformations</category><category domain="http://www.blogger.com/atom/ns#">Data visualization</category><category domain="http://www.blogger.com/atom/ns#">Google Charts</category><category domain="http://www.blogger.com/atom/ns#">Διαύγεια</category><category domain="http://www.blogger.com/atom/ns#">ΥπερΔιαύγεια</category><title>Visualizing Clarity document categories in a pie chart</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
The &quot;&lt;a href=&quot;http://diavgeia.gov.gr/en&quot; target=&quot;_blank&quot;&gt;Cl@rity&lt;/a&gt;&quot; program of the &lt;a href=&quot;http://en.wikipedia.org/wiki/Greece&quot; target=&quot;_blank&quot;&gt;Hellenic Republic&lt;/a&gt; offers a wealth of data about the decisions and expenditure of all Greek ministries and their organizations. It operates for more than two years now and it is a great source of&amp;nbsp;public&amp;nbsp;data waiting for all of us to explore. However, it has been facing a lot of technical problems over the last year because of the large number of documents uploaded daily and the heavy data management cost. Unfortunately, their frontend and its &lt;a href=&quot;http://et.diavgeia.gov.gr/f/ihu/search/index.php&quot; target=&quot;_blank&quot;&gt;search functionality&lt;/a&gt; is not working most of the time. Thankfully, a private initiative,&amp;nbsp;&lt;a href=&quot;http://yperdiavgeia.gr/&quot; target=&quot;_blank&quot;&gt;UltraCl@rity&lt;/a&gt;, has come up in the meantime to offer a great alternative for searching the digitally signed public pdf documents and their metadata, filling in the gap left by the Greek government.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://yperdiavgeia.gr/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;47&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7fO3357a4D6In_NeaYuUcElrFoNEev0IsoB1ugqY53eEJXOstGFBroFppVkdj6AcOsj8F76E_rgF9lleFywQ5p5Eba4qySEfxeYsMryR58gTKn4l-YxjcAq9eCXfyDJDxL46fjiZiFaZm/s320/ultraclarity-logo380x75v2.png&quot; width=&quot;240&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; As you probably already know we focus on &lt;a href=&quot;http://deixto.blogspot.gr/2012/03/uses-and-applications-of-web-scraping.html&quot; target=&quot;_blank&quot;&gt;web scraping&lt;/a&gt; and the utilization of the information extracted. One of the best ways to exploit the data you might have gathered with &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; (or another web data extraction tool) is presenting it with a comprehensive chart. Hence, we thought it might be interesting to collect the subject categories of the documents published on Cl@rity by a big educational&amp;nbsp;institution like the &lt;a href=&quot;http://www.aueb.gr/index_en.php&quot; target=&quot;_blank&quot;&gt;Athens University of&amp;nbsp;Economics and Business&lt;/a&gt;&amp;nbsp;(AUEB) and create a handy&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Pie_chart&quot; target=&quot;_blank&quot;&gt;pie chart&lt;/a&gt;.&lt;br /&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; This page&amp;nbsp;&lt;a href=&quot;http://et.diavgeia.gov.gr/f/aueb/list.php?l=themes&quot; target=&quot;_blank&quot;&gt;http://et.diavgeia.gov.gr/f/aueb/list.php?l=themes&lt;/a&gt;&amp;nbsp;provides a convenient categorization of&amp;nbsp;AUEB&#39;s&amp;nbsp;decisions. Therefore, with &lt;a href=&quot;http://deixto.com/wp-content/uploads/aueb_diavgeia_categories.wpf&quot; target=&quot;_blank&quot;&gt;a simple pattern&lt;/a&gt;&amp;nbsp;(extraction rule), created with GUI DEiXTo, we captured the categories and their number of documents. Then, it was quite easy and straightforward to programmatically transform the output data (as of 16th of April 2013) into an interactive &lt;a href=&quot;https://developers.google.com/chart/interactive/docs/gallery/piechart&quot; target=&quot;_blank&quot;&gt;Google pie chart&lt;/a&gt;&amp;nbsp;with the most popular categories using the amazing&amp;nbsp;&lt;a href=&quot;https://developers.google.com/chart/interactive/docs/index&quot; target=&quot;_blank&quot;&gt;Google Chart Tools&lt;/a&gt;.&amp;nbsp;So, here it is:&lt;/div&gt;
&lt;script src=&quot;https://www.google.com/jsapi&quot; type=&quot;text/javascript&quot;&gt;&lt;/script&gt;
    &lt;script type=&quot;text/javascript&quot;&gt;
      google.load(&quot;visualization&quot;, &quot;1&quot;, {packages:[&quot;corechart&quot;]});
      google.setOnLoadCallback(drawChart);
      function drawChart() {
        var data = google.visualization.arrayToDataTable([
          [&#39;Category&#39;, &#39;Number of decisions&#39;],          [&#39;Εκπαίδευση και Έρευνα&#39;,     20439],          [&#39;Προϋπολογισμός&#39;,      2991],
          [&#39;Ταξίδια&#39;,  2366],          [&#39;Υπηρεσίες&#39;, 2210],          [&#39;Υποτροφίες&#39;,    1728],
          [&#39;Προμήθειες&#39;,    1471],          [&#39;Τρόφιμα και ποτά&#39;,    1016],
          [&#39;Μηχανήματα&#39;,    582],          [&#39;Τηλεπικοινωνιακά τέλη&#39;,    566],
          [&#39;Ταχυδρομικές Υπηρεσίες&#39;,    352],
          [&#39;Other&#39;,    1818]
        ]);
        var options = {
          &#39;width&#39;: 640,
          &#39;height&#39;: 290,
          &#39;chartArea&#39;: {top:5,left: 60,bottom:5,width:&quot;75%&quot;,height:&quot;90%&quot;},
          &#39;title&#39;: &#39;Clarity Document categorization for AUEB - 16-03-2013&#39;,          
        };
        var chart = new google.visualization.PieChart(document.getElementById(&#39;chart_div&#39;));
        chart.draw(data, options);
      }
    &lt;/script&gt;
&lt;br /&gt;
&lt;div id=&quot;chart_div&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; By the way,&amp;nbsp;&lt;a href=&quot;http://publicspending.medialab.ntua.gr/en/home&quot; target=&quot;_blank&quot;&gt;publicspending.gr&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href=&quot;http://greekspending.com/&quot; target=&quot;_blank&quot;&gt;greekspending.com&lt;/a&gt;&amp;nbsp;are truly remarkable research efforts aiming at visualizing public expenditure data from the Cl@rity project in user-friendly diagrams and charts. Of course the deixto-based scenario described above is just a simple scraping example. What we would like to point out is that this kind of data transformations could have some, innovative practical applications and facilitate useful web-based services. In conclusion, Cl@rity (or &quot;&lt;a href=&quot;http://diavgeia.gov.gr/&quot; target=&quot;_blank&quot;&gt;Διαύγεια&lt;/a&gt;&quot; as it is known in Greek) can be a goldmine, spark new innovations and allow citizens and developers in particular to dig into open data in a creative fashion and in favor of the transparency of public life.&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/04/visualizing-clarity-document-categories-pie-chart.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7fO3357a4D6In_NeaYuUcElrFoNEev0IsoB1ugqY53eEJXOstGFBroFppVkdj6AcOsj8F76E_rgF9lleFywQ5p5Eba4qySEfxeYsMryR58gTKn4l-YxjcAq9eCXfyDJDxL46fjiZiFaZm/s72-c/ultraclarity-logo380x75v2.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6015457618295198632</guid><pubDate>Sat, 13 Apr 2013 20:05:00 +0000</pubDate><atom:updated>2013-04-16T11:02:46.238+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Data transformations</category><category domain="http://www.blogger.com/atom/ns#">Data visualization</category><category domain="http://www.blogger.com/atom/ns#">Google Charts</category><category domain="http://www.blogger.com/atom/ns#">Price monitoring</category><title>Fuel price monitoring &amp; data visualization</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
Recently, we stumbled upon a very useful public, web-based service, the &lt;a href=&quot;http://www.fuelprices.gr/&quot; target=&quot;_blank&quot;&gt;Greek Fuel Prices Observatory&lt;/a&gt;&amp;nbsp;(&quot;Παρατηρητήριο Τιμών Υγρών Καυσίμων&quot; in Greek). Its main objective is to allow consumers find out the prices of liquid fuels per type and &lt;a href=&quot;http://www.fuelprices.gr/GetGeography&quot; target=&quot;_blank&quot;&gt;geographic region&lt;/a&gt;. Having a wealth of fuel-related information at his disposal, one could build some innovative services (e.g. taking advantage of the &lt;a href=&quot;http://deixto.blogspot.gr/2012/01/geo-location-data-yahoo-placefinder.html&quot; target=&quot;_blank&quot;&gt;geo-location&lt;/a&gt; of gas stations), find interesting stats or create meaningful charts.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://www.fuelprices.gr/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;53&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzSB0a2ExzdxVizvCzXFOLn0ADJuhWSWBGzh-IqHvYOKanAr28oB8caMPkgVzMyo9SWUJ6XuQj9ONMCd3npvA-zTpVOA5Z0t2eIktO-KgRBRbpvE89xgKR2XXih3VzJbicXryAtVRkeLBH/s200/sms.png&quot; width=&quot;200&quot; /&gt;&lt;/a&gt;&lt;span id=&quot;goog_1120806205&quot;&gt;&lt;/span&gt;&lt;span id=&quot;goog_1120806206&quot;&gt;&lt;/span&gt;&lt;a href=&quot;http://www.blogger.com/&quot;&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; One of the site&#39;s most interesting pages is that which contains the min, max and mean prices over the last 3 months:&amp;nbsp;&lt;a href=&quot;http://www.fuelprices.gr/price_stats_ng.view?time=1&amp;amp;prodclass=1&quot; target=&quot;_blank&quot;&gt;&lt;span style=&quot;font-family: inherit;&quot;&gt;http://www.fuelprices.gr/price_stats_ng.view?time=1&amp;amp;prodclass=1&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
However, the hyperlink at the bottom right corner of the page (with the text &quot;Γραφήματος&quot;) which is supposed to display a comprehensive graph returns an HTTP Status 500 exception message instead (at least as of 13th of April 2013). So, we could not resist scraping the data from the table with &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; and presenting it nicely with a &lt;a href=&quot;https://google-developers.appspot.com/chart/interactive/docs/gallery/linechart&quot; target=&quot;_blank&quot;&gt;Google line chart&lt;/a&gt; after some post-processing. We used a &lt;a href=&quot;http://en.wikipedia.org/wiki/Regular_expression&quot; target=&quot;_blank&quot;&gt;regular expression&lt;/a&gt; to isolate the date, we reversed the order of the records found (so that the list is sorted chronologically, the oldest one first), we replaced commas in prices with dots (as a decimal mark) and we wrote a short script to produce the necessary lines for the &lt;a href=&quot;https://developers.google.com/chart/interactive/docs/datatables_dataviews#arraytodatatable&quot; target=&quot;_blank&quot;&gt;arrayToDataTable&lt;/a&gt; method call of the &lt;a href=&quot;https://developers.google.com/chart/interactive/docs/reference&quot; target=&quot;_blank&quot;&gt;Google Visualization API&lt;/a&gt;. Therefore, it was pretty straightforward to create the following:&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;img border=&quot;0&quot; height=&quot;226&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi53czubZ9jh6O3y2rgCDWoLqyZA2nglKrZWsBp4OFCr2GfT28TuOJkwgJeuuDUMsUwnB46k008QW4UmpFbwwjgLWFyOJ2ZdQz4K7i-ORsQRnNZ60DQ32e-EZjto_5pZb-NgHahpR4G2n7z/s400/Screen+Shot+2013-04-13+at+10.29.23+PM.png&quot; width=&quot;400&quot; /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Generally, there are various remarkable data visualization tools out there (one of the best is &lt;a href=&quot;https://developers.google.com/chart/&quot; target=&quot;_blank&quot;&gt;Google Charts&lt;/a&gt; of course) but we would not like to elaborate further on this now. Nevertheless, we would like to give emphasis on the fact that once you have rich and useful web data in hand you can exploit them in a wide variety of ways and come up with smart methods to analyze, use and present them. Your imagination is the only limit (along with the &lt;a href=&quot;http://deixto.blogspot.gr/2011/12/robotstxt-access-restrictions.html&quot; target=&quot;_blank&quot;&gt;copyright restrictions&lt;/a&gt;).&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/04/fuel-price-monitoring-data-visualization.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzSB0a2ExzdxVizvCzXFOLn0ADJuhWSWBGzh-IqHvYOKanAr28oB8caMPkgVzMyo9SWUJ6XuQj9ONMCd3npvA-zTpVOA5Z0t2eIktO-KgRBRbpvE89xgKR2XXih3VzJbicXryAtVRkeLBH/s72-c/sms.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-238510656511322230</guid><pubDate>Fri, 12 Apr 2013 06:34:00 +0000</pubDate><atom:updated>2013-04-12T21:07:29.455+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Data migration</category><category domain="http://www.blogger.com/atom/ns#">e-commerce</category><title>Data migration through browser automation</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
As we have already mentioned quite a few times, browser automation can have a lot of practical applications ranging from testing of web applications to web-based administration tasks and web scraping. The latter (scraping) is our field of expertise and &lt;a href=&quot;http://deixto.blogspot.gr/2013/01/selenium-browser-automation-companion-for-deixto.html&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt; is our tool of choice when it comes to automated browser interaction and dealing with complex, JavaScript-rich pages.&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; A very interesting scenario (among others) of combining our beloved web scraping tool,&amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;, with Selenium could be data migration. Imagine for example that you have an &lt;a href=&quot;http://www.oscommerce.com/&quot; target=&quot;_blank&quot;&gt;osCommerce&lt;/a&gt; online store and you would like to migrate it to a &lt;a href=&quot;http://www.joomla.org/&quot; target=&quot;_blank&quot;&gt;Joomla &lt;/a&gt;&lt;a href=&quot;http://virtuemart.net/&quot; target=&quot;_blank&quot;&gt;VirtueMart&lt;/a&gt; e-commerce system. Wouldn&#39;t it be great if you could scrape the product details from the old, online catalogue through DEiXTo and then automate the data entry labor via Selenium? Once we have the data at hand in a suitable format, e.g. comma/ tab delimited or XML, we could then write a script that would&amp;nbsp;repeatedly&amp;nbsp;visit the data entry online form (in the administration environment of the new e-shop), fill in the necessary fields and submit it (once for each single product) so as to insert&amp;nbsp;automatically&amp;nbsp;all the products into the new website.&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://www.oscommerce.com/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;35&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibKZx-pkbpwM0F5CJZsMwq2pTf-1bN4J28qdQZF_S8nIxRKvo9BrEIB9A66O-YC6hQJTD-S5fnGAgeeF6exhrubLJBp5PX6o7VH6TBiiYVIN6JkW34F6VpZ7dTme8HfFzmX37nOUKKjIp-/s200/oscommerce.png&quot; width=&quot;200&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&amp;nbsp; &amp;nbsp; This way you can save a lot of time and effort and avoid messing with complex data migration tools (which are very useful in many cases). Important: of course we don&#39;t claim that migrating databases through scraping and automated data entry is the best solution. However, it&#39;s a nice and quick alternative approach for several, especially relatively simple, cases. The big advantage is that you don&#39;t even need to know the underlying schemas of the two systems under consideration. The only condition is to have access to the administrator interface of the new system.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; By the way, below you can see a screenshot from &lt;a href=&quot;http://www.altova.com/mapforce.html&quot; target=&quot;_blank&quot;&gt;Altova MapForce&lt;/a&gt;, maybe the best (but not free) data mapping, conversion and integration software tool out there.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://www.altova.com/mapforce.html&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;155&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTqBg8bdrWhqyyU-mYyNETAdJt2dCmYGiZ5cSSPtuPebObfmPr99LAUCeMvlIgd2s_fr47q5_O3NV8KF4NYPupt9camqSWG815IR6fYMhAlkYJo8YYBSNLm_bFPzhyD6G_p1cEu6Ab-xWr/s320/mapforce-overview.png&quot; width=&quot;320&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Generally speaking, the uses and applications of web data extraction are numerous. You can check out some of them &lt;a href=&quot;http://deixto.blogspot.gr/2012/03/uses-and-applications-of-web-scraping.html&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;. Perhaps you are about to think the next one and we would be glad to help you with the technicalities!&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/04/data-migration-through-browser.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibKZx-pkbpwM0F5CJZsMwq2pTf-1bN4J28qdQZF_S8nIxRKvo9BrEIB9A66O-YC6hQJTD-S5fnGAgeeF6exhrubLJBp5PX6o7VH6TBiiYVIN6JkW34F6VpZ7dTme8HfFzmX37nOUKKjIp-/s72-c/oscommerce.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6079054790029513248</guid><pubDate>Thu, 28 Mar 2013 08:40:00 +0000</pubDate><atom:updated>2013-04-03T18:18:25.914+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Browser automation</category><category domain="http://www.blogger.com/atom/ns#">Selenium</category><title>How to pass Selenium pages to DEiXToBot</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
Recently we talked about &lt;a href=&quot;http://deixto.blogspot.gr/2013/01/selenium-browser-automation-companion-for-deixto.html&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt; and its potential combination with &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;. It is a truly remarkable browser automation tool with numerous uses and applications. For those of you wondering how to programmatically pass pages fetched with Selenium to DEiXToBot on the fly, then here is a way (provided you are familiar with Perl programming):&lt;/div&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;# suppose that you have already fetched the target page with the WWW::Selenium object ($sel variable)&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my $content = $sel-&amp;gt;get_html_source(); # get the page source code&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my ($fh,$name); # create a temporary file containing the page&#39;s source code&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;do { $name = tmpnam() } until $fh = IO::File-&amp;gt;new($name, O_RDWR|O_CREAT|O_EXCL);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;print $fh $content;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;close $fh;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;$agent-&amp;gt;get(&quot;file://$name&quot;); # load the temporary file/page with the DEiXToBot agent using the file:// scheme&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;unlink $name; # delete the temporary file, it is not needed any more&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;if (! $agent-&amp;gt;success) { die &quot;Could not fetch the temp file!&quot;; }&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;$agent-&amp;gt;load_pattern(&#39;pattern.xml&#39;); # load the pattern built with the GUI tool&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;$agent-&amp;gt;build_dom(); # build the DOM tree of the page&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;$agent-&amp;gt;extract_content(); # apply the pattern&amp;nbsp;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my @records = @{$agent-&amp;gt;records};&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;for my $record (@records) { # loop through the data/ records scraped&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;....&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Therefore, you can create temporary HTML files, in real time, containing the source code of the target pages (after the&amp;nbsp;&lt;a href=&quot;http://search.cpan.org/~mattp/Test-WWW-Selenium/lib/WWW/Selenium.pm&quot; target=&quot;_blank&quot;&gt;WWW::Selenium&lt;/a&gt;&amp;nbsp;object gets these pages) and pass them to the DEiXToBot agent to do the scraping job. Another interesting scenario is to download the pages locally with Selenium and then read/ scrape them directly from the disk at a later stage. We hope the above snippet helps. Please do not hesitate to contact us for any questions or feedback!&lt;/div&gt;
&lt;br /&gt;&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/03/how-to-pass-selenium-pages-to-deixtobot.html</link><author>noreply@blogger.com (kntonas)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-1011149959411420819</guid><pubDate>Sat, 02 Feb 2013 10:19:00 +0000</pubDate><atom:updated>2013-02-06T09:03:34.061+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Archivability</category><category domain="http://www.blogger.com/atom/ns#">Digital preservation</category><category domain="http://www.blogger.com/atom/ns#">Selenium</category><category domain="http://www.blogger.com/atom/ns#">Web archiving</category><title>Digital Preservation and ArchiveReady</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
Although our blog&#39;s main focus is scraping data from web information sources (especially via &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;), we are also very interested in services and applications that can be built on top of agents and crawlers. Our favorite tools for programmatic web browsing are &lt;a href=&quot;http://search.cpan.org/~jesse/WWW-Mechanize/lib/WWW/Mechanize.pm&quot; target=&quot;_blank&quot;&gt;WWW::Mechanize&lt;/a&gt; and &lt;a href=&quot;http://seleniumhq.org/&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt;. The first one is a handy Perl object (that lacks Javascript support though) whereas the latter is a great browser automation tool that we have been using more and more lately in a&amp;nbsp;variety&amp;nbsp;of cases. Through them we can simulate whatever a user can do in a browser window and automate the interaction with pages of interest.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp;Traversing a website is one of the most basic and common tasks for a developer of web robots. However, the&amp;nbsp;methodologies used and the mechanisms&amp;nbsp;deployed can vary a lot. So, we tried to think of a meaningful crawler-based scenario&amp;nbsp;that would blend&amp;nbsp;various &quot;tasty&quot; ingredients and come up with a nice story. Hopefully we did and our post has four major pillars that we would like to highlight and discuss further below:&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Crawling (through Selenium)&lt;/li&gt;
&lt;li&gt;Sitemaps&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;text-align: left;&quot;&gt;Archivability&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;text-align: left;&quot;&gt;Scraping (in the demo that follows we download reports from a target website)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; An interesting topic we recently stumbled upon is &lt;a href=&quot;http://en.wikipedia.org/wiki/Digital_preservation&quot; target=&quot;_blank&quot;&gt;digital preservation&lt;/a&gt; which can be viewed as a series of policies and strategies necessary to ensure continued access to digital content over time and regardless of the challenges of media failure and technological change. In this context, we discovered &lt;a href=&quot;http://archiveready.com/&quot; target=&quot;_blank&quot;&gt;ArchiveReady&lt;/a&gt;, a remarkable web application that checks whether a website is easily archivable. This means that it scans a page and checks whether it&#39;s suitable for &lt;a href=&quot;http://en.wikipedia.org/wiki/Web_archiving&quot; target=&quot;_blank&quot;&gt;web archiving&lt;/a&gt; projects (such as the &lt;a href=&quot;http://archive.org/&quot; target=&quot;_blank&quot;&gt;Internet Archive&lt;/a&gt; and &lt;a href=&quot;http://blogforever.eu/&quot; target=&quot;_blank&quot;&gt;BlogForever&lt;/a&gt;) to access and preserve it. However, you can only pass one web page at a time to its checker (not an entire website) and it might take some time to complete depending on the complexity and size of the page. Therefore, we thought it could be useful for those interested to test multiple pages if we wrote a small script that parses the XML sitemap of a target site and checks each of the URLs contained in it against the ArchiveReady service and at the same time downloads the results.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://archiveready.com/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;58&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhL_jL6uOkuiBoGERa-2QTq9K5XnEAXRaYwiwmN2uYif1iBozIYOfZVXB8hvIzeXeNsVOe8tb6wHXobltN7SxJMSGFjeX8bNdGJmnXMCQMCQtUSe_qqU2K5wmsUSHRxPpCMDNc5VN4DOr_K/s200/archiveready_logo.png&quot; width=&quot;200&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&amp;nbsp; &amp;nbsp; &lt;a href=&quot;http://www.sitemaps.org/&quot; target=&quot;_blank&quot;&gt;Sitemaps&lt;/a&gt;, as you probably already know,&amp;nbsp;are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a sitemap is an XML file that lists URLs for a site along with some additional metadata about each URL so that search engines can more intelligently crawl the site. Typically sitemaps are auto-generated by plugins.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp;OK, enough with the talking. Let&#39;s get to work and write some code! The Perl modules we utilised for our purposes were &lt;a href=&quot;http://search.cpan.org/~mattp/Test-WWW-Selenium/lib/WWW/Selenium.pm&quot; target=&quot;_blank&quot;&gt;WWW::Selenium&lt;/a&gt; and &lt;a href=&quot;http://search.cpan.org/~shlomif/XML-LibXML/LibXML.pod&quot; target=&quot;_blank&quot;&gt;XML::LibXML&lt;/a&gt;. The activities we had to do were the following:&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li style=&quot;text-align: justify;&quot;&gt;launch a Firefox instance&amp;nbsp;&lt;/li&gt;
&lt;li style=&quot;text-align: justify;&quot;&gt;read the sitemap document of a sample target website (we chose openarchives.gr)&lt;/li&gt;
&lt;li style=&quot;text-align: justify;&quot;&gt;pass each of its URLs to the ArchiveReady validation engine and finally&lt;/li&gt;
&lt;li style=&quot;text-align: justify;&quot;&gt;download locally the results returned&amp;nbsp;in Evaluation and Report Language (&lt;a href=&quot;http://www.w3.org/TR/EARL10-Schema&quot; target=&quot;_blank&quot;&gt;EARL&lt;/a&gt;)&amp;nbsp;format since the&amp;nbsp;ArchiveReady offers this option&lt;/li&gt;
&lt;/ul&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
So, here is the code (just note that we wait till the validation page contains 8 &quot;Checking complete&quot; messages, one for each section, to determine whether the processing has finished):&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;use WWW::Selenium;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;use XML::LibXML;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my $parser = XML::LibXML-&amp;gt;new();&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my $dom = $parser-&amp;gt;parse_file(&#39;http://openarchives.gr/sitemap.xml&#39;);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my @loc_elms = $dom-&amp;gt;getElementsByTagName(&#39;loc&#39;);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my @urls;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;for my $loc (@&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace; font-size: xx-small;&quot;&gt;loc_elms&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace; font-size: xx-small;&quot;&gt;) {&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; push @urls,$loc-&amp;gt;textContent;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;}&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;# launch a Firefox instance&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;my $sel = WWW::Selenium-&amp;gt;new( host =&amp;gt; &quot;localhost&quot;,&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; port =&amp;gt; 4444,&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; browser =&amp;gt; &quot;*firefox&quot;,&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; browser_url =&amp;gt; &quot;http://archiveready.com/&quot;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; );&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;$sel-&amp;gt;start;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;# parse through the pages contained in the sitemap&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;for my $u (@urls) {&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; $sel-&amp;gt;open(&quot;http://archiveready.com/check?url=$u&quot;);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; my $content = $sel-&amp;gt;get_html_source();&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; while ( (() = $content =~ /Checking complete/g) != 8) { # check if complete&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; sleep(1);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; $content = $sel-&amp;gt;get_html_source();&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; }&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; $content=~m#href=&quot;/download-results\?test_id=(\d+)&amp;amp;amp;format=earl&quot;#;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; my $id = $1; # capture the identifier of the current validation test&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; $sel-&amp;gt;click(&#39;xpath=//a[contains(@href,&quot;earl&quot;)]&#39;); # click on the&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace; font-size: xx-small;&quot;&gt;EARL link&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; $sel-&amp;gt;wait_for_page_to_load(5000);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; eval { $content = $sel-&amp;gt;get_html_source(); };&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; open my $fh,&quot;&amp;gt;:utf8&quot;,&quot;download_results_$id.xml&quot;; # write the EARL report to a file&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; print $fh $content;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&amp;nbsp; &amp;nbsp; close $fh;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;}&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;$sel-&amp;gt;stop;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: xx-small;&quot;&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; We hope you found the above helpful and the scenario described interesting. We tried to take advantage of software agents/ crawlers and use them creatively in combination with&amp;nbsp;&lt;a href=&quot;http://archiveready.com/&quot; target=&quot;_blank&quot;&gt;ArchiveReady&lt;/a&gt;,&amp;nbsp;an innovative&amp;nbsp;service that helps you strengthen your website&#39;s archivability. Finally, scraping and automated browsing can have an extremely extensive set of &lt;a href=&quot;http://deixto.blogspot.gr/2012/03/uses-and-applications-of-web-scraping.html&quot; target=&quot;_blank&quot;&gt;uses and applications&lt;/a&gt;. Please check out &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;, our feature-rich web data&amp;nbsp;extraction&amp;nbsp;tool, and don&#39;t&amp;nbsp;hesitate&amp;nbsp;to&amp;nbsp;contact&amp;nbsp;us! Maybe we can help you with your tedious and time-consuming web tasks and data needs!&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/02/digital-preservation-and-archiveready.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhL_jL6uOkuiBoGERa-2QTq9K5XnEAXRaYwiwmN2uYif1iBozIYOfZVXB8hvIzeXeNsVOe8tb6wHXobltN7SxJMSGFjeX8bNdGJmnXMCQMCQtUSe_qqU2K5wmsUSHRxPpCMDNc5VN4DOr_K/s72-c/archiveready_logo.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-2612032599435081465</guid><pubDate>Thu, 24 Jan 2013 06:17:00 +0000</pubDate><atom:updated>2013-01-26T21:42:58.705+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Browser automation</category><category domain="http://www.blogger.com/atom/ns#">Sauce Labs</category><category domain="http://www.blogger.com/atom/ns#">Selenium</category><category domain="http://www.blogger.com/atom/ns#">Testing</category><title>Cloudify your browser testing (and scraping) with Sauce!</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
For quite some time now, along with our&amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;&amp;nbsp;scraping software, we have been using&amp;nbsp;&lt;a href=&quot;http://seleniumhq.org/&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt;&amp;nbsp;which is perhaps the best web browser automation tool currently available. It&#39;s really great and has helped us a lot in a variety of web data extraction cases (we published&amp;nbsp;&lt;a href=&quot;http://deixto.blogspot.gr/2013/01/selenium-browser-automation-companion-for-deixto.html&quot; target=&quot;_blank&quot;&gt;another post&lt;/a&gt;&amp;nbsp;about it recently). We tried it locally as well as on remote&amp;nbsp;&lt;a href=&quot;http://www.gnu.org/gnu/linux-and-gnu.html&quot; target=&quot;_blank&quot;&gt;GNU/Linux&lt;/a&gt;&amp;nbsp;servers and we wrote code for a couple of automated tests and scraping tasks. However, it was not that easy to set everything up and get things running; we came across various difficulties (ranging from installation to stability issues e.g. sporadic timeout errors) although we were finally able to surpass most of them.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Wouldn&#39;t it be great though if there was a robust framework that would provide you with the necessary infrastructure and all possible browser/OS combinations and allow you to run your Selenium tests in the &lt;a href=&quot;http://en.wikipedia.org/wiki/Cloud_computing&quot; target=&quot;_blank&quot;&gt;cloud&lt;/a&gt;? You would not have to worry about setting a bunch of things up, installing updates, machines management, maintenance, etc. Well, there is! And it offers a whole lot more.. Its name is &lt;a href=&quot;https://saucelabs.com/home&quot; target=&quot;_blank&quot;&gt;Sauce Labs&lt;/a&gt; and it provides an amazing set of tools and &lt;a href=&quot;https://saucelabs.com/features&quot; target=&quot;_blank&quot;&gt;features&lt;/a&gt;. Admittedly they have done awesome work and they bring great products to software developers. Moreover, their team seems to share some great &lt;a href=&quot;https://saucelabs.com/company/values&quot; target=&quot;_blank&quot;&gt;values&lt;/a&gt;: pursuit of excellence,&amp;nbsp;innovation and &lt;a href=&quot;http://en.wikipedia.org/wiki/Open_source_software&quot; target=&quot;_blank&quot;&gt;open source&lt;/a&gt;&amp;nbsp;culture&amp;nbsp;(among others).&lt;br /&gt;
&amp;nbsp; &amp;nbsp; They offer a variety of &lt;a href=&quot;https://saucelabs.com/pricing&quot; target=&quot;_blank&quot;&gt;pricing plans&lt;/a&gt; (a bit expensive in my opinion though) while the free account includes 100 automated code minutes for Win, Linux and Android, 40 automated code minutes for Mac and iOS and 30 Minutes of manual testing. And for those contributing to an open source project that needs testing support, &lt;a href=&quot;http://saucelabs.com/opensauce&quot; target=&quot;_blank&quot;&gt;Open Sauce Plan&lt;/a&gt;&amp;nbsp;is just for you (unlimited minutes without any cost!). Please note that the Selenium project is sponsored by Sauce Labs.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://saucelabs.com/home&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOyvWaT_zktMmRmoTDAFCRO6ZrtXQVll0WDQvZep-_248m6SXENNHFOCf1mx07JMd_tKFgExJ6XkvSehfDGFVgOe5w18PmrcCGwvt6gFcSYTiuTnhxf6ABsKxudISUDz4iQDlHwXDxIPEK/s1600/sauce-labs-logo.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp;&amp;nbsp;Being a &lt;a href=&quot;http://www.perl.org/&quot; target=&quot;_blank&quot;&gt;Perl&lt;/a&gt; programmer, I could not resist signing up and writing some Perl code to run a test on the &lt;i&gt;ondemand.saucelabs.com&lt;/i&gt; host! I was already familiar with the&amp;nbsp;&lt;a href=&quot;http://search.cpan.org/~mattp/Test-WWW-Selenium/lib/WWW/Selenium.pm&quot; target=&quot;_blank&quot;&gt;WWW::Selenium&lt;/a&gt;&amp;nbsp;CPAN module, so it was quite easy and straightforward. It should be noted that they provide useful guidelines and various &lt;a href=&quot;https://saucelabs.com/docs/code-examples&quot; target=&quot;_blank&quot;&gt;examples&lt;/a&gt; online for multiple languages e.g. Python, Java, PHP and others. Overall my test script worked pretty well but it was a bit slow (compared to running the same code locally). However, one could improve speed by deploying lots of processes in parallel (if the use case scenario is suitable) and by &lt;a href=&quot;http://saucelabs.com/docs/ondemand/additional-config#video&quot; target=&quot;_blank&quot;&gt;disabling video&lt;/a&gt; (the script&#39;s execution and browser activity is recorded for easier debugging). Furthermore, Sauce&#39;s big advantage is that it can go large scale, which would be especially suited for complex cases with heavy requirements.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; &amp;nbsp;The bottom line is that the &quot;Selenium - Sauce Labs&quot; pair is remarkable and can be very useful in a wide range of cases and purposes. Sauce in particular offers developers an exciting way to cloudify and manage their automated browser testing (although we personally focus more on the scraping capabilities that these tools provide). Their combination with &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; extraction patterns could definitely be very fertile and open new, interesting potentials. In conclusion, the &lt;a href=&quot;http://deixto.blogspot.gr/2012/03/uses-and-applications-of-web-scraping.html&quot; target=&quot;_blank&quot;&gt;uses and applications&lt;/a&gt; of web scraping are limitless and Selenium turns out to be a powerful tool in our quiver!&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/01/cloudify-your-browser-testing-and.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOyvWaT_zktMmRmoTDAFCRO6ZrtXQVll0WDQvZep-_248m6SXENNHFOCf1mx07JMd_tKFgExJ6XkvSehfDGFVgOe5w18PmrcCGwvt6gFcSYTiuTnhxf6ABsKxudISUDz4iQDlHwXDxIPEK/s72-c/sauce-labs-logo.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6567054700607830256</guid><pubDate>Sun, 13 Jan 2013 05:07:00 +0000</pubDate><atom:updated>2013-01-24T08:15:48.283+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">OCR</category><category domain="http://www.blogger.com/atom/ns#">PDF</category><category domain="http://www.blogger.com/atom/ns#">Tesseract</category><title>Scraping PDF files</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
While puttering around on the Internet, I recently stumbled upon the website of the&amp;nbsp;&lt;a href=&quot;http://www.et.gr/&quot; target=&quot;_blank&quot;&gt;National Printing House&lt;/a&gt;&amp;nbsp;(&quot;Εθνικό Τυπογραφείο&quot; in Greek) which is the public service responsible for the dissemination of Greek law.&amp;nbsp;It publishes and distributes the government gazette and its website provides free access to all series of the Official Journal&amp;nbsp;of the Hellenic Republic&amp;nbsp;(ΦΕΚ).&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://www.et.gr/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjy2cBOkMvQYESxEgbvA0pvVutaPGMpvZCllX3KKw3AXuurFl290stug3eeHLuiFhhwOLndmi3WlMr4_muR9_54qj2ez7aoGhg6oCGZbt5amHYmjqEc8-6c3OdfllinQAjGw66NzTgmrZ-W/s1600/logo_etw_el.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; So, at its &lt;a href=&quot;http://www.et.gr/index.php?option=com_content&amp;amp;view=article&amp;amp;id=166&amp;amp;Itemid=103&amp;amp;lang=el&quot; target=&quot;_blank&quot;&gt;search page&lt;/a&gt; I noticed a &lt;a href=&quot;http://www.et.gr/index.php?option=com_wrapper&amp;amp;view=wrapper&amp;amp;Itemid=148&amp;amp;lang=el&quot; target=&quot;_blank&quot;&gt;section&lt;/a&gt; with&amp;nbsp;the most popular issues. The most-viewed&amp;nbsp;one, as of 13 Jan 2013, with 351.595 views was this:&amp;nbsp;&lt;a href=&quot;http://www.et.gr/idocs-nph/search/pdfViewerForm.html?args=5C7QrtC22wFYAFdDx4L2G3dtvSoClrL84tQ3Uej7Zml5MXD0LzQTLWPU9yLzB8V68knBzLCmTXKaO6fpVZ6Lx9hLslJUqeiQiiD930OBDBHUohi1lAlpD-vAa2f_8ua_g5tppHc83kc.&quot; target=&quot;_blank&quot;&gt;ΦΕΚ A 226 - 27.10.2011&lt;/a&gt;.&amp;nbsp;Out of curiosity I decided to download it in order to take a quick look and see what it is all about.&amp;nbsp;It was available in a PDF format and&amp;nbsp;it turned out to be an issue about the Economic Adjustment Programme&amp;nbsp;for Greece&amp;nbsp;aiming to reduce its macroeconomic and fiscal imbalances. However,&amp;nbsp;I was quite surprised to find that the text was contained in images and you could not perform any keyword search in it nor could you copy-paste its textual content! I guess because the document&#39;s pages were scanned and converted to digital images.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; This&amp;nbsp;instantly&amp;nbsp;brought to my mind once again the difficulties that PDF scraping involves. From our web scraping experience there are many cases where the data is &quot;locked&quot; in PDF files e.g. in a .pdf brochure.&amp;nbsp;Getting the data of interest out is not an easy task but quite a few tools (&lt;a href=&quot;http://www.cyberciti.biz/faq/converter-pdf-files-to-text-format-command/&quot; target=&quot;_blank&quot;&gt;pdftotext&lt;/a&gt; is one of them)&amp;nbsp;have popped up over the years&amp;nbsp;to ease the pain. One of the best tools I have encountered so far is&amp;nbsp;&lt;a href=&quot;http://code.google.com/p/tesseract-ocr/&quot; target=&quot;_blank&quot;&gt;Tesseract&lt;/a&gt;, a pretty accurate &lt;a href=&quot;http://en.wikipedia.org/wiki/Open_source_software&quot; target=&quot;_blank&quot;&gt;open source&lt;/a&gt; &lt;a href=&quot;http://en.wikipedia.org/wiki/Optical_character_recognition&quot; target=&quot;_blank&quot;&gt;OCR&lt;/a&gt; engine currently maintained by Google.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;http://code.google.com/p/tesseract-ocr/&quot; target=&quot;_blank&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbZMy_ivq_1OAEKyhq0RXt4e2viAvFZ63WXU8aTuvpyr-REo_IecqrPKggpCtO8a5hIkREQG_uE1Vl5cMbtlgsTyacBPG8kMbISI3EUxImS_WwmmvkVJBLPc3ZnNLaatuUcHPja9S9Ew0D/s1600/tesseract-ocr.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; So, I thought it would be nice to put Tesseract into action and check its efficiency against the PDF document (that so dramatically&amp;nbsp;affects the lives of all Greeks..). It worked quite well, although not perfect (probably because of the Greek language), and a few minutes later (and after converting the pdf to &lt;a href=&quot;http://deixto.com/wp-content/uploads/FEK_A_226-27.10.2011.tiff&quot; target=&quot;_blank&quot;&gt;a tiff image&lt;/a&gt; through &lt;a href=&quot;http://linux.about.com/library/cmd/blcmdl1_gs.htm&quot; target=&quot;_blank&quot;&gt;Ghostscript&lt;/a&gt;) I had the full text, or at least most of it, in my hands.&amp;nbsp;The output text file generated can be found&amp;nbsp;&lt;a href=&quot;http://deixto.com/wp-content/uploads/FEK_A_226-27.10.2011.txt&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;.&amp;nbsp;The truth is that I could not do much with it (and the austerity measures were harsh..) but at least I was happy that I was able to extract the largest part of the text.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; Of course this is just an example, there are numerous PDF files out there containing rich, inaccessible data that could potentially be processed and further utilised e.g. in order to create a full text search index.&amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;, our beloved web data extraction tool, can scrape&amp;nbsp;only&amp;nbsp;HTML pages. It cannot deal with PDF files residing on the Web. However, we do have the tools and the knowledge to parse those as well, find bits of interest and unleash their value!&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/01/scraping-pdf-files.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjy2cBOkMvQYESxEgbvA0pvVutaPGMpvZCllX3KKw3AXuurFl290stug3eeHLuiFhhwOLndmi3WlMr4_muR9_54qj2ez7aoGhg6oCGZbt5amHYmjqEc8-6c3OdfllinQAjGw66NzTgmrZ-W/s72-c/logo_etw_el.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6271643811269006152</guid><pubDate>Fri, 11 Jan 2013 06:43:00 +0000</pubDate><atom:updated>2013-04-18T09:27:18.334+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">AJAX</category><category domain="http://www.blogger.com/atom/ns#">Browser automation</category><category domain="http://www.blogger.com/atom/ns#">JavaScript</category><category domain="http://www.blogger.com/atom/ns#">Selenium</category><title>Selenium: a web browser automation companion for DEiXTo</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp;&lt;a href=&quot;http://seleniumhq.org/&quot; target=&quot;_blank&quot;&gt;Selenium&lt;/a&gt;&amp;nbsp;is probably the best web browser automation tool we have come across so far. Primarily it is intended for automated testing of web applications but it&#39;s certainly not limited to that; it provides a suite of &lt;a href=&quot;http://en.wikipedia.org/wiki/Free_software&quot; target=&quot;_blank&quot;&gt;free software&lt;/a&gt; tools to automate web browsers across many platforms.&amp;nbsp;The range of its use case scenarios is really wide and its usefulness is just great.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWttiqWoOX7Sa0LDUp9yojRdXvIZCQF8SfeBuq-wd8fwDOmeTtIEhPWo1xCI4qH38DTASQkGygZDTjPiHEb9yAy0Vzqzt0glaJGnkx9mFJvgNut3EfFKtEH590CU78Ia6VSVx5gULvgj8t/s1600/selenium_logo.jpeg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWttiqWoOX7Sa0LDUp9yojRdXvIZCQF8SfeBuq-wd8fwDOmeTtIEhPWo1xCI4qH38DTASQkGygZDTjPiHEb9yAy0Vzqzt0glaJGnkx9mFJvgNut3EfFKtEH590CU78Ia6VSVx5gULvgj8t/s1600/selenium_logo.jpeg&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; However, as scraping experts, we inevitably focus on using Selenium for web data extraction purposes. Its functionality-rich client API can be used to launch browser instances (e.g. Firefox processes)&amp;nbsp;and simulate, through the proper commands, almost everything a user could do on a web site/ page.&amp;nbsp;Thus, it allows you to deploy a fully-fledged web browser and surpass the difficulties that pop up from heavy JavaScript/ AJAX use.&amp;nbsp;Moreover, via the virtual framebuffer X server (&lt;a href=&quot;http://www.xfree86.org/4.0.1/Xvfb.1.html&quot; target=&quot;_blank&quot;&gt;Xvfb&lt;/a&gt;), one could&amp;nbsp;automate browsers without the need for an actual display and create scripts/ services running periodically or at will on a headless server e.g. on a remote&amp;nbsp;&lt;a href=&quot;http://www.gnu.org/gnu/linux-and-gnu.html&quot; target=&quot;_blank&quot;&gt;GNU/Linux&lt;/a&gt;&amp;nbsp;machine. Therefore, Selenium could successfully be used in combination with DEiXToBot, our beloved Mechanize scraping module.&amp;nbsp;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; For example, the Selenium-automated browser could fetch a target page after a couple of steps (like clicking a button/ hyperlink, selecting an item from a drop-down list,&amp;nbsp;submitting&amp;nbsp;a form, etc.) and then &lt;a href=&quot;http://deixto.blogspot.gr/2013/03/how-to-pass-selenium-pages-to-deixtobot.html&quot; target=&quot;_blank&quot;&gt;pass it to DEiXToBot&lt;/a&gt; (which&amp;nbsp;lacks JavaScript support) to do the scraping job through DOM-based tree patterns previously generated with the GUI DEiXTo tool. This is particularly useful for complex scraping cases and opens new potential for &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; wrappers.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; The Selenium Server component (formerly the Selenium RC Server) as well as the client drivers that allow you to write scripts that interact with the Selenium Server can be found &lt;a href=&quot;http://seleniumhq.org/download/&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;. We have used it quite a few times for various cases and the results were great.&amp;nbsp;In conclusion, Selenium is an amazing &quot;weapon&quot; added to our arsenal and we strongly believe that along with &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; it boosts our scraping capabilities. If you have an idea/ project that involves web browser automation or/ and web data extraction, we would be more than glad to hear from you!&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2013/01/selenium-browser-automation-companion-for-deixto.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWttiqWoOX7Sa0LDUp9yojRdXvIZCQF8SfeBuq-wd8fwDOmeTtIEhPWo1xCI4qH38DTASQkGygZDTjPiHEb9yAy0Vzqzt0glaJGnkx9mFJvgNut3EfFKtEH590CU78Ia6VSVx5gULvgj8t/s72-c/selenium_logo.jpeg" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-2922626744838802426</guid><pubDate>Sat, 18 Aug 2012 09:17:00 +0000</pubDate><atom:updated>2012-08-19T20:54:39.996+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Agents</category><category domain="http://www.blogger.com/atom/ns#">PhantomJS</category><category domain="http://www.blogger.com/atom/ns#">Scraping</category><title>PhantomJS &amp; finding pizza using Yelp and DEiXTo!</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
Recently I stumbled upon &lt;a href=&quot;http://code.google.com/p/phantomjs/&quot; target=&quot;_blank&quot;&gt;PhantomJS&lt;/a&gt;, a headless WebKit browser which can serve a wide variety of purposes such as web browser automation, site scraping,&amp;nbsp;website testing,&amp;nbsp;SVG rendering and network monitoring. It&#39;s&amp;nbsp;a very interesting tool and I am sure that it could successfully be used in combination with DEiXToBot which is our beloved powerful Mechanize scraper. For example, it could fetch a not-easy-to-reach (probably JavaScript-rich) target page (that &lt;a href=&quot;http://search.cpan.org/~jesse/WWW-Mechanize-1.72/lib/WWW/Mechanize.pm&quot; target=&quot;_blank&quot;&gt;WWW::Mechanize&lt;/a&gt; could not get due to its lack of&amp;nbsp;JavaScript&amp;nbsp;support) after completing some steps like clicking, selecting, checking, etc and then pass it to DEiXToBot to do the scraping job. This is particularly useful for complex scraping cases where in my humble opinion PhantomJS DOM manipulation support would just not be enough and DEiXTo extraction capabilities could come into play.&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNP6BX_w4fZAYF8aRZ9EzOZRLvhC0k7XHWdZo_alj_pVV5PF_Nv6UVDQxPxWmUy88Q6ishlbmizyRVDn3REAWtsFJZdh0XAoGlcyqHgRdJyAFUpg48sGUhizxaeX9HGblDRScTJGnGYk4Y/s1600/PhantomJS.jpeg&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;117&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNP6BX_w4fZAYF8aRZ9EzOZRLvhC0k7XHWdZo_alj_pVV5PF_Nv6UVDQxPxWmUy88Q6ishlbmizyRVDn3REAWtsFJZdh0XAoGlcyqHgRdJyAFUpg48sGUhizxaeX9HGblDRScTJGnGYk4Y/s1600/PhantomJS.jpeg&quot; width=&quot;108&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
So, I was taking a look at the PhantomJS examples and I liked (among others) the one about &lt;a href=&quot;https://github.com/ariya/phantomjs/blob/master/examples/pizza.coffee&quot; target=&quot;_blank&quot;&gt;finding pizza&lt;/a&gt; in Mountain View using &lt;a href=&quot;http://www.yelp.com/search?find_desc=pizza&amp;amp;find_loc=94040&amp;amp;find_submit=Search&quot; target=&quot;_blank&quot;&gt;Yelp&lt;/a&gt; (I really like pizza!). So, I thought it would be nice to port the example to DEiXToBot in order to demonstrate the latter&#39;s use and efficiency. Hence, I visually created a pretty simple and easy to build &lt;a href=&quot;http://deixto.com/wp-content/uploads/yelp_pizza.xml&quot; target=&quot;_blank&quot;&gt;XML pattern&lt;/a&gt; with GUI &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; for extracting the address field of each pizzeria returned (essentially equivalent to what PhantomJS does by getting the inner text of span.address items) and wrote a few lines of Perl code to execute the pattern on the target page and print the addresses extracted on the screen (either on a &lt;a href=&quot;http://www.gnu.org/gnu/linux-and-gnu.html&quot; target=&quot;_blank&quot;&gt;GNU/Linux&lt;/a&gt; terminal or a command prompt window on Windows).&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFYbYLNsw7ZhAcDW77rlzBiiGAqve-5w20L3fAMsDhyphenhyphenSUAm20FU6VntoSwJguzq69BCDdIsJU_TVNvAXlCDUNT9ozLE0a-ZPrn1ohkucp7rdH0SByXiWy2jpjrg-2Jyi2iX6uxdSxwA94j/s1600/Screen+Shot+2012-08-18+at+11.51.20+AM.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;146&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFYbYLNsw7ZhAcDW77rlzBiiGAqve-5w20L3fAMsDhyphenhyphenSUAm20FU6VntoSwJguzq69BCDdIsJU_TVNvAXlCDUNT9ozLE0a-ZPrn1ohkucp7rdH0SByXiWy2jpjrg-2Jyi2iX6uxdSxwA94j/s400/Screen+Shot+2012-08-18+at+11.51.20+AM.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
The resulting script was simple like that:&lt;/div&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;use DEiXToBot;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;my $agent = DEiXToBot-&amp;gt;new();&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;$agent-&amp;gt;get(&#39;http://www.yelp.com/search?find_desc=pizza&amp;amp;find_loc=94040&amp;amp;find_submit=Search&#39;);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;die &#39;Unable to access network&#39; unless&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace; font-size: x-small;&quot;&gt;$agent-&amp;gt;success;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;$agent-&amp;gt;load_pattern(&#39;yelp_pizza.xml&#39;);&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;$agent-&amp;gt;build_dom();&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;$agent-&amp;gt;extract_content();&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;my @addresses;&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;for my $record (@{$agent-&amp;gt;records}) {&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;&amp;nbsp; &amp;nbsp; push @&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace; font-size: x-small;&quot;&gt;addresses&lt;/span&gt;&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;, $$record[0];&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;}&lt;/span&gt;&lt;br /&gt;
&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;print join(&quot;\n&quot;,@&lt;/span&gt;&lt;span style=&quot;font-family: &#39;Courier New&#39;, Courier, monospace; font-size: x-small;&quot;&gt;addresses&lt;/span&gt;&lt;span style=&quot;font-family: Courier New, Courier, monospace; font-size: x-small;&quot;&gt;);&lt;/span&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
Just note that it scrapes only the first results page (just like in the PhantomJS example). We could easily parse through all the pages by following the &quot;Next&quot; page link but this is out of scope.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
I would like to further look into PhantomJS and check the potential of using it (along with &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;) as a pre-scraping step for hard JavaScript-enabled pages. In any case,&amp;nbsp;PhantomJS&amp;nbsp;is a handy tool that can be quite useful for a wide range of use cases. Generally speaking, web scraping can have countless &lt;a href=&quot;http://deixto.blogspot.gr/2012/03/uses-and-applications-of-web-scraping.html&quot; target=&quot;_blank&quot;&gt;applications and uses&lt;/a&gt; and there are many remarkable tools out there. One of the best we believe is &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;, so check it out! DEiXTo has helped quite a few people get their web data extraction tasks done easily and free!&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2012/08/phantomjs-finding-pizza-using-yelp-and.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNP6BX_w4fZAYF8aRZ9EzOZRLvhC0k7XHWdZo_alj_pVV5PF_Nv6UVDQxPxWmUy88Q6ishlbmizyRVDn3REAWtsFJZdh0XAoGlcyqHgRdJyAFUpg48sGUhizxaeX9HGblDRScTJGnGYk4Y/s72-c/PhantomJS.jpeg" height="72" width="72"/><thr:total>1</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-5942739521550432571</guid><pubDate>Fri, 09 Mar 2012 06:52:00 +0000</pubDate><atom:updated>2012-03-09T09:48:20.186+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">APIs</category><category domain="http://www.blogger.com/atom/ns#">Federated search</category><category domain="http://www.blogger.com/atom/ns#">Open Source</category><category domain="http://www.blogger.com/atom/ns#">Scraping</category><category domain="http://www.blogger.com/atom/ns#">Search engines</category><title>DEiXTo powers ΟPEN-SME</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;&lt;div style=&quot;text-align: left;&quot;&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;We are happy to announce that &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; is going to power &lt;a href=&quot;http://opensme.eu/&quot; target=&quot;_blank&quot;&gt;ΟPEN-SME&lt;/a&gt;, an exciting EU-funded project that promotes software reuse among small and medium-sized software enterprises (SMEs). ΟPEN-SME is coordinated by the &lt;a href=&quot;http://www.computer-engineers.gr/&quot; target=&quot;_blank&quot;&gt;Greek Association of Computer Engineers&lt;/a&gt; and it is aiming to develop a set of methodologies, tools and business models centered on SME Associations, which will enable software SMEs to effectively introduce &lt;a href=&quot;http://www.opensource.org/&quot; target=&quot;_blank&quot;&gt;open source&lt;/a&gt; software reuse practices in their production processes.&lt;br /&gt;
&lt;br /&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipeQInu7n2mLWnoTU_xRs4U0ihOZ27utSLPekfA8-YuqjoKMCcEdGiWqM1U-4e-OAWPXV3AuB37_C8sp5P2YnxnmM84FaP5fWG5K39DZ56EW-NSjn_YpgD_JRN2ks7E4NGKVyE4NQlaUNn/s1600/logo_el.gif&quot; imageanchor=&quot;1&quot; style=&quot;clear: right; float: right; margin-bottom: 1em; margin-left: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;67&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipeQInu7n2mLWnoTU_xRs4U0ihOZ27utSLPekfA8-YuqjoKMCcEdGiWqM1U-4e-OAWPXV3AuB37_C8sp5P2YnxnmM84FaP5fWG5K39DZ56EW-NSjn_YpgD_JRN2ks7E4NGKVyE4NQlaUNn/s200/logo_el.gif&quot; width=&quot;100&quot; /&gt;&lt;/a&gt;&amp;nbsp; &amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;-based wrappers have been successfully deployed in order to enable the project&#39;s federated search engine, called OCEAN (developed by&amp;nbsp;the&amp;nbsp;&lt;a href=&quot;http://www.csd.auth.gr/en/index.php&quot; target=&quot;_blank&quot;&gt;Department of Informatics&lt;/a&gt;&amp;nbsp;of the Aristotle University of Thessaloniki), to simultaneously search in real time existing open source software search engines that do NOT offer&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Application_programming_interface&quot; target=&quot;_blank&quot;&gt;API&lt;/a&gt;&amp;nbsp;access (i.e.&amp;nbsp;&lt;a href=&quot;http://www.koders.com/&quot; target=&quot;_blank&quot;&gt;Koders&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href=&quot;http://opensearch.krugle.org/home/home_page/&quot; target=&quot;_blank&quot;&gt;Krugle&lt;/a&gt;).&amp;nbsp;To achieve this, custom &lt;a href=&quot;http://www.perl.org/&quot; target=&quot;_blank&quot;&gt;Perl&lt;/a&gt; code was written so as to submit the&amp;nbsp;user-specified queries to&amp;nbsp;the native websites and&amp;nbsp;scrape the (N first) results&amp;nbsp;returned into a suitable form.&lt;/div&gt;&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&lt;br /&gt;
&amp;nbsp; &amp;nbsp; We are really glad that we are participating in this challenging and innovative project and we hope that &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; will help ΟPEN-SME towards implementing its goals. So, if you are looking for a web scraping framework to power your&amp;nbsp;aggregator&amp;nbsp;or search engine, please do not hesitate to &lt;a href=&quot;http://deixto.com/contact.php&quot; target=&quot;_blank&quot;&gt;contact us&lt;/a&gt;!&lt;/div&gt;&lt;/div&gt;</description><link>http://deixto.blogspot.com/2012/03/deixto-powers-pen-sme.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipeQInu7n2mLWnoTU_xRs4U0ihOZ27utSLPekfA8-YuqjoKMCcEdGiWqM1U-4e-OAWPXV3AuB37_C8sp5P2YnxnmM84FaP5fWG5K39DZ56EW-NSjn_YpgD_JRN2ks7E4NGKVyE4NQlaUNn/s72-c/logo_el.gif" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6577860543994242397</guid><pubDate>Mon, 05 Mar 2012 09:07:00 +0000</pubDate><atom:updated>2012-03-09T09:04:37.042+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Scraping</category><title>Uses and applications of web scraping</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;Some people wonder what the uses of web scraping might be. Well, your imagination is the only limit (along with the copyright notices perhaps). There is a huge wealth of data out there and many&amp;nbsp;believe&amp;nbsp;that the open Web is a real goldmine. So, web data extraction tools and &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; in particular could help you unlock this treasure and give birth to innovations, applications and&amp;nbsp;new&amp;nbsp;ideas.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; Public institutions, companies and organizations,&amp;nbsp;entrepreneurs,&amp;nbsp;professionals&amp;nbsp;as well as mere citizens and users generate an&amp;nbsp;enormous&amp;nbsp;amount&amp;nbsp;of information&amp;nbsp;every single day. The question is: how effectively is it being used?&amp;nbsp;Towards this direction, web&amp;nbsp;content extraction&amp;nbsp;can prove a valuable ally. Along with data mining, they have much to offer in every field you can imagine. The following are only some of the uses of web scraping:&lt;/div&gt;&lt;ul style=&quot;text-align: left;&quot;&gt;&lt;li&gt;collect properties from real estate listings&lt;/li&gt;
&lt;li&gt;scrape retailer sites on a daily basis&lt;/li&gt;
&lt;li&gt;extract offers and discounts from deal-of-the-day websites&lt;/li&gt;
&lt;li&gt;gather data for hotels and vacation rentals&lt;/li&gt;
&lt;li&gt;scrape jobs postings and internships&lt;/li&gt;
&lt;li&gt;crawl forums and social sites so as to enable analysis and post-processing&amp;nbsp;of their rich data&lt;/li&gt;
&lt;li&gt;power aggregators and product search engines&lt;/li&gt;
&lt;li&gt;monitor your online reputation and check what is being said for you or your brand&lt;/li&gt;
&lt;li&gt;quickly populate product catalogues with full specifications&lt;/li&gt;
&lt;li&gt;monitor prices of the competition&lt;/li&gt;
&lt;li&gt;scrape the content of digital libraries in order to transform it into suitable, structured forms&lt;/li&gt;
&lt;li&gt;collect and aggregate government and public data&lt;/li&gt;
&lt;li&gt;search (in real time) bibliographic databases and online sources that don&#39;t offer an API, thus powering &lt;a href=&quot;http://deixto.blogspot.com/2012/01/federated-searching-dbwiz.html&quot; target=&quot;_blank&quot;&gt;federated&amp;nbsp;search engines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;look for educational material and information from across traditional formal higher education subjects and real-life context environments in order to help the contemporary learner&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://deixto.blogspot.com/2011/12/can-deixto-power-mobile-apps-yes-it-can.html&quot; target=&quot;_blank&quot;&gt;power mobile applications&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;help building geolocation apps (e.g. &lt;a href=&quot;http://deixto.blogspot.com/2012/01/geo-location-data-yahoo-placefinder.html&quot; target=&quot;_blank&quot;&gt;extracting addresses available on web pages and using their coordinates to build meaningful maps with points of interest&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;prepare large, focused datasets for scientific tasks (i.e. data mining)&lt;/li&gt;
&lt;li&gt;extract and summarize large volumes of text (e.g. &lt;a href=&quot;http://deixto.com/newegg.php&quot; target=&quot;_blank&quot;&gt;summarizing product reviews&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&amp;lt;your scraping task goes here!&amp;gt;&lt;/li&gt;
&lt;/ul&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; This list can grow very long. There are countless use cases and potential scenarios, either business-oriented or non-profit. As far as the &lt;a href=&quot;http://deixto.blogspot.com/2011/12/robotstxt-access-restrictions.html&quot; target=&quot;_blank&quot;&gt;access and copyright restrictions&lt;/a&gt; are concerned, it is a really significant issue that has raised a lot of discussion and controversy. However, the opinion that seems to be gaining ground is that (well-intentioned) web scraping is legal since the data is publicly and freely available on the Web. So, let your creativity and imagination loose;&amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; can probably help you to achieve your scraping-based project goals. We would be more than happy to &lt;a href=&quot;http://deixto.com/contact.php&quot; target=&quot;_blank&quot;&gt;hear from you&lt;/a&gt;.&lt;/div&gt;&lt;br /&gt;
&lt;/div&gt;</description><link>http://deixto.blogspot.com/2012/03/uses-and-applications-of-web-scraping.html</link><author>noreply@blogger.com (kntonas)</author><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-8274736727794590973</guid><pubDate>Sun, 19 Feb 2012 06:08:00 +0000</pubDate><atom:updated>2012-02-22T22:49:23.616+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Digital Libraries</category><category domain="http://www.blogger.com/atom/ns#">Dublin Core</category><category domain="http://www.blogger.com/atom/ns#">Europeana</category><category domain="http://www.blogger.com/atom/ns#">Linked Data</category><category domain="http://www.blogger.com/atom/ns#">OAI-PMH</category><title>Linked Data &amp; DEiXTo</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;As explained in a&amp;nbsp;&lt;a href=&quot;http://deixto.blogspot.com/2012/01/open-archives-digital-libraries.html&quot; target=&quot;_blank&quot;&gt;previous post&lt;/a&gt;,&amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;&amp;nbsp;can scrape the content of digital libraries, archives and multimedia collections lacking an &lt;a href=&quot;http://en.wikipedia.org/wiki/Application_programming_interface&quot; target=&quot;_blank&quot;&gt;API&lt;/a&gt;&amp;nbsp;and enable their metadata&amp;nbsp;transformation (through post-processing and&amp;nbsp;custom Perl code)&amp;nbsp;to&amp;nbsp;&lt;a href=&quot;http://dublincore.org/&quot; target=&quot;_blank&quot;&gt;Dublin Core&lt;/a&gt;&amp;nbsp;and subsequently in&amp;nbsp;&lt;a href=&quot;http://www.openarchives.org/pmh/&quot; target=&quot;_blank&quot;&gt;OAI-PMH&lt;/a&gt;&amp;nbsp;or another suitable form, e.g.&amp;nbsp;&lt;a href=&quot;http://www.europeana.eu/portal/&quot; target=&quot;_blank&quot;&gt;Europeana&lt;/a&gt;&amp;nbsp;Semantic Elements (&lt;a href=&quot;http://www.europeana.eu/schemas/ese/&quot; target=&quot;_blank&quot;&gt;ESE&lt;/a&gt;).&lt;br /&gt;
&amp;nbsp; &amp;nbsp; Meanwhile,&amp;nbsp;the Web has become a dynamic collaboration platform that allows everyone to meet, read and more importantly write. Thus, it steadily approaches the vision of &lt;a href=&quot;http://www.w3.org/People/Berners-Lee/&quot; target=&quot;_blank&quot;&gt;Tim Berners-Lee&lt;/a&gt; (the inventor of the World Wide Web): the &lt;a href=&quot;http://linkeddata.org/&quot; target=&quot;_blank&quot;&gt;Linked Data&lt;/a&gt; Web, a place where related data are linked and information is represented in a more structured and easily machine-processable way.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; &lt;a href=&quot;http://www.w3.org/DesignIssues/LinkedData.html&quot; target=&quot;_blank&quot;&gt;Linked Data&lt;/a&gt; refers to a set of best practices for publishing and connecting structured data on the Web. Its key technologies are &lt;a href=&quot;http://en.wikipedia.org/wiki/Uniform_resource_identifier&quot; target=&quot;_blank&quot;&gt;URIs&lt;/a&gt; (a generic method to identify resources on the Internet), the&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol&quot; target=&quot;_blank&quot;&gt;Hypertext Transfer Protocol&lt;/a&gt;&amp;nbsp;(HTTP) and &lt;a href=&quot;http://www.w3.org/TR/rdf-primer/&quot; target=&quot;_blank&quot;&gt;RDF&lt;/a&gt; (a data model and a general method for conceptual description of things in the real world). It is an exciting topic of interest and it&#39;s expected to make great progress in the next few years. A video that does a nice job of explaining what Linked Open Data is all about can be found here: &lt;a href=&quot;http://vimeo.com/36752317&quot;&gt;http://vimeo.com/36752317&lt;/a&gt;&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnYYm9QHpH6mJUNpo21M9oPz1-6-qpEZPNjfYhHwT0t_q3TjBA5f8fWvmSUsSu8O5lSbCMYdjEQNG3TYl4tv6U_qmgpI6M716vgryDcoEJ9PVnLNmb0dTl7QTgrspuB4QKgMXf7uKyC6V0/s1600/lod-datasets_2009-07-14_cropped.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;297&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnYYm9QHpH6mJUNpo21M9oPz1-6-qpEZPNjfYhHwT0t_q3TjBA5f8fWvmSUsSu8O5lSbCMYdjEQNG3TYl4tv6U_qmgpI6M716vgryDcoEJ9PVnLNmb0dTl7QTgrspuB4QKgMXf7uKyC6V0/s400/lod-datasets_2009-07-14_cropped.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;Over the last decade, the&amp;nbsp;Open Archives Initiative Protocol for Metadata Harvesting (&lt;a href=&quot;http://www.openarchives.org/pmh/&quot; target=&quot;_blank&quot;&gt;OAI-PMH&lt;/a&gt;)&amp;nbsp;has become the de facto standard for metadata exchange in digital libraries and it&#39;s playing an increasingly important role.&amp;nbsp;However, it has two major drawbacks: it does not make its resources accessible via dereferencable URIs and it provides only restricted means of selective access to metadata.&amp;nbsp;Therefore, there is a strong need for&amp;nbsp;efficient&amp;nbsp;tools that would allow&amp;nbsp;metadata repositories to expose their content&amp;nbsp;according to the&amp;nbsp;&lt;span class=&quot;s1&quot;&gt;Linked Data&lt;/span&gt;&amp;nbsp;&lt;a href=&quot;http://www.w3.org/DesignIssues/LinkedData.html&quot; target=&quot;_blank&quot;&gt;guidelines&lt;/a&gt;. This would make&amp;nbsp;digitized items and media objects accessible via HTTP URIs and&amp;nbsp;query able&amp;nbsp;via the&amp;nbsp;&lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-query/&quot; target=&quot;_blank&quot;&gt;SPARQL&lt;/a&gt;&amp;nbsp;protocol.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; &lt;a href=&quot;http://www.linkedin.com/in/bernhardhaslhofer&quot; target=&quot;_blank&quot;&gt;Dr&amp;nbsp;Haslhofer&lt;/a&gt; has performed significant research and work towards this direction. He has&amp;nbsp;developed (among others) the &lt;a href=&quot;http://www.mediaspaces.info/tools/oai2lod/&quot; target=&quot;_blank&quot;&gt;OAI2LOD Server&lt;/a&gt;&amp;nbsp;based on the &lt;a href=&quot;http://sourceforge.net/projects/d2rq-map/&quot; target=&quot;_blank&quot;&gt;&lt;span class=&quot;s1&quot;&gt;D2R Server&lt;/span&gt;&lt;/a&gt; implementation and wrote the &lt;a href=&quot;https://github.com/behas/ese2edm&quot; target=&quot;_blank&quot;&gt;ESE2EDM&lt;/a&gt;&amp;nbsp;converter, a collection of ruby scripts that can convert given&amp;nbsp;XML-based ESE&amp;nbsp;source files into the RDF-based Europeana Data Model (&lt;a href=&quot;http://pro.europeana.eu/web/guest/edm-documentation&quot; target=&quot;_blank&quot;&gt;EDM&lt;/a&gt;). These remarkable tools could turn out very useful for making large volumes of information Linked-Data ready, with all the advantages this brings.&lt;br /&gt;
&amp;nbsp; &amp;nbsp; Linked&amp;nbsp;Open&amp;nbsp;Data can change the computer world as we know it. So, there is a lot of potential in combining &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; with Linked Data technologies. Their blend could eventually produce an innovative and useful outcome. Many already believe&amp;nbsp;that Linked Data is the next big thing. Time will tell. Meanwhile,&amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; could definitely help you&amp;nbsp;generate structured data in a variety of formats from unstructured HTML pages, either&amp;nbsp;your ultimate goal is&amp;nbsp;Linked Data or not.&lt;/div&gt;&lt;/div&gt;</description><link>http://deixto.blogspot.com/2012/02/linked-data-deixto.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnYYm9QHpH6mJUNpo21M9oPz1-6-qpEZPNjfYhHwT0t_q3TjBA5f8fWvmSUsSu8O5lSbCMYdjEQNG3TYl4tv6U_qmgpI6M716vgryDcoEJ9PVnLNmb0dTl7QTgrspuB4QKgMXf7uKyC6V0/s72-c/lod-datasets_2009-07-14_cropped.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-2194704879425998273</guid><pubDate>Sat, 11 Feb 2012 07:42:00 +0000</pubDate><atom:updated>2012-02-11T14:19:33.144+02:00</atom:updated><title>DEiXTo components clarified</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;From the emails and feedback received, it seems that many people get a bit confused about the&amp;nbsp;utility&amp;nbsp;and functionality of the &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; GUI tool compared to the &lt;a href=&quot;http://www.perl.org/&quot; target=&quot;_blank&quot;&gt;Perl&lt;/a&gt; command line executor (CLE). DEiXToBot is even more confusing for quite a few users. So, let&#39;s clarify things.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; The GUI tool&amp;nbsp;is freeware (available&amp;nbsp;at no cost but without any source code, at least yet) and it allows you to visually build and execute extraction rules for web pages of interest with point and click convenience. It offers you an embedded web browser and a friendly graphical interface so that you can highlight an element/ record instance as the mouse moves over it. The GUI tool is a Windows-only application that harnesses Internet Explorer&#39;s HTML parser and render engine.&amp;nbsp;&amp;nbsp;It is worth noting that it can support simple&amp;nbsp;&lt;a href=&quot;http://deixto.blogspot.com/2011/12/cooperating-deixto-agents.html&quot; target=&quot;_blank&quot;&gt;cooperative extraction scenarios&lt;span id=&quot;goog_1052480954&quot;&gt;&lt;/span&gt;&lt;/a&gt;&amp;nbsp;as well as periodic, scheduled execution through batch files and the Windows Task Scheduler.&amp;nbsp;Perhaps its main drawback is that it can execute just one pattern on a page although for several cases (maybe for the majority) one and only extraction rule is enough to get the job done.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; On the other hand, the command line executor, or CLE for short, is implemented in Perl and it is freely distributed under the &lt;a href=&quot;http://www.gnu.org/licenses/gpl.html&quot; target=&quot;_blank&quot;&gt;GNU General Public License&lt;/a&gt;&amp;nbsp;v3, thus its source code is included. Its purpose is to execute wrapper project files (.wpf) that have previously been created with the GUI tool. It runs on a DOS prompt window or on a Linux/ Mac terminal. &amp;nbsp;Besides the code though, we have built two standalone executables so that you can run CLE either on a Windows or a &lt;a href=&quot;http://www.gnu.org/gnu/linux-and-gnu.html&quot; target=&quot;_blank&quot;&gt;GNU/Linux&lt;/a&gt;&amp;nbsp;machine&amp;nbsp;without having Perl or any prerequisite modules&amp;nbsp;installed. CLE is faster, offers more output formats and has some add&lt;span id=&quot;goog_987015250&quot;&gt;&lt;/span&gt;&lt;span id=&quot;goog_987015251&quot;&gt;&lt;/span&gt;&lt;a href=&quot;http://www.blogger.com/&quot;&gt;&lt;/a&gt;itional features such as an efficient &lt;a href=&quot;http://deixto.wikispaces.com/message/view/home/32988104&quot; target=&quot;_blank&quot;&gt;post-processing mechanism&lt;/a&gt; and database support.&amp;nbsp;However, it shares the same shortcoming as the GUI tool: it&amp;nbsp;supports&amp;nbsp;just one pattern on a page.&amp;nbsp;Finally, it relies on DEiXToBot, a &quot;homemade&quot; package that facilitates the&amp;nbsp;execution of&amp;nbsp;GUI &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; generated wrappers.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp;&amp;nbsp;DEiXToBot is&amp;nbsp;the third and probably the most powerful and well-crafted software component of the DEiXTo scraping suite and it is available under the GPL v3 license. It is a Perl module based on &lt;a href=&quot;http://search.cpan.org/~kntonas/WWW-Mechanize-Sleepy-0.7/Sleepy.pm&quot; target=&quot;_blank&quot;&gt;WWW::Mechanize::Sleepy&lt;/a&gt;, a handy web browser Perl object, and several other CPAN modules. It allows extensive customization and tailor-made solutions since it facilitates the combination of &lt;i&gt;multiple&lt;/i&gt; extraction rules/ patterns as well as the post-processing of their results through custom code. Therefore, it can deal with complex cases and cover more advanced web scraping needs. But it requires programming skills in order to use it.&amp;nbsp;&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; The bottom line is that DEiXToBot is the essence of our long experience. The GUI tool might be more suitable for most every-day users (due to its visual convenience) but when things get&amp;nbsp;difficult or the situation requires a more&amp;nbsp;advanced&amp;nbsp;solution (e.g. scheduled or on-demand execution and coordination of multiple wrappers on a &lt;a href=&quot;http://www.gnu.org/gnu/linux-and-gnu.html&quot; target=&quot;_blank&quot;&gt;GNU/Linux&lt;/a&gt; server), a customized DEiXToBot-based script is your choice. You can use the GUI tool first to create the necessary patterns and then deploy a Perl script that uses them to extract structured data from the pages of the target website. So, if you are familiar with Perl, you should not find it very hard to write your first &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;deixto&lt;/a&gt;-based spider/ crawler!&lt;/div&gt;&lt;/div&gt;</description><link>http://deixto.blogspot.com/2012/02/deixto-components-clarified.html</link><author>noreply@blogger.com (kntonas)</author><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-1491882698361958417</guid><pubDate>Sat, 28 Jan 2012 00:03:00 +0000</pubDate><atom:updated>2013-05-05T20:09:21.614+03:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">APIs</category><category domain="http://www.blogger.com/atom/ns#">dbWiz</category><category domain="http://www.blogger.com/atom/ns#">Federated search</category><category domain="http://www.blogger.com/atom/ns#">Scraping</category><category domain="http://www.blogger.com/atom/ns#">Search engines</category><category domain="http://www.blogger.com/atom/ns#">Z39.50</category><title>Federated searching &amp; dbWiz</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
Nowadays, most university and college students, professors&amp;nbsp;as well as&amp;nbsp;researchers&amp;nbsp;are increasingly&amp;nbsp;seeking&amp;nbsp;information&amp;nbsp;and&amp;nbsp;finding&amp;nbsp;answers&amp;nbsp;on the open Web. Google has become the dominant search tool for&amp;nbsp;almost&amp;nbsp;everyone. Its popularity is enormous, no need to wonder or analyze why. It has a simple and effective interface and it returns fast, accurate results.&lt;br /&gt;
&lt;br /&gt;&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHg9VMKqJlf_0Bxkgom1i7uFgbV2e8Yr-Lh7z_zRonLooCykRdYdFbJDJwfO-f9NTYbIQCh7uNNlG1agi618zXgrvVYVqaLT1vPW9kRKhB0Kv9kf4ORaNKvfLQRh31plCAdtPSH-bmU8VZ/s1600/logo3w.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;68&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHg9VMKqJlf_0Bxkgom1i7uFgbV2e8Yr-Lh7z_zRonLooCykRdYdFbJDJwfO-f9NTYbIQCh7uNNlG1agi618zXgrvVYVqaLT1vPW9kRKhB0Kv9kf4ORaNKvfLQRh31plCAdtPSH-bmU8VZ/s200/logo3w.png&quot; width=&quot;200&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;br /&gt;
&amp;nbsp; &amp;nbsp; However, libraries, in their effort to win some&amp;nbsp;patrons&amp;nbsp;back, have tried to offer a decent searching alternative by developing a new model: federated search engines. &lt;a href=&quot;http://en.wikipedia.org/wiki/Federated_search&quot; target=&quot;_blank&quot;&gt;Federated searching&lt;/a&gt; (also known as metasearch or cross searching) allows users to search simultaneously multiple web resources and&amp;nbsp;subscription-based bibliographic databases from a single interface. To achieve that, parallel processes are executed in real time and retrieve results from each separate source. Τhen, the results returned&amp;nbsp;get grouped together and presented&amp;nbsp;to the user&amp;nbsp;in a unified way.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; The mechanisms used for pulling the data from the target sources are broadly two: either through an &lt;a href=&quot;http://en.wikipedia.org/wiki/Application_programming_interface&quot; target=&quot;_blank&quot;&gt;Application Programming Interface&lt;/a&gt; (API) or via &lt;a href=&quot;http://en.wikipedia.org/wiki/Web_scraping&quot; target=&quot;_blank&quot;&gt;scraping&lt;/a&gt; the native web interface/ site of each database.&amp;nbsp;The first method is undoubtedly better but very often a search API is not available. In such cases, &lt;a href=&quot;http://en.wikipedia.org/wiki/Internet_bot&quot; target=&quot;_blank&quot;&gt;web robots&lt;/a&gt; (or agents) come into play and capture information of interest, typically by simulating a human browsing through the target webpages. Especially in the academia, there are numerous online bibliographic databases. Some of them offer &lt;a href=&quot;http://en.wikipedia.org/wiki/Z39.50&quot;&gt;Z39.50&lt;/a&gt;&amp;nbsp;or API access. However, a large number still does not provide protocol-based search functionality. Thus, scraping techniques should be deployed for those (unless the vendor disallows bots).&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6VRENT1edzgK8p7nm-RDP41W32dWBTdb6n9YvvaKg8Jui2dlCgIMwQjC0IsTxoUaK1uCp5HrMXOeWecC8xWsafH4XD048lgVw2XI1pLbzejU_wmLFJXfdpZl7MLKH0t3H0C1RSc0ua1N5/s1600/dbwiz.png&quot; imageanchor=&quot;1&quot; style=&quot;clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: center;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6VRENT1edzgK8p7nm-RDP41W32dWBTdb6n9YvvaKg8Jui2dlCgIMwQjC0IsTxoUaK1uCp5HrMXOeWecC8xWsafH4XD048lgVw2XI1pLbzejU_wmLFJXfdpZl7MLKH0t3H0C1RSc0ua1N5/s1600/dbwiz.png&quot; /&gt;&lt;/a&gt;&amp;nbsp; &amp;nbsp;When starting my programming adventure with &lt;a href=&quot;http://www.perl.org/&quot; target=&quot;_blank&quot;&gt;Perl&lt;/a&gt; back in 2006, in the context of my former full-time job at the &lt;a href=&quot;http://www.lib.uom.gr/index.php?lang=utf-8&quot; target=&quot;_blank&quot;&gt;Library of University of Macedonia&lt;/a&gt;&amp;nbsp;(Thessaloniki,&amp;nbsp;Greece), I had the chance (and luck) to run across &lt;a href=&quot;http://researcher.sfu.ca/dbwiz&quot; target=&quot;_blank&quot;&gt;dbWiz&lt;/a&gt;, a remarkable &lt;a href=&quot;http://www.opensource.org/&quot; target=&quot;_blank&quot;&gt;open source&lt;/a&gt;, federated search tool developed by the &lt;a href=&quot;http://www.lib.sfu.ca/&quot; target=&quot;_blank&quot;&gt;Simon Fraser University&amp;nbsp;(SFU)&amp;nbsp;Library&lt;/a&gt;&amp;nbsp;in Canada. I was fascinated with Perl as well as dbWiz&#39;s internal design and implementation. So, this is how I met and fell in love with Perl.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; dbWiz offered a friendly and usable admin interface that allowed you to create search categories and select from a global list of resources which databases would be active and searchable. If you had to add a new resource though, you would have to write your own plugin (Perl knowledge and programming skills were required). Some of the dbWiz search plugins were based upon Z39.50 whereas others (the majority) relied on &lt;a href=&quot;http://en.wikipedia.org/wiki/Regular_expression&quot; target=&quot;_blank&quot;&gt;regular expressions&lt;/a&gt; and &lt;a href=&quot;http://search.cpan.org/~jesse/WWW-Mechanize-1.71/lib/WWW/Mechanize.pm&quot; target=&quot;_blank&quot;&gt;WWW::Mechanize&lt;/a&gt;&amp;nbsp;(a handy web browser Perl object).&lt;br /&gt;
&amp;nbsp; &amp;nbsp; The federated search engine developed while working&amp;nbsp;at the University of Macedonia (2006-2008)&amp;nbsp;was named &quot;&lt;a href=&quot;http://pantou.lib.uom.gr/modperl/dbwiz2.pl&quot; target=&quot;_blank&quot;&gt;Pantou&lt;/a&gt;&quot; and became a valuable everyday tool for students and professors of the University. The results of this work &lt;a href=&quot;http://www.lib.uom.gr/images/stories/pdf/dimosieuseis/federated_search.pdf&quot; target=&quot;_blank&quot;&gt;were presented&lt;/a&gt; at the&amp;nbsp;&lt;a href=&quot;http://libconf2007.unipi.gr/index.php?lang=en&quot; target=&quot;_blank&quot;&gt;16th Panhellenic Academic Libraries Conference &lt;/a&gt;(Piraeus, 1-3 October 2007). Unfortunately, its maintenance stopped at the end of 2010 due to the economic crisis and severe cuts in funding. Consequently, a few months later some of its plugins started falling apart.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjH15hnfEDbdUdaJRBwm6WgHAmAP-1dNLEo_lfKgPDmkv2q3xrYCRr8m7Za1ONl6qt7ADETC0_WRi4J5ZnBFi5P1Xv81cCaP7LSG1c1BGiI97OYjdyI4lXunTaKsUjlXtxILE_g7LUD8zqE/s1600/ScreenShot_Pantou.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;302&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjH15hnfEDbdUdaJRBwm6WgHAmAP-1dNLEo_lfKgPDmkv2q3xrYCRr8m7Za1ONl6qt7ADETC0_WRi4J5ZnBFi5P1Xv81cCaP7LSG1c1BGiI97OYjdyI4lXunTaKsUjlXtxILE_g7LUD8zqE/s400/ScreenShot_Pantou.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Generally, delving into dbWiz taught me a lot of lessons such as web development, Perl programming and &lt;a href=&quot;http://www.gnu.org/gnu/linux-and-gnu.html&quot; target=&quot;_blank&quot;&gt;GNU/Linux&lt;/a&gt; administration. I loved it! Meanwhile, in my effort to improve the relatively hard and tedious procedure of creating new dbWiz plugins, I put into practice an early version of GUI &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; (which was my MSc thesis being fulfilled in the same period at the &lt;a href=&quot;http://www.auth.gr/home/index_en.html&quot; target=&quot;_blank&quot;&gt;Aristotle University of Thessaloniki&lt;/a&gt;). The result was a &lt;a href=&quot;http://lib-code.lib.sfu.ca/projects/dbwiz/browser/trunk/DBWIZ_search/lib/DBWIZ/Search/Internet/DEiXTo.pm?rev=691&quot; target=&quot;_blank&quot;&gt;new Perl module&lt;/a&gt; that allowed the execution of &lt;a href=&quot;http://www.w3.org/DOM/&quot; target=&quot;_blank&quot;&gt;W3C DOM&lt;/a&gt;-based, XML patterns (built with the GUI DEiXTo) inside dbWiz and eliminated, at least to a large extent, the need for heavy use of regular expressions. That module, which was the first predecessor of today&#39;s DEiXToBot package,&amp;nbsp;&lt;a href=&quot;http://lib-code.lib.sfu.ca/projects/dbwiz/browser/trunk/DBWIZ_search/lib/DBWIZ/Search/Internet/DEiXTo.pm?rev=691&quot; target=&quot;_blank&quot;&gt;got included in the official dbWiz distribution&lt;/a&gt; after contacting the dbWiz development team in 2007. Unfortunately, SFU Library &lt;a href=&quot;http://lib-forums.lib.sfu.ca/viewtopic.php?f=1&amp;amp;t=329&quot; target=&quot;_blank&quot;&gt;ended the support&lt;/a&gt; and development of dbWiz in 2010.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; Looking back, I can now say with quite a bit of certainty, that &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;&amp;nbsp;(more than ever before) can power federated search tools and help them extend their reach to previously inaccessible resources. As far as the search engines war is concerned, Google seems to triumph but nobody can say for sure what is going to happen in the next few years to come. Time will tell..&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2012/01/federated-searching-dbwiz.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHg9VMKqJlf_0Bxkgom1i7uFgbV2e8Yr-Lh7z_zRonLooCykRdYdFbJDJwfO-f9NTYbIQCh7uNNlG1agi618zXgrvVYVqaLT1vPW9kRKhB0Kv9kf4ORaNKvfLQRh31plCAdtPSH-bmU8VZ/s72-c/logo3w.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-7092960453956243431</guid><pubDate>Thu, 19 Jan 2012 22:02:00 +0000</pubDate><atom:updated>2012-01-28T19:06:19.669+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Data transformations</category><category domain="http://www.blogger.com/atom/ns#">Digital Libraries</category><category domain="http://www.blogger.com/atom/ns#">DSpace</category><category domain="http://www.blogger.com/atom/ns#">Dublin Core</category><category domain="http://www.blogger.com/atom/ns#">Institutional repositories</category><category domain="http://www.blogger.com/atom/ns#">Music Library Lilian Voudouri</category><category domain="http://www.blogger.com/atom/ns#">OAI-PMH</category><category domain="http://www.blogger.com/atom/ns#">Open Archives</category><category domain="http://www.blogger.com/atom/ns#">openarchives.gr</category><title>Open Archives &amp; Digital Libraries</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxu5vimSNanRWKq2llpJaAsEDq8iaJftws_hUKr48FdWYA8vPs2u-uGfOEf1lcsYZJNkcVA81t9JHqh4XJr0QVE3fLMhIyd4E7rM0mlD4Ts9CupqkGDh2GacRRy7sOTfsMwrpxmmOLhpzD/s1600/OA100.gif&quot; imageanchor=&quot;1&quot; style=&quot;clear: right; float: right; margin-bottom: 1em; margin-left: 1em;&quot;&gt;&lt;br /&gt;
&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxu5vimSNanRWKq2llpJaAsEDq8iaJftws_hUKr48FdWYA8vPs2u-uGfOEf1lcsYZJNkcVA81t9JHqh4XJr0QVE3fLMhIyd4E7rM0mlD4Ts9CupqkGDh2GacRRy7sOTfsMwrpxmmOLhpzD/s1600/OA100.gif&quot; /&gt;&lt;/a&gt;The &lt;a href=&quot;http://www.openarchives.org/&quot; target=&quot;_blank&quot;&gt;Open Archives Initiative&lt;/a&gt; (OAI) develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and &lt;a href=&quot;http://en.wikipedia.org/wiki/Institutional_repository&quot; target=&quot;_blank&quot;&gt;institutional repository&lt;/a&gt; movements and its cornerstone is the&amp;nbsp;Protocol for Metadata Harvesting (&lt;a href=&quot;http://www.openarchives.org/OAI/openarchivesprotocol.html&quot; target=&quot;_blank&quot;&gt;OAI-PMH&lt;/a&gt;) which allows data providers/ repositories to expose their content in a structured format. A client then can make OAI-PMH service requests to harvest that metadata through HTTP.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; &lt;a href=&quot;http://openarchives.gr/&quot;&gt;openarchives.gr&lt;/a&gt; is a great federated search engine harvesting &lt;i&gt;57 &lt;/i&gt;Greek digital libraries and institutional repositories (as of January 2012). It currently provides access to almost half a million(!) documents (mainly undergraduate theses and Master/ PhD dissertations) and its index gets updated on a daily basis. It&amp;nbsp;began its operation back in 2006 after being designed and implemented by&amp;nbsp;&lt;a href=&quot;http://vbanos.gr/&quot; target=&quot;_blank&quot;&gt;Vangelis Banos&lt;/a&gt;&amp;nbsp;but&amp;nbsp;since May 2011 it is being hosted, managed and co-developed by the &lt;a href=&quot;http://www.ekt.gr/&quot; target=&quot;_blank&quot;&gt;National Documentation Centre&lt;/a&gt; (EKT). What makes this amazing searching tool even more remarkable is the fact that it is entirely built on &lt;a href=&quot;http://www.opensource.org/&quot; target=&quot;_blank&quot;&gt;open source&lt;/a&gt;/ &lt;a href=&quot;http://www.gnu.org/philosophy/free-sw.html&quot; target=&quot;_blank&quot;&gt;free software&lt;/a&gt;.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7NkBLew1wMxklRRTAVer_fhN9QkeaCUhcy9tkO48U7D7meGZxbwskACiIijqcBa4imGjeYn6ANCM0_uS-lqC1em3oBi_itWBgNR1L-0qlpX7xc46wfNRZp5poru1CBiCqtzUUbsCSgR4P/s1600/logo_en.png&quot; imageanchor=&quot;1&quot; style=&quot;clear: left; float: left; margin-bottom: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7NkBLew1wMxklRRTAVer_fhN9QkeaCUhcy9tkO48U7D7meGZxbwskACiIijqcBa4imGjeYn6ANCM0_uS-lqC1em3oBi_itWBgNR1L-0qlpX7xc46wfNRZp5poru1CBiCqtzUUbsCSgR4P/s1600/logo_en.png&quot; /&gt;&lt;/a&gt;&amp;nbsp; &amp;nbsp; A tricky point that needs some clarification is that when a user searches &lt;a href=&quot;http://openarchives.gr/&quot;&gt;openarchives.gr&lt;/a&gt;, the search is not submitted in real time to the target sources. Instead, it is performed locally on the&amp;nbsp;openarchives.gr server&amp;nbsp;where full copies of the repositories/ libraries are stored (and updated at regular time intervals).&lt;br /&gt;
&amp;nbsp; &amp;nbsp; The majority of the sources searched by openarchives.gr are OAI-PMH compliant repositories (such as &lt;a href=&quot;http://www.dspace.org/&quot; target=&quot;_blank&quot;&gt;DSpace&lt;/a&gt; or &lt;a href=&quot;http://www.eprints.org/&quot; target=&quot;_blank&quot;&gt;EPrints&lt;/a&gt;). Therefore, their data are periodically retrieved via their OAI-PMH endpoint. However, it is worth mentioning that non OAI-PMH digital libraries have also been included in its database. This was made possible through scraping their websites with &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; and&amp;nbsp;transforming&amp;nbsp;their&amp;nbsp;metadata&amp;nbsp;into&amp;nbsp;&lt;a href=&quot;http://dublincore.org/&quot; target=&quot;_blank&quot;&gt;Dublin Core&lt;/a&gt;. So, more than &lt;i&gt;16.000&lt;/i&gt; records from &lt;i&gt;6&lt;/i&gt; significant online digital libraries (such as the&amp;nbsp;&lt;a href=&quot;http://www.lykeionellinidon.gr/lyceumportal/&quot; target=&quot;_blank&quot;&gt;Lyceum Club of Greek Women&lt;/a&gt;&amp;nbsp;and the&amp;nbsp;&lt;a href=&quot;http://digma.mmb.org.gr/Default.aspx&quot; target=&quot;_blank&quot;&gt;Music Library&lt;/a&gt;&amp;nbsp;of Greece “Lilian Voudouri”) were inserted in openarchives.gr with the use of DEiXTo wrappers and custom Perl code.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; Finally, it is known that digital collections have flourished over the last few years and&amp;nbsp;enjoy growing popularity. However, most of them do NOT provide their contents in OAI-PMH or another appropriate metadata format. Actually, many of them (especially legacy systems) do NOT even offer an &lt;a href=&quot;http://en.wikipedia.org/wiki/Application_programming_interface&quot; target=&quot;_blank&quot;&gt;API&lt;/a&gt; or an &lt;a href=&quot;http://en.wikipedia.org/wiki/Search/Retrieve_Web_Service&quot; target=&quot;_blank&quot;&gt;SRW/U&lt;/a&gt; interface. Consequently, we believe that there is much room for &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; to help cultural and educational organizations (e.g., museums, archives, libraries and multimedia collections) to export, present and&amp;nbsp;distribute&amp;nbsp;their&amp;nbsp;digitized&amp;nbsp;items and rich content to the outside world, in an efficient and structured way, through scraping and repurposing their data.&lt;/div&gt;&lt;/div&gt;</description><link>http://deixto.blogspot.com/2012/01/open-archives-digital-libraries.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxu5vimSNanRWKq2llpJaAsEDq8iaJftws_hUKr48FdWYA8vPs2u-uGfOEf1lcsYZJNkcVA81t9JHqh4XJr0QVE3fLMhIyd4E7rM0mlD4Ts9CupqkGDh2GacRRy7sOTfsMwrpxmmOLhpzD/s72-c/OA100.gif" height="72" width="72"/><thr:total>2</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6296717665748183658</guid><pubDate>Tue, 17 Jan 2012 13:05:00 +0000</pubDate><atom:updated>2012-01-28T18:14:53.959+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">CAQDA</category><category domain="http://www.blogger.com/atom/ns#">Ethnography</category><category domain="http://www.blogger.com/atom/ns#">Forums</category><category domain="http://www.blogger.com/atom/ns#">Netnography</category><category domain="http://www.blogger.com/atom/ns#">Qualitative Analysis</category><category domain="http://www.blogger.com/atom/ns#">Scraping</category><category domain="http://www.blogger.com/atom/ns#">Social sites</category><title>Netnography &amp; Scraping</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Netnography&quot; target=&quot;_blank&quot;&gt;Netnography&lt;/a&gt;&amp;nbsp;or digital ethnography, is (or should be) the correct translation of ethnographic methods to online environments such as bulletin boards and social sites. It is more or less doing the same that ethnographers do in actual places like squares, pubs, clubs, etc:&amp;nbsp;observe what people say and do, and try to participate as much as possible in order to better understand what&#39;s involved in action and discourses. Using ethnography&amp;nbsp;may answer a lot of what, when, who and how questions defining several everyday problems. However,&amp;nbsp;netnography&amp;nbsp;differs in many ways compared to ethnography; especially in the fashion it is conducted.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; Forums, Wikis as well as the blogosphere are good online equivalents of public squares and pubs. There are not physical identities, but online&amp;nbsp;ones; there are not faces, but avatars; there is no gender, age or&amp;nbsp;any reliable info about physical identities, but there are voices&amp;nbsp;discussing and arguing about common topics of interests.&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYj4L7s7OYm-1BaqZxvDlB0m8ySOIMcYRNxdm_keTbHyKvC6g6-HPdz_qk_2UPM4eaValW7qS-9w3Yx_DvzcBQ_oA_PW4Vd_3gUZAeZl2GNWPwXWwbGLJWgmClnks_aMjfMHu9GxlfZQID/s1600/soc-icons.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;86&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYj4L7s7OYm-1BaqZxvDlB0m8ySOIMcYRNxdm_keTbHyKvC6g6-HPdz_qk_2UPM4eaValW7qS-9w3Yx_DvzcBQ_oA_PW4Vd_3gUZAeZl2GNWPwXWwbGLJWgmClnks_aMjfMHu9GxlfZQID/s400/soc-icons.png&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; The more popular a forum is, the more difficult it gets to follow it&amp;nbsp;nethnographically. A nethnographer has to use a Computer Assisted Qualitative Data Analysis (&lt;a href=&quot;http://en.wikipedia.org/wiki/Computer_assisted_qualitative_data_analysis_software&quot; target=&quot;_blank&quot;&gt;CAQDA&lt;/a&gt;) tool (such as &lt;a href=&quot;http://rqda.r-forge.r-project.org/&quot; target=&quot;_blank&quot;&gt;RDQA&lt;/a&gt;) on&amp;nbsp;certain parts of the texts collected during his&amp;nbsp;research. In a forum use case, these texts would be posts and threads.&amp;nbsp;If the researcher has to browse the forum and manually copy and paste its content, a huge amount of effort would be required. However, this obstacle could be surpassed through scraping the forum with a web data extraction tool such as &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; A scraped forum is a jewel: perfectly ordered textual data corresponding to each thread, ready for further analysis. So, this is where DEiXTo comes into play and may boost the research process&amp;nbsp;significantly. To our knowledge,&amp;nbsp;&lt;a href=&quot;http://www.linkedin.com/in/jlchulilla&quot; target=&quot;_blank&quot;&gt;Dr&amp;nbsp;Juan Luis Chulilla Cano&lt;/a&gt;, CEO of &lt;a href=&quot;http://www.onlineandoffline.net/&quot; target=&quot;_blank&quot;&gt;Online and Offline Ltd&lt;/a&gt;., has been successfully&amp;nbsp;utilizing&amp;nbsp;scraping techniques so as to capture the threads of popular Spanish forums (and their metadata) and transform them into a structured format, suitable for&amp;nbsp;post-processing. Typically, such sites have a common presentation style for their threads and offer rich metadata. Thus, they are potential goldmines upon which various methodologies can be tested and applied so as to discover knowledge and trends and draw useful conclusions.&lt;/div&gt;&lt;div style=&quot;text-align: justify;&quot;&gt;&amp;nbsp; &amp;nbsp; Finally, netnography and anthropology seem to be gaining momentum over the last few years. They are really interesting as well as challenging fields and scraping could evolve to an important ally. It is worth mentioning that quite a few IT vendors and firms employ ethnographers for R&amp;amp;D and testing of new products. Therefore, there is a lot of potential in using computer aided techniques in the context of&amp;nbsp;netnography. So, if you are coming from social sciences&amp;nbsp;and creating wrappers/ extraction rules is not your second nature, why don&#39;t you &lt;a href=&quot;http://deixto.com/contact.php&quot; target=&quot;_blank&quot;&gt;drop us an email&lt;/a&gt;? Perhaps we could help you gather quite a few tons of usable data with DEiXTo! &lt;i&gt;Unless terms of use or copyright restrictions forbid it..&lt;/i&gt;&lt;/div&gt;&lt;/div&gt;</description><link>http://deixto.blogspot.com/2012/01/netnography-scraping.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYj4L7s7OYm-1BaqZxvDlB0m8ySOIMcYRNxdm_keTbHyKvC6g6-HPdz_qk_2UPM4eaValW7qS-9w3Yx_DvzcBQ_oA_PW4Vd_3gUZAeZl2GNWPwXWwbGLJWgmClnks_aMjfMHu9GxlfZQID/s72-c/soc-icons.png" height="72" width="72"/><thr:total>0</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-3639231664593965268.post-6589159885559155952</guid><pubDate>Thu, 12 Jan 2012 22:03:00 +0000</pubDate><atom:updated>2014-01-18T19:26:17.098+02:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">Geo-location</category><category domain="http://www.blogger.com/atom/ns#">Geographic data</category><category domain="http://www.blogger.com/atom/ns#">Google Maps</category><category domain="http://www.blogger.com/atom/ns#">Web services</category><category domain="http://www.blogger.com/atom/ns#">Yahoo PlaceFinder</category><title>Geo-location data, Yahoo! PlaceFinder &amp; Google Maps API</title><description>&lt;div dir=&quot;ltr&quot; style=&quot;text-align: left;&quot; trbidi=&quot;on&quot;&gt;
&lt;div class=&quot;p1&quot; style=&quot;text-align: justify;&quot;&gt;
Location-aware applications have known huge success over the last few years and geographic data have been used extensively in a wide variety of ways.&amp;nbsp;Meanwhile,&amp;nbsp;there are numerous places of interest out there, such as&amp;nbsp;shopping malls, airports, restaurants, museums, transit stations and for most of them their addresses are publicly available on the Web.&amp;nbsp;Therefore, you could use &lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;&amp;nbsp;(or a web data extraction tool of your choice) in order to scrape the desired location information for any points of interest and then postprocess it so as to produce geographic data for further use.&lt;/div&gt;
&lt;div class=&quot;p1&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3lyRJGhS9xJziboh8vR9KYW3yl2IoIgOsP044K9yBuxR91DZ2BLMbFKD6nVngb5f1BV3fWc0ui33NxrKqzTk1fAuz-L-a7kFpcPjobZiDC0nS9554gsyAJrXoiQXYVkiA4sonJ3GNaMDF/s1600/yahoo.png&quot; imageanchor=&quot;1&quot; style=&quot;clear: right; float: right; margin-bottom: 1em; margin-left: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3lyRJGhS9xJziboh8vR9KYW3yl2IoIgOsP044K9yBuxR91DZ2BLMbFKD6nVngb5f1BV3fWc0ui33NxrKqzTk1fAuz-L-a7kFpcPjobZiDC0nS9554gsyAJrXoiQXYVkiA4sonJ3GNaMDF/s200/yahoo.png&quot; height=&quot;40&quot; width=&quot;200&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;a href=&quot;http://developer.yahoo.com/geo/placefinder/&quot; target=&quot;_blank&quot;&gt;Yahoo! PlaceFinder&lt;/a&gt; is a great web service that supports world-wide geocoding of street addresses and place names. It allows developers to convert addresses and places into geographic coordinates (and vice versa). Thus, you can send an HTTP request with a street address to it and get the latitude and longitude back! It&#39;s amazing how well it works. Of course, the more complete and detailed the address, the more precise the coordinates returned.&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;text-align: left;&quot;&gt;&amp;nbsp; &amp;nbsp; In the context of this post, we thought it would be nice, mostly for demonstration purposes, to build a map of&amp;nbsp;&lt;/span&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Thessaloniki&quot; style=&quot;text-align: left;&quot; target=&quot;_blank&quot;&gt;Thessaloniki&lt;/a&gt;&lt;span style=&quot;text-align: left;&quot;&gt;&amp;nbsp;museums using the&amp;nbsp;&lt;a href=&quot;http://code.google.com/apis/maps/documentation/javascript/&quot; target=&quot;_blank&quot;&gt;Google Maps API&lt;/a&gt;&amp;nbsp;and geo-location data generated with&amp;nbsp;&lt;/span&gt;Yahoo! PlaceFinder&lt;span style=&quot;text-align: left;&quot;&gt;. The source of data for our demo was&amp;nbsp;&lt;/span&gt;&lt;a href=&quot;http://odysseus.culture.gr/index_en.html&quot; style=&quot;text-align: left;&quot; target=&quot;_blank&quot;&gt;Odysseus&lt;/a&gt;&lt;span style=&quot;text-align: left;&quot;&gt;, the WWW server of the Hellenic Ministry of Culture that provides a full list of Greek museums, monuments and&amp;nbsp;archaeological&amp;nbsp;sites.&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUIAfAOoUqzzYtr6b0SnK2lPNboJq1o9DR1rt7wVu8ta4x0XIsXaQIV3SkdgHp_6W_EJa1HnFBQyddWhaMRKHB8DEfnaiGlXqi4wb3C6qa2pP3mUn3fkyL3jWhNqFqXTEoRlyZmzGe3I7L/s1600/odysseus.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUIAfAOoUqzzYtr6b0SnK2lPNboJq1o9DR1rt7wVu8ta4x0XIsXaQIV3SkdgHp_6W_EJa1HnFBQyddWhaMRKHB8DEfnaiGlXqi4wb3C6qa2pP3mUn3fkyL3jWhNqFqXTEoRlyZmzGe3I7L/s400/odysseus.png&quot; height=&quot;226&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&amp;nbsp; &amp;nbsp; So, we&amp;nbsp;searched for museums&amp;nbsp;located in the city of Thessaloniki (&lt;span style=&quot;text-align: left;&quot;&gt;the second-largest city in Greece and the capital of the region of Central Macedonia)&lt;/span&gt;&lt;span style=&quot;text-align: left;&quot;&gt;&amp;nbsp;&lt;/span&gt;and&amp;nbsp;extracted&amp;nbsp;through&amp;nbsp;&lt;a href=&quot;http://deixto.com/&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt;&amp;nbsp;the street addresses&amp;nbsp;of the ten results returned. At the picture below you can see a sample screenshot from the &quot;INFORMATION&quot; section of the &lt;a href=&quot;http://www.lemmth.gr/c/portal_public/layout?p_l_id=1.2&amp;amp;setlanguage=en_US&quot; target=&quot;_blank&quot;&gt;Folk Art and Ethnological Museum of Macedonia and Thrace&lt;/a&gt;&amp;nbsp;Odysseus&amp;nbsp;&lt;a href=&quot;http://odysseus.culture.gr/h/1/eh155.jsp?obj_id=3273&quot; target=&quot;_blank&quot;&gt;detailed webpage&lt;/a&gt;&amp;nbsp;(from which the address of this specific museum was scraped):&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi49ptyUcq9khW0ryUX-ATRO4vZO4MHcYxM87ueiJDY9BsB5N5NNU5POXJFCvy2RYMD86zp02S_PoFaShFcwXN0FZidZmaosMk7PKUbuEqF_lUR-pWbM3O32e-RNS1avTDqQ4Txw_dj9dGS/s1600/lemm_odysseus.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi49ptyUcq9khW0ryUX-ATRO4vZO4MHcYxM87ueiJDY9BsB5N5NNU5POXJFCvy2RYMD86zp02S_PoFaShFcwXN0FZidZmaosMk7PKUbuEqF_lUR-pWbM3O32e-RNS1avTDqQ4Txw_dj9dGS/s400/lemm_odysseus.png&quot; height=&quot;148&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;text-align: left;&quot;&gt;&amp;nbsp; &amp;nbsp; After capturing the name and location of each museum and exporting them to a simple tab delimited&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;text-align: left;&quot;&gt;text&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;text-align: left;&quot;&gt;file, we wrote a Perl script harnessing the&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;text-align: left;&quot;&gt;&lt;a href=&quot;http://search.cpan.org/~gray/Geo-Coder-PlaceFinder-0.05/lib/Geo/Coder/PlaceFinder.pm&quot; target=&quot;_blank&quot;&gt;Geo::Coder::PlaceFinder&lt;/a&gt;&amp;nbsp;CPAN module in order to automatically find their geo-location coordinates and create an XML output file containing all the necessary information (through &lt;a href=&quot;http://search.cpan.org/~josephw/XML-Writer-0.614/Writer.pm&quot; target=&quot;_blank&quot;&gt;XML::Writer&lt;/a&gt;). Part of this XML document is displayed right below:&lt;/span&gt;&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGR5Wq9XnDrMM2AO7dm0hylQVbS86n3RRLY8LpV3RUp2Mxejy83MaVf-rN1sRXYl6oddcIHJleEwTGMIp0S6DT36RpErjU5H94jacVaq9QygiYYtGF7RnYeMRk7ycMIy263bNPI7QQhObw/s1600/xml_museums.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGR5Wq9XnDrMM2AO7dm0hylQVbS86n3RRLY8LpV3RUp2Mxejy83MaVf-rN1sRXYl6oddcIHJleEwTGMIp0S6DT36RpErjU5H94jacVaq9QygiYYtGF7RnYeMRk7ycMIy263bNPI7QQhObw/s400/xml_museums.png&quot; height=&quot;180&quot; width=&quot;400&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;text-align: left;&quot;&gt;&amp;nbsp; &amp;nbsp; After having all the metadata we needed in this XML file, we utilized the &lt;a href=&quot;http://code.google.com/apis/maps/documentation/javascript/&quot; target=&quot;_blank&quot;&gt;Google Maps JavaScript API v3&lt;/a&gt; and created a&amp;nbsp;&lt;a href=&quot;http://deixto.com/wp-content/uploads/thessaloniki_museums_map.html&quot; target=&quot;_blank&quot;&gt;map&lt;/a&gt;&amp;nbsp;(centered on Thessaloniki)&amp;nbsp;displaying&amp;nbsp;all city museums! To accomplish that goal, we followed the helpful guidelines given in this &lt;a href=&quot;http://www.svennerberg.com/2009/07/google-maps-api-3-markers/&quot; target=&quot;_blank&quot;&gt;very informative post&lt;/a&gt;&amp;nbsp;about Google Maps markers and wrote a short script that parsed the XML contents (via &lt;a href=&quot;http://search.cpan.org/~shlomif/XML-LibXML-1.90/LibXML.pod&quot; target=&quot;_blank&quot;&gt;XML::LibXML&lt;/a&gt;) and produced a web page with the desired Google Map object embedded (including markers for each museum). Finally, t&lt;/span&gt;&lt;span style=&quot;text-align: left;&quot;&gt;he&amp;nbsp;&lt;a href=&quot;http://deixto.com/thessaloniki_museums_map.html&quot; target=&quot;_blank&quot;&gt;end result&lt;/a&gt;&amp;nbsp;was pretty satisfying (after some extra manual effort to be absolutely honest):&lt;/span&gt;&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;a href=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguqnt0gsAPVpNSesEgsUITmDil-yLtEMw_T3UL4_r8VsSRCsPiijw_4lVntvBJPNXWLFne4tmhnyk3-FY-tplo7yr-yZCO7ylnJqBPQloVPzUEd4vSzIrMm53XeEj5YjGrGtY5lA-VtVBB/s1600/Google_map_thessaloniki_museums.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguqnt0gsAPVpNSesEgsUITmDil-yLtEMw_T3UL4_r8VsSRCsPiijw_4lVntvBJPNXWLFne4tmhnyk3-FY-tplo7yr-yZCO7ylnJqBPQloVPzUEd4vSzIrMm53XeEj5YjGrGtY5lA-VtVBB/s400/Google_map_thessaloniki_museums.png&quot; height=&quot;400&quot; width=&quot;388&quot; /&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div style=&quot;text-align: justify;&quot;&gt;
&lt;span style=&quot;text-align: left;&quot;&gt;&amp;nbsp; &amp;nbsp; This is kind of cool, isn&#39;t it? Of course, the same procedure could be applied in a larger scale (e.g. for creating a map of Greece with ALL museums or/and monuments available) or expanded to other points of interest (whatever you can imagine, from schools and educational institutions to cinemas, supermarkets, shops or bank ATMs). In conclusion, we think that the combination of &lt;a href=&quot;http://http%3B//deixto.com&quot; target=&quot;_blank&quot;&gt;DEiXTo&lt;/a&gt; with other powerful tools and technologies can sometimes yield&amp;nbsp;an innovative and hopefully useful outcome. Since you have the raw web data at your disposal (captured with DEiXTo), your imagination (and perhaps &lt;a href=&quot;http://deixto.blogspot.com/2011/12/robotstxt-access-restrictions.html&quot; target=&quot;_blank&quot;&gt;copyright restrictions&lt;/a&gt;) is the only limit!&lt;/span&gt;&lt;/div&gt;
&lt;/div&gt;
</description><link>http://deixto.blogspot.com/2012/01/geo-location-data-yahoo-placefinder.html</link><author>noreply@blogger.com (kntonas)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3lyRJGhS9xJziboh8vR9KYW3yl2IoIgOsP044K9yBuxR91DZ2BLMbFKD6nVngb5f1BV3fWc0ui33NxrKqzTk1fAuz-L-a7kFpcPjobZiDC0nS9554gsyAJrXoiQXYVkiA4sonJ3GNaMDF/s72-c/yahoo.png" height="72" width="72"/><thr:total>0</thr:total></item></channel></rss>