#bytescrolls

Run Datastax Graph in a docker container for windows hosts

noreply@blogger.com (Unknown) — Wed, 26 Apr 2017 21:29:00 +0000

#docker #datastax

There are some docker images available here. If you don't have docker, download and install docker toolbox from here for your system. If you are using a windows enterprise version, docker on windows may not be supported on older versions. So you may have to upgrade and use Windows native virtualisation Hyper V. For docker toolbox, you have to rely on others like virtualbox. Also, note that if you have enabled Windows HyperV, you may have to disable it for docker toolbox to install. On the elevated command line you may execute the following to disable the HyperV (only for docker toolbox).

dism.exe /Online /Disable-Feature:Microsoft-Hyper-V-All

After installing docker in your windows box run this from the docker bash,

docker run --name my-dse -d -p 9042:9042 luketillman/datastax-enterprise:5.1.0 -g

This will download the image from the docker hub. Thanks to Luke Tillman from Datastax. For your java/scala code or client or Dev Studio to work make sure that port 9042 is exposed. Also, the -g option will run the graph. To bash into gremlin or the container, execute the following,

docker exec -it my-dse bash

You can login to server and access gremlin from /usr/bin/dse gremlin-console and create graphs etc. Note that this image supports only single node. To connect to dse node from windows host, you have to fetch the ip of the container, so run:

docker-machine env default

The value of DOCKER_HOST can be used by the client code to connect to cassandra graph as the endpoint.

A simple recommender system for your e-commerce store using a graph database

noreply@blogger.com (Unknown) — Fri, 15 Apr 2016 10:31:00 +0000

#graph #recommendation #orientdb #e-commerce #etl

In the last post, I have introduced you to a simple ETL use case for graph database like Orient DB. If you haven’t read it, I suggest you read this - OrientDB A simple use case note.

After loading data, you might want to play around with the graph structure and its possible traversal logic. As it is easy to represent the semantic relationships between them, the queries we will write also be designed based on the logic we come up with. In the last post, I have provided the query to find out the books bought by the buyers he know or befriended. In this post, I will provide some more simple examples to query such a graph in Orient DB. Here I a using the native query supported by the database.

How do we find out the books bought by a buyer named ‘Hary’?

select @rid, title from (select expand(out('Bought')) from Buyer where name='Hary')

Here this query will return the RecordId of the format <<cluster: position>> form. In OrientDB each record has its own self-assigned unique ID within the database called Record ID or RID. cluster-id is the id of the cluster and cluster-position is the position of the record inside the cluster. You can consider a cluster as a Table where each class (say, Buyer) of records are stored. Here the subquery uses expand function to expand the collection in the field and use it as result. It will fetch the records linked to the edge ‘Bought’.

How do we find out the people ‘Hary’ knows?

select expand(out('Knows')) from  Buyer where name='Hary'

Find out the books bought by friends of Hary?

select title from (
select expand(out('Bought')) from (select expand(out('Knows')
) from  Buyer where name='Hary'))

Here we combined both of the queries above it to make a logical decision as the interlinking of vertices is clearly identified.

Find out books bought by Hary but not by his friends, so that we can recommend some?

select title from (select expand(out('Bought')) from Buyer where name='Hary') 
let $temp = (
  select title from (
    select expand(out('Bought')) from (
      select expand(out('Knows')) from  Buyer where name='Hary'
    )
  )
)where title not in $temp

Here we used LET to assign the results of a subquery. In the subquery, we find the books bought by Hary’s friends. Then we find the books bought by Hary but not by friends.

Find out the books who also bought a book like The Shining? This is a common use case for recommend links where we may want to list the similar products bought by people who is about to buy the displayed product.

select expand(inE('Bought').outV().OutE('Bought').inV().title) 
from Book where title not in ['The Shining']

Orient DB - A simple ETL use case note

noreply@blogger.com (Unknown) — Tue, 12 Apr 2016 17:11:00 +0000

#orientdb #graph #etl #java #database

As someone who is familiar with graph data structures would like to know how we can map real-world models to a graph and process them. If you are trying to build them programmatically and approach them using traversal algorithms, you are going to have a hard time. If your application use a relational database to store data mapped to these models, then it will become complex while trying to link them with more relationships. How will you design the relationships between domains in a better semantic way? How would you query them like a sql-like or DSL language? Graph databases should be a right candidate. Here I am trying to test out Orient DB.

In relational databases, we have primary and foreign-key columns references that helps joins that are computed at query time which is memory and compute intensive. Also we use junctions tables for many-to-many relationships with highly normalized tables which will increase the query execution time and complexity. Graph databases are like relational databases, but with first class support for “relationships” defined by edges (stored as list) connected nodes (vertex/entity). Whenever you run a join operation, the database just uses this materialized list and has direct access to the connected nodes, eliminating the need for a expensive search / match computation.

Consider following tables,

Author Table

id	name
1	Stephen King
2	George R. R. Martin
3	John Grisham

Book Table

id	author_id	title
1	1	Carrie
2	1	The Shining

Buyer Table

id	name	knows_id	book_id
1	Hary	2	2
2	Mary	1	2

In graph database like orient db, we can define the relationships in amore semantic way. Graph databases operate on 3 structures: Vertex(sometimes called Node), Edge(or Arc) and Property(sometimes called Attribute).

Vertex. It’s data: Author, Book etc
Edge is physical relation between Vertices. Each Edge connects two different vertices, no more, no less. Additionally Edge has label and Direction, so If you label your edge as likes you know that Hary bought the book The Shining. The direction of relationship cane be either Out or In.
Property - it’s a value related to Vertex or Edge.

OrientDB comes with an ETL tool to import data. Also, you can use the libraries and write your own code to create nodes in the database. A generic framework for graph databases is available. More on Apache TinkerPop later.
You have to define configuration files for loading certain data into the graph store.
In the above sample configuration, you are defining,

“source”: { “file”: { “path”: “csv file location” } } // the source of file input for a model/entity
in transformer
- vertex as the model or table
- edge will define the edges in and out of the table
In the loader definition we define all the entities and constraints

More about the transformation definition can be read here
Import the csv files and configuration from the github repo. Please change the location of files and conf with respective to your environment.

Simply execute the oetl.sh tool from $ORIENTDB_HOME as sh oetl.sh ‘location of conf file’

You have to execute all the configurations to load all the data.
After loading all the data you can query out and visualize them in the Orient DB’s web based console.

Here you can see the links between the entities.

how do you find the books bought by your friends?

select expand( both('Knows').out('Bought')) from Buyer where name = 'Hary'

Analytics by SQL and Spark using Apache Zeppelin

noreply@blogger.com (Unknown) — Fri, 04 Dec 2015 13:25:00 +0000

#spark #hadoop #analytics #apache #zeppelin #scala

I was looking for a cool dashboard based query interface for analytics. I stumbled upon a cool open source project called Apache Zeppelin,

Zeppelin is a modern web-based tool for the data scientists to collaborate over large-scale data exploration and visualization projects. It is a notebook style interpreter that enable collaborative analysis sessions sharing between users. Zeppelin is independent of the execution framework itself. Current version runs on top of Apache Spark but it has pluggable interpreter APIs to support other data processing systems. More execution frameworks could be added at a later date i.e Apache Flink, Crunch as well as SQL-like backends such as Hive, Tajo, MRQL.

As their apache proposal mentioned, it does have good support for pluggable interpreters (a lot), ie. you can query files, databases, hadoop etc using this interface seamlessly. This application is easily executable in you workstation, if you want to try out. Download from the project site and follow the installation guide.

Run the zeppelin server daemon, and access the UI at http://localhost:8088

We can use different interpreters in notebooks and display the results in dashboard. I was interested in plain simple SQL db, like postgre.

create a tables sales and insert some sample data.

create table sales(category varchar, units integer);
insert into sales values('Men-Shirts', 134344);
insert into sales values('Men-Shoes', 56289);
insert into sales values('Men-Wallets', 19377);
insert into sales values('Men-Watches', 345673);
insert into sales values('Women-Shirts', 87477);
insert into sales values('Women-Skirts', 140533);
insert into sales values('Women-Shoes', 77301);
insert into sales values('Electronics-Mobile', 67457);
insert into sales values('Electronics-Tablets', 21983);
insert into sales values('Electronics-Accessories', 865390);

Create a notebook,

setup the connection properties in psql interpreter configuration.

and run with %psql interpreter. In the notebook, type in,

%psql select * from sales

You have the dashboard ready. You can share the graph as a link and run the notebook scheduled.

Then I decided to use the spark code. As it supports jdbc source, use that in the spark context. In Spark, JdbcRDD can be used to connect with a relational data source. RDDs are a unit of compute and storage in Spark but lack any information about the structure of the data i.e. schema. Dataframes combine RDDs with Schema. To support postgre as source, you need the driver loaded to execute the queries or building schema. Copy the driver to $ZEPLLIN_HOME/interpreter/spark and restart the daemon. If you don't do this, you will not be able to source postgre and may get jdbc connection errors like "No suitable driver found" etc.

Use the notebook to provide the spark code,

In the %sql (to be noted, its not %psql) interpreter provide,

%sql select * from sales

You have to schedule the %sql notebook only and the dashboard is updated based on the data inserts when the cron job is triggered.

Json parsing, Scala way

noreply@blogger.com (Unknown) — Thu, 10 Sep 2015 12:34:00 +0000

Most java developers are familiar with json parsing and object mapping using Jackson library's object mapper functionality that enables serializing POJOs to json string and back. In scala, using the play's json inception mechanism provides a subtle way to serialize json. Using the powerful Scala macros, (a macro is a piece of Scala code, executed at compile-time, which manipulates and modifies the AST of a Scala compile-time metaprogramming), it is able to introspect code at compile-time based on Scala reflection API, access all imports, implicits in the current compile context and generate code. This means the case classes are automatically serialized to json. Also, you can explicitly provide the path to json key and map the value to object's field. But, for simple case classes they are just another boiler-plate code. Use it when we need more powerful mapping and logic for serialized fields. So how does this mapping works? The compiler will inject code into compiled scala AST (Absract Syntax Tree) as the macro-compiler replaces, say, Json.reads[T] by injecting into compile chain and eventually writes out the code for mapping fields in json to object. Internally, play's json module use Jackson's object mapper (ref: play.api.libs.json.jackson.JacksonJson).

You can add dependency in build.sbt in a minimal-scala project which will provide Json APIs from play framework -

"com.typesafe.play" %% "play-ws" % "2.4.2" withSources()

For eg, if we have to two classes (in this case class),

case class Region(name: String, state: Option[String])
case class Sales(count: Int, region: Region)

You have to add the implicit methods for reading and writing to and from json and objects. The methods marked implicit will be inserted for you by the compiler and type is inferred from the context. Any compilation will fail if no implicit value of the right type is available in scope.

implicit val readRegion = Json.reads[Region]
implicit val readSales = Json.reads[Sales]
implicit val writeRegion = Json.writes[Region]
implicit val writeSales = Json.writes[Sales]

If you interchange the order, from readRegion and readSales, you will get compilation error.As the compiler creates a Reads[T] by resolving case class fields & required implicits at COMPILE-time, If any missing implicit is discovered, compiler will break with corresponding error.

Error:(12, 38) No implicit format for test.Region available.
implicit val readSales = Json.reads[Sales]

Interesting method to try is the validate() method while converting json to object which will help to pin point the path of error.

Executing the following program:

Results:

This is testing json..
Test 1
-------
Result:Some(Sales(123,Region(West,None)))
Test 2
-------
Error at JsPath: /region/name
error.path.missing
()
Result:None
Test 3
------
Error at JsPath: /count
error.expected.jsnumber
Error at JsPath: /region/name
error.expected.jsstring
()
Result:None
Test 4
------
Result:{"count":123,"region":{"name":"West","state":"California"}}
Process finished with exit code 0

Simple metastore creation for Hive in MySQL

noreply@blogger.com (Unknown) — Sun, 05 May 2013 11:32:00 +0000

For Hive, the meta-store is like the system catalog which contains metadata about the tables stored in Hive. This metadata is speciﬁed during table creation and reused every time the table is referenced in HiveQL. The database is a namespace for tables, where ‘default’ is used for tables with no user supplied database name. The metadata for table contains list of columns and their types, owner, storage and SerDe information (which I can detail in future posts). It can also contain any user supplied key and value data; which can be used for table statistics. Storage information includes location of the table’s data in the underlying ﬁle system, data formats and bucketing information. SerDe (which controls how Hive serializes/deserializes the data in a row) metadata includes the implementation class of serializer and deserializer methods and any supporting information required by that implementation. The partitions can have its own columns and SerDe and storage information which can be used in the future to evolve Hive schema.The metastore uses either a traditional relational database (like MySQL, Oracle) or ﬁle system and not HDFS since it is optimized for sequential scans only),thus the fired HiveQL statements are executed slow which only access metadata objects.

its simple to install the metastore.

-install mysql-conector

$ sudo yum install mysql-connector-java

-create a symbolic link in the Hive directory

$ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysqlconnector-java.jar

-create the database for the Hive metastore.cdh4 ships with scripts for derby,mysql,oracle and postgre

$ mysql -u root -p
mysql> CREATE DATABASE hivemetastoredb;
mysql> USE hivemetastoredb;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema- 0.9.0.mysql.sql;

-create a user for the metastore

mysql>CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';

-grant access for all hosts in the network

mysql> GRANT ALL PRIVILEGES ON hivemetastoredb.* TO hive@'%' WITH GRANT OPTION;
mysql> FLUSH PRIVILEGES;

following entries in the file /etc/hive/conf/hive-sites.xml, if you are trying a jdbc connection

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hivemetastoredb</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>

Data and Brain

noreply@blogger.com (Unknown) — Tue, 18 Dec 2012 18:32:00 +0000

#bigdata

Came across an interesting presentation on Using Data to Understand Brain.

Using Data to Understand the Brain from jakehofman

Is it possible to read your brain? hmmm

I am a little two-faced with these riddles....

Eventual Consistency

noreply@blogger.com (Unknown) — Mon, 17 Dec 2012 19:19:00 +0000

#distributed #nosql

Unicode features in various languages

noreply@blogger.com (Unknown) — Sun, 16 Dec 2012 20:38:00 +0000

Here’s what each language natively supports in its standard distribution.

Unicode	Javascript	ᴘʜᴘ	Go	Ruby	Python	☕ Java	Perl
Internally	UCS‐2 or UTF‐16	UTF‐8⁻	UTF‐8	varies	UCS‐2 or UCS‐4	UTF‐16	UTF‐8⁺
Identiﬁers	─	✔	✔	✔	✅^∓	✔	✔
Casefolding	none	simple	simple	full	none	simple	full
Casemapping	simple	simple	simple^∓	full	simple	full	full
Graphemes	─	✅	─	─	─	─	✔
Normalization	─	✔	─⁺	─	✔	✔	✔
UCA Collation	─	─	─	─	─	─	✔⁺
Named Characters	─	─	─	─	✅	─	✔⁺
Properties	─	two	(non‐regex)⁻	three	(non‐regex)⁻	two⁺	every⁺

from Tom Christiansen Unicode Support Shootout: The Good, the Bad, the Mostly Ugly

Grapheme - A grapheme is the smallest semantically distinguishing unit in a written language, analogous to the phonemes of spoken languages.

Casefolding - Unicode defines case folding through the three case-mapping properties of each character: uppercase, lowercase and titlecase. These properties relate all characters in scripts with differing cases to the other case variants of the character.

Case mapping - is used to handle the mapping of upper-case, lower-case, and title case characters for a given language.

What is the difference between case mapping and case folding? Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.

Normalization - courtesy

Normalization - Unicode has encoded many entities that are really variants of existing nominal characters. The visual representations of these characters are typically a subset of the possible visual representations of the nominal character. more -

UCA Collation - Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode. more

Named Characters - Unicode characters are assigned a unique Name (na). The name, in English, is composed of A-Z capitals, 0-9 digits, - (hyphen-minus) and .The Unicode Standard specifies notational conventions for referring to sequences of characters (or code points) treated as a unit, using angle brackets surrounding a comma-delimited list of code points, code points plus character names, and so on. For example, both of the designations in Table 1 refer to a combining character sequence consisting of the letter “a” with a circumflex and an acute accent applied to it. more more

Properties - Each Unicode character belongs to a certain category. Unicode assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.more
Perl looks cool!

Machine generated data

noreply@blogger.com (Unknown) — Tue, 31 Jul 2012 19:11:00 +0000

At first, the term "machine-generated data" can be confusing. One would think, every data is (or are?) generated from one device or another is provided by an innocent mortal in this so called era of social media and big data. Then, there should be a clear distinction to such definitions. If an user enter some data in a form, then it is not considered machine generated. At the same time, the same application can track the user's location and log it in a remote server. So it becomes the machine generated data.

Wikipedia says,

Machine-generated data (MGD) is the generic term for information which was automatically created from a computer process, application, or other machine without the intervention of a human.

According to Monash Research,

In classical human-generated data, what’s recorded is the direct result of human choices. Somebody buys something, makes an inquiry about it, fills an order from inventory, makes a payment in return for the object, makes a bank deposit to have funds for the next purchase, or promotes a manager who’s been particularly successful at selling stuff. Database updates ensue. Computers memorialize these human actions more quickly and cheaply than humans carry them out. Plenty of difficulties can occur with that kind of automation — applications are commonly too inflexible or confusing — but keeping up with data volumes is generally the least of the problems.

So what are they? Are they stream of logs flowing through the information super waterway?

May be, until they churned into some books or toilet rolls!

Application Logs - Logs generated by web or desktop applications. The server side logs used for debugging and support tickets!

Call Detail Records - The ones recorded your telecom company. They contain useful details of the call or service that passed through the switch etc like the phone number of the calls, its duration etc. Needed for billing.

Web logs - use to count the visitors and similar web analytics done on these data

Database Audit Logs - Enable auditing to audit for suspicious database activity, it is common that not much information is available to target specific users or schema objects

OS logs - tracks crashing or errors

There are many similar generated data by different application and systems like RFIDs, sensors etc. Then these messages can be mashed up. For the machine data, there will be structure or format and semantics based on the domain it relies on.

The growth of such data is fast and continuous. As it is a stream of data and like a history they are not changed. They are like a record of events.

courtesy- link

Anyone tried iPhonetracker?

courtesy- link

Geolocation and LBS does push a load a data. HTML5 do have a geolocation functionality (even though you have the choice not to track). Following a sample code to test it.

Nodeable - Realtime Insights

noreply@blogger.com (Unknown) — Mon, 23 Jul 2012 14:11:00 +0000

#Nodeable is a good example of generating #insights from #bigdata or the real time trickle feeds. It uses Twitter's Storm for the processing engine Stream reduce. I signed up for a trial account to play around.

The insights like "Most Active" metrics are generated for Amazon Web services status. The reports are generated and tagged in real time. The twitter follow counts are displayed.

It has only some basic set of connectors, but one can create custom connectors using its JSON Schema. The outbound data can be pushed to your own Amazon s3 or Hadoop WebHDFS, which is good for private companies.

The github/rss stream is shown as activity stream.

Sharing an interesting presentation of Storm real-time computation.
ETE 2012 - Nathan Marz on Storm from Chariot Solutions on Vimeo.

Hadoop meetup @inmobi Bangalore

noreply@blogger.com (Unknown) — Fri, 20 Apr 2012 18:08:00 +0000

Had a chance to attend the #hadoop #meetup today at #Inmobi Bangalore.

Arun Murthy and Suresh Srinivasan from Hortonworks made presentations on next gen Hadoop and HDFS Namenode High Availability respectively.

From Inmobi, they had presentations on Real time analytics done on HBase and Ivory, an opensource feed processing platform by Srikanth

Dream On!

Creating index in Hive

noreply@blogger.com (Unknown) — Thu, 08 Mar 2012 18:40:00 +0000

Simple:

CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;

As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:

CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]

WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.

IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties

PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions

ROW FORMAT - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified

STORED AS - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]

See the code from CompactIndexHandler,

What's it about Cascading?

noreply@blogger.com (Unknown) — Tue, 06 Mar 2012 11:56:00 +0000

Cascading helps manipulating data in Hadoop. It is a framework written in Java which abstracts map reduce that allows to write scripts to read and modify data inside Hadoop. Provides a programming API for defining and executing fault tolerant data processing workflows and a query processing API in which the developers can go without map reduce. There are quite a number of DSLs built on top of Cascading, most noteably Cascalog (written in Clojure) and Scalding (written in Scala). There is Pig data processing API which is similar but SQLy.

Terminology

Taps - streams of source (input) and sink (output)
Tuple - can be considered as a result set. This is a single row with named columns of data being processed. A series of tuples make a stream.All tuples in a stream have the exact same fields.
Pipes - tie operations together when executed upon a Tap. Pipe Assembly is created when pipes are successuvely executed.Pipe assemblies are Directed Acyclic Graphs.
Flows - reusable combinations of source,sink and pipe assemblies.
Cascade - series of flows

What all operations possible?

Relational - Join, Filter, Aggregate etc
Each - for each row result (tuple)
Group - Groupby
CoGroup - joins for tuples
Every - for every key in group or cogroup, like an aggregate function to all tuples in a group at once
SubAssembly - nesting reusable pipe assemblies into a Pipe

Internally the cascading employs an intelligent planner to convert the pipe assembly to a graph of dependent MapReduce jobs that can be executed on a Hadoop cluster.

What are the advantages from a normal map reduce workflow do this Cascading have? (Need to investigate!)

O Blimey! TED Talk 2023

noreply@blogger.com (Unknown) — Thu, 01 Mar 2012 17:16:00 +0000

Prometheus film, going viral... like a fire that danced at the end of the match.

Aha! cybernetic life-forms...

The only "purpose'' (in the biological sense) of this identity is to preserve its own existence in time, that is to survive in current, specific environmental conditions, as well as to produce as many copies of itself as possible. The entire network of negative feedback mechanisms is ultimately directed at the latter task. Within the cybernetic paradigm, however, reproduction is nothing but a positive feedback.

-from Cybernetic Formulation of the Defnition of Life

Tinker, Tailor, Soldier, Spy and The Perspicacious "Collusion"

noreply@blogger.com (Unknown) — Tue, 28 Feb 2012 21:23:00 +0000

Collusion!

A secret agreement between two or more parties for a fraudulent, illegal, or deceitful purpose.

In this battleground of privacy wars and illusionary consumer willpower, there comes another wizard to show you the goblins who steal your data.. Collusion from Mozilla.

Collusion is an experimental add-on for Firefox and allows you to see all the third parties that are tracking your movements across the Web. It will show, in real time, how that data creates a spider-web of interaction between companies and other trackers.

Oh yeah, thanks mozilla for helping us to find the hooligans steal our cookies! Yeah we can now haplessly stare at the red devils and haloing thieves

What the heck! We don't have time for tracking everything in our life. Anyway, the stuff looks cool... collusion, interesting word.

The mythical unstructured data!

noreply@blogger.com (Unknown) — Tue, 28 Feb 2012 17:34:00 +0000

As semantic web and big data integration gaining its fus-ro-dah, enterprises are finding a way to harness any available form of information swarming the web and the world

I came across some interesting artcles which gives a concise idea of harnessing metadata from unstructured data....

Lee Dallas says

In some respects it is analogous to hieroglyphics where pictographs carry abstract meaning. The data may not be easily interpretable by machines but document recognition and capture technologies improve daily. The fact that an error rate still exists in recognition does not mean that the content lacks structure. Simply that the form it takes is too complex for simple processes to understand.

more here : http://bigmenoncontent.com/2010/09/21/the-myth-of-unstructured-data/

Ram Subramanyam Gopalan says

A lot of data growth is happening around these so-called unstructured data types. Enterprises which manage to automate the collection, organization and analysis of these data types, will derive competitive advantage.
Every data element does mean something, though what it means may not always be relevant for you.

more here : http://bigdataintegration.blogspot.in/2012/02/unstructured-data-is-myth.html

Consistent Hashing

noreply@blogger.com (Unknown) — Mon, 27 Feb 2012 11:56:00 +0000

What is a consistent hash function?

A consistent hash function is one which changes minimally as the range of function changes.

What's the advantage of such functions?

This is ideal when set of buckets change over time. Two users with inconsistent but overlapping sets of buckets will map items to the same bucket with high probability. So this eliminates the need of "maintaining" a consistent "state" among all nodes in a network. The algorithm can be used for making consistent assignments or relationships between different sets of data in such a way that if we add or remove items, the algorithm can be recalculated on any machine and produce the same results.

Theory

A view V is a set of buckets where user is aware. A client uses a consistent hash function, f(V,i), maps an object to one of the buckets in the view. Say, assign each of hash buckets to random points on mod 2^n circle (virtually!) where hash key size = n. The hash of object= closest clockwise bucket. These small sets of buckets lie near the object. In this case, all the buckets get roughly same number of items. When kth bucket is added only a 1/k fraction of items move. This means when new node is added only minimum reshuffle is needed, which is the advantage of having a view. There can be a hash structure for the key lookup (a balanced tree) which stores the hash of all nodes (in the view). When a new node is added its hash value is added to the table.

Suppose there are two nodes A and B three objects 1–3 (mapped to a hash-function’s result range). The objects 3 and 1 are mapped to node A, object 2 to node B. When a node leaves the system, data will get mapped to their adjacent node (in clockwise direction) and when a node enters the system it will get hashed onto the ring and will overtake objects.

As an example, (refer link1, link2), the circle denotes a range of key values. Say, the points in circle represents 64 bit numbers. Hash the data to get the 64 bit number, which is a point in the circle. Take the IPs of nodes and hash them into 64 bit number and point in the circle. Associate the data to the nodes in the clockwise direction (ie. closest, which can be retrieved from the node in the hash structure). When a new node is inserted into the hash tree, the data will always be assigned to the closest one only. Everything between this number and one that's next in the ring and that has been picked by a different node previously, is now belong to this node.

The basic idea of consistent hash function is to hash both objects and buckets using the same function. It's one of the best ways to implement APIs that can dynamically scale out and rebalanced. The client applications can calculate which node to contact in order to request or write the data with no metadata server required.

Used by

memcached cluster.

Typically, multiple memcached daemons are started, on different hosts. The clients are passed a list of memcached addresses (IP address and port) and pick one daemon for a given key. This is done via consistent hashing, which always maps the same key K to the same memcached server S. When a server crashes, or a new server is added, consistent hashing makes sure that the ensuing rehashing is minimal. Which means that most keys still map to the same servers, but keys hashing to a removed server are rehashed to a new server. - from A memcached implementation in JGroups

Amazon's Dynamo uses consistent hashing along with replication as a partitioning scheme.

Data is partitioned and replicated using consistent hashing [10], and consistency is facilitated by object versioning [12]. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. - from Dynamo: Amazon's Highly Available Key-value Store

Data of a Cassandra table gets partitioned and distributed among the nodes by a consistent hashing function.

Cassandra partitions data across the cluster using consistent hashing [11] but uses an order preserving hash function to do so. In consistent hashing the output range of a hash function is treated as a circular space or "ring" (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is as-signed a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item's key to yield its position on the ring, and then walking the ring clockwise to fi nd the first node with a position larger than the item's position. This node is deemed the coordinator for this key. The application specifi es this key and the Cassandra uses it to route requests. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principal advantage of consistent hashing is that departure or arrival of a node only aff ects its immediate neighbors and other nodes remain una ffected. - from Cassandra - A Decentralized Structured Storage System

Voldemort automatic sharding of data. Nodes can be added or removed from a database cluster, and the system adapts automatically. Voldemort automatically detects and recovers failed nodes. [refer]

References:
http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf
http://sharplearningcurve.com/blog/2010/09/27/consistent-hashing/
http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html

About Bulk Synchronous Parallel(BSP) model

noreply@blogger.com (Unknown) — Sun, 26 Feb 2012 17:16:00 +0000

As an alternative to mapreduce paradigm, there is another parallel computing model called Bulk Synchronous Parallel(BSP). A BSP computer is defined as a set of processors with local memory, interconnected by a communication mechanism (e. g., a network or shared memory) capable of point-to-point communication, and a barrier synchronization mechanism. It differentiates/decouples the use of local memory from that of remote memory. A BSP program consists of a set of BSP processes and a sequence of super-steps—time intervals bounded by the barrier synchronization. Each processor has its own local memory module, and all other memories are non-local where they are accessed by networking. The communication between processors are non-blocking.The essence of the BSP model is super-step. At the start of super step computations are done locally. Then, using the messaging system in the network, the other processes can handle requests for further computation.The communication and synchronization are decoupled. There exists a barrier synchronization in which the processors wait and sync when all communications are completed. When all processes have invoked the sync method and all messages are delivered, the next super-step begins. Then the messages sent during the previous super-step can be accessed by its recipients.The data locality is an inherent part of this model in which the communication is made only when the peer data in necessary. This is different from mapreduce frameworks in which they do not preserve data locality in consecutive operations. During mapreduce processing, it generally passes input data through either many passes of mapreduce or mapreduce iteration in order to derive final results which makes communication cost added on to the processing cost. So BSP is useful with many programs requiring iterations and recursions.

Apache Hama is one such project enabling hadoop to leverage BSP. Google Pregel uses BSP for large scale mining of graphs.

reference:

http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
http://incubator.apache.org/hama/

Boyer and Moore's Linear Time Voting Algorithm

noreply@blogger.com (Unknown) — Fri, 15 Jul 2011 17:42:00 +0000

This is a simple linear time voting algorithm designed by Robert S Boyer and J Stother Moore in 1980 which is discussed in their paper MJRTY - A Fast Majority Vote Algorithm.

This algorithm decides which element of a sequence is in the majority, provided there is such an element.

Suppose there are n characters (objects/candidates). When the ith element is visited, the set can be divided into two groups,ca group of k elements in favor of current selected candidate and a group of elements that disagree.After processing all, we can conclude that candidate selected can be considered majority if there's any !

When the pointer forward over an element e:

If the counter is 0, we set the current candidate to e and we set the counter to 1.
If the counter is not 0, we increment or decrement the counter according to whether e is the current candidate.
When we are done, the current candidate is the majority element, if there is a majority.

I have written a simple java implementation.

Sometime ties may occur. But this algorithm doesn't fit as the solution.For an assurance, if the vote is greater than n/2, the candidate which is returned as majority it is announced to be the selected one. This counting phase can be done when the increment for the candidate happens.This algorithm is really effective when the data is read from a tape.The algorithm only works when at least half of the elements constitute the majority.

Reference
http://www.cs.utexas.edu/users/moore/best-ideas/mjrty/example.html

Descending Iterator and Adapter pattern

noreply@blogger.com (Unknown) — Thu, 14 Jul 2011 15:38:00 +0000

There is a descending iterator in linked list implementation in Java SDK. A humble private class in LinkedList. A good example of adapter.

calls up

public Iterator<E> descendingIterator() {
return new DescendingIterator();
}

Using Avro to serialize logs in log4j

noreply@blogger.com (Unknown) — Mon, 30 May 2011 12:42:00 +0000

I have written about serialization mechanism of Protocol Buffers previously. Similarly, Apache Avro provides a better serialization framework.

It provide features like:

- Independent Schema - use different schemas for serialization and de-serialization
- Binary serialization - compact data encoding, and faster data processing
- Dynamic typing - serialization and deserialization without code generation

We can encode data when serializing with Avro: binary or JSON. In the binary file schema is included at the beginning of file. In JSON, the type is defined along with the data. Switching JSON protocol to a binary format in order to achieve better performance is pretty straightforward with Avro. This means less type information needs to be sent with the data and it stores data with its schema means any program can de-serialize the encoded data, which makes a good candidate for RPC.

In Avro 1.5 we have to use (this is different from previous versions which had no factory for encoders)
- org.apache.avro.io.EncoderFactory.binaryEncoder(OutputStream out, BinaryEncoder reuse) for binary
- org.apache.avro.io.EncoderFactory.jsonEncoder(Schema schema, OutputStream out) for JSON

The values (Avro supported value types) are put for the schema field name as the key
in a set of name-value pairs called GenericData.Record

Avro supported value types are
Primitive Types - null, boolean, int, long, float, double, bytes, string
Complex Types - Records, Enums, Arrays, Maps, Unions, Fixed

you can read more about them here

An encoded schema definition to be provided for the record instance. To read/write data, just use put/get methods

I have used this serialization mechanism to provide a layout for log4j. The logs will be serialized to avro mechanism.

github project is here - https://github.com/harisgx/avro-log4j

Add the libraries to your project and add new properties to log4j.properties

log4j.appender.logger_name.layout=com.avrolog.log4j.layout.AvroLogLayout
log4j.appender.logger_name.layout.Type=json
log4j.appender.logger_name.layout.MDCKeys=mdcKey

Provide the MDC keys as comma seperated values

This is the schema

Bloom Filters

noreply@blogger.com (Unknown) — Sun, 22 May 2011 19:28:00 +0000

A Bloom filter is a probabilistic data-structure. This can be used to store a set of data in a space-efficient manner. For eg; a distributed cache called Cache Digests shared as summaries between the nodes to have a global image.

The data-structure can be used to provide membership queries ie. checkIfDataPresentInStore() If it is to check an element is already inserted in the filter then it will return true, there are no false negatives. But there can be chance if the element not inserted may return true. But the check for that element can be done in the original store ie. the overhead is associated with the rate of false positives. This is different from dictionary in which the hit/miss is deterministic.

For a set of n elements, a bloom filter can be a vector of size m.Initially, all bits are set to 0. For each element e, k hash functions will set k bits in the bit vector to 1. When a query for membership executed, it will check for the bit positions for the set value. If matches all, the queried element is possibly present in the store else, it is sure not present.Each hash function returns the index to set. This means we have to store these m bits per key. So a total of m * N bits of space required. The use of different hash functions results less collision.

Uses

Design a spell checker.
Database join implementation (Oracle)
Peer to peer (P2P) communication and routing
In HBase, the Bloom filter is stored as meta block in the HFile. When a HFile is opened, the bloom filter is loaded into memory and used to determine if a given key is in that store file. This can avoid the scanning region for the key.
and more

I found a java implementation here
Cassandra's java implementation here

Reference

http://en.wikipedia.org/wiki/Bloom_filter
https://issues.apache.org/jira/browse/HBASE-1200
http://wiki.squid-cache.org/SquidFaq/CacheDigests
http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

Labs

noreply@blogger.com (Unknown) — Mon, 09 May 2011 18:23:00 +0000

avro-log4j - serialization mechanism to provide a layout for log4j

firetester - A simple RESTful services testing tool written in Groovy Griffon framework

gitter - Publishes github activities to Twitter

jfilemagic (jfm) is an utility for identifying files using magic numbers or signatures

cometd-chat - a comet based chatter for fun

Interesting uses of sun.misc.Unsafe

noreply@blogger.com (Unknown) — Wed, 13 Apr 2011 17:54:00 +0000

Inspired from the question that found in stackoverflow, I started looking up for the uses. I found some pretty interesting ones...

VM "intrinsification." ie CAS (Compare-And-Swap) used in Lock-Free Hash Tables eg:sun.misc.Unsafe.compareAndSwapInt it can make real JNI calls into native code that contains special instructions for CAS

What is intrinsification?

They are optimization done like compiler generating code directly for called method or JVM native optimizations. We know that there are VM downcalls from JDK like wait method etc. Its about low level programming. For eg:- the Atomic classes for numbers, they are pure numbers represented by objects but atomically modified in which the operations are managed natively.

read more about CAS here http://en.wikipedia.org/wiki/Compare-and-swap

The sun.misc.Unsafe functionality of the host VM can be used to allocate uninitialized objects and then interpret the constructor invocation as any other method call.

One can track the data from the native address.It is possible to retrieve an object’s memory address using the sun.misc.Unsafe class, and operate on its fields directly via unsafe get/put methods!

Compile time optimizations for JVM. HIgh performance VM using "magic", requiring low-level operations. eg: http://en.wikipedia.org/wiki/Jikes_RVM

Allocating memory, sun.misc.Unsafe.allocateMemory eg:- DirectByteBuffer constructor internally calls it when ByteBuffer.allocateDirect is invoked

Tracing the call stack and replaying with values instantiated by sun.misc.Unsafe, useful for instrumentation

sun.misc.Unsafe.arrayBaseOffset and arrayIndexScale can be used to develop arraylets, a technique for efficiently breaking up large arrays into smaller objects to limit the real-time cost of scan, update or move operations on large objects

References

Design and implementation of a comprehensive real time java virtual machine

Demystifying Magic: High-level Low-level Programming

Implementing Fast JVM Interpreters Using Java Itself

TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones

Efficient Checkpointing of Java Software Using Context-Sensitive Capture and Replay

How To Write Directly to a Memory Locations In Java