The past few months I had been busy tinkering with a micro-cluster during weekends to have a more in-depth peek at the Hadoop 2 eco-system. Six Odroid U3s stitched together by an el-cheapo DLink DGS-1008A gigabit switch. This is by far much more better than spinning Amazon EC2 instances or the two node Hadoop 1 cluster I tried a few years back.
One thing that keeps bugging me however is the intermittent connectivity am having with the cluster. A bit odd am getting a far more stable connection doing an ssh from the laptop into the cluster, than from one of the nodes in the cluster trying to ping or ssh another node.
This makes the master namenode unable to start properly and causes intermittent timeouts whenever it tries to start-cycle the host listings on the slaves config. And toying around ethtool with half/full duplex settings doesn't help either -- out of despair I tried even the 10 base half duplex but it still didn't work. The ODroid are using an smsc95xx USB 2.0 module and is purported to work in 100 base full duplex mode.
Am suspecting its either the cables or the network switch. Also looking at getting a powered USB charging hub to make setup a little bit more tidy. Alright, back to some other stuff for now.
The rise of BigData is now prompting engineering teams around the world to graple the last mile that comes along with huge data sets: how do you turn them to actionable insights? A picture is worth a thousand numbers, naturally, data visualization is the preferred medium.
If you start separating the tiers required for managing tons of data for visualization, the need for a fast searchable data store with analytical capabilities will emerge. This is where elasticsearch comes into the picture. Its a swiss army knife stack that can be used as a regular lucene backed data store, or call in the power of aggregations (facet replacement) and you have a decent analytical engine.
The question is where does Hadoop/Teradata maps into this in the overall scheme of Big Data. I'd say if your data set is only in the few GBs a clustered Elasticsearch is all that you need. If your data is running in terrabytes then its probably worth to have that in a Hadoop cluster and run your jobs there that either feeds in back to Elasticsearch or use Elasticsearch as one of the inputs to your favorite Hadoop script.
On one end, I can't believe this prompted me to write a three paragraph blog post!
Still got a ton of things to do. Liking the steady pace of stuff I had been itching to do but didn't materialize last year. The family settling in the new place is helping a lot to make schedule for some things.
Finished my first book for 2014. Still a lot on my reading list. Not to mention my impossible long Pocket list and the a dozen or so Safari reading list.
On the personal side of things, managed to run consistently at a 7/kph pace for 30 mins in recent attempts.