Friday, May 11, 2012

OpenWeb 05/12/2012 (a.m.)

  • Excellent Hadoop/Hive explanation.  Hat tip to Matt Asay for the link.  I eft a comment on Matt's blog questioning the consequences of the Oracle vs. Google Android lawsuit, and the possible enforcement of the Java API copyright claim against Hadoop/Hive.  Based on this explanation of Hadoop/Hive, i'm wondering if Oracle is making a move to claim the entire era of Big Data Cloud Computing?  To understand why, it's first necessary to read Matt the Hadoople's explanation.   kill shot excerpt: "You've built your Hadoop job, and have successfully processed the data. You've generated some structured output, and that resides on HDFS. Naturally you want to run some reports, so you load your data into a MySQL or an Oracle database. Problem is, the data is large. In fact it's so large that when you try to run a query against the table you've just created, your database begins to cry. If you listen to its sobs, you'll probably hear "I was built to process Megabytes, maybe Gigabytes of data. Not Terabytes. Not Perabytes. That's not my job. I was built in the 80's and 90's, back when floppy drives were used. Just leave me alone". "This is where Hive comes to the rescue. Hive lets you run an SQL statement against structured data stored on HDFS. When you issue an SQL query, it parses it, and translates it into a Java Map/Reduce job, which is then executed on your data. Although Hive does some optimizations, in general it just goes record by record against all your data. This means that it's relatively slow - a typical Hive query takes 5 or 10 minutes to complete, depending on how much data you have. However, that's what makes it effective. Unlike a relational database, you don't waste time on query optimization, adding indexes, etc. Instead, what keeps the processing time down is the fact that the query is run on all machines in your Hadoop cluster, and the scalability is taken care of for you." "Hive is extremely useful in data-warehousing kind of scenarios. You would never use Hive as a database for a web application, because the response time is always in minutes, not seconds. However, for generating huge custom reports, running some really expensive query on year's worth of data, or doing any kind of processing on massive amounts of data, Hive really shines. This is why companies like Oracle and IBM (IBM owns Netezza, a competitor to Oracle) are scared of Hadoop and Hive. Hive makes it possible for companies to easily process massive amounts of data, and processing massive amounts of data is typically how database makers differentiate themselves. And yes, just like the rest of Hadoop ecosystem, Hive is free and open-source." ...........

    Tags: Hadoop_Hive, Java-API, Oracle, Google, Map_Reduce, Cloud, Cloud-Productivity-Platform


Posted from Diigo. The rest of Open Web group favorite links are here.

Post a Comment