Hadoop designed to solve issues impossible for the traditional commercial tools
In the first summary part of my blog series I opened background of the big data tools. We discussed how the different components originally born to solve issues more challenging than traditional enterprise IT architects and operators even wanted to think. Therefore I claim that Hadoop tools are extremely well designed and have excellent features for almost every company.
Before going more into the details, let’s get back to history of big data. Roughly twenty years ago internet startups found a low cost and open source software to build their services. On the high level the main difference was economics behind them and pretty much made the internet boom possible. Like I mentioned earlier, suddenly you didn’t need to have a significant funding to set-up an IT or internet company.
However, these open source tools pretty much copied the functionality of commercial tools. As examples, the Linux operating system on the low cost commodity intel servers took their place in the servers. We Finns where proud about MySql as the de facto open source database among the PostgreSQL. Similarly the Apache web server became the standard to run static html- and php-pages. If you needed a full scale java-application server Jboss was available for free etc. From the architectural and functional point of view there was not major difference between the open source and commercial tools, though.
The new era of internet and the new infrastructural components running it did not change dramatically the enterprise software sector and economic of it, either. Majority of enterprise software vendors started to use the same open source software components embedded into their products – like providing MySql as an embedded free repository database option or embedded Apache tomcat as application server instead of building their completely own ones. As 95% of added value still came from their proprietary code and professional services tightly bound to it, the business logic of enterprise software has been pretty much the same until these days.
But change is happening now. Under the shadows these new internet companies grew faster than we had ever seen. Quickly they run into overwhelming technological challenges, which the traditional software tools, commercial or open source, never needed to tackle. The new comers needed to invent how to provide almost 100% automated service 24/7/365. They needed capability to add hardware, upgrade platform software versions and develop application further without any kind of downtime ever. This means that you need 100% fault to tolerance, no single point of failures, software capable for rolling upgrades, an excellent monitoring and automated management from few servers to thousands of servers with a minimal labor. Amount of data managed can be 100 or 10000 times bigger than in a traditional large enterprise. 24/7 solutions do not have “a night time” for the batch jobs and software need to scale linearly as you add hardware. Some of the start-ups found solutions for these issues and as results we have Facebook, Yahoo, Linked-in, Twitter etc.
Let’s take a few examples how certain Hadoop tools are different from the traditional ones. For example the traditional databases can be clustered, but a cluster of dozens servers is already very large. The replication of data among the servers is done on software layer. At the same time a traditional database locks rows and even whole table at the time of updates and deletes. As the result the maximal cluster size is very limited. To resolve the issue Yahoo created a completely new kind of file system (HDFS) – way files are stored on the disk. This file system is “distributed” so the “operating system” of Hadoop already makes same data files available on multiple servers. The method is completely license free. Traditionally the same result has required an expensive hardware and still it does not scale for clusters of thousands servers like the HDFS-file system method. About at the same time Google invented a new way for going efficient parallel processing on large amount of servers – the Map Reduce language mentioned in the previous blog. Map Reduce can also do more complex processing than SQL and the processed data can be even unstructured (like text or photos). Together HDFS and Map Reduce provided way to do advanced querying and processing of extremely massive amounts of data. As Yahoo and Google where tackling with the batch processing of the data, LinkedIn had challenges to integrate all of their system in real-time and batches. They ended up moving to a completely real-time architecture and developing own messaging software for it – Kafka. The main difference is superior throughput compared to traditional messaging servers. Kafka can also store messages exceptionally long – default setting being seven days. These companies shared the code with each other. Also others joined in and the Hadoop project was born under Apache organization.
Of course Hadoop tools are not perfect from every aspect. As in case of IKEA they are not polished luxury. There is not a nice graphical UI for everything and you need to work with configuration files. In general they are performance and high availability optimized, but as compromise a traditional software for the same purpose may have more features. But the main question is, would you really use those features and are they worth of money.
But like in case of IKEA, all of your furniture is not from them. Similarly, Hadoop tools may replace only parts of your existing platforms. Merely you will extend your existing landscape with these tools. There is still room for traditional messaging servers and especially relational databases.
Tuomas Autio: In love with Hadoop deployment automation
Mikko Mattila: Hadoop – IKEA of the IT ecosystem: Part 1
Mikko Mattila: Hadoop – IKEA of the IT ecosystem: Part 2
Mikko Mattila: Hadoop – IKEA of the IT ecosystem: Part 3