Hadoop and other open source Big Data projects provide huge range of IT software for areas of data management and system integration
In the first part of my blog series I claimed that Hadoop offers a superior range of tools to do amazing things. The summary blog emphasized the marriage of analytical aspects and real time system integrations and messaging. The purpose of this part is to depict what those tools really are and the purpose of each.
In the terms of a traditional data warehousing and analytics tools the process of data loading and transformation in Hadoop is closer to ELT (database pushdown centric) than ETL (ETL server centric transformations). For the data extraction from relational databases we have Sqoop. For sourcing a continuously generating log files we commonly use Flume. If the source systems provide transfer files, we drop them to the Hadoop file system (HDFS) and use them directly or indirectly as the “data files” of Hive “database”. The Hive component provides a database query (SQL + odbc/jdbc) interface to Hadoop. In practice this means that the schema descriptions are defined on the top of your data files (for example csv-file) and you can use SQL to query and insert data. The transformation part of data management is done with a Latin language.
As you are ready with the things you have done years with ETL and data warehouses, with Hadoop you can move to the area of real data science. The Pig Latin language is actually a simplified way to write powerful Map Reduce language. Map Reduce is Hadoop’s programming language to do deep analysis of a huge amount of data. The drawback of Map Reduce is the need for writing code – even bit too much for simple things. If you do bother to learn the Map Reduce syntax, check Spark. Spark allows you do the same things and even more with one of the languages you probably already know: SQL, Java, Scala, Python or R. Actually you can even mix them. As the languages mentioned here may indicate Spark includes statistical libraries for even a deeper data science analysis and graph database functionalities for understanding how Social Media connections, products in market baskets and telecom subscribers form networks. The understanding of the networks is important for identifying the real influencers and you can focus your action on them. Spark also loads data under processing into memory and does all processing even 100 times faster than Map Reduce or Hive.
Spark has even move advantages. It can run in a stream mode, meaning that it processes, merges and other ways enhances and transforms your data as it flows in and pushes it out to a defined storage or a next real time process. In the traditional terms it is a component to do complex event processing. So we have arrived to the area of traditional Enterprise Application Integration (EAI) and messaging. Another bit more traditional Hadoop component for the complex event processing is Strom. In system integrations, before complex event processing, you need way to manage message flows i.e. you need a message queue system. For that purpose the Hadoop umbrella has Kafka. Kafka is fed by different streaming sources like Flume mentioned earlier, more traditional EAI tools or SOME sources like Twitter directly.
How do we then integrate results of patch analysis or stream processing to other systems like on-line web pages? The Hive and Spark components described earlier offer an odbc/jdbc interfaces for doing SQL queries to Hadoop data. However, truth is that these are not capable for the less-than-second response times. The response times are enough for analytical visualization clients like Tableau, MS PowerBI or SAP Lumira, but not enough for the on-line web shops or many mobile applications. For the fast queries for massive audience Hadoop offers Hbase noSQL database component. Queries for Hbase can be assigned either through build-in API’s or SQL/JDBC using a Phoenix extension. Developments is continuing actively and as example there is project going on to enable full Spark API access from outside of Hadoop. More about these in coming blogs.
Although the Hadoop stack does not have traditional Enterprise Service Bus (ESB), the critical parts of the system integration and advanced analytics have married together and only your imagination is the limiting what kind of services you can provide. I would recommend to spend few minutes imaging out how these tools are running behind LinkedIn, Facebook, Uber, and biggest web shops etc.
And all this you can have on your laptop for free and forever, even for production use, if you just have eight gigabytes of memory to run it. All this may sound overwhelming, but you do not need to use and master all of the components, like you never even think of buying every item from IKEA’s store.
If you want hear more about HDP Hadoop, Modern Data Architecture and Azure Marketplace you may like these blog-posts:
Tuomas Autio: In love with Hadoop deployment automation
Mikko Mattila: Hadoop – IKEA of the IT ecosystem: Part 1
Mikko Mattila: Hadoop – IKEA of the IT ecosystem: Part 2
Mikko Mattila: Hadoop – IKEA of the IT ecosystem: Part 4
