Wednesday, 13 September 2017

A Brief Introduction to the Architecture of Cloudera Impala

In this era of technological advancements, time and velocity are the parameters which describe the effectiveness of any product. When we are talking about Big Data, we not only deal with the huge volume of data but also the velocity. Big data processing is done on Hadoop which provides batch and real time processing.

 Apache Hive is a tool which facilitates a SQL-like interface to query the data present in HDFS. Though hive fulfills the requirements of managing large datasets, it comes at a cost of low latency. Queries running on Hive triggers a MapReduce job which takes a lot to time(sometimes from minutes to hours). To overcome this drawback of Hive, Cloudera built a distributed system, also known as Cloudera Impala, which executes on the top of  Hive warehouse and produces quick results. 

Impala is a massively parallel processing database engine which is widely used to execute analytic queries on huge data sets. Impala is compatible with HiveQL syntax. One can use Impala or Hive to read, write and manage data.

Let us dive into Impala Architecture:

At a high level, Impala consists of the following:
  • The Impala Daemon
  • The Impala Statestore
  • The Impala Catalog Service
The Impala Daemon: Impala consists of a daemon process. This daemon process runs on each DataNode of a cluster. These processes are represented by the impalad process.

Impala daemon is responsible for the following:
  1. Read and write to data files.
  2. Accept the queries from the Impala-shell, JDBC or ODBC.
  3. Parallelize the queries and distribute work across the cluster.
  4. Transmit the intermediate query result back to the central coordinator node.
Note - The central coordinator node is always the daemon instance where the query is submitted. When other nodes complete the computation, results are sent back to the central coordinator node.
  • Impala daemons are in constant communication with the statestore to get the details about the condition of the nodes.
  • Impala daemons also communicate with the catalogd daemons to get the latest updates about the ongoing activities in the cluster like create, alter or drop statements. Also, the load and insert updates are communicated to the Impala daemons. This minimizes the need of a frequent refresh or invalidate metadata statements. This facility was not available prior to Impala 1.2.


The Impala Statestore: Impala statestore is represented by a daemon process named statestored. It checks the health of all the datanodes in a cluster. If an Impala daemon goes offline due to some hardware failures or other issues, the statestore informs all the other Impala daemons. This is useful as Impala daemons avoid making requests to the unavailable node. 


The Impala Catalog Service: It is represented by a daemon process named catalogd. It conveys the metadata changes from an Impala SQL statement to all the datanodes in a cluster. Such process is required only on one host in a cluster.

Facts about Catalog and Statestore daemons:
  • They don't have a special requirement or high availability as it doesn't result in the data lost when offline.
  • If Catalog daemon and Statestore daemons are unavailable:
    1. Stop Impala Service.
    2. Delete Impala Statestore and Impala Catalog server roles.
    3. Add a role on a different host.
    4. Restart the impala services.


This is all about the impala architecture. In the next blog, I would be sharing a quick start guide to Impala that will contain all the basic functions and commands that are frequently used in Impala.

Stay tuned for future posts. :)





5 comments:

  1. If running mapreduce job is a drawback of hive then lwt me know how impala performs tasks like when we execute a query in hive it execute mapreduce job, We can use TEZ execution engine to speed up the process but how impala do this?

    ReplyDelete
    Replies
    1. Hive on tez is a good option if you don't want to use Impala. Hive on tez is basically used when there is a large number of queries to run in parallel.

      Delete
    2. Impala speed up the process in multiple ways.
      1. Its 'coldstart' time is negligible as compared to Hive/Tez.
      2.It writes the intermediate results to the executor unlike hive which writes it on disk.
      3.Impala process are multithreaded.
      4. Impala starts the final aggregation as soon as pre-aggregation fragments starts returning results.

      Delete

A Brief Introduction to the Architecture of Cloudera Impala

In this era of technological advancements, time and velocity are the parameters which describe the effectiveness of any product. When we a...