Query Optimization

Join our re-occurring workshop online as well as offline (at Gurugram)

Query Optimization
Query Optimization Workflow

Hive Query Optimizations.

There are various ways to optimize Hive Query. These ways are Bucketing, Hive Data Formats, Partitioning on date keys, Map Side Aggregation, and Map Join. Bucketing improves the join performance if the bucket key and join keys are common. Data format plays an important role in Hive Performance. For example, the most of the tables are using text format so queries for such tables could potentially get a performance boost if change the input format something to ORC. Hive Partitioning is an effective method to improve the query performance on larger tables. Hive does the first-level aggregation directly in the map task when this is set to true and hence be able to handle the map side aggregation. Map joins are really efficient if a table on the other side of a join is small enough to fit in the memory.  

Big Data engines or cluster optimizations for faster query execution.

Big Data engines could be of type Hadoop or Hive or Spark or Airflow or Others. There are different flags to tune these engines by enabling/disabling these flags. But underlying compute infrastructure is also crucial role with right set of hardware. These factors are RAM, Core, and Volume Types (I/O rates). The latest compute infrastructure is always a plus points, such as AWS Graviton or GCP N2D. 

Workflow Optimizations.

There is always advisable to run queries in multiple commands rather run queries in one command. As running in one command locks runtime resources and either halt other processing or make them slow. 

Dynamic Query Routing based on workloads on nodes.

Dynamic Query Routing works on query placements mechanism and development of query routing services. There are challenges while running complex queries on cluster with large number of nodes (500 Nodes +). These challenges are unable to choose consistently the most suitable cluster and users are frustrated by queries that become slow due to cluster resource challenges. Some time duplicate queries are submitted.