By Andrew C. Oliver, Columnist, InfoWorld |

Learn to live with Apache Hive in 12 easy steps

Hive lets you use SQL on Hadoop, but tuning SQL on a distributed system is different. Here are 12 tips to help your effort fly

Hive is not an RDBMS, but it pretends to be one most of the time. It has tables, it runs SQL, and it supports both JDBC and ODBC.

The good and bad news of this revelation: Hive doesn’t run queries the way an RDBMS does. It's a long story, but I spent an 80-plus-hour workweek personally tuning Hive. Needless to say, my head won't quit buzzing. So for your benefit, here are a few suggestions to make your next Hive project go a little bit faster than mine did.

1. Don’t use MapReduce

Whether you believe in Tez, Spark or Impala, don’t believe in MapReduce. It is slow on its own, and it's really slow under Hive. If you’re on Hortonwork’s distribution, you can throw set hive.execution.engine=tez at the top of a script. On Cloudera, use Impala. Hopefully, soon you can set hive.execution.engine=spark when Impala isn’t appropriate.

2. Don’t do string matching in SQL

Ever! Especially in Hive. If you stick a like string match where a clause should be, you'll generate a cross-product warning. If you have a query that runs in seconds, with string matching it will take minutes. Your best alternative is to use one of many tools that allow you to add search to Hadoop. Look at Elasticsearch’s Hive integration or Lucidwork’s integration for Solr. Also, there's Cloudera Search. RDBMSes were never good at this, but Hive is worse.

3. Don't do a join on a subquery

You're better off creating a temporary table, then joining against the temp table instead of asking Hive to be smart about how it handles subqueries. Meaning don't do this:

select a.* from something a inner join (select ... from somethingelse union b select ... from anotherthing c) d on a.key1 = d.key1 and a.key2 = b.key2 where a.condition=1

Instead, do this:

create var_temp as select ... from somethingelse b union select ... from anotherthing c and then select a.* from something a inner join from var_temp b where a.key1=b.key1 and a.key2=b.key2 where a.condition=1

It really shouldn’t be tons faster at this point in Hive’s evolution, but it is, generally.

4. Use Parquet or ORC, but don’t convert to them for sport

That is, use Parquet or ORC as opposed to, say, TEXTFILE. However, if you have text data coming in and are massaging it into something slightly more structured, do the conversion to the target tables. You can’t LOAD DATA from a text file into an ORC, so do the initial load into a text.

When you create other tables against which you’ll ultimately run most of your analysis, do your ORCing there because converting to ORC or Parquet takes time and isn’t worth it as step one in your ETL process. If you have simple flat files coming in and aren’t doing any tweaking, then you’re stuck loading into a temporary table and doing a select create into an ORC or Parquet. I don’t envy you because it is kind of slow.

5. Try turning vectorization on and off

Add set hive.vectorized.execution.enabled = true set hive.vectorized.execution.reduce.enabled = true to the top of your scripts. Try it with them on and off because vectorization seems problematic in recent versions of Hive.

6. Don’t use structs in a join

I have to admit my native-brain SQL syntax is about SQL-92 era, so I don’t tend to use structs anyhow. But if you’re doing something super-repetitive like ON clauses for compound PKs, structs are handy. Unfortunately, Hive chokes on them -- particularly in the ON clause. Of course, it doesn’t do so at smaller data sets and yields no errors much of the time. In Tez, you get a fun vector error. This limitation isn’t documented anywhere that I know of. Consider this a fun way to get to know the innards of your execution engine!

7. Check your container size

You may need to increase your container size for Impala or Tez. Also, the “recommended” sizes may not apply to your system if you have larger node sizes. You might want to make sure your YARN queue and general YARN memory are appropriate. You might also want to peg it to something that isn’t the default queue all the peasants use.

8. Enable statistics

Hive does somewhat boneheaded things with joins unless statistics are enabled. You may also want to use query hints in Impala.

9. Consider MapJoin optimizations

If you do an explain on your query, you may find that recent versions of Hive are smart enough to apply the optimization automatically. But you may need to tweak them.

10. If you can, put the largest table last

Period.

11. Partitions are your friends ... sorta

If you have this one item in many places where clauses like a date (but ideally not a range) or a location repeat, you might have your partition key! Partitions basically mean “split this into its own directory,” which means instead of looking at a big file, Hive looks at the one file because you have it in your join/where clause saying you’re only looking at location=’NC’, which is a small subset of your data. Also, unlike with column values, you can push partitions in your LOAD DATA statements. However, remember that HDFS does not love small files.

12. Use hashes for column comparisons

If you’re comparing the same 10 fields in every query, consider using hash() and comparing the sums. These are sometimes so useful you might shove them in an output table. Note that the hash in Hive 0.12 is a low resolution, but better hashes are available in 0.13.

There’s my dirty dozen. I hope this means my endless late-night sufferings will not have been in vain -- and help keep others from getting stung by Hive’s eccentricities.

Next read this:

Andrew C. Oliver is a columnist and software developer with a long history in open source, databases, and cloud computing. He founded Apache POI and served on the board of the Open Source Initiative. Oliver has helped with marketing in startups including JBoss, Lucidworks, and Couchbase. He advises startups on marketing, growth, and outreach.