Find Set Difference in Hive

February 15, 2014. Category: blog. Tags: hive

Hive doesn’t support NOT IN patterns in sub query. So we can’t run this following query:

select actor_id
from table1
where NOT actor_id IN 
  ( select actor_id
    from table2

To achieve the set difference, we have to get a little more creative:

select t1.actor_id
( select actor_id
  from table1
) t1
left outer join 
( select actor_id
  from table2
) t2
on (t1.actor_id = t2.actor_id)
where t2.actor_id is null;

Rows where successful joins are possible will have a non-null entry and this way they will be filtered out. The left outer join will make sure all entries from table1 are included and rows from table2 are filtered out.

Comments Section

If you have questions or would like to comment on your experience please do so.

blog comments powered by Disqus