- HUGで発表されたImpalaのスライド 2013/1/30版
- Inside Cloudera Impala: Runtime Code Generation：Impalaの実行時コード生成についてのブログ。技術情報に興味がある方は是非ご覧下さい
- MDX on Cloudera Impala：ExcelのピボットテーブルからImpalaを利用する
> Is there any difference in the way Impala reads HBase as compared to
Yes, reading from hbase involves the hbase client API; reading from
hdfs is done via libhdfs.
> Does the partitioned join that is going to be supported need the joining
> partitions to be co-located on a node?
> Is it possible to implement an equivalent of reduce-side/ repartition join
> in Impala?
That will be supported in GA at the latest.
> Would it not be possible to sort a large table with Impala? I see that every
> ORDER BY clause needs to have a LIMIT specified.
Not at this point.
> How does Impala keep track of data locality? Does it collect the information
> from HDFS namenode and HBase metadata server?
It talks to the hdfs namenode.
> Does a task always get scheduled on a data local node? If yes, how does
> Impala prevent hotspots?
Impala tries to run scans locally, if at all possible. It makes not
attempt to avoid hot spots at the moment.
> Since Impala doesn’t materialize results on disk, is there a limit on the
> size of output?
The final output of the query is streamed back to the user, and only
needs to be materialized in memory for blocking operators (GROUP BY,
ORDER BY … LIMIT). In other words, if the query doesn’t contain such
blocking operators, there is no limit on the size of the result set,
otherwise it’s limited by the available main memory on the node to
which the query was submitted.
> What is the scheduling policy to schedule multiple queries in a Impala
Each of the impalad process can accept client requests, and requests
should be submitted to them in a round-robin fashion.