ようこそ Tech blogへ!


続)Cloudera Impala 情報 (15)

Impala情報 2013/2/26版


impala-user MLにアーキテクチャーの話が流れていたので、取り急ぎ貼っておきます

> Is there any difference in the way Impala reads HBase as compared to
> Hive/Hadoop?
Yes, reading from hbase involves the hbase client API; reading from
hdfs is done via libhdfs.
> Does the partitioned join that is going to be supported need the joining
> partitions to be co-located on a node?
> Is it possible to implement an equivalent of reduce-side/ repartition join
> in Impala?
That will be supported in GA at the latest.
> Would it not be possible to sort a large table with Impala? I see that every
> ORDER BY clause needs to have a LIMIT specified.
Not at this point.
> How does Impala keep track of data locality? Does it collect the information
> from HDFS namenode and HBase metadata server?
It talks to the hdfs namenode.
> Does a task always get scheduled on a data local node? If yes, how does
> Impala prevent hotspots?
Impala tries to run scans locally, if at all possible. It makes not
attempt to avoid hot spots at the moment.
> Since Impala doesn’t materialize results on disk, is there a limit on the
> size of output?
The final output of the query is streamed back to the user, and only
needs to be materialized in memory for blocking operators (GROUP BY,
ORDER BY … LIMIT). In other words, if the query doesn’t contain such
blocking operators, there is no limit on the size of the result set,
otherwise it’s limited by the available main memory on the node to
which the query was submitted.
> What is the scheduling policy to schedule multiple queries in a Impala
> cluster?
Each of the impalad process can accept client requests, and requests
should be submitted to them in a round-robin fashion.