Hive (IGFS + IgniteMR) vs Hive (Tez)

classic Classic list List threaded Threaded
2 messages Options
theena theena
Reply | Threaded
Open this post in threaded view
|

Hive (IGFS + IgniteMR) vs Hive (Tez)

Hi
I am doing a POC on HDP 2.5 Cluster with Ignite as Hadoop Accelerator.

we have 3 node cluster each with 8 core and 60G RAM.

I was able to run hive on Tez query on a sample data set and finished in 32
sec.
The same query took 94 sec in Hive + IGFS + Ignite-MR.

I followed most of the instructions from this forum and Ignite Website. Just
want to check if I am missing any important parameters that could improve
the performance.

Below are more details:

core-site properties:

<property>
  <name>fs.igfs.impl</name>
  <value>org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem</value>
</property>
<property>
  <name>fs.AbstractFileSystem.igfs.impl</name>
  <value>org.apache.ignite.hadoop.fs.v2.IgniteHadoopFileSystem</value>
</property>
    <property>
      <name>fs.defaultFS</name>
<value>igfs://igfs@</value>
      <final>true</final>
    </property>
   
IG_MR_JOB_TRACKER=ip-10-0-0-200:11211

1. I can run IGFS and use HDFS as the secondary FS

hadoop fs -ls  igfs://igfs@//tmp/orders/


2. I can run Ignite-MR


hadoop  --config /usr/etc/ignite_conf jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-2.*.jar
wordcount -Dignite.job.shared.classloader=false
-Dmapreduce.jobtracker.address=$IG_MR_JOB_TRACKER
-Dmapreduce.framework.name=ignite igfs://igfs@//tmp/orders/
igfs://igfs@//tmp/6

3. I can run hive query on IGFS+IgniteMR and see the console log.

beeline -n hdfs -u "$HIVE_JDBC"  --hiveconf fs.defaultFS=igfs://igfs@/
--hiveconf mapreduce.framework.name=ignite --hiveconf
mapreduce.jobtracker.address=$IG_MR_JOB_TRACKER  

set ignite.job.shared.classloader=false ;
set hive.rpc.query.plan = true;
set hive.execution.engine = mr;
set hive.auto.convert.join = false; -- Added this to avoid mapRed task
failure while running the query.

4. I could not run Hive+Tez+IGFS

beeline -n hdfs -u "$HIVE_JDBC"
set fs.defaultFS=igfs://igfs@/  ;
set hive.execution.engine = tez;
set tez.use.cluster.hadoop-libs = true;  
set ignite.job.shared.classloader=false ;
set hive.rpc.query.plan = true;

INFO  : Tez session hasn't been created yet. Opening session
ERROR : Failed to execute tez graph.
org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown.
Application application_1530330378726_0007 failed 2 times due to AM
Container for appattempt_1530330378726_0007_000002 exited with  exitCode:
-1000
For more detailed output, check the application tracking page:
http://ip-10.ec2.internal:8088/cluster/app/application_1530330378726_0007
Then click on links to logs of each attempt.
Diagnostics: java.io.IOException: Failed to parse endpoint: null
Failing this attempt. Failing the application.
        at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:779)
        at
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:217)
        at
org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:279)
        at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:159)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
        at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1745)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1491)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1289)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1156)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1151)
        at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:197)
        at
org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:76)
        at
org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:253)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1865)
        at
org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:264)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Error: Error while processing statement: FAILED: Execution Error, return
code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask (state=08S01,code=1)


Attached the log file and default config.
default-config.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t1884/default-config.xml>  
master.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t1884/master.log>  
worker.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t1884/worker.log>  
compute.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t1884/compute.log>  

SQL:


select
  l_shipmode,
  sum(case
    when o_orderpriority ='1-URGENT'
         or o_orderpriority ='2-HIGH'
    then 1
    else 0
end
  ) as high_line_count,
  sum(case
    when o_orderpriority <> '1-URGENT'
         and o_orderpriority <> '2-HIGH'
    then 1
    else 0
end
  ) as low_line_count
from
  orders o join lineitem l
  on
    o.o_orderkey = l.l_orderkey and l.l_commitdate < l.l_receiptdate
and l.l_shipdate < l.l_commitdate and l.l_receiptdate >= '1994-01-01'
and l.l_receiptdate < '1995-01-01'
where
  l.l_shipmode = 'MAIL' or l.l_shipmode = 'SHIP'
group by l_shipmode
order by l_shipmode;




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Hive (IGFS + IgniteMR) vs Hive (Tez)

Hi,

1. Have you tried to run it without -Dignite.job.shared.classloader=false ?
It definitely has a performance impact.

2. Are Ignite nodes placed on the same machines as Hadoop? If not, it will
add a huge network interaction.

3. What is the amount of the data that you have in hdfs? If it's not fit in
IGFS(as I see, you have something like 90gb in IGFS), you will have a lot of
data moving between memory and hdfs - it will remove old data on reading a
new one every time, when there is not enough memory.

Evgenii



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/