Error pyspark does not support any application options - Исправление ошибок и поиск оптимальных решений проблем

It seems there are some problems with handling wrong options as below:

*spark-submit script — this one looks fine

spark-submit --aabbcc
Error: Unrecognized option: --aabbcc

Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark:Usage: spark-submit --status [submission ID] --master [spark:Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark:  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
...

*spark-sql script — this one looks fine

spark-sql --aabbcc
Unrecognized option: --aabbcc
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
...

*sparkr script — this one might be a bit major because users possibly mistakenly put some wrong options with typos and the error message does not indicate the options are wrong.

sparkr --aabbcc

...

Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  :
  JVM is not ready after 10 seconds
>

*pyspark script — we could make the error message consistently with the others

pyspark --aabbcc
Exception in thread "main" java.lang.IllegalArgumentException: pyspark does not support any application options.
	at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
	at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:290)
	at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:147)
	at org.apache.spark.launcher.Main.main(Main.java:86)

*spark-shell — it seems the error message is not pretty kind like spark-submit or spark-sql.

spark-shell --aabbcc
bad option: '--aabbcc'

Источник

How to set up and run an Apache Spark Cluster on EC2? This tutorial will walk you through each step to get an Apache Spark cluster up and running on EC2. The cluster consists of one master and one worker node. It includes each step I took regardless if it failed or succeeded. While your experience may not match exactly, I’m hoping these steps could be helpful as you attempt to run an Apache Spark cluster on Amazon EC2. There are screencasts throughout the steps.

Assumptions

This post assumes you have already signed up and have a verified AWS account. If not, sign up here https://aws.amazon.com/. It assumes you are familiar with running Spark Standalone Cluster and deploying to a Spark cluster.

Approach

I’m going to go through step by step and also show some screenshots and screencasts along the way. For example, there is a screencast that covers steps 1 through 5 below.

Spark Cluster on Amazon EC2 Step by Step

Note: There’s a screencast of steps one through four at the end of step five below.

1) Generate Key/Pair in EC2 section of AWS Console

Click “Key Pairs” in the left nav and then Create Key Pair button.

Spark Key Pair

Download the resulting key/pair PEM file.

2) Create a new AWS user named courseuser and download the file which includes the User Name, Access Key Id, Secret Access Key. We need the Key Id and Secret Access Key.

3) Set your environment variables according to the key and id from the previous step. For me, that meant running the following from the command line:

export AWS_SECRET_ACCESS_KEY=F9mKN6obfusicatedpBrEVvel3PEaRiC

export AWS_ACCESS_KEY_ID=AKIAobfusicatedPOQ7XDXYTA

4) Open a terminal window and goto the root dir of your Spark distribution. Then, copy PEM file from first step in this tutorial to root of Spark home dir

5) From Spark home dir, run:

ec2/spark-ec2 --key-pair=courseexample --identity-file=courseexample.pem launch spark-cluster-example

I received errors about the PEM file permissions, so I changed according to the error notification recommendation and re-ran the script.

Then, you should receive permission errors from Amazon, so update permissions of courseuser on Amazon and try again.

You may receive an error about zone availability such as:

Your requested instance type (m1.large) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1c, us-east-1e, us-east-1d, us-east-1a.

If so, just update the script zone argument and re-run:

ec2/spark-ec2 --key-pair=courseexample --identity-file=courseexample.pem --zone=us-east-1d launch spark-cluster-example

The cluster creation takes approximately 10 min with all kinds output including deprecated warnings and possibly errors starting GANGLIA. GANGLIA errors are fine if you are just experimenting. Try a different Spark version or you can tweak PHP settings on your Cluster.

Here’s a screencast example of me creating an Apache Spark Cluster on EC2

6) After the cluster creation succeeds, you can verify by going to master http://<your-ec2-hostname>.amazonaws.com:8080/

7) And you can verify from Spark console in Spark or Python

Scala example:

bin/spark-shell --master spark://ec2-54-145-64-173.compute-1.amazonaws.com:7077

Python example

IPYTHON_OPTS="notebook" ./bin/pyspark --master spark://ec2-54-198-139-10.compute-1.amazonaws.com:7077

At first, both of these should have issues which eventually lead to an “ERROR OneForOneStrategy: java.lang.NullPointerException”:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 1.4.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
16/01/17 07:30:28 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
16/01/17 07:30:28 ERROR OneForOneStrategy: 
java.lang.NullPointerException

This is an Amazon permission issue related to port 7077 not being open. You need to open up port 7077 via an Inbound Rule. Here’s a screencast on how to create an Inbound Rule in EC2:

After creating this inbound rule, everything will work from both ipython notebook and spark shell

Conclusion

Hope this helps you configure a Spark Cluster on EC2. Let me know in the page comments if I can help. Once you are finished with your EC2 instances, make sure to destroy using the following command:

ec2/spark-ec2 --key-pair=courseexample --identity-file=courseexample.pem destroy spark-cluster-example

Resources

For a list of additional resources and tutorials, see Spark tutorials page.

Spark EC2 Tutorial Featured Image Credit: https://flic.kr/p/g19ivQ

Источник

We are running into issues when we launch PySpark (with or without Yarn).

It seems to be looking for hive-site.xml file which we already copied to spark configuration path but I am not sure if there are any specific parameters that should be part of.

[apps@devdm003.dev1 ~]$ pyspark —master yarn —verbose
WARNING: User-defined SPARK_HOME (/opt/spark) overrides detected (/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark).
WARNING: Running pyspark from user-defined location.
Python 2.7.8 (default, Oct 22 2016, 09:02:55)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2
Type «help», «copyright», «credits» or «license» for more information.
Using properties file: /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark/conf/spark-defaults.conf
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.yarn.jars=hdfs://devdm001.dev1.turn.com:8020/user/spark/spark-2.1-bin-hadoop/*
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.shuffle.service.enabled=true
Adding default property: spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
Adding default property: spark.yarn.historyServer.address=http://devdm004.dev1.turn.com:18088
Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1
Adding default property: spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels
Adding default property: spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
Adding default property: spark.shuffle.service.port=7337
Adding default property: spark.master=yarn
Adding default property: spark.authenticate=false
Adding default property: spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
Adding default property: spark.eventLog.dir=hdfs://devdm001.dev1.turn.com:8020/user/spark/applicationHistory
Adding default property: spark.dynamicAllocation.enabled=true
Adding default property: spark.dynamicAllocation.minExecutors=0
Adding default property: spark.dynamicAllocation.executorIdleTimeout=60
Parsed arguments:
master yarn
deployMode null
executorMemory null
executorCores null
totalExecutorCores null
propertiesFile /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath null
driverExtraLibraryPath /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
driverExtraJavaOptions null
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass null
primaryResource pyspark-shell
name PySparkShell
childArgs []
jars null
packages null
packagesExclusions null
repositories null
verbose true

Spark properties used, including those specified through
—conf and those from the properties file /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark/conf/spark-defaults.conf:
spark.executor.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.yarn.jars -> hdfs://devdm001.dev1.turn.com:8020/user/spark/spark-2.1-bin-hadoop/*
spark.driver.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.authenticate -> false
spark.yarn.historyServer.address -> http://devdm004.dev1.turn.com:18088
spark.yarn.am.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.eventLog.enabled -> true
spark.dynamicAllocation.schedulerBacklogTimeout -> 1
spark.yarn.config.gatewayPath -> /opt/cloudera/parcels
spark.serializer -> org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.executorIdleTimeout -> 60
spark.dynamicAllocation.minExecutors -> 0
spark.shuffle.service.enabled -> true
spark.yarn.config.replacementPath -> {{HADOOP_COMMON_HOME}}/../../..
spark.shuffle.service.port -> 7337
spark.eventLog.dir -> hdfs://devdm001.dev1.turn.com:8020/user/spark/applicationHistory
spark.master -> yarn
spark.dynamicAllocation.enabled -> true

Main class:
org.apache.spark.api.python.PythonGatewayServer
Arguments:

System properties:
spark.executor.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.driver.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.yarn.jars -> hdfs://devdm001.dev1.turn.com:8020/user/spark/spark-2.1-bin-hadoop/*
spark.authenticate -> false
spark.yarn.historyServer.address -> http://devdm004.dev1.turn.com:18088
spark.yarn.am.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.eventLog.enabled -> true
spark.dynamicAllocation.schedulerBacklogTimeout -> 1
SPARK_SUBMIT -> true
spark.yarn.config.gatewayPath -> /opt/cloudera/parcels
spark.serializer -> org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled -> true
spark.dynamicAllocation.minExecutors -> 0
spark.dynamicAllocation.executorIdleTimeout -> 60
spark.app.name -> PySparkShell
spark.yarn.config.replacementPath -> {{HADOOP_COMMON_HOME}}/../../..
spark.submit.deployMode -> client
spark.shuffle.service.port -> 7337
spark.eventLog.dir -> hdfs://devdm001.dev1.turn.com:8020/user/spark/applicationHistory
spark.master -> yarn
spark.yarn.isPython -> true
spark.dynamicAllocation.enabled -> true
Classpath elements:

log4j:ERROR Could not find value for key log4j.appender.WARN
log4j:ERROR Could not instantiate appender named «WARN».
log4j:ERROR Could not find value for key log4j.appender.DEBUG
log4j:ERROR Could not instantiate appender named «DEBUG».
Setting default log level to «WARN».
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/jars/avro-tools-1.7.6-cdh5.5.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/server/turn/deploy/160622/turn/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Traceback (most recent call last):
File «/opt/spark/python/pyspark/shell.py», line 43, in <module>
spark = SparkSession.builder
File «/opt/spark/python/pyspark/sql/session.py», line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File «/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py», line 1133, in __call__
File «/opt/spark/python/pyspark/sql/utils.py», line 79, in deco
raise IllegalArgumentException(s.split(‘: ‘, 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u»Error while instantiating ‘org.apache.spark.sql.hive.HiveSessionState’:»

We installed Spark 2.1 for business reasons and updated SPARK_HOME variable in safety valve.

(Ensured SPARK_HOME is set early in spark-env.sh so other PATH variables are set properly).

I also learnt that there is no hive-site.xml dependency with spark 2.1 which confuses me more for reasons it is looking into.

Did anyone face similar issue, any suggestions? This is a linux environment running CDH5.5.4

Источник