Spark Session

The entry point to programming Spark with the Dataset and DataFrame API. To create a Spark session, you should use SparkSession.builder attribute. See also SparkSession.

SparkSession.builder.appName(name)

Sets a name for the application, which will be shown in the Spark web UI.

SparkSession.builder.config([key, value, conf])

Sets a config option.

SparkSession.builder.enableHiveSupport()

Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

SparkSession.builder.getOrCreate()

Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.

SparkSession.builder.master(master)

Sets the Spark master URL to connect to, such as “local” to run locally, “local[4]” to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.

SparkSession.catalog

Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.

SparkSession.conf

Runtime configuration interface for Spark.

SparkSession.createDataFrame(data[, schema, …])

Creates a DataFrame from an RDD, a list or a pandas.DataFrame.

SparkSession.getActiveSession()

Returns the active SparkSession for the current thread, returned by the builder

SparkSession.newSession()

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

SparkSession.range(start[, end, step, …])

Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.

SparkSession.read

Returns a DataFrameReader that can be used to read data in as a DataFrame.

SparkSession.readStream

Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.

SparkSession.sparkContext

Returns the underlying SparkContext.

SparkSession.sql(sqlQuery, **kwargs)

Returns a DataFrame representing the result of the given query.

SparkSession.stop()

Stop the underlying SparkContext.

SparkSession.streams

Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.

SparkSession.table(tableName)

Returns the specified table as a DataFrame.

SparkSession.udf

Returns a UDFRegistration for UDF registration.

SparkSession.version

The version of Spark on which this application is running.