DataFrame.agg(*exprs)
DataFrame.agg
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).
DataFrame
df.groupBy().agg()
DataFrame.alias(alias)
DataFrame.alias
Returns a new DataFrame with an alias set.
DataFrame.approxQuantile(col, probabilities, …)
DataFrame.approxQuantile
Calculates the approximate quantiles of numerical columns of a DataFrame.
DataFrame.cache()
DataFrame.cache
Persists the DataFrame with the default storage level (MEMORY_AND_DISK).
DataFrame.checkpoint([eager])
DataFrame.checkpoint
Returns a checkpointed version of this DataFrame.
DataFrame.coalesce(numPartitions)
DataFrame.coalesce
Returns a new DataFrame that has exactly numPartitions partitions.
DataFrame.colRegex(colName)
DataFrame.colRegex
Selects column based on the column name specified as a regex and returns it as Column.
Column
DataFrame.collect()
DataFrame.collect
Returns all the records as a list of Row.
Row
DataFrame.columns
Returns all column names as a list.
DataFrame.corr(col1, col2[, method])
DataFrame.corr
Calculates the correlation of two columns of a DataFrame as a double value.
DataFrame.count()
DataFrame.count
Returns the number of rows in this DataFrame.
DataFrame.cov(col1, col2)
DataFrame.cov
Calculate the sample covariance for the given columns, specified by their names, as a double value.
DataFrame.createGlobalTempView(name)
DataFrame.createGlobalTempView
Creates a global temporary view with this DataFrame.
DataFrame.createOrReplaceGlobalTempView(name)
DataFrame.createOrReplaceGlobalTempView
Creates or replaces a global temporary view using the given name.
DataFrame.createOrReplaceTempView(name)
DataFrame.createOrReplaceTempView
Creates or replaces a local temporary view with this DataFrame.
DataFrame.createTempView(name)
DataFrame.createTempView
Creates a local temporary view with this DataFrame.
DataFrame.crossJoin(other)
DataFrame.crossJoin
Returns the cartesian product with another DataFrame.
DataFrame.crosstab(col1, col2)
DataFrame.crosstab
Computes a pair-wise frequency table of the given columns.
DataFrame.cube(*cols)
DataFrame.cube
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them.
DataFrame.describe(*cols)
DataFrame.describe
Computes basic statistics for numeric and string columns.
DataFrame.distinct()
DataFrame.distinct
Returns a new DataFrame containing the distinct rows in this DataFrame.
DataFrame.drop(*cols)
DataFrame.drop
Returns a new DataFrame that drops the specified column.
DataFrame.dropDuplicates([subset])
DataFrame.dropDuplicates
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.
DataFrame.drop_duplicates([subset])
DataFrame.drop_duplicates
drop_duplicates() is an alias for dropDuplicates().
drop_duplicates()
dropDuplicates()
DataFrame.dropna([how, thresh, subset])
DataFrame.dropna
Returns a new DataFrame omitting rows with null values.
DataFrame.dtypes
Returns all column names and their data types as a list.
DataFrame.exceptAll(other)
DataFrame.exceptAll
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.
DataFrame.explain([extended, mode])
DataFrame.explain
Prints the (logical and physical) plans to the console for debugging purpose.
DataFrame.fillna(value[, subset])
DataFrame.fillna
Replace null values, alias for na.fill().
na.fill()
DataFrame.filter(condition)
DataFrame.filter
Filters rows using the given condition.
DataFrame.first()
DataFrame.first
Returns the first row as a Row.
DataFrame.foreach(f)
DataFrame.foreach
Applies the f function to all Row of this DataFrame.
f
DataFrame.foreachPartition(f)
DataFrame.foreachPartition
Applies the f function to each partition of this DataFrame.
DataFrame.freqItems(cols[, support])
DataFrame.freqItems
Finding frequent items for columns, possibly with false positives.
DataFrame.groupBy(*cols)
DataFrame.groupBy
Groups the DataFrame using the specified columns, so we can run aggregation on them.
DataFrame.head([n])
DataFrame.head
Returns the first n rows.
n
DataFrame.hint(name, *parameters)
DataFrame.hint
Specifies some hint on the current DataFrame.
DataFrame.inputFiles()
DataFrame.inputFiles
Returns a best-effort snapshot of the files that compose this DataFrame.
DataFrame.intersect(other)
DataFrame.intersect
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.
DataFrame.intersectAll(other)
DataFrame.intersectAll
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.
DataFrame.isEmpty()
DataFrame.isEmpty
Returns True if this DataFrame is empty.
True
DataFrame.isLocal()
DataFrame.isLocal
Returns True if the collect() and take() methods can be run locally (without any Spark executors).
collect()
take()
DataFrame.isStreaming
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.
DataFrame.join(other[, on, how])
DataFrame.join
Joins with another DataFrame, using the given join expression.
DataFrame.limit(num)
DataFrame.limit
Limits the result count to the number specified.
DataFrame.localCheckpoint([eager])
DataFrame.localCheckpoint
Returns a locally checkpointed version of this DataFrame.
DataFrame.mapInPandas(func, schema)
DataFrame.mapInPandas
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame.
DataFrame.mapInArrow(func, schema)
DataFrame.mapInArrow
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a DataFrame.
DataFrame.na
Returns a DataFrameNaFunctions for handling missing values.
DataFrameNaFunctions
DataFrame.observe(observation, *exprs)
DataFrame.observe
Observe (named) metrics through an Observation instance.
Observation
DataFrame.orderBy(*cols, **kwargs)
DataFrame.orderBy
Returns a new DataFrame sorted by the specified column(s).
DataFrame.persist([storageLevel])
DataFrame.persist
Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed.
DataFrame.printSchema()
DataFrame.printSchema
Prints out the schema in the tree format.
DataFrame.randomSplit(weights[, seed])
DataFrame.randomSplit
Randomly splits this DataFrame with the provided weights.
DataFrame.rdd
Returns the content as an pyspark.RDD of Row.
pyspark.RDD
DataFrame.registerTempTable(name)
DataFrame.registerTempTable
Registers this DataFrame as a temporary table using the given name.
DataFrame.repartition(numPartitions, *cols)
DataFrame.repartition
Returns a new DataFrame partitioned by the given partitioning expressions.
DataFrame.repartitionByRange(numPartitions, …)
DataFrame.repartitionByRange
DataFrame.replace(to_replace[, value, subset])
DataFrame.replace
Returns a new DataFrame replacing a value with another value.
DataFrame.rollup(*cols)
DataFrame.rollup
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
DataFrame.sameSemantics(other)
DataFrame.sameSemantics
Returns True when the logical query plans inside both DataFrames are equal and therefore return same results.
DataFrame.sample([withReplacement, …])
DataFrame.sample
Returns a sampled subset of this DataFrame.
DataFrame.sampleBy(col, fractions[, seed])
DataFrame.sampleBy
Returns a stratified sample without replacement based on the fraction given on each stratum.
DataFrame.schema
Returns the schema of this DataFrame as a pyspark.sql.types.StructType.
pyspark.sql.types.StructType
DataFrame.select(*cols)
DataFrame.select
Projects a set of expressions and returns a new DataFrame.
DataFrame.selectExpr(*expr)
DataFrame.selectExpr
Projects a set of SQL expressions and returns a new DataFrame.
DataFrame.semanticHash()
DataFrame.semanticHash
Returns a hash code of the logical query plan against this DataFrame.
DataFrame.show([n, truncate, vertical])
DataFrame.show
Prints the first n rows to the console.
DataFrame.sort(*cols, **kwargs)
DataFrame.sort
DataFrame.sortWithinPartitions(*cols, **kwargs)
DataFrame.sortWithinPartitions
Returns a new DataFrame with each partition sorted by the specified column(s).
DataFrame.sparkSession
Returns Spark session that created this DataFrame.
DataFrame.stat
Returns a DataFrameStatFunctions for statistic functions.
DataFrameStatFunctions
DataFrame.storageLevel
Get the DataFrame’s current storage level.
DataFrame.subtract(other)
DataFrame.subtract
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.
DataFrame.summary(*statistics)
DataFrame.summary
Computes specified statistics for numeric and string columns.
DataFrame.tail(num)
DataFrame.tail
Returns the last num rows as a list of Row.
num
list
DataFrame.take(num)
DataFrame.take
Returns the first num rows as a list of Row.
DataFrame.toDF(*cols)
DataFrame.toDF
Returns a new DataFrame that with new specified column names
DataFrame.toJSON([use_unicode])
DataFrame.toJSON
Converts a DataFrame into a RDD of string.
RDD
DataFrame.toLocalIterator([prefetchPartitions])
DataFrame.toLocalIterator
Returns an iterator that contains all of the rows in this DataFrame.
DataFrame.toPandas()
DataFrame.toPandas
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
pandas.DataFrame
DataFrame.to_pandas_on_spark([index_col])
DataFrame.to_pandas_on_spark
DataFrame.transform(func, *args, **kwargs)
DataFrame.transform
Returns a new DataFrame.
DataFrame.union(other)
DataFrame.union
Return a new DataFrame containing union of rows in this and another DataFrame.
DataFrame.unionAll(other)
DataFrame.unionAll
DataFrame.unionByName(other[, …])
DataFrame.unionByName
Returns a new DataFrame containing union of rows in this and another DataFrame.
DataFrame.unpersist([blocking])
DataFrame.unpersist
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
DataFrame.where(condition)
DataFrame.where
where() is an alias for filter().
where()
filter()
DataFrame.withColumn(colName, col)
DataFrame.withColumn
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
DataFrame.withColumns(*colsMap)
DataFrame.withColumns
Returns a new DataFrame by adding multiple columns or replacing the existing columns that has the same names.
DataFrame.withColumnRenamed(existing, new)
DataFrame.withColumnRenamed
Returns a new DataFrame by renaming an existing column.
DataFrame.withMetadata(columnName, metadata)
DataFrame.withMetadata
Returns a new DataFrame by updating an existing column with metadata.
DataFrame.withWatermark(eventTime, …)
DataFrame.withWatermark
Defines an event time watermark for this DataFrame.
DataFrame.write
Interface for saving the content of the non-streaming DataFrame out into external storage.
DataFrame.writeStream
Interface for saving the content of the streaming DataFrame out into external storage.
DataFrame.writeTo(table)
DataFrame.writeTo
Create a write configuration builder for v2 sources.
DataFrame.pandas_api([index_col])
DataFrame.pandas_api
Converts the existing DataFrame into a pandas-on-Spark DataFrame.
DataFrameNaFunctions.drop([how, thresh, subset])
DataFrameNaFunctions.drop
DataFrameNaFunctions.fill(value[, subset])
DataFrameNaFunctions.fill
DataFrameNaFunctions.replace(to_replace[, …])
DataFrameNaFunctions.replace
DataFrameStatFunctions.approxQuantile(col, …)
DataFrameStatFunctions.approxQuantile
DataFrameStatFunctions.corr(col1, col2[, method])
DataFrameStatFunctions.corr
DataFrameStatFunctions.cov(col1, col2)
DataFrameStatFunctions.cov
DataFrameStatFunctions.crosstab(col1, col2)
DataFrameStatFunctions.crosstab
DataFrameStatFunctions.freqItems(cols[, support])
DataFrameStatFunctions.freqItems
DataFrameStatFunctions.sampleBy(col, fractions)
DataFrameStatFunctions.sampleBy