pyspark.pandas.read_csv¶

pyspark.pandas.read_csv(path: Union[str, List[str]], sep: str = ',', header: Union[str, int, None] = 'infer', names: Union[str, List[str], None] = None, index_col: Union[str, List[str], None] = None, usecols: Union[List[int], List[str], Callable[[str], bool], None] = None, squeeze: bool = False, mangle_dupe_cols: bool = True, dtype: Union[str, numpy.dtype, pandas.core.dtypes.base.ExtensionDtype, Dict[str, Union[str, numpy.dtype, pandas.core.dtypes.base.ExtensionDtype]], None] = None, nrows: Optional[int] = None, parse_dates: bool = False, quotechar: Optional[str] = None, escapechar: Optional[str] = None, comment: Optional[str] = None, encoding: Optional[str] = None, **options: Any) → Union[pyspark.pandas.frame.DataFrame, pyspark.pandas.series.Series][source]¶

Read CSV (comma-separated) file into DataFrame or Series.

Parameters

pathstr or list: Path(s) of the CSV file(s) to be read.
sepstr, default ‘,’: Delimiter to use. Non empty string.
headerint, default ‘infer’: Whether to use the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names
namesstr or array-like, optional: List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list will cause an error to be issued. If a string is given, it should be a DDL-formatted string in Spark SQL, which is preferred to avoid schema inference for better performance.
index_col: str or list of str, optional, default: None: Index column of table in Spark.
usecolslist-like or callable, optional: Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.
squeezebool, default False: If the parsed data only contains one column then return a Series.

Deprecated since version 3.4.0.
mangle_dupe_colsbool, default True: Duplicate columns will be specified as ‘X0’, ‘X1’, … ‘XN’, rather than ‘X’ … ‘X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns. Currently only True is allowed.

Deprecated since version 3.4.0.
dtypeType name or dict of column -> type, default None: Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object together with suitable na_values settings to preserve and not interpret dtype.
nrowsint, default None: Number of rows to read from the CSV file.
parse_datesboolean or list of ints or names or list of lists or dict, default False.: Currently only False is allowed.
quotecharstr (length 1), optional: The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
escapecharstr (length 1), default None: One-character string used to escape other characters.
comment: str, optional: Indicates the line should not be parsed.
encoding: str, optional: Indicates the encoding to read file
optionsdict: All other options passed directly into Spark’s data source.

Returns

DataFrame or Series