IndexedRowMatrix

class pyspark.mllib.linalg.distributed.IndexedRowMatrix(rows: pyspark.rdd.RDD[Union[Tuple[int, VectorLike], pyspark.mllib.linalg.distributed.IndexedRow]], numRows: int = 0, numCols: int = 0)[source]

Represents a row-oriented distributed Matrix with indexed rows.

Parameters
rowspyspark.RDD

An RDD of IndexedRows or (int, vector) tuples or a DataFrame consisting of a int typed column of indices and a vector typed column.

numRowsint, optional

Number of rows in the matrix. A non-positive value means unknown, at which point the number of rows will be determined by the max row index plus one.

numColsint, optional

Number of columns in the matrix. A non-positive value means unknown, at which point the number of columns will be determined by the size of the first row.

Methods

columnSimilarities()

Compute all cosine similarities between columns.

computeGramianMatrix()

Computes the Gramian matrix A^T A.

computeSVD(k[, computeU, rCond])

Computes the singular value decomposition of the IndexedRowMatrix.

multiply(matrix)

Multiply this matrix by a local dense matrix on the right.

numCols()

Get or compute the number of cols.

numRows()

Get or compute the number of rows.

toBlockMatrix([rowsPerBlock, colsPerBlock])

Convert this matrix to a BlockMatrix.

toCoordinateMatrix()

Convert this matrix to a CoordinateMatrix.

toRowMatrix()

Convert this matrix to a RowMatrix.

Attributes

rows

Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.

Methods Documentation

columnSimilarities()pyspark.mllib.linalg.distributed.CoordinateMatrix[source]

Compute all cosine similarities between columns.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows)
>>> cs = mat.columnSimilarities()
>>> print(cs.numCols())
3
computeGramianMatrix()pyspark.mllib.linalg.Matrix[source]

Computes the Gramian matrix A^T A.

New in version 2.0.0.

Notes

This cannot be computed on matrices with more than 65535 columns.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows)
>>> mat.computeGramianMatrix()
DenseMatrix(3, 3, [17.0, 22.0, 27.0, 22.0, 29.0, 36.0, 27.0, 36.0, 45.0], 0)
computeSVD(k: int, computeU: bool = False, rCond: float = 1e-09)pyspark.mllib.linalg.distributed.SingularValueDecomposition[pyspark.mllib.linalg.distributed.IndexedRowMatrix, pyspark.mllib.linalg.Matrix][source]

Computes the singular value decomposition of the IndexedRowMatrix.

The given row matrix A of dimension (m X n) is decomposed into U * s * V’T where

  • U: (m X k) (left singular vectors) is a IndexedRowMatrix

    whose columns are the eigenvectors of (A X A’)

  • s: DenseVector consisting of square root of the eigenvalues

    (singular values) in descending order.

  • v: (n X k) (right singular vectors) is a Matrix whose columns

    are the eigenvectors of (A’ X A)

For more specific details on implementation, please refer the scala documentation.

New in version 2.2.0.

Parameters
kint

Number of leading singular values to keep (0 < k <= n). It might return less than k if there are numerically zero singular values or there are not enough Ritz values converged before the maximum number of Arnoldi update iterations is reached (in case that matrix A is ill-conditioned).

computeUbool, optional

Whether or not to compute U. If set to be True, then U is computed by A * V * s^-1

rCondfloat, optional

Reciprocal condition number. All singular values smaller than rCond * s[0] are treated as zero where s[0] is the largest singular value.

Returns
SingularValueDecomposition

Examples

>>> rows = [(0, (3, 1, 1)), (1, (-1, 3, 1))]
>>> irm = IndexedRowMatrix(sc.parallelize(rows))
>>> svd_model = irm.computeSVD(2, True)
>>> svd_model.U.rows.collect() 
[IndexedRow(0, [-0.707106781187,0.707106781187]),        IndexedRow(1, [-0.707106781187,-0.707106781187])]
>>> svd_model.s
DenseVector([3.4641, 3.1623])
>>> svd_model.V
DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
multiply(matrix: pyspark.mllib.linalg.Matrix)pyspark.mllib.linalg.distributed.IndexedRowMatrix[source]

Multiply this matrix by a local dense matrix on the right.

New in version 2.2.0.

Parameters
matrixpyspark.mllib.linalg.Matrix

a local dense matrix whose number of rows must match the number of columns of this matrix

Returns
IndexedRowMatrix

Examples

>>> mat = IndexedRowMatrix(sc.parallelize([(0, (0, 1)), (1, (2, 3))]))
>>> mat.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect()
[IndexedRow(0, [2.0,3.0]), IndexedRow(1, [6.0,11.0])]
numCols() → int[source]

Get or compute the number of cols.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6]),
...                        IndexedRow(2, [7, 8, 9]),
...                        IndexedRow(3, [10, 11, 12])])
>>> mat = IndexedRowMatrix(rows)
>>> print(mat.numCols())
3
>>> mat = IndexedRowMatrix(rows, 7, 6)
>>> print(mat.numCols())
6
numRows() → int[source]

Get or compute the number of rows.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6]),
...                        IndexedRow(2, [7, 8, 9]),
...                        IndexedRow(3, [10, 11, 12])])
>>> mat = IndexedRowMatrix(rows)
>>> print(mat.numRows())
4
>>> mat = IndexedRowMatrix(rows, 7, 6)
>>> print(mat.numRows())
7
toBlockMatrix(rowsPerBlock: int = 1024, colsPerBlock: int = 1024)pyspark.mllib.linalg.distributed.BlockMatrix[source]

Convert this matrix to a BlockMatrix.

Parameters
rowsPerBlockint, optional

Number of rows that make up each block. The blocks forming the final rows are not required to have the given number of rows.

colsPerBlockint, optional

Number of columns that make up each block. The blocks forming the final columns are not required to have the given number of columns.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows).toBlockMatrix()
>>> # This IndexedRowMatrix will have 7 effective rows, due to
>>> # the highest row index being 6, and the ensuing
>>> # BlockMatrix will have 7 rows as well.
>>> print(mat.numRows())
7
>>> print(mat.numCols())
3
toCoordinateMatrix()pyspark.mllib.linalg.distributed.CoordinateMatrix[source]

Convert this matrix to a CoordinateMatrix.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 0]),
...                        IndexedRow(6, [0, 5])])
>>> mat = IndexedRowMatrix(rows).toCoordinateMatrix()
>>> mat.entries.take(3)
[MatrixEntry(0, 0, 1.0), MatrixEntry(0, 1, 0.0), MatrixEntry(6, 0, 0.0)]
toRowMatrix()pyspark.mllib.linalg.distributed.RowMatrix[source]

Convert this matrix to a RowMatrix.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows).toRowMatrix()
>>> mat.rows.collect()
[DenseVector([1.0, 2.0, 3.0]), DenseVector([4.0, 5.0, 6.0])]

Attributes Documentation

rows

Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.

Examples

>>> mat = IndexedRowMatrix(sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                                        IndexedRow(1, [4, 5, 6])]))
>>> rows = mat.rows
>>> rows.first()
IndexedRow(0, [1.0,2.0,3.0])