IndexedRowMatrix¶

class pyspark.mllib.linalg.distributed.IndexedRowMatrix(rows: pyspark.rdd.RDD[Union[Tuple[int, VectorLike], pyspark.mllib.linalg.distributed.IndexedRow]], numRows: int = 0, numCols: int = 0)[source]¶

Represents a row-oriented distributed Matrix with indexed rows.

Parameters

rowspyspark.RDD: An RDD of IndexedRows or (int, vector) tuples or a DataFrame consisting of a int typed column of indices and a vector typed column.
numRowsint, optional: Number of rows in the matrix. A non-positive value means unknown, at which point the number of rows will be determined by the max row index plus one.
numColsint, optional: Number of columns in the matrix. A non-positive value means unknown, at which point the number of columns will be determined by the size of the first row.

Methods

`columnSimilarities`()	Compute all cosine similarities between columns.
`computeGramianMatrix`()	Computes the Gramian matrix A^T A.
`computeSVD`(k[, computeU, rCond])	Computes the singular value decomposition of the IndexedRowMatrix.
`multiply`(matrix)	Multiply this matrix by a local dense matrix on the right.
`numCols`()	Get or compute the number of cols.
`numRows`()	Get or compute the number of rows.
`toBlockMatrix`([rowsPerBlock, colsPerBlock])	Convert this matrix to a BlockMatrix.
`toCoordinateMatrix`()	Convert this matrix to a CoordinateMatrix.
`toRowMatrix`()	Convert this matrix to a RowMatrix.

Attributes

rows

Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.

Methods Documentation

columnSimilarities() → pyspark.mllib.linalg.distributed.CoordinateMatrix [source]¶

Compute all cosine similarities between columns.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows)
>>> cs = mat.columnSimilarities()
>>> print(cs.numCols())
3

computeGramianMatrix() → pyspark.mllib.linalg.Matrix [source]¶

Computes the Gramian matrix A^T A.

New in version 2.0.0.

Notes

This cannot be computed on matrices with more than 65535 columns.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows)

>>> mat.computeGramianMatrix()
DenseMatrix(3, 3, [17.0, 22.0, 27.0, 22.0, 29.0, 36.0, 27.0, 36.0, 45.0], 0)

computeSVD(k: int, computeU: bool = False, rCond: float = 1e-09) → pyspark.mllib.linalg.distributed.SingularValueDecomposition[pyspark.mllib.linalg.distributed.IndexedRowMatrix, pyspark.mllib.linalg.Matrix][source]¶

Computes the singular value decomposition of the IndexedRowMatrix.

The given row matrix A of dimension (m X n) is decomposed into U * s * V’T where

U: (m X k) (left singular vectors) is a IndexedRowMatrix
whose columns are the eigenvectors of (A X A’)
s: DenseVector consisting of square root of the eigenvalues
(singular values) in descending order.
v: (n X k) (right singular vectors) is a Matrix whose columns
are the eigenvectors of (A’ X A)

For more specific details on implementation, please refer the scala documentation.

New in version 2.2.0.

Parameters

kint: Number of leading singular values to keep (0 < k <= n). It might return less than k if there are numerically zero singular values or there are not enough Ritz values converged before the maximum number of Arnoldi update iterations is reached (in case that matrix A is ill-conditioned).
computeUbool, optional: Whether or not to compute U. If set to be True, then U is computed by A * V * s^-1
rCondfloat, optional: Reciprocal condition number. All singular values smaller than rCond * s[0] are treated as zero where s[0] is the largest singular value.

Returns

SingularValueDecomposition

Examples

>>> rows = [(0, (3, 1, 1)), (1, (-1, 3, 1))]
>>> irm = IndexedRowMatrix(sc.parallelize(rows))
>>> svd_model = irm.computeSVD(2, True)
>>> svd_model.U.rows.collect() 
[IndexedRow(0, [-0.707106781187,0.707106781187]),        IndexedRow(1, [-0.707106781187,-0.707106781187])]
>>> svd_model.s
DenseVector([3.4641, 3.1623])
>>> svd_model.V
DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)

multiply(matrix: pyspark.mllib.linalg.Matrix) → pyspark.mllib.linalg.distributed.IndexedRowMatrix [source]¶

Multiply this matrix by a local dense matrix on the right.

New in version 2.2.0.

Parameters

matrixpyspark.mllib.linalg.Matrix: a local dense matrix whose number of rows must match the number of columns of this matrix

Returns

IndexedRowMatrix

Examples

>>> mat = IndexedRowMatrix(sc.parallelize([(0, (0, 1)), (1, (2, 3))]))
>>> mat.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect()
[IndexedRow(0, [2.0,3.0]), IndexedRow(1, [6.0,11.0])]

numCols() → int[source]¶

Get or compute the number of cols.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6]),
...                        IndexedRow(2, [7, 8, 9]),
...                        IndexedRow(3, [10, 11, 12])])

>>> mat = IndexedRowMatrix(rows)
>>> print(mat.numCols())
3

>>> mat = IndexedRowMatrix(rows, 7, 6)
>>> print(mat.numCols())
6

numRows() → int[source]¶

Get or compute the number of rows.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(1, [4, 5, 6]),
...                        IndexedRow(2, [7, 8, 9]),
...                        IndexedRow(3, [10, 11, 12])])

>>> mat = IndexedRowMatrix(rows)
>>> print(mat.numRows())
4

>>> mat = IndexedRowMatrix(rows, 7, 6)
>>> print(mat.numRows())
7

toBlockMatrix(rowsPerBlock: int = 1024, colsPerBlock: int = 1024) → pyspark.mllib.linalg.distributed.BlockMatrix [source]¶

Convert this matrix to a BlockMatrix.

Parameters

rowsPerBlockint, optional: Number of rows that make up each block. The blocks forming the final rows are not required to have the given number of rows.
colsPerBlockint, optional: Number of columns that make up each block. The blocks forming the final columns are not required to have the given number of columns.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows).toBlockMatrix()

>>> # This IndexedRowMatrix will have 7 effective rows, due to
>>> # the highest row index being 6, and the ensuing
>>> # BlockMatrix will have 7 rows as well.
>>> print(mat.numRows())
7

>>> print(mat.numCols())
3

toCoordinateMatrix() → pyspark.mllib.linalg.distributed.CoordinateMatrix [source]¶

Convert this matrix to a CoordinateMatrix.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 0]),
...                        IndexedRow(6, [0, 5])])
>>> mat = IndexedRowMatrix(rows).toCoordinateMatrix()
>>> mat.entries.take(3)
[MatrixEntry(0, 0, 1.0), MatrixEntry(0, 1, 0.0), MatrixEntry(6, 0, 0.0)]

toRowMatrix() → pyspark.mllib.linalg.distributed.RowMatrix [source]¶

Convert this matrix to a RowMatrix.

Examples

>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                        IndexedRow(6, [4, 5, 6])])
>>> mat = IndexedRowMatrix(rows).toRowMatrix()
>>> mat.rows.collect()
[DenseVector([1.0, 2.0, 3.0]), DenseVector([4.0, 5.0, 6.0])]

Attributes Documentation

rows¶

Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.

Examples

>>> mat = IndexedRowMatrix(sc.parallelize([IndexedRow(0, [1, 2, 3]),
...                                        IndexedRow(1, [4, 5, 6])]))
>>> rows = mat.rows
>>> rows.first()
IndexedRow(0, [1.0,2.0,3.0])

IndexedRow MatrixEntry