DataFrame.
where
Replace values where the condition is False.
Where cond is True, keep the original value. Where False, replace with corresponding value from other.
Entries where cond is False are replaced with corresponding value from other.
Can only be set to 0 at the moment for compatibility with pandas.
Examples
>>> from pyspark.pandas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> df1 = ps.DataFrame({'A': [0, 1, 2, 3, 4], 'B':[100, 200, 300, 400, 500]}) >>> df2 = ps.DataFrame({'A': [0, -1, -2, -3, -4], 'B':[-100, -200, -300, -400, -500]}) >>> df1 A B 0 0 100 1 1 200 2 2 300 3 3 400 4 4 500 >>> df2 A B 0 0 -100 1 -1 -200 2 -2 -300 3 -3 -400 4 -4 -500
>>> df1.where(df1 > 0).sort_index() A B 0 NaN 100.0 1 1.0 200.0 2 2.0 300.0 3 3.0 400.0 4 4.0 500.0
>>> df1.where(df1 > 1, 10).sort_index() A B 0 10 100 1 10 200 2 2 300 3 3 400 4 4 500
>>> df1.where(df1 > 1, df1 + 100).sort_index() A B 0 100 100 1 101 200 2 2 300 3 3 400 4 4 500
>>> df1.where(df1 > 1, df2).sort_index() A B 0 0 100 1 -1 200 2 2 300 3 3 400 4 4 500
When the column name of cond is different from self, it treats all values are False
>>> cond = ps.DataFrame({'C': [0, -1, -2, -3, -4], 'D':[4, 3, 2, 1, 0]}) % 3 == 0 >>> cond C D 0 True False 1 False True 2 False False 3 True False 4 False True
>>> df1.where(cond).sort_index() A B 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN
When the type of cond is Series, it just check boolean regardless of column name
>>> cond = ps.Series([1, 2]) > 1 >>> cond 0 False 1 True dtype: bool
>>> df1.where(cond).sort_index() A B 0 NaN NaN 1 1.0 200.0 2 NaN NaN 3 NaN NaN 4 NaN NaN
>>> reset_option("compute.ops_on_diff_frames")