menu

Questions & Answers

Filter rows from Pandas dataframe based on max value of a column

I have a pandas dataframe with following columns

col1   col2   col3
x      12     abc
x       7     abc
x       5     abc
x       3   
y      10     abc
y       9     abc

I would like to find all rows in a pandas DataFrame which have the max value for col2 column, after grouping by 'col1' columns after filtering the rows where col3 is null?

The expected output is:

col1   col2   col3
x      12     abc
y      10     abc

I have tried the below code so far.

df[df[['col3']].notnull().all(1) & df.sort_values('col2').drop_duplicates(['col1'], keep='last')]

However I am getting following error.

TypeError: unsupported operand type(s) for &: 'bool' and 'float'

Any help is highly appreciated

Answers(2) :

another solution

import pandas as pd
lstColumns=["col1","col2","col3"]
lstValues=[["x",12,"abc"],["x",7,"abc"],["x",5,"abc"],["x",3,"abc"],["y",10,"abc"],["y",9,"abc"]]
df=pd.DataFrame(lstValues,columns=lstColumns)
df=df.sort_values(['col1', 'col2'], ascending=[True, True])
newdf=df.drop_duplicates(subset='col1', keep="last")

  col1  col2 col3
0    x    12  abc
4    y    10  abc

How max method calculate without mentioning the column?

According to the pd.DataFrame.max it returns the maximum values over the selected axis with default being (0, index).

In your example you only have 1 numeric values and all values in col3 are identical. If col3 was also numeric, max method would return the max values of that column as well with the resulting DataFrame may have different rows than the original one.

It is suitable in this case but if you only would like the output DataFrame have the rows as the original one you need to be specific about the column whose maximum you would like to consider.

df.loc[df.notnull().all(axis=1)].groupby('col1').max().reset_index()

  col1  col2 col3
0    x    12  abc
1    y    10  abc

Or you can create a boolean Series first and assign it to a name to improve readability:

m = df.notnull().all(axis=1)
df.loc[m].groupby('col1').max().reset_index()

Let's say now this is your original DataFrame:

  col1  col2  col3
0    x    12   2.0
1    x     7  20.0
2    x     5   1.0
3    x     3   NaN
4    y    10   4.0
5    y     9  11.0

When you apply max on this without specifying the column name it will return the following:

  col1  col2  col3
0    x    12  20.0
1    y    10  11.0
Comments:
2023-01-18 00:30:18
One question around that: How does the max calculated without mentioning the column name.
2023-01-18 00:30:18
I added some notes you can check them. Let me know if you need further explanations.