Quantcast
Channel: Robin on Linux
Viewing all articles
Browse latest Browse all 236

Some tips about pandas, again

$
0
0
  1. pd.merge() may change the names of original columns:
import pandas as pd

df1 = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
df2 = pd.DataFrame(data={"name": ["lion", "heart"], "age": [50, 60]})

merged = pd.merge(df1, df2, how="outer", on="name")
print(merged)

The output will not have a column named age but two more new columns named age_x and age_y. So when you merging two tables with many columns, be aware of that the column names may change.

2. Use iterrows() to traverse rows of dataframe:

import pandas as pd

from multiprocessing import Pool


def process(row):
    # Do something for row
    print(row[1])


df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})
pool = Pool(6)
pool.map(process, df.iterrows())

If we directly use pool.map(process, df), it will incorrectly traverse the column names of dataframe.

3. How to append pd.Series to a pd.DataFrame. From this article, the easist way is:

import pandas as pd

df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]})

series = pd.Series(["water", 50], index=["name", "age"])

print(df.append(series, ignore_index=True))

The result is

    name  age
0  robin   40
1   hood   30
2  water   50

Or, we can add a name to pd.Series and remove the ignore_index. It could give the same result.

If the pd.Series doesn’t have index, the result will become:

    name   age      0     1
0  robin  40.0    NaN   NaN
1   hood  30.0    NaN   NaN
2    NaN   NaN  water  50.0

The post Some tips about pandas, again first appeared on Robin on Linux.


Viewing all articles
Browse latest Browse all 236

Trending Articles