Quantcast
Channel: Robin on Linux
Viewing all articles
Browse latest Browse all 236

Get the schema of a parquet file

$
0
0

Previously I just use this snippet to get all the column names of a parquet file:

import pandas as pd

df = pd.read_parquet("hello.parquet")
print(list(df.columns))

But if the parquet file is very large (maybe not very large, for example, 1GB), it will cause OOM in my small VM (about 4GB RAM).

Actually, what I want is just column names, not the whole data. Since parquet file has strongly designed format, there must be someway we can only get the schema instead of all data.

And, here it is:

import pyarrow.parquet as pq

schema = pq.read_schema("hello.parquet", memory_map=True)
print(list(schema.names))

The post Get the schema of a parquet file first appeared on Robin on Linux.


Viewing all articles
Browse latest Browse all 236

Trending Articles