I am trying to write a pandas Dataframe to a parquet file that is compatible with a table in Impala but am struggling to find a solution.
My df has 3 columns
code int64
number float
name object
When I create this into a parquet file and load it into impala, the python schema is preserved and it fails. I would like the parquet to save with the following schema:
code int
number decimal(36,18)
name string
I tried this:
env_schema = """
code int
number decimal(36,18)
name string
"""
df.to_parquet(f'path', index=False, schema=env_schema)
but get the following error:
Argument 'schema' has incorrect type (expected pyarrow.lib.Schema, got str)
Does anyone know how I could achieve this? Thanks
Create the schema like this:
import pyarrow as pa
env_schema = pa.schema([
('code', pa.int32()),
('number', pa.decimal128(36,18)),
('name', pa.string())
])
If the columns of the pandas DataFrame do not have dtypes that match the schema, then you will need to create a PyArrow table and cast it to the schema before saving it to Parquet:
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df).cast(env_schema)
pq.write_table(table, f'path')