menu

Questions & Answers

Writing a parquet file from python that is compatible for SQL/Impala

I am trying to write a pandas Dataframe to a parquet file that is compatible with a table in Impala but am struggling to find a solution.

My df has 3 columns

code   int64
number float
name   object

When I create this into a parquet file and load it into impala, the python schema is preserved and it fails. I would like the parquet to save with the following schema:

code    int
number  decimal(36,18)
name    string

I tried this:

env_schema = """
code    int
number  decimal(36,18)
name    string
"""
df.to_parquet(f'path', index=False, schema=env_schema)

but get the following error:

Argument 'schema' has incorrect type (expected pyarrow.lib.Schema, got str)

Does anyone know how I could achieve this? Thanks

Answers(1) :

Create the schema like this:

import pyarrow as pa

env_schema = pa.schema([
  ('code', pa.int32()),
  ('number', pa.decimal128(36,18)),
  ('name', pa.string())
])

If the columns of the pandas DataFrame do not have dtypes that match the schema, then you will need to create a PyArrow table and cast it to the schema before saving it to Parquet:

import pyarrow.parquet as pq

table = pa.Table.from_pandas(df).cast(env_schema)
pq.write_table(table, f'path')
Comments:
2023-01-20 00:30:14
ArrowTypeError: ('int or Decimal object expected, got float', 'Conversion failed for column number with type object')
2023-01-20 00:30:14
I got this error
2023-01-20 00:30:14
@geds133 I updated the answer; I think it should work for you now