Skip to content

pandas.to_sql fails when writing more than 256 cells to a SQL Warehouse with method="multi" #24

@paulstaab

Description

@paulstaab

This is a duplicate of databricks/databricks-sql-python#300 . I am reposting it in this repo, as it still occurs with current versions of databricks-sqlalchemy~=1.0 (1.0.5) and databricks-sql-connector (4.0.2) together with the DBR 15.4 LTS ML versions of pandas (1.5.3) and sqlalchemy (1.4.39).

Problem

When using the multi-row insert mode of sqlalchemy, currently each value gets its own named parameter in the generated query:

INSERT INTO default.test_table (numeric_col, string_col) 
VALUES 
   (:numeric_col_m0, :string_col_m0), 
   (:numeric_col_m1, :string_col_m1), 
   (:numeric_col_m2, :string_col_m2), 
   (:numeric_col_m3, :string_col_m3),
   [...]
   (:numeric_col_m998, :string_col_m998), 
   (:numeric_col_m999, :string_col_m999)

with parameters:

{
'numeric_col_m0': 0, 
'string_col_m0': 'AAA', 
'numeric_col_m1': 1,
'string_col_m1': 'AAA', 
'numeric_col_m2',
[...], 
'numeric_col_m999': 999, 
'string_col_m999': 'AAA'
}

Consequently, even very small tables exceed the limit of 256 named parameters:

sqlalchemy.exc.OperationalError: (databricks.sql.exc.RequestError) Error during request to server: BAD_REQUEST: Parameterized query has too many parameters: 2000 parameters were given but the limit is 256.. BAD_REQUEST: Parameterized query has too many parameters: 2000 parameters were given but the limit is 256.

Example

import pandas as pd 
from sqlalchemy import create_engine

sqlalchemy_connection_string = f"databricks://token:{token}@{host}?http_path={http_path}?catalog={catalog}"
engine = create_engine(sqlalchemy_connection_string)
test_data = pd.DataFrame({"numeric_col": range(1_000), "string_col": ["AAA"] * 1_000})
test_data.to_sql("test_table", engine, if_exists="replace", index=False, method="multi")

Workarounds

Inline Parameters

I tried to avoid this issue by using legacy inline parameters. However, I get a different error then:

sqlalchemy_connection_string = f"databricks://token:{token}@{host}?http_path={http_path}?catalog={catalog}"
engine = create_engine(sqlalchemy_connection_string, connect_args={"use_inline_params": "silent"})
test_data = pd.DataFrame({"numeric_col": range(1_000), "string_col": ["AAA"] * 1_000})
test_data.to_sql("test_table", engine, if_exists="replace", index=False, method="multi")
DatabaseError: (databricks.sql.exc.ServerOperationError) [UNBOUND_SQL_PARAMETER] Found the unbound parameter: numeric_col_m0. Please, fix `args` and provide a mapping of the parameter to either a SQL literal or collection constructor functions such as `map()`, `array()`, `struct()`. SQLSTATE: 42P02; line 1 pos 57

Writing row-by-row

The only workaround at the moment seems to be to insert the data row-by-row by not setting method in the to_sql() method.
However, this is prohibitively slow even for medium-sized data frames:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions