Recently, I wanted to make use of the Postgres COPY methods within Django. Most people recommended django-postgres-copy, however, as of April 2024, the package only supports psycopg2 and Django 4.2. It also makes use of a temporary table, whereas I felt comfortable enough with copying the values directly to their destination table.
Psycopg2
If you’re using the older version of the psycopg Postgres database adapter, psycopg2, then Django’s underlying database connection cursor will have copy_from
, copy_to
, and copy_expert
methods available. You can read their full API documentation here.
I recommend accessing the cursor using a with
statement:
from django.db import connection
with connection.cursor() as cursor:
print(hasattr(cursor, "copy_from"))
These methods expect files, or a file-like object with read()
and readline()
methods. For instance, you can upload a file uploaded to Django’s request.FILES
.
f = self.request.FILES["file"]
with connection.cursor() as cursor:
cursor.copy_from(f, "db_table_name")
By default, the method expects the CSV to be tab-separated and have all columns present, including the primary key that is assigned by default to Django models. For instance, the following model class:
class Item(models.Model):
name = models.CharField(max_length=128)
amount = models.FloatField(null=True)
modified = models.DateTimeField(null=True)
Would require a CSV with four columns by default: id
, name
, amount
, modified
.
Generally, we want to leave the primary key assignment to the underlying database. Thankfully, we can use the model’s meta
attributes to find or dynamically assign these parameters:
columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
with connection.cursor() as cursor:
cursor.copy_from(f, Item._meta.db_table, sep=",", columns=columns)
We can also specify null values for the various database columns that support them by using the special \N
string. This special string value can also be changed via the null
keyword argument.
Accepting user uploads directly to the database is risky behavior. We may want to sanitize or alter the uploaded file in some way. We may even have some other sort of non-file data that we want to copy. To do so, we can dynamically create a file-like csv object:
import csv
from io import StringIO
f = StringIO()
writer = csv.writer(f)
writer.writerow(["A", 1.001, timezone.now()])
f.seek(0) # Make sure the SQL cursor reads the CSV stream from its start
columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
with connection.cursor() as cursor:
cursor.copy_from(f, Item._meta.db_table, sep=",", columns=columns)
Note that the defaults of the Python standard library csv
package are different from the copy_from
method. The csv
package will emit comma separated values by default and will transform None
values to empty strings. Sadly, I was unable to find an easy way to change the csv
writer to write None
as \N
.
If you need to support null values with csv
, the easiest ways are to either manually write \N
values or to change the copy_from
method to use empty strings as null:
f = StringIO()
writer = csv.writer(f, dialect=csv.excel_tab)
writer.writerow(["B", 1, None])
writer.writerow(["C", None, timezone.now()])
f.seek(0) # Make sure the SQL cursor reads the CSV stream from its start
columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
with connection.cursor() as cursor:
cursor.copy_from(f, Item._meta.db_table, null="", columns=columns)
But watch out, doing so will prevent you from loading any values that are actually empty strings!
If you can install the pandas
package, you can use the above approach to copy pandas.DataFrame
objects directly to the database, including the expected null format:
from io import StringIO
import pandas as pd
columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
df = pd.DataFrame(columns=columns)
df.loc[0] = ["D", 42.0, None]
df.loc[1] = ["E", -0.1, None]
f = StringIO()
df.to_csv(f, sep="\t", na_rep=r"\N", index=False, header=False)
f.seek(0) # Make sure the SQL cursor reads the CSV stream from its start
with connection.cursor() as cursor:
cursor.copy_from(f, Item._meta.db_table, columns=columns)
Psycopg Version 3
The latest version of psycopg (which has the package name psycopg
, but is technically version 3), moves away from the above file-based methods and instead introduces a single copy
method.
f = self.request.FILES["file"]
with connection.cursor() as cursor:
with cursor.copy(f"COPY table_name (name, amount) FROM STDIN DELIMITER ','") as copy:
copy.write(f.read())
This new approach allows rows to be written iteratively:
records = [
["D", 42.0, None],
["E", -0.1, None]
]
with connection.cursor() as cursor:
with cursor.copy("COPY table_name (name, amount, modified) FROM STDIN") as copy:
for record in records:
copy.write_row(record)
I have mixed feelings on having to construct the COPY
statement yourself: although highly flexible, it does require some extra coding in the simpliest cases compared to the psycopg2
methods like copy_from
.
A few things to be mindful of regardless of your psycopg version:
- If you are using additional databases or advanced routing, make sure you have the correct connection for your Django model.
- You can specify auto-increment values in your uploads (such as
id
in the exampleItem
above), but know that the underlying sequence in Postgres will not be updated. You may need to manually update the sequence before resuming normal inserts on your table. - Similarly, identical or repeat uploads with manually specified auto-increment primary keys will raise
UniqueViolation
exceptions and cause the copy to be canceled. - Also regarding auto-increment values: errors during copy will result in the operation being canceled, but any auto-increment values created before the error will be discarded.
- Using copy methods will not trigger Model signals, such as
pre_save
andpost_save
.
Using copy has a significant speed improvement over Django’s bulk_create
, which itself is already an improvement over saving individual models. Here are the results of some rudimentary tests on my M1 MacBook Pro (total elapsed time in seconds):
rows | save() | save() in transaction | bulk_create() | copy_from() |
---|---|---|---|---|
1,000 | 0.69 | 0.30 | 0.10 | 0.08 |
10,000 | 2.26 | 1.25 | 0.36 | 0.10 |
100,000 | 22.35 | 9.41 | 2.81 | 0.39 |
Happy hacking!