Loading CSV Files in Django with psycopg 2 and 3

By Aaron O. Ellis

Recently, I wanted to make use of the Postgres COPY methods within Django. Most people recommended django-postgres-copy, however, as of April 2024, the package only supports psycopg2 and Django 4.2. It also makes use of a temporary table, whereas I felt comfortable enough with copying the values directly to their destination table.

Psycopg2

If you’re using the older version of the psycopg Postgres database adapter, psycopg2, then Django’s underlying database connection cursor will have copy_from, copy_to, and copy_expert methods available. You can read their full API documentation here.

I recommend accessing the cursor using a with statement:

from django.db import connection

with connection.cursor() as cursor:
    print(hasattr(cursor, "copy_from"))

These methods expect files, or a file-like object with read() and readline() methods. For instance, you can upload a file uploaded to Django’s request.FILES.

f = self.request.FILES["file"]
with connection.cursor() as cursor:
    cursor.copy_from(f, "db_table_name")

By default, the method expects the CSV to be tab-separated and have all columns present, including the primary key that is assigned by default to Django models. For instance, the following model class:

class Item(models.Model):
    name = models.CharField(max_length=128)
    amount = models.FloatField(null=True)
    modified = models.DateTimeField(null=True)

Would require a CSV with four columns by default: id, name, amount, modified.

Generally, we want to leave the primary key assignment to the underlying database. Thankfully, we can use the model’s meta attributes to find or dynamically assign these parameters:

columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
with connection.cursor() as cursor:
    cursor.copy_from(f, Item._meta.db_table, sep=",", columns=columns)

We can also specify null values for the various database columns that support them by using the special \N string. This special string value can also be changed via the null keyword argument.

Accepting user uploads directly to the database is risky behavior. We may want to sanitize or alter the uploaded file in some way. We may even have some other sort of non-file data that we want to copy. To do so, we can dynamically create a file-like csv object:

import csv
from io import StringIO

f = StringIO()
writer = csv.writer(f)
writer.writerow(["A", 1.001, timezone.now()])

f.seek(0)  # Make sure the SQL cursor reads the CSV stream from its start
columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
with connection.cursor() as cursor:
    cursor.copy_from(f, Item._meta.db_table, sep=",", columns=columns)

Note that the defaults of the Python standard library csv package are different from the copy_from method. The csv package will emit comma separated values by default and will transform None values to empty strings. Sadly, I was unable to find an easy way to change the csv writer to write None as \N.

If you need to support null values with csv, the easiest ways are to either manually write \N values or to change the copy_from method to use empty strings as null:

f = StringIO()
writer = csv.writer(f, dialect=csv.excel_tab)
writer.writerow(["B", 1, None])
writer.writerow(["C", None, timezone.now()])

f.seek(0)  # Make sure the SQL cursor reads the CSV stream from its start
columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
with connection.cursor() as cursor:
    cursor.copy_from(f, Item._meta.db_table, null="", columns=columns)

But watch out, doing so will prevent you from loading any values that are actually empty strings!

If you can install the pandas package, you can use the above approach to copy pandas.DataFrame objects directly to the database, including the expected null format:

from io import StringIO
import pandas as pd

columns = [f.name for f in Item._meta.get_fields() if not f.primary_key]
df = pd.DataFrame(columns=columns)
df.loc[0] = ["D", 42.0, None]
df.loc[1] = ["E", -0.1, None]

f = StringIO()
df.to_csv(f, sep="\t", na_rep=r"\N", index=False, header=False)

f.seek(0)  # Make sure the SQL cursor reads the CSV stream from its start
with connection.cursor() as cursor:
    cursor.copy_from(f, Item._meta.db_table, columns=columns)

Psycopg Version 3

The latest version of psycopg (which has the package name psycopg, but is technically version 3), moves away from the above file-based methods and instead introduces a single copy method.

f = self.request.FILES["file"]
with connection.cursor() as cursor:
    with cursor.copy(f"COPY table_name (name, amount) FROM STDIN DELIMITER ','") as copy:
        copy.write(f.read())

This new approach allows rows to be written iteratively:

records = [
    ["D", 42.0, None],
    ["E", -0.1, None]
]
with connection.cursor() as cursor:
    with cursor.copy("COPY table_name (name, amount, modified) FROM STDIN") as copy:
        for record in records:
            copy.write_row(record)

I have mixed feelings on having to construct the COPY statement yourself: although highly flexible, it does require some extra coding in the simpliest cases compared to the psycopg2 methods like copy_from.

A few things to be mindful of regardless of your psycopg version:

  1. If you are using additional databases or advanced routing, make sure you have the correct connection for your Django model.
  2. You can specify auto-increment values in your uploads (such as id in the example Item above), but know that the underlying sequence in Postgres will not be updated. You may need to manually update the sequence before resuming normal inserts on your table.
  3. Similarly, identical or repeat uploads with manually specified auto-increment primary keys will raise UniqueViolation exceptions and cause the copy to be canceled.
  4. Also regarding auto-increment values: errors during copy will result in the operation being canceled, but any auto-increment values created before the error will be discarded.
  5. Using copy methods will not trigger Model signals, such as pre_save and post_save.

Using copy has a significant speed improvement over Django’s bulk_create, which itself is already an improvement over saving individual models. Here are the results of some rudimentary tests on my M1 MacBook Pro (total elapsed time in seconds):

rows save() save() in transaction bulk_create() copy_from()
1,000 0.69 0.30 0.10 0.08
10,000 2.26 1.25 0.36 0.10
100,000 22.35 9.41 2.81 0.39

Happy hacking!