Skip to content
All posts

A Quick Start to Connecting to PostgreSQL and Pulling Data into Pandas

What is PostgreSQL?

PostgreSQL is a powerful relational database management system (RDBMS) that many organizations use. Connecting to it is easy, and thanks to the great Python ecosystem, getting your data into a Data Frame in Pandas is just as easy. Let's look at a simple example that will help you get started.

Running PostgreSQL Locally

One simple way to run PostgreSQL locally is to use Docker. If you don't have it installed, you can download Docker here. Once you have Docker installed, you can run the following command to start a PostgreSQL container.

Head to the Docker Hub and search for postgres. You'll see several different images. We want the postgres image, and you can pull it down with the following command.

docker pull postgres

After that, you can run the following command (per the documentation on Docker Hub) to start a container.

docker run --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=postgrespw -d postgres

You can connect to the DB using any supported SQL tool such as SQL Workbench.

postgres_workbench

CREATE DATABASE MAIN;

CREATE TABLE CONTACTS (
ID integer PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
FIRST_NAME varchar(255),
LAST_NAME varchar(255)
);

INSERT INTO CONTACTS (FIRST_NAME, LAST_NAME)
VALUES ('Brian', 'Roepke');

SELECT * FROM CONTACTS;

Running the above commands will create a database called MAIN and a table called CONTACTS. It will also insert a single row into the table and then select all the rows from the table; that should get you started with a simple PG setup locally.

Imports

The first thing we'll do is import the libraries we will need. We will need the psycopg2library to connect to PostgreSQL and the pandas library to work with our data. in addition; we're going to use pandas, os, urllib, and sqlalchemy to help us connect to our database.

import pandas as pd
import psycopg2
import os
from sqlalchemy import create_engine
from urllib.parse import quote_plus

Next, we want to ensure we're not hard-coding our sensitive information but rather pulling it from environment variables. If you'd like to learn more about this process, check out this article.

user = os.environ.get("USER")
pw = os.environ.get("PASS")
db = os.environ.get("DB")
host = os.environ.get("HOST")
api = os.environ.get("API")
port = 5432

Next, we need to ensure any information in our connection string is properly URL encoded, like the username and password, which can be easily achieved with the quote_plus function from urllib. Let's take a quick look at a sample password and how it is encoded.

fake_pw = "p@ssw0rd'9'!"
print(quote_plus(fake_pw))
p%40ssw0rd%279%27%21

Next, we can create our connection string that contains all the information needed to connect to our database. We will use string format (f in front of the string) to insert variables. We will also use the quote_plus function to encode our username and password.

In addition to the connection string, we will also create an SQLAlchemy engine.

 uri = f"postgresql+psycopg2://{quote_plus(user)}:\
{quote_plus(pw)}@{host}:{port}/{db}"
alchemyEngine = create_engine(uri)

That's it! Assuming you didn't receive an error, you are now connected to your PostgreSQL database. Let's look at how we can pull data into a Pandas Data Frame.

First, we'll define a query in the form of a string. A best practice here is to use triple quotes ("""), allowing us to write our query over multiple lines as well as avoid conflicts when our query itself contains quotes, such as WHERE FIRST_NAME = 'Brian'.

q = """SELECT * FROM CONTACTS"""

Next, we will create the connection from the alchemyEngine we created earlier.

dbConnection = alchemyEngine.connect();

We will then use the read_sql function from pandas to pull our data into a Data Frame.

df = pd.read_sql(q, dbConnection);

Now that we have a data frame and can display the first few rows and see our data.

df.head()

A best practice is to close your connection when you're done.

dbConnection.close();

Conclusion

PostgreSQL is a powerful relational database management system (RDBMS) used by many companies. We covered connecting to it and getting your data into a Pandas dataframe. We also discussed some best practices like storing credentials in Environment Variables and how to easily URL encode your username and password, which often will have characters that are not supported.