Contoso Postgres
By Dani
I’ve wanted to use Contoso in Postgres for years but never found a project that ported it. I decided to do it myself so I could create a development environment that was portable, open source, and shareable. It turned out to be more of an undertaking than expected, but it was fun. I don’t plan to write a detailed step-by-step of the process, but I have outlined some bullet points below.
Before that, though, if you’re interested in working on the Contoso database in your development environment, you can download my Docker image here.
Using the Contoso Postgres Image
The image is based on the official Postgres image, and you must read its documentation. It applies to this one as well. There is only one requirement when running the container: environment variable POSTGRES_PASSWORD must be specified.
What is different for the Postgres Contoso image is that there’s already a shell script in the init dicrectory which will start a restore of the Contoso data automatically the first time the container runs. As covered in the official image instructions, init scripts are ignored once there is data in the database, so it will only run the first time the container runs.
You can control which database the restore uses in a couple ways. In all cases, Contoso data is restored to the contoso schema. You cannot change that, but you can control the database.
CONTOSO_DB environment variable
If this environment variable is defined, the init shell script will use the value to create a new database and restore the Contoso data to it. Do not use “Postgres” as that is the default db name and will cause unpredictable behavior.
POSTGRES_DB environment varialbe
This environment variable comes from the official image on which the Postgres Contoso image is based. Read the documentation about how it’s used on the official image. If CONTOSO_DB is not specified and POSTGRES_DB is specified, the Contoso init script will not create this db, but it will restore the Contoso data to it.
Otherwise
Contoso data will be restored to the default Postgres database.
Contoso to Postgres Process
DDL
- Export DDL for the tables from SQL Server.
- Run the DDL through a Python script that applied several RegEx in an attempt to Postgres-ify it. This got me about 75% of the way there. The most important things it did were
- Map data types that needed changing
- Modify table and column names from TitleCase to snake_case. (More on that below.)
- I finished up the required DDL changes semi-manually with VS Code’s amazing multi-select editing capabilities and a few more RegEx applied via its Find and Replace feature. Since I only needed to do this once, I found this acceptable.
- Test the DDL in Postgres and tweak until it worked.
- If I started this endeavor again, I would use a Python script from the outset to construct the DDL rather than export SQL Server’s version and modifying that.
Export Data
- Used Microsoft’s command line bcp tool, which is made for bulk export / import.
- Used a Python script to
- Execute one bcp command per table.
- Generate the Postgres COPY command that would later import the data into Postgres.
Container 1
- Based on the official Postgres image.
- Copied the exported data into the container.
- Used Postgres’ native intialization script to copy the data into Postgres.
- This worked great, but the data I had to copy in was 4.8G, and it made the container way too large to distribute.
- Used pg_dump to generate a backup of the imported data with compression. This took the 4.8G of data in 25 text files exported from Contoso and turned it into a neat 523MB script that can be restored back into Postgres.
Container 2
- Based on the official Postgres image.
- Copied the backup from pg_dump into the container.
- Used Postgres’ native initialization script option to restore the backup.
- This is still a hefty image at just under 1G, but it’s a fraction of the size of Container 1.
- This is the image I use to rebuild my dev environment, and it’s what I am making available to others.
- Looks for the environment variable CONTOSO_DB and will create and restore into that if provided. If that is not provided, it will resotre into the db indicated in the POSTGRES_DB environment variable that is a part of the underlying Postgres image. If neither of those is supplied, it will restore in the Postgres database. If you supply Postgres as the value for CONTOSO_DB, expect problems. If that’s the value you want, leave CONTOSO_DB and POSTGRES_DB empty because Postgres is the default database name.
- The default Postgres image requires the environment variable POSTGRES_PASSWORD.
Why snake_case instead of TitleCase?
I have a guilty secret, and that is that I like TitleCase. I find it readable and easier to type than snake_case even though all the cool kids prefer snake_case. Originally, I thought I’d keep the TitleCase; however, my goal with this is to run Contoso on the simplest possible Postgres instance, and Postgres is case sensitive by default. If you encounter Postgres on a job, chances are, it will be case sensitive. I figure people interested in this project want to get familiar with Postgres, and that means getting used to case sensitive queries. Postgres will also smallcase the column names in a query if they aren’t wrapped in double quotes. So using TitleCase in Postgres means either modifying its default case sensitivity or double-quoting all the columns in your queries. Neither was acceptable to me, so I went with snake_case which, if you encounter Postgres in a professional scenario, is what you’re likely to find.
Differences from MS SQL Contoso
- No geography columns.
- No geometry columns.
- snake_case instead of TitleCase
- No views
- No stored procedures
- Dates forwarded from a range of 2009 - 2011 to 2022 - 2024.
- See this article for details on this was accomplished.
- Foreign keys are not defined.
- Primary keys are non-incremementing integers.
- Generally, the Postgres database will not behave the same for practicing ETL since you don’t have the auto-incrementing keys or the cascade relationships.