PostgreSQL cookbook


Document overview

This is a copy of my personal PostgreSQL notes. I write down things I've had to repeatedly look up.

Table of contents

Click to expand table of contents


Last updated: 2021-03-05


Order of execution

Order of execution is different than the written order:

  • Written order: The SELECT statement is written first:
    SELECTFROM + JOINWHEREGROUP BYHAVINGORDER BYLIMIT
  • Execution order: The FROM + JOIN statement is executed first:
    FROM + JOINWHEREGROUP BYHAVINGSELECTORDER BYLIMIT

I conceptually organize these verbs into:

  1. Generate source data

    • FROM + JOIN (required): Create the source data
  2. Filter

    • WHERE (optional): Filter the source data (row-by-row)
    • GROUP BY (optional): Group rows (after filtering if WHERE is used) by one or more variables
    • HAVING (optional): Filter entire groups based on summaries of each GROUP BY group (e.g. COUNT(varname) > 1, MAX(varname) = 10, etc.)
  3. Reduce / Calculate / Select

    • SELECT (required): Calculate and select the final variables. If a GROUP BY has been defined, summary functions are needed for each non-grouped variable of interest to reduce the grouped values to a single summary value. The output of a GROUP BY is a new table, one row per unique combination grouping variables.
  4. Return data

    • ORDER BY (optional): Order the data by one or more variables
    • LIMIT (optional): Return the first X rows

⇧ Back to TOCPostgreSQL: 13.0

Log into a posgres server

Using [psql](https://www.postgresql.org/docs/current/app-psql.html) or [pgcli](https://www.pgcli.com/) (interchangeable in the examples below), the prototype:

pgsql -h [HOST_ADDRESS] -p [PORT] -U [USER_NAME] -d [DATABASE_NAME]

⇧ Back to TOCPostgreSQL: 13.0


Get information on or delete current processes

Get all information about all current processes

SELECT * FROM pg_stat_activity;

Information returned: datid, datname, pid, usesysid, usename, application_name, client_addr, client_hostname, client_port, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event, state, backend_xid, backend_xmin, query, and backend_type.

This command isn't useful at a CLI because the information returned is too large to display in a single screen.

⇧ Back to TOCPostgreSQL: 13.0

Get the most useful information on all current processes

Get the most useful information about all current processes: pid, age, usename, and query.

SELECT pid, age(clock_timestamp(), query_start), usename, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' 
ORDER BY query_start desc;

This is a useful query to issue at a CLI to see current activity.

⇧ Back to TOCPostgreSQL: 13.0


Getting and calculating data

Get unique values of variable

SELECT DISTINCT
  myvar
FROM
  myschema.mytable;

⇧ Back to TOCPostgreSQL: 13.0

Get the first row by datetime per group

Get the first row, by date, for each grouped id_hash. There are two critical components:

  • SELECT DISTINCT ON (group_variable): Groups by group_variable and keeps the first row
  • ORDER BY: Order the elements within the group so that the row of interest occurs first
-- Grouping by id_hash
SELECT DISTINCT ON (id_hash)
  id_hash AS id,
  journal_date
FROM
  myschema.mytable
ORDER BY
  id_hash,
  -- Earliest date should go first in each group
  journal_date DESC,

⇧ Back to TOCPostgreSQL: 13.0

Count the number of NULL entries

COUNT(*) counts all records, COUNT(varname) only counts records with non-null values in varname. The difference is the number of NULL values

SELECT
  count(*) - count(jrn_day) AS numbernull
FROM
  sandbox.amo_tbm_alpha_with_emp_attributes17;

⇧ Back to TOCPostgreSQL: 13.0


Datetime

Create a date sequence by DAY

-- Create a sequence of dates from 2019-01-01 to 2020-06-01
SELECT
  generate_series::date AS cdate
FROM
  generate_series(
      '2019-01-01'::timestamp,
      '2020-06-01'::timestamp,
      interval '1 day')

⇧ Back to TOCPostgreSQL: 13.0

Create a date sequence by WEEK

SELECT
  t.day::date 
FROM
  generate_series(timestamp '2004-03-07',
                  timestamp '2004-08-16',
                  interval  '1 week') AS t(day);

⇧ Back to TOCPostgreSQL: 12.4

Find the difference between two dates or datetimes

PostgreSQL time differences rely on combinations of DATE_PART() functions which extract the timestamp or interval subfield in the unit specified. For example:

  • Timestamp: date_part('hour', timestamp '2021-02-16 20:38:40') \(\rightarrow\) 20.
  • Interval: date_part('hour', timestamp '2021-02-16 20:38:40' - timestamp '2020-12-15 12:35:01') \(\rightarrow\) 63 days 08:03:39 \(\rightarrow\) 8

DATE_PART() only extracts the field in the specific unit position (e.g. year, hour, minute, etc.). Building the total time difference involves combining DATE_PART() components. For example, to find the total number of minutes between two dates.

\[ \Delta\,(\text{Total Minutes}) = (\Delta\, \text{Days} \cdot 24 + \Delta \text{Hours}) \cdot 60 + \Delta \cdot \text{Minutes}\]

And the corresponding SQL code.

SELECT

  (DATE_PART('day', enddatetime::timestamp - startdatetime::timestamp) * 24 + 
      DATE_PART('hour', enddatetime::timestamp - startdatetime::timestamp)
  ) * 60 +

  DATE_PART('minute', enddatetime::timestamp - startdatetime::timestamp)
);

In the samples below startdate and enddate are dates, e.g. 2021-03-01, and starttimestamp and endtimestamp are timestamps, e.g. 2021-03-01 12:56:12.

Find the difference between two dates in YEARS

SELECT
  DATE_PART('year', enddate) - DATE_PART('year', startdate);

⇧ Back to TOCPostgreSQL: 12.4

Find the difference between two dates in MONTHS

diffyears is the result of the in YEARS query above.

SELECT
  (DATE_PART('year', enddate - startdate) - DATE_PART('year', startdate)) * 12 + 
    (DATE_PART('month', enddate) - DATE_PART('month', startdate))

⇧ Back to TOCPostgreSQL: 12.4

Find the difference between to dates in WEEKS

SELECT
  TRUNC(DATE_PART('day', startdate - enddate) / 7);

⇧ Back to TOCPostgreSQL: 12.4

Find the difference between to dates in DAYS

SELECT
  DATE_PART('day', enddate - startdate);

⇧ Back to TOCPostgreSQL: 12.4

Find the difference between two datetimes in HOURS

diffdays is the result of the in days query above.

SELECT 
    DATE_PART('day', enddate - startdate) * 24 + DATE_PART('hour', end - start);

⇧ Back to TOCPostgreSQL: 12.4

Find the difference between two datetimes in MINUTES

SELECT (
  DATE_PART('day', enddatetime::timestamp - startdatetime::timestamp) * 24 + 
    DATE_PART('hour', enddatetime::timestamp - startdatetime::timestamp)
  ) * 60 +

  DATE_PART('minute', enddatetime::timestamp - startdatetime::timestamp)
);

⇧ Back to TOCPostgreSQL: 12.4

Find the difference between two datetimes in SECONDS

SELECT (
  (
    (
    DATE_PART('day', enddatetime::timestamp - startdatetime::timestamp) * 24 + 
    DATE_PART('hour', enddatetime::timestamp - startdatetime::timestamp)
    ) * 60 +

    DATE_PART('minute', enddatetime::timestamp - startdatetime::timestamp)
  ) * 60 +

  DATE_PART('second', enddatetime::timestamp - startdatetime::timestamp)
);

⇧ Back to TOCPostgreSQL: 12.4


Advanced data analysis

Create a histogram

SELECT
  WIDTH_BUCKET(nlabels, 1, 5, 1) AS buckets,
  COUNT(*)
FROM (
  SELECT
    claim_id AS id,
    COUNT(DISTINCT label) AS nlabels
  FROM
    myschema.mytable
  GROUP BY
    claim_id) AS A
GROUP BY
  buckets;

⇧ Back to TOCPostgreSQL: 13.0