General Information

Order of execution

Order of execution is different than the written order:

Written order: The SELECT statement goes first goes first.

          SELECTFROM + JOINWHEREGROUP BYHAVINGORDER BYLIMIT

Execution order: FROM + JOIN goes first.

          FROM + JOINWHEREGROUP BYHAVINGSELECTORDER BYLIMIT

I conceptually organize these verbs into:

STAGE 1: Built parent table

  • FROM + JOIN (required): Create the parent data table

STAGE 2: Filter

  • WHERE (optional): Filter the parent data table rows
  • GROUP BY (optional): Group rows (after filtering if WHERE is used) by some criteria
  • HAVING (optional): Filter entire groups defined by GROUP BY

STAGE 3: Reduce / calculate / select

  • SELECT (optional): Calculate and select the final variables. If a GROUP BY has been defined, each group will be used to create a new row in a new table, one row per group, depending on the methods of calculation and combining defined in SELECT.

STAGE 4: Return data

  • ORDER BY (optional): Order the data by one or more variables
  • LIMIT (optional): Return the first X rows

Log into a posgres server

Using [psql](https://www.postgresql.org/docs/current/app-psql.html) or [pgcli](https://www.pgcli.com/) (interchangeable in the examples below), the prototype:

pgsql -h [HOST_ADDRESS] -p [PORT] -U [USER_NAME] -d [DATABASE_NAME]

Get information on or delete current processes

Get all information about all current processes

SELECT * FROM pg_stat_activity;

Information returned: datid, datname, pid, usesysid, usename, application_name, client_addr, client_hostname, client_port, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event, state, backend_xid, backend_xmin, query, and backend_type.

This command isn't useful at a CLI because the information returned is too large to display in a single screen.

Get the most useful information on all current processes

Get the most useful information about all current processes: pid, age, usename, and query.

SELECT pid, age(clock_timestamp(), query_start), usename, query 
FROM pg_stat_activity 
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' 
ORDER BY query_start desc;

This is a useful query to issue at a CLI to see current activity.


Getting and calculating data

Get unique values of variable

SELECT DISTINCT
  myvar
FROM
  myschema.mytable;

Get the first row by datetime per group

Get the first row, by date, for each grouped id_hash. There are two critical components:

  • SELECT DISTINCT ON (group_variable): Groups by group_variable and keeps the first row
  • ORDER BY: Order the elements within the group so that the row of interest occurs first
-- Grouping by id_hash
SELECT DISTINCT ON (id_hash)
  id_hash AS id,
  journal_date
FROM
  myschema.mytable
ORDER BY
  id_hash,
  -- Earliest date should go first in each group
  journal_date DESC,

Count the number of NULL entries

COUNT(*) counts all records, COUNT(varname) only counts records with non-null values in varname. The difference is the number of NULL values

SELECT
  count(*) - count(jrn_day) AS numbernull
FROM
  sandbox.amo_tbm_alpha_with_emp_attributes17;

Datetime

Create a date sequence by day

I find this useful to create a subquery to LEFT JOIN to

-- Create a sequence of dates from 2019-01-01 to 2020-06-01
SELECT
  generate_series::date AS cdate
FROM
  generate_series(
      '2019-01-01'::timestamp,
      '2020-06-01'::timestamp,
      interval '1 day')

Create a date sequence by week

SELECT
  t.day::date 
FROM
  generate_series(timestamp '2004-03-07',
                  timestamp '2004-08-16',
                  interval  '1 week') AS t(day);

Data analytics

Histograming

SELECT
  WIDTH_BUCKET(nlabels, 1, 5, 1) AS buckets,
  COUNT(*)
FROM (
  SELECT
    claim_id AS id,
    COUNT(DISTINCT label) AS nlabels
  FROM
    myschema.mytable
  GROUP BY
    claim_id) AS A
GROUP BY
  buckets;