Tuesday, October 15, 2013

Use IMMUTABLE on PostgreSQL function to avoid optimization fence

I thought this is the PostgreSQL equivalent of SQL Server's efficient inline table-valued function:
create or replace function get_orgs() returns table
(
ID int,
Favorite int
)
as
$$ 
    select p.person_id, eorg.ounits
    from person p
    join lateral
    (
        select case when p.person_id = 4 then p.person_id / 0 else 7 end
    ) eorg(ounits) on true
$$ language sql;


Using that function on this expression..
select p.*, o.* 
from person p
join get_orgs() o on p.person_id = o.ID
where p.person_id < 4
..produces a divide-by-zero error on PostgreSQL, while SQL Server does not materializes the rows that are greater than or equal to 4, hence this doesn't produce an error on SQL Server:

ERROR:  division by zero
CONTEXT:  SQL function "get_orgs" statement 1

********** Error **********

ERROR: division by zero
SQL state: 22012
Context: SQL function "get_orgs" statement 1


To achieve the same optimized execution plan of SQL Server on PostgreSQL, must add IMMUTABLE on the function:
create or replace function get_orgs() returns table
(
ID int,
Favorite int
)
as
$$ 
    select person_id, eorg.ounits
    from person p
    join lateral
    (
        select case when p.person_id = 4 then p.person_id / 0 else 7 end
    ) eorg(ounits) on true
$$ language sql IMMUTABLE;
When that function is used on the the query above, it doesn't yield divide-by-zero anymore, it's now efficient, it doesn't eager load the function's result, it's like the function is inlined on the query itself, the query that joins to the IMMUTABLE function behaves like a table-deriving query, very efficient, no divide-by-zero will occur, i.e., the RDBMS just expand the function's query to other query, divide-by-zero error won't happen, to wit:
select p.*, o.* 
from person p
join
(
    select p.person_id as ID, eorg.ounits
    from person p
    join lateral
    (
        select case when p.person_id = 4 then p.person_id / 0 else 7 end
    ) eorg(ounits) on true
)
o on p.person_id = o.ID
where p.person_id < 4;
Produces this output, has no divide by zero error:
 person_id | lastname  |  firstname   | nickname | favorite_number | id | favorite 
-----------+-----------+--------------+----------+-----------------+----+----------
         2 | mccartney | james paul   | paul     |                 |  2 |        7
         3 | harrison  | george       |          |                 |  3 |        7
         1 | lemon     | john winston | john     |                 |  1 |        7
(3 rows)
However, PostgreSQL, unlike SQL Server, has an optimization fence on its CTE, i.e., PostgreSQL eagerly loads the result of the CTE. SQL Server's CTE is more efficient than PostgreSQL's CTE on this regard:
with o as
(
    select p.person_id as ID, eorg.ounits
    from person p
    join lateral
    (
        select case when p.person_id = 4 then p.person_id / 0 else 7 end
    ) eorg(ounits) on true
)
select p.*, o.* 
from person p
join o on p.person_id = o.ID
where p.person_id < 4;
That CTE yields the following error on Postgresql, a proof that Postgresql eagerly load the result of the CTE; while on SQL Server it has no error, SQL Server doesn't eagerly load the result of CTE, a proof that SQL Server sees CTE as subject to execution plan rather than an eager-loading mechanism to fetch the results. Eagerly-loaded results has no chance to be optimized, as it looks like a blackbox from the caller, the caller can't optimize what's inside
ERROR:  division by zero

********** Error **********

ERROR: division by zero
SQL state: 22012
If we will follow C language principle design, I wishes PostgreSQL to use an existing keyword to indicate we don't want an optimization fence on a CTE:
with o immutable as
(
    select p.person_id as ID, eorg.ounits
    from person p
    join lateral
    (
        select case when p.person_id = 4 then p.person_id / 0 else 7 end
    ) eorg(ounits) on true
)
select p.*, o.* 
from person p
join o on p.person_id = o.ID
where p.person_id < 4;
If a CTE's query can be re-used, the query is better be made as view instead, a query inside a view don't have optimization fence too, i.e., it also behaves as table-deriving query, view's query just get expanded to other query, this won't produce divide-by-zero error:
create or replace view vw_get_orgs as
    select person_id as ID, eorg.ounits
    from person p
    join lateral
    (
        select case when p.person_id = 4 then person_id / 0 else 7 end
    ) eorg(ounits) on true;


select p.*, o.* 
from person p
join vw_get_orgs o on p.person_id = o.ID
where p.person_id < 4;


Happy Coding!

No comments:

Post a Comment