PostgreSQL Internals: Understanding Database Architecture & MVCC

PostgreSQL is more than just a relational database—it's a sophisticated system with intricate internal mechanisms. This deep dive explores the fundamental architecture of PostgreSQL, from PGDATA and database clusters to the Postmaster process and Multi-Version Concurrency Control (MVCC), revealing how PostgreSQL achieves data consistency and high concurrency.

1. PGDATA: The Database Cluster Root

PGDATA is an environment variable that points to the root directory of a PostgreSQL database cluster on disk. Technically, this directory contains all the configuration files, data files, and transaction logs required for the database system to operate.

💡 Important Note: In PostgreSQL terminology, a Cluster does not refer to multiple servers working together. Instead, it represents a collection of databases managed by a single PostgreSQL server instance.

How PostgreSQL Locates PGDATA

When a PostgreSQL server starts, it needs to know which data and configuration to use. There are two ways to specify this:

Using the -D parameter: pg_ctl start -D /var/lib/postgresql/data
Using the PGDATA environment variable: If -D is not specified, the system defaults to the path defined in the PGDATA environment variable.

Default Databases in a Fresh Cluster

After running initdb to create a new cluster, PostgreSQL creates three identical databases:

template0

Purpose: The pristine root template created by initdb. It represents the purest state of a PostgreSQL database.

Key Characteristics: When a database is created, the LC_COLLATE and LC_CTYPE (locale/region) settings from the template are inherited and cannot be changed later. template0 is also used when restoring logical backups.

template1

Purpose: The default source for all new user-created databases.

Behavior: When you execute CREATE DATABASE x;, PostgreSQL creates a copy of template1 and names it x. Any modifications made to template1 (such as installing extensions) will be inherited by all subsequently created databases.

postgres

Purpose: The default user database.

Behavior: When you connect to the server (e.g., using psql) without specifying a database name, the system connects you to the postgres database by default.

2. System Catalog: PostgreSQL's Metadata Repository

All database objects (tables, indexes, data types, functions, etc.) in every cluster are stored in the System Catalog. Each database has its own set of catalog tables. However, some catalog tables are cluster-wide and do not belong to any specific database.

💡 Naming Convention: All system catalog tables start with pg_, such as pg_database, pg_class, pg_attribute, etc.

3. Schemas: Logical Database Partitions

A database is divided into logical partitions called Schemas. You can think of schemas as a folder structure within a database. Within the same database, you can have two tables with the same name in different schemas (e.g., schema1.users and schema2.users), and they will not conflict. Every database object must belong to a schema.

Default Schemas

public

The default "user" schema. When you create a table without specifying a schema (CREATE TABLE table_name), the system automatically places it in the public schema.

pg_catalog

The database's metadata repository. This schema contains all the system catalog tables that store information about database objects.

information_schema

A set of SQL standard-compliant metadata views. This schema provides a standardized way to query database metadata across different SQL database systems.

pg_toast

PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) mechanism. Data pages in PostgreSQL have a fixed size, but if a row exceeds this size, PostgreSQL compresses or moves the data to TOAST tables.

4. Tablespaces: Physical Storage Locations

A Tablespace is nothing more than a file system directory at the operating system level. It defines where database objects are physically stored on disk.

Tablespace Relationships

One Tablespace → Multiple Databases: Multiple databases can share the same tablespace.
One Database → Multiple Tablespaces: A single database can use multiple tablespaces to store different objects.

5. Postmaster: The PostgreSQL Orchestrator

The Postmaster is the first and primary process that runs when a PostgreSQL database server starts. It is the orchestrator of the system. Technically, PostgreSQL uses a multi-process architecture, and the Postmaster is the parent of all other processes.

Core Responsibilities of Postmaster

1. Initialization and Memory Management

When you execute pg_ctl start, you are actually starting the Postmaster. It performs the following tasks:

Validates the database files in PGDATA
Checks if recovery is needed (e.g., after an unclean shutdown)
Allocates Shared Memory from the operating system, which serves as a common data pool for all processes

2. Network Listening and Connection Handling

The Postmaster listens on port 5432 by default. When a client sends a connection request, the Postmaster is the first component to receive it. It initiates the authentication process using pg_hba.conf.

3. Forking Backend Processes

The Postmaster creates a new process for each incoming connection. It never executes SQL queries itself; it only opens the door, assigns a new worker (backend process), and returns to listening for new requests. This ensures that heavy operations by one user cannot lock the main server.

4. Supervision and Recovery

The Postmaster monitors the health of all child processes (backend workers, autovacuum, checkpointer, etc.). If a child process crashes unexpectedly, the Postmaster intervenes and takes corrective action.

Key Background Processes Managed by Postmaster

Startup Process

Runs during pg_ctl start. If the system did not shut down cleanly, it replays WAL (Write-Ahead Log) files to recover the database to a consistent state.

Autovacuum Launcher/Worker

PostgreSQL's automatic cleanup mechanism. It removes dead tuples (rows) left by deleted or updated records, freeing up disk space and preventing transaction ID wraparound.

WAL Writer

Writes transaction logs to disk. Even if a transaction is not committed, it ensures data is not lost and balances I/O load.

Checkpointer

Creates checkpoints by flushing dirty pages from shared buffers to disk, ensuring data durability and reducing recovery time.

Background Writer

Assists the Checkpointer by gradually writing dirty pages to disk before a checkpoint occurs, reducing I/O spikes.

6. SQL Standard Isolation Levels

Isolation levels determine how much concurrent transactions can affect each other. The fundamental principle is: the higher the isolation level, the greater the data consistency, but the lower the performance. There is an inverse relationship between consistency and concurrency.

Data Inconsistency Anomalies

Before understanding isolation levels, we must first understand the types of anomalies that can occur when multiple transactions run concurrently.

⚠️ Note: All the following scenarios assume Transaction A and Transaction B are running concurrently.

Dirty Read

Scenario: Transaction B updates a row but does not commit. Transaction A reads this uncommitted (dirty) data. Then Transaction B rolls back. Transaction A has read data that never actually existed in the database.

Non-Repeatable Read

Scenario: Transaction A reads a row (SELECT). Then Transaction B updates that row and commits. Transaction A reads the same row again and sees a different value.

Phantom Read

Scenario: Transaction A executes a query with a condition (e.g., WHERE age > 18) and gets five rows. Transaction B inserts a new row that matches this condition and commits. Transaction A re-executes the same query and now sees six rows.

Serialization Anomaly

Scenario: When transactions run in parallel, the final result does not match any possible outcome if those transactions had been executed sequentially.

SQL Standard Isolation Levels

The SQL standard defines four isolation levels based on which anomalies are permitted:

📊 Isolation Levels vs. Anomalies:

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Serialization Anomaly
Read Uncommitted	Allowed	Allowed	Allowed	Allowed
Read Committed	Prevented	Allowed	Allowed	Allowed
Repeatable Read	Prevented	Prevented	Allowed	Allowed
Serializable	Prevented	Prevented	Prevented	Prevented

✅ PostgreSQL's Stricter Approach: PostgreSQL behaves more strictly than the SQL standard:

Read Uncommitted does not exist in PostgreSQL (it is treated as Read Committed)
Repeatable Read is stronger and uses Snapshot Isolation, which also prevents Phantom Reads

7. Multi-Version Concurrency Control (MVCC)

PostgreSQL achieves high concurrency and isolation using Transaction IDs (XID). Every transaction that writes data is assigned a unique, sequentially increasing number. These IDs allow PostgreSQL to create an "identity card" for each row (tuple) in the database.

The xmin and xmax System Columns

Although you don't see them in a regular SELECT * FROM table query, every row on disk has the following metadata:

xmin (Creator XID): Which transaction created this row?
xmax (Deleter XID): Which transaction deleted or updated this row?

How MVCC Works: A Simple Example

Let's assume the current Transaction ID is 100.

Step 1: INSERT (Creation)

Transaction 100 inserts a row.

The row is written to disk
xmin: 100 (I created it)
xmax: 0 (Not yet deleted, still alive)

Step 2: DELETE (Deletion)

Transaction 101 deletes this row.

PostgreSQL does NOT physically delete the row from disk! (This is crucial!)
It only writes xmax: 101 on the row
The row is still on disk, but it is marked as "Transaction 101 killed me"

Step 3: UPDATE (Modification)

Transaction 102 updates the row.

In PostgreSQL, UPDATE = DELETE + INSERT
The old row's xmax is set to 102 (I killed it)
A new row is created with xmin: 102 (I created it)

Visibility Rules

When a transaction reads data, PostgreSQL determines which row versions are visible based on the transaction's snapshot and the xmin/xmax values:

If xmin is committed and less than the transaction's snapshot, the row is visible
If xmax is set and committed before the transaction's snapshot, the row is not visible (it was deleted)
Multiple versions of the same row can exist on disk simultaneously, allowing different transactions to see different versions

✅ Key Benefit: This mechanism allows readers and writers to work without blocking each other. Readers see a consistent snapshot of the data without waiting for writers to finish, and writers can modify data without blocking readers.

Practical Implications

Best Practices for PostgreSQL

Choose the Right Isolation Level

Use Read Committed for most applications (PostgreSQL's default). Use Repeatable Read when you need consistent reads within a transaction. Use Serializable only when absolute consistency is critical and you can handle serialization failures.

Monitor and Tune Autovacuum

Since PostgreSQL does not physically delete rows immediately, dead tuples accumulate. Autovacuum is essential for reclaiming space. Monitor pg_stat_user_tables to ensure autovacuum is running effectively.

Understand Transaction ID Wraparound

PostgreSQL uses 32-bit transaction IDs, which can wrap around after ~2 billion transactions. Autovacuum prevents this by freezing old tuples. Ensure autovacuum is not disabled or blocked.

Monitor Bloat

Frequent updates can cause table and index bloat due to dead tuples. Use tools like pgstattuple to monitor bloat and consider periodic VACUUM FULL or REINDEX for heavily updated tables.

Conclusion

Understanding PostgreSQL's internal architecture—from PGDATA and the Postmaster process to MVCC and transaction isolation—is essential for building robust, high-performance database applications. The abstractions provided by SQL are powerful, but knowing what happens under the hood enables you to make informed decisions about schema design, transaction management, and performance tuning.

The next time you execute a query or encounter a performance issue, you'll know exactly where to look: check the isolation level, examine autovacuum activity, review transaction ID usage, and understand the MVCC mechanisms at play.