PGVECTOR IS THE NEW BLACK
In the ever-evolving world of PostgreSQL, pgvector is the latest buzz among my database enthusiast friends. Although I initially brushed it off as I stay away from “hype driven development” methodology, my perspective shifted dramatically during a recent workshop at the Prague PostgreSQL Developer Conference (P2D2). To truly appreciate the significance of pgvector, it’s essential to first understand the broader landscape of data science and data engineering.
A few years back, I was working for a major European telecom company with a formidable team of data scientists focused on predictive analytics. This team relied on a data warehouse that aggregated information from 82 different source systems. However, the raw data in this warehouse was often messy and unsuitable for direct analysis. To address this, we built a specialized data mart designed for cleaner, more manageable datasets with flatter tables and historical versions.
The bulk of our time was consumed by data cleaning, reloading ETL pipelines into the data mart, fixing broken data streams, and correcting errors from the source data. Only after this exhaustive process could the data scientists run their models and strive to improve metrics like precision and recall. Frequently, the data mart schema needed adjustments to support new models, which added more complexity and delays. It often took several months to provide definitive answers. While data scientists had a clear vision of the desired outcomes, this vision was not fully shared with the data engineers, leading to numerous iterations before achieving the final state. Below gif explains this phenomenon in simpler terms :-
This is where pgvector could become a game-changer.
Record scratch… pump the brakes! I’m not going to dive into the nitty-gritty details of what vectorization of data is, what vectors are, or how to install pgvector. There’s plenty of excellent documentation and blogs out there for that. Instead, let’s take a moment to explore some of the intriguing activities we tackled during the workshop.
One of our challenges was to find images of muffins among a set of chihuahua and muffin images. We started with a straightforward schema, and pgvector was already installed:
CREATE EXTENSION pgvector;
CREATE SCHEMA p2d2;
CREATE TABLE p2d2.pictures (
id BIGSERIAL PRIMARY KEY,
filename text NOT NULL,
embedding vector(768) NOT NULL
);
ALTER TABLE p2d2.pictures ALTER COLUMN embedding SET STORAGE PLAIN;
We used a script, load_embeddings.py
, to vectorize the data (or generate embeddings) and insert it into the pictures
table. After loading the embeddings, we could freely query the table and use vector operators to determine the nearest neighbor, among other things. These queries were handled in challenge.py
, and we also stored the results in an HTML format. You can find all the scripts on GitHub of our Instructor Gulcin from EDB.
#generating vector data (embeddings)
./load_embeddings.py /home/dataset/chihuahua_muffin/muffin /var/run/postgresql/ 5432 p2d2db02 user02
./load_embeddings.py /home/dataset/chihuahua_muffin/chihuahua /var/run/postgresql/ 5432 p2d2db02 user02
#similarity search using image datasets (finding images that resemble muffin out of all images)
.the_challenge/chihuahua_or_muffin.py /home/dataset/chihuahua_muffin/challenge/img1c.jpg /var/run/postgresql 5432 p2d2db02 user02
#To view the HTML results, we ran the following command:
user02@p2d2lab01:~/.the_challenge$ python3 -m http.server 8002
#and open http://ip-address-of-server:8002/ to see result
And Postgres said “Let there be light.”
As you can see, a data engineer (or a DBA in my case) can easily tackle “the challenge.” While the choice of embedding models and the overarching vision of the project will always be the domain of data scientists, data engineers can swiftly verify the entire pipeline, making it easier for them to grasp the project’s essence. This significantly reduces the back-and-forth between different teams, resulting in a much more agile and efficient pipeline.
Pgvector is currently in its early stage and is continually evolving. While there are limitations to parallelization and horizontal scaling, I am confident these challenges will be addressed soon, just as similar issues have been tackled with the PostgreSQL database engine. The future is promising, and indeed, pgvector is the new black.