Hacking Postgres Internals - Indexing Schemes for Data Recording Systems | Akash Trehan

Hacking Postgres Internals - Indexing Schemes for Data Recording Systems

Databases

Project Report

Team DataAcids  
Harshith Goka  
Akash Trehan  
Abhishek Kumar  
Tarun Verma  

For the backstory on this project read this first.

The code for this project has been open-sourced on Github.

Introduction

Every minute, 600,000 pieces of content are shared on Facebook, and more than 100,000 tweets are sent. And that does not even begin to scratch the surface of data generation, which spans to sensors, medical records, corporate databases, and more. With such a high amount of data being stored, viewed and analysed, a demand for high performance comes as a must. Hence, the need of the hour is that the data should be stored and retrieved quite efficiently without the performance being compromised.

The reference paper for the project can be found here.

Objectives

  1. Design a technique that supports both insertion and queries with reasonable efficiency, and without the delays of periodic batch processing.
  2. Implement this on top of PostgreSQL, one of the most popular open-source DBMS.

Functionalities

  1. Insertion of new tuples into the relation along with updating the corresponding stepped-merge index
  2. Search using the custom index we implement

System Architecture

Front-end

Back-end

  1. Implemented a structure similar to Log Structured Merge trees(Stepped Merge Trees) to organize the incoming data on the basis of clustering by search key.
    • Worked in single-user mode
    • Not handled concurrency control and recovery issues
  2. Implemented in the C language.
  3. Used Eclipse IDE for debugging and building the project.

Engineering details:-

Our goal was to maintain multiple indices(runs) for maintaining the actual index. Only one of the indices(run) would be in the memory at a particular time and would act as an index for the latest incoming data. After this run fills up the memory it is written to the disk using B-tree bottom up build. Both the in memory run and the one just constructed are Level -1 runs.

We have implemented a stepped-merge algorithm as suggested in the paper. There are two parameters to the algorithm K (denoting number of maximum number of trees at any level) and N(Number of levels). When K runs of level i accumulate on disk, we merge them to create a single i+1 level run. When finally a N level run is reached, we write it to the root relation.

    DATA(insert OID = 9399 (  smerge        smergehandler i ));
    DESCR("stepped merge index access method");
    #define SMERGE_AM_OID 9399
    typedef struct SmMetadata {
            int K;
            int N;

            int attnum;
            AttrNumber attrs[INDEX_MAX_KEYS];

            int levels[MAX_N];
            Oid tree[MAX_N][MAX_K];

            int currTuples;

            Oid curr;
            Oid root;

            bool unique;
    } SmMetadata;

Run Through

K = 3, N = 3, max_tuple_per_index = 4

create table foo (uid int, name varchar(20)); # Create a sample table create index sm on foo using smerge (uid); # Creates the smerge index

insert into foo values (1, 'axzagd');
insert into foo values (2, 'axzagd');
insert into foo values (3, 'axzagd');
insert into foo values (4, 'axzagd');
——————– Memory index fills up. A new index1 is created and the filled index goes to level 0
insert into foo values (5, 'axzagd');
insert into foo values (6, 'axzagd');
insert into foo values (7, 'axzagd');
insert into foo values (8, 'axzagd');
——————– Similar index 2 is created
insert into foo values (9, 'axzagd');
insert into foo values (10, 'axzagd');
insert into foo values (11, 'axzagd');
insert into foo values (12, 'axzagd');
——————– Similar index 3 is created. Level 0 fills up. Index 1, 2, 3 are merged to create a level 1 index.
insert into foo values (13, 'axzagd');
insert into foo values (14, 'axzagd');
insert into foo values (15, 'axzagd');
insert into foo values (16, 'axzagd');
——————– New level 0 index is created and so on.
insert into foo values (17, 'axzagd');
insert into foo values (18, 'axzagd');
insert into foo values (19, 'axzagd');
.
.
After N-1th level fills up, it is merged with the single root relation.

Further Work

Resources

https://www.postgresql.org/docs/9.6/static/xindex.html (Prequel for the below) https://www.postgresql.org/docs/9.6/static/indexam.html https://www.postgresql.org/files/developer/internalpics.pdf https://www.pgcon.org/2016/schedule/attachments/434_Index-internals-PGCon2016.pdf

*All mentions of B-tree actually refer to B+ trees




Follow @CodeMaxx
Database course project and how I almost ditched it!
Akash Trehan

Akash Trehan

Hacker-Developer-Geek

comments powered by Disqus
rss facebook twitter github youtube mail spotify instagram linkedin google pinterest medium