Jekyll2023-03-02T10:45:50+00:00https://www.akashtrehan.com/feed.xmlAkash TrehanHacker-Developer-GeekSpamSlam - A blockchain solution to less spam2018-06-06T08:20:00+00:002018-06-06T08:20:00+00:00https://www.akashtrehan.com/spamslam<p><img src="/assets/images/SpamSlam.png" alt="SpamSlam" /></p>
<p><strong>Note:</strong> A post elaborating on my experience at this hackathon was published <a href="https://hackinout.co/blog/blockchain-track-winner-spamslam/">here</a></p>
<p>This project was done as part of <a href="https://hackinout.co/">Hack InOut</a>, a 30 hour hackathon held at Bengaluru in Fall 2017. There was a blockchain track and we had never explored this trending technology before, so we figured we would do something in this space. However, our idea can be implemented, and perhaps in a better manner, without blockchain as well. With due gratitude to Gnosis who sponsored the first prize that we got in the Blockchain Track, and whose platform we used in the hackathon implementation, I would discuss the idea independent of an implementation.</p>
<p>Spam prevention is costly. According to a study in 2012 by American Economic Association, it costs the world $50 billion and earns the spammers about $50 million. This is one of the more modest estimates. The study also reported the cause to be the ease of sending spams, and suggested a heavier negative penalty would do the world good. Drawing from one of Vitalik Buterin, the Ethereum co-founder’s many crazy ideas, we wondered what if sending emails did cost us? And then one wonders isn’t this the same as putting stamps on letters? And perhaps one can paste a costlier stamp if one wants their message delivered faster?</p>
<p>Consider additionally the culture shift that we had with the exponential growth of telecommunications. Calling someone was the way to go and it was much quicker than sending letters that take a long time to come back to you, but as it turns out, we prefer it be less instant and we let the other person take their time to reply. On similar lines of a cultural oscillation, perhaps we do not prefer to have sending emails for free. If we have a system of digital stamps, imagine what more could we achieve. If someone has paid for a costlier stamp, then the message will be delievered to you faster. The urgency is reflected in the cost you pay for it. For friends and family, it would be easy to have a setting where they don’t have to pay anything to contact you, but for everyone else, you end up having a really effective filter.</p>
<p>There is one major challenge. How do you shift our civilisation to work on this model from the current one? This, we do have no clear answer for, and if we did, we would be off implementing that. Apart from this, we spent a long time thinking of other loopholes with the plan and we were able to shoot down everything that we did come up with. As part of the hackathon prize, we also won office hours with YCombinator, who also did not see more problems with the idea than the one which I’ve already highlighted. I would be delighted to have more discussion on this if you wish and are welcome to write back to me.</p>
<p>You can find the full presentation <a href="https://speakerdeck.com/codemaxx/hackinout-blockchain-winner-spamslam?slide=1">here</a>.</p>
<p>Thanks to <a href="http://cheekujodhpur.github.io/">Kumar Ayush</a> and Kumar Ashutosh for working with me on this project. Also thanks to Kumar Ayush for letting me borrow this post from his blog.</p>HackInOut 4.0 Winners - My First Blockchain Project2018-06-06T08:15:00+00:002018-06-06T08:15:00+00:00https://www.akashtrehan.com/hackinoutMy Other Computer is Your Computer - Malware Classification2018-05-06T00:00:00+00:002018-05-06T00:00:00+00:00https://www.akashtrehan.com/my-other-computer-is-your-computer<p><img src="/assets/images/malware/malware.jpg" alt="Malware" /></p>
<p>Source: <a href="https://www.cyberpointllc.com/images/svcs/img-mare.jpg">https://www.cyberpointllc.com/images/svcs/img-mare.jpg</a></p>
<h2 id="background">Background</h2>
<h4 id="why-this-project">Why this project?</h4>
<p>This project emerged for fulfilling a requirement of the Machine Learning course (EE 769) I took this semester. As I have always been interested in computer security, I wanted to combine my newly learnt knowledge from this course with it. The attacks last year from malwares like <a href="https://www.wikiwand.com/en/WannaCry_ransomware_attack">WannaCry</a>, <a href="https://www.wikiwand.com/en/Petya_(malware)#/2017_Cyberattack">NotPetya</a> and <a href="https://www.kaspersky.com/blog/bad-rabbit-ransomware/19887/">Bad Rabbit</a> had made me curious about how these attacks could be prevented. Malware Classification was the perfect project. For this I teamed up with my good friend Mukesh Pareek who is also a security enthusiast. This post is written in collaboration with him. Our guide for this project is <a href="https://www.ee.iitb.ac.in/web/people/faculty/home/asethi">Prof. Amit Sethi</a>.</p>
<h4 id="malware">Malware</h4>
<p><img src="/assets/images/malware/classes.jpg" alt="Malware Classes" /></p>
<p>Source: <a href="http://thepcworks.com/malware/">http://thepcworks.com/malware/</a></p>
<p>Wikipedia defines <a href="https://www.wikiwand.com/en/Malware">malware</a> as:</p>
<blockquote>
<p>Malware, short for malicious software, is an umbrella term used to refer to a variety of forms of hostile or intrusive software, including computer viruses, worms, Trojan horses, ransomware, spyware, adware, scareware, and other intentionally harmful programs.</p>
</blockquote>
<p>The definition tells us that there are many classes of malware. These categories are made based on how the malware propagates as well as what its intent is.</p>
<h2 id="the-problem">The Problem</h2>
<p>On hearing about malware classification two things come to mind:</p>
<ol>
<li>Separating malware from benign files</li>
<li>Given a malware, identifying which class the malware belongs to</li>
</ol>
<p>We focused on solving the second problem.</p>
<p>But why classify malware into classes? Just like a doctor can treat you better if he/she knows what disease you have, anti-malware softwares like antiviruses can defend better if they know the class of malware they are dealing with.</p>
<h4 id="challenges">Challenges</h4>
<p>Earlier malware was detected using signatures. So whenever there a new malware was found, the companies created its signature and any file whose signature matched was detected. Since this method could only find exact matches, it was very restrictive. Later people shifted to identifying “indicators” which were the defining properties of a class of malware. So if a file had certain indicators it could be classified to the corresponding class. But this again uses only known indicators. With new methods to obfuscate and polymorphism techniques, the creators of these malwares were easily able to get across this layer of protection. The problems we face today is that millions of samples of malware spread everyday. Most of these are duplicates or slight modifications of one another made to decieve the defense systems. Classifying malware thus becomes a daunting task.</p>
<h4 id="why-machine-learning">Why Machine Learning?</h4>
<p>In today’s data rich world, machine learning has become ubiquitous. With its ability to find useful pieces of information from data, machine learning has lead to great results. The defense against a malware attack depends on the broader category of malware and not necessarily on the specific attack sample. This is why machine learning can be used. It can find hidden relationships among the various features of the samples and then leverage those to classify unknown samples.</p>
<h4 id="more-specifically">More specifically…</h4>
<p>Now we come to what exactly is the information we have about the malware and what categories do we need to classify it to.</p>
<p>For every malware sample, the input we have is:</p>
<ul>
<li><em>.asm file</em> - This contains the assembly code for the malware program and can be used to extract information about instruction calls, segments etc.</li>
</ul>
<p><img src="/assets/images/malware/asm.png" alt="Screenshot of .asm file" /></p>
<figcaption class="caption">Snippet from asm file</figcaption>
<ul>
<li><em>.bytes file</em> - This contains the hexadecimal representation of the file’s binary content. It can be used to extract infomation about the lower level functioning of the malware.</li>
</ul>
<p><img src="/assets/images/malware/bytes.png" alt="Screenshot of .bytes file" /></p>
<figcaption class="caption">Snippet from bytes file</figcaption>
<p>So these two files need to be used to classify the malware into the following 9 families:</p>
<ol>
<li>Ramnit</li>
<li>Lollipop</li>
<li>Kelihos_ver3</li>
<li>Vundo</li>
<li>Simda</li>
<li>Tracur</li>
<li>Kelihos_ver1</li>
<li>Obfuscator.ACY</li>
<li>Gatak</li>
</ol>
<h4 id="the-dataset">The dataset</h4>
<p>The specific problem as stated above is taken from a <a href="https://www.kaggle.com/c/malware-classification">malware classification challenge</a> organised by Microsoft on Kaggle. The dataset was also taken from there. The dataset contained 200 GB of training data and 200 GB of test data. Since we didn’t have the labels for the test data, we divided the training data itself into two parts one of which we used for testing purposes. The testing part was half the size of the training part with training having 7221 samples and test having 3648 labels. This divison was done randomly but ensuring that enough members of each class were are a part of both the training and test set.</p>
<p>Each malware sample had a 20 character long ID. We had a csv contatining the ID to Class mapping of the training samples.</p>
<p><img src="/assets/images/malware/eda.png" alt="Dataset Class Graph" /></p>
<h2 id="preprocessing-and-feature-extraction-">Preprocessing and Feature Extraction <sup id="fnref:code" role="doc-noteref"><a href="#fn:code" class="footnote" rel="footnote">1</a></sup></h2>
<p>The features we used for classification are as follows:</p>
<ol>
<li>
<p>Instruction n-gram from <code class="language-plaintext highlighter-rouge">.asm</code> file - We extracted a list of instructions from the .asm file and used the count for each instruction (1-gram) and instruction-instruction pair (2-gram).</p>
</li>
<li>
<p>Byte n-gram from <code class="language-plaintext highlighter-rouge">.bytes</code> file - We used the hexadecimal representation to extract the byte sequence of the actual malware. Then we used the 1-gram and 2-gram count as our features</p>
</li>
<li>
<p>Segment Size - We store the number of lines in each of the segments - Header, Data, Text etc. This information is extracted from the <code class="language-plaintext highlighter-rouge">.asm</code> files</p>
</li>
<li>
<p>Pixel Intensity of <code class="language-plaintext highlighter-rouge">.asm</code> files - We converted the <code class="language-plaintext highlighter-rouge">.asm</code> file into an image and then extracted the last 1000 pixels of the image as features</p>
</li>
</ol>
<p>Our intuition behind using instruction n-grams was that samples from the same class of malware should have similar code and hence there should be similar instructions sequences present in the code. n-grams were a way to represent that. Likewise for the byte n-grams. Using segment size is again based on the intuition that the amount of static data, the amount of space required for the code would be similar for the same class.</p>
<h4 id="implementation-details">Implementation details</h4>
<p>Extracting the above features involves text processing and parsing. For this we used the <code class="language-plaintext highlighter-rouge">pyparsing</code> python library. The library can be used to specify token formats which make it easier to identify the required instructions or bytes. For getting an image from the <code class="language-plaintext highlighter-rouge">.asm</code> file we used byte arrays.</p>
<p>For speeding up the feature extraction we used the <code class="language-plaintext highlighter-rouge">ProcessPoolExecutor</code> from <code class="language-plaintext highlighter-rouge">concurrent</code> library which made sure that all the cores were being used for processing.</p>
<p>After extracting the features we dumped them to a file so that the processing need not be done again.</p>
<ul>
<li>Write about feature selection</li>
</ul>
<h2 id="training">Training</h2>
<p>We used the following models/techniques for learning:</p>
<ol>
<li>Support Vector Classifier</li>
<li>Xtreme Gradient Booster</li>
<li>Logistic Regression</li>
<li>K Nearest Neighbour Classifier</li>
<li>Random Forest</li>
<li>Neural Network</li>
</ol>
<p>For each of these models we did hyperparameter tuning to find out the best model. Grid search was used to try out all combinations for the values of hyperparameters. We used k-fold cross validation with k=4 for training. To make efficient use of our CPUs we did the grid search in parallel since training of each hyperparameter combination is independent of the other. We used <code class="language-plaintext highlighter-rouge">sklearn</code> and <code class="language-plaintext highlighter-rouge">xgboost</code> libraries to help us with training.</p>
<h2 id="evaluation">Evaluation</h2>
<h4 id="hyperparameter-tuning">Hyperparameter Tuning</h4>
<p>The graphs for hyperparameter tuning are as follows:</p>
<p><img src="/assets/images/malware/svc.png" alt="svc" /></p>
<figcaption class="caption">Support Vector Classifier</figcaption>
<p><img src="/assets/images/malware/knn.png" alt="svc" /></p>
<figcaption class="caption">K Nearest Neighbour Classifier</figcaption>
<p><img src="/assets/images/malware/lr.png" alt="svc" /></p>
<figcaption class="caption">Logistic Regression</figcaption>
<p><img src="/assets/images/malware/xgbc.png" alt="svc" /></p>
<figcaption class="caption">Xtreme Gradient Booster</figcaption>
<p><img src="/assets/images/malware/rfc.png" alt="svc" /></p>
<figcaption class="caption">Random Forest Classifier</figcaption>
<!-- #### Feature importance
![Feature importance]() -->
<h4 id="cross-validation-and-test-set-accuracy">Cross Validation and Test Set Accuracy</h4>
<table>
<thead>
<tr>
<th>Model</th>
<th>4-Fold Cross Validation Accuracy</th>
<th>Test Set Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td>Logistic Regression</td>
<td>0.9745187647140285</td>
<td>0.910562449264865</td>
</tr>
<tr>
<td>Support Vector Classifier</td>
<td>0.9775654341503947</td>
<td>0.869346629</td>
</tr>
<tr>
<td>Neural Network</td>
<td>0.941</td>
<td>0.893875612342112</td>
</tr>
<tr>
<td>K Nearest Neighbour Classifier</td>
<td>0.9641323916355076</td>
<td>0.821231293817848</td>
</tr>
<tr>
<td>XGBoost</td>
<td>0.9945990859991691</td>
<td>0.921231623812763</td>
</tr>
<tr>
<td>Random Forest</td>
<td>0.9609472372247612</td>
<td>0.88658497372</td>
</tr>
</tbody>
</table>
<p>We find that we get very good cross-validation accuracies with all models but XGBoost works the best.</p>
<p>XGBoost still dominate all the other models in case of test set but Logistic regression and neural networks also come quite close.</p>
<h2 id="problems-faced-and-learning">Problems faced and Learning</h2>
<p>What did not work is as important as understanding what worked. This section talks about the challenges we faced during this project and what we learned from them. Firstly was the number of features. We wanted to take higher n-grams but the number of combinations were too many leading to very slow training. To get across this hurdle we decided to use Random Forest feature selection so that other models need not train on all the features but only the most important ones.</p>
<p>We were also trying to account for loops in the <code class="language-plaintext highlighter-rouge">.asm</code> files while getting the instruction counts. But since we can only do static analysis of the files, we could only follow unconditional jumps which would not have been very useful.</p>
<p>Since we were trying out various techniques we hadn’t used before, we decided to apply semi-supervised learning. But later we learnt that it is used when we have a small amount of labelled data and a large amount of unlabelled data. Then we also use the unlabelled data for learning. Since we didn’t have any shortage of samples, we decided not to do this.</p>
<p>We also wanted to try out Deep Learning but due to the large size of the files (~100 MB for many of the <code class="language-plaintext highlighter-rouge">.asm</code> files) it would have been very slow without extracting features manually first to decrease the size.</p>
<p>A major problem we faced was the huge size of the data. We didn’t have enough space on our computers to store all the training data so we had to store it on a server and then run all our code there. After doing this a few times, we came up with the idea that we should just dump the features after extracting them the first time. Then we can read directly from the dumps. This reduced the size from 200 GBs to ~1GB! We thought we were done but then we ran short of another resource - the RAM. All the features from all the data did not fit inside the RAM. A better idea at this point would have been to do batch learning, but we ended up just training on a smaller amount of data due to lack of time.</p>
<p>We learnt a lot about practical ML lessons during the project which increased our understanding significantly.</p>
<h2 id="conclusion-and-future-work">Conclusion and Future Work</h2>
<p>We got good enough accuracy with the data and the low computational resources we had. Thus we can conclude that machine learning can be an effective technique for malware classification. Infact it is extensively being used in industrial applications these days.</p>
<p>Inspite of all the success, machine learning models aren’t full-proof too. The datasets used to train the models are usually biased because there is no common data sink for malware samples. This is caused by the lack of collaboration in the industry.</p>
<p>In future, we would like to try out more models and try more combination of features to find out which ones work best together. We will also make a web front-end for the application where people can upload malware samples and in the backend we use our models to predicts it’s class. This would make this project a complete ready to use package for the users.</p>
<h3 id="references">References</h3>
<p>[1] <a href="http://vizsec.org/files/2011/Nataraj.pdf">Malware Images: Visualization and Automatic Classification</a></p>
<p>[2] <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.644&rep=rep1&type=pdf">Code Obfuscation and Malware Detection</a></p>
<p>[3] <a href="https://www.kaggle.com/c/malware-classification/data">Microsoft Malware Clasification Challenge 2015</a></p>
<p>[4] <a href="http://www.iis.sinica.edu.tw/page/jise/2015/201505_11.pdf">Feature selection and extraction for Malware Classification</a></p>
<p>[5] <a href="https://github.com/xiaozhouwang/kaggle_Microsoft_Malware/blob/master/Saynotooverfitting.pdf">Kaggle challenge first place team</a></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:code" role="doc-endnote">
<p>Our code is available on github <a href="https://github.com/CodeMaxx/my-other-computer-is-your-computer">here</a> <a href="#fnref:code" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Graphics - Modelling, Rendering and Animation2017-12-31T08:15:00+00:002017-12-31T08:15:00+00:00https://www.akashtrehan.com/graphicsThe Right way to use Sublime Text2017-12-23T00:00:00+00:002017-12-23T00:00:00+00:00https://www.akashtrehan.com/right-way-to-use-sublime-text<p><img src="/assets/images/sublime.png" alt="Sublime Text" /></p>
<p>After having used sublime text for about 2 years the wrong way, I have finally learnt my lesson. I am going around the internet looking for ways to be more productive with Sublime Text.</p>
<p>Sublime Text is a swiss army knife with all forms of tips and tricks up it’s sleeve. These include but are not limited to keyboard shortcuts, creating projects and code snippets.</p>
<p>I will keep listing down interesting things I find so this blog will be updated regularly for a few days. Remember to try out these tricks hands on as you read through them else they won’t get registered in your mind.</p>
<p>So let’s get started!</p>
<ol>
<li>
<p><strong>Next Occurance of a word</strong> <code class="language-plaintext highlighter-rouge">cmd + d</code> (Mac) | <code class="language-plaintext highlighter-rouge">ctrl + d</code> (PC)</p>
</li>
<li>
<p><strong>Multi-cursor</strong> <code class="language-plaintext highlighter-rouge">cmd + Left moust click</code> | <code class="language-plaintext highlighter-rouge">ctrl + Left moust click</code></p>
</li>
<li>
<p><strong>Column selection</strong> <code class="language-plaintext highlighter-rouge">alt + left click drag</code> | <code class="language-plaintext highlighter-rouge">shift + right click drag</code></p>
</li>
<li>
<p><strong>Split selection into lines</strong> <code class="language-plaintext highlighter-rouge">cmd + shift + L</code> | <code class="language-plaintext highlighter-rouge">ctrl + shift + L</code></p>
</li>
<li>
<p><strong>Move cursor to beginning of line</strong> <code class="language-plaintext highlighter-rouge">cmd + left arrow</code> | <code class="language-plaintext highlighter-rouge">home</code></p>
</li>
<li>
<p><strong>Wrap selection with html tag</strong> <code class="language-plaintext highlighter-rouge">ctrl + shift + w</code> | <code class="language-plaintext highlighter-rouge">alt + shift + w</code></p>
</li>
<li>
<p><strong>Move line vertically</strong> <code class="language-plaintext highlighter-rouge">cmd + ctrl + arrow</code> | <code class="language-plaintext highlighter-rouge">ctrl + shift + arrow</code></p>
</li>
<li>
<p><strong>Duplicate Line</strong> <code class="language-plaintext highlighter-rouge">cmd + shift + D</code> | <code class="language-plaintext highlighter-rouge">ctrl + shift + D</code></p>
</li>
<li>
<p><strong>Delete Line</strong> <code class="language-plaintext highlighter-rouge">ctrl + shift + K</code> | <code class="language-plaintext highlighter-rouge">ctrl + shift + K</code></p>
</li>
<li>
<p><strong>Indent line</strong> <code class="language-plaintext highlighter-rouge">cmd + [ or ]</code> | <code class="language-plaintext highlighter-rouge">ctrl + [ or ]</code> (Also <code class="language-plaintext highlighter-rouge">Edit -> Line -> Reindent</code> for indentation of selection)</p>
</li>
<li>
<p><strong>Paste & Indent</strong> <code class="language-plaintext highlighter-rouge">cmd + shift + V</code> | <code class="language-plaintext highlighter-rouge">ctrl + shift + V</code> (Very very useful!)</p>
</li>
</ol>
<p>Again, do try all this out by yourself!</p>
<p><strong>Cheers!</strong></p>
<p><strong>If you are an Infosec person, don’t forget to checkout my <a href="../../writeups">CTF Write-ups</a></strong></p>
<p><a href="../blog">See other Blog posts</a></p>Hacking Postgres Internals - Indexing Schemes for Data Recording Systems2017-12-13T08:20:00+00:002017-12-13T08:20:00+00:00https://www.akashtrehan.com/indexing-schemes<p><img src="/assets/images/database.jpg" alt="Databases" /></p>
<h1 id="project-report">Project Report</h1>
<table>
<thead>
<tr>
<th>Team DataAcids</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://github.com/tastelessjolt/">Harshith Goka</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://github.com/codemaxx">Akash Trehan</a></td>
<td> </td>
</tr>
<tr>
<td><a href="AbhishekKumar16">Abhishek Kumar</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://github.com/vermatarunv">Tarun Verma</a></td>
<td> </td>
</tr>
</tbody>
</table>
<p>For the backstory on this project read <a href="../project-ditch/">this</a> first.</p>
<p>The <a href="https://github.com/codemaxx/postgres">code for this project</a> has been open-sourced on Github.</p>
<h2 id="introduction">Introduction</h2>
<p>Every minute, 600,000 pieces of content are shared on Facebook, and more than 100,000 tweets are sent. And that does not even begin to scratch the surface of data generation, which spans to sensors, medical records, corporate databases, and more. With such a high amount of data being stored, viewed and analysed, a demand for high performance comes as a must. Hence, the need of the hour is that the data should be stored and retrieved quite efficiently without the performance being compromised.</p>
<p>The reference paper for the project can be found <a href="https://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/indexbuffering-vldb97.pdf">here</a>.</p>
<h2 id="objectives">Objectives</h2>
<ol>
<li>Design a technique that supports both insertion and queries with reasonable efficiency, and without the delays of periodic batch processing.</li>
<li>Implement this on top of PostgreSQL, one of the most popular open-source DBMS.</li>
</ol>
<h2 id="functionalities">Functionalities</h2>
<ol>
<li>Insertion of new tuples into the relation along with updating the corresponding stepped-merge index</li>
<li>Search using the custom index we implement</li>
</ol>
<h2 id="system-architecture">System Architecture</h2>
<p><strong>Front-end</strong></p>
<ul>
<li>There is no real front-end we will implement. It is just the user interface that PostgreSQL provide.</li>
</ul>
<p><strong>Back-end</strong></p>
<ol>
<li>Implemented a structure similar to Log Structured Merge trees(Stepped Merge Trees) to organize the incoming data on the basis of clustering by search key.
<ul>
<li>Worked in single-user mode</li>
<li>Not handled concurrency control and recovery issues</li>
</ul>
</li>
<li>Implemented in the C language.</li>
<li>Used Eclipse IDE for debugging and building the project.</li>
</ol>
<h2 id="engineering-details-">Engineering details:-</h2>
<p>Our goal was to maintain multiple indices(runs) for maintaining the actual index. Only one of the indices(run) would be in the memory at a particular time and would act as an index for the latest incoming data. After this run fills up the memory it is written to the disk using B-tree bottom up build. Both the in memory run and the one just constructed are Level -1 runs.</p>
<p>We have implemented a stepped-merge algorithm as suggested in the paper. There are two parameters to the algorithm <code class="language-plaintext highlighter-rouge">K</code> (denoting number of maximum number of trees at any level) and <code class="language-plaintext highlighter-rouge">N</code>(Number of levels). When <code class="language-plaintext highlighter-rouge">K</code> runs of level <code class="language-plaintext highlighter-rouge">i</code> accumulate on disk, we merge them to create a single <code class="language-plaintext highlighter-rouge">i+1</code> level run. When finally a <code class="language-plaintext highlighter-rouge">N</code> level run is reached, we write it to the root relation.</p>
<ul>
<li>To go about this task, we firstly have to make Postgres recognise that we have created an index. Firstly we need to add an entry into <code class="language-plaintext highlighter-rouge">pg_am</code> system catalog to identify our <code class="language-plaintext highlighter-rouge">'</code><code class="language-plaintext highlighter-rouge">smerge</code><code class="language-plaintext highlighter-rouge">'</code> index as an access method. This is done by adding an entry to the file <code class="language-plaintext highlighter-rouge">pg_am.h</code> and giving it a unique OID and the name of the handler(which would be created next).</li>
</ul>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">DATA</span><span class="p">(</span><span class="n">insert</span> <span class="n">OID</span> <span class="o">=</span> <span class="mi">9399</span> <span class="p">(</span> <span class="n">smerge</span> <span class="n">smergehandler</span> <span class="n">i</span> <span class="p">));</span>
<span class="n">DESCR</span><span class="p">(</span><span class="s">"stepped merge index access method"</span><span class="p">);</span>
<span class="cp">#define SMERGE_AM_OID 9399
</span></code></pre></div></div>
<ul>
<li>
<p>To be useful, an index access method must also have one or more operator families and operator classes defined in <code class="language-plaintext highlighter-rouge">pg_opfamily</code>, <code class="language-plaintext highlighter-rouge">pg_opclass</code>, <code class="language-plaintext highlighter-rouge">pg_amop</code>, and <code class="language-plaintext highlighter-rouge">pg_amproc</code> which allow the planner to determine what kinds of query qualifications can be used with indexes of this access method. Hence the corresponding entries are added in the corresponding files.</p>
</li>
<li>
<p>Next, a new access method directory is created in <code class="language-plaintext highlighter-rouge">src/backend/access</code> (called ‘smerge’ in our case) . Inside this directory we create a file called <code class="language-plaintext highlighter-rouge">smerge.c</code>(corresponding .h file is also created in <code class="language-plaintext highlighter-rouge">src/include/access</code>) and define the handler function that returns <code class="language-plaintext highlighter-rouge">IndexAmRoutine</code> with access method parameters and callbacks. Various parameters are set in this handler regarding the kind of support our index provides. For example, <code class="language-plaintext highlighter-rouge">amroutine->amcanorder</code> is set to false indicating that the ordering is not yet supported with the index. All the functions from <code class="language-plaintext highlighter-rouge">nbtree.c</code> are retained (names are changed according to our convenience) whose definitions would be changed complying to out requirements. The basic idea was to use the functionalities of <code class="language-plaintext highlighter-rouge">nbtree</code> by calling them from these functions or using their ideas as much as possible as we were merely building multiple versions of them.</p>
</li>
<li>
<p>For building a new smerge index, <code class="language-plaintext highlighter-rouge">smergebuild()</code> function is used which is tailored to create btree index statement, and executed it giving a unique OID to that index. The in-built function <code class="language-plaintext highlighter-rouge">DefineIndex()</code> (Defined in <code class="language-plaintext highlighter-rouge">indexcmds.c</code>, that created a new index given the index creating statement and other parameters)was used for this. Also in this function, we needed to add the metadata corresponding to each binary tree. The metadata that has to be inculded is defined in the struct smMetadata defined as follows:-</p>
</li>
</ul>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">typedef</span> <span class="k">struct</span> <span class="nc">SmMetadata</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">K</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">N</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">attnum</span><span class="p">;</span>
<span class="n">AttrNumber</span> <span class="n">attrs</span><span class="p">[</span><span class="n">INDEX_MAX_KEYS</span><span class="p">];</span>
<span class="kt">int</span> <span class="n">levels</span><span class="p">[</span><span class="n">MAX_N</span><span class="p">];</span>
<span class="n">Oid</span> <span class="n">tree</span><span class="p">[</span><span class="n">MAX_N</span><span class="p">][</span><span class="n">MAX_K</span><span class="p">];</span>
<span class="kt">int</span> <span class="n">currTuples</span><span class="p">;</span>
<span class="n">Oid</span> <span class="n">curr</span><span class="p">;</span>
<span class="n">Oid</span> <span class="n">root</span><span class="p">;</span>
<span class="kt">bool</span> <span class="n">unique</span><span class="p">;</span>
<span class="p">}</span> <span class="n">SmMetadata</span><span class="p">;</span>
</code></pre></div></div>
<ul>
<li>The metadata is stored by first allocating a page of size of a block defined as BLCKSZ and then calling functions <code class="language-plaintext highlighter-rouge">_sm_init_metadata</code><code class="language-plaintext highlighter-rouge">()</code> and <code class="language-plaintext highlighter-rouge">_sm_writepage</code><code class="language-plaintext highlighter-rouge">()</code> which are defined in the file <code class="language-plaintext highlighter-rouge">smmeta.c</code>.
<ul>
<li><code class="language-plaintext highlighter-rouge">_sm_init_metadata</code><code class="language-plaintext highlighter-rouge">()</code> is used for the purpose of initialisation of the metadata values. We have hard-coded the values of K and N here.</li>
<li><code class="language-plaintext highlighter-rouge">_sm_writepage</code><code class="language-plaintext highlighter-rouge">()</code> uses similar functions as used by the storage module of postgres specifically <code class="language-plaintext highlighter-rouge">smgrwrite()</code>to store the metadata onto the first page of the smerge index relation.</li>
</ul>
</li>
<li>
<p>As the OID of the newly created index is stored in its metadata page using <code class="language-plaintext highlighter-rouge">smgrwrite()</code> function it would be easy for us to get the btree using <code class="language-plaintext highlighter-rouge">index_open()</code> on the stored OID easily.</p>
</li>
<li>Next part is to insert an index tuple into the current btree.
<ul>
<li>For this we first get the metadata of the relation and then extract the OID of the current in-memory b-tree using <code class="language-plaintext highlighter-rouge">_get_curr_btree()</code> function which simply uses the <code class="language-plaintext highlighter-rouge">index_open</code> function to get that b-tree.</li>
<li>Once we have this <code class="language-plaintext highlighter-rouge">btreeRel</code>, we simply call the <code class="language-plaintext highlighter-rouge">bt_insert()</code> function for inserting the new tuple, followed by closing the opened index using <code class="language-plaintext highlighter-rouge">index_close()</code> function.</li>
<li>Next we need to check if the current in-memory tree is full. If yes, then create a new in-memory tree using _sm_create_curr_btree() function and calling <code class="language-plaintext highlighter-rouge">sm_flush</code><code class="language-plaintext highlighter-rouge">()</code> to flush the values into the next level.</li>
<li>Finally we also need to write to the metadata page the changed values as a new tuple was added and the current count of number of entries the in-memory tree has changed. So we again call the function <code class="language-plaintext highlighter-rouge">_sm_write_metadata()</code> to update the meta-data.</li>
</ul>
</li>
<li>
<p><code class="language-plaintext highlighter-rouge">smsort.c</code> contains the implementation for merging the indices which involves creation of spools for various indices and then merging them. The main function called when it’s time to merge is the <code class="language-plaintext highlighter-rouge">sm_flush()</code> function.</p>
</li>
<li>
<p>We also need that after creating the smerge index, all search queries go through this for debugging. Hence, as a hack we have changed the <code class="language-plaintext highlighter-rouge">smergecostestimate()</code> function and set the costs very low(Close to 0).</p>
</li>
<li>
<p>Now once one of the levels is full, and it’s time to merge the k runs, <code class="language-plaintext highlighter-rouge">sm_flush()</code> is invoked which is responsible for merging the <code class="language-plaintext highlighter-rouge">k</code> level <code class="language-plaintext highlighter-rouge">i</code> runs into a single <code class="language-plaintext highlighter-rouge">i+1</code> level run. The function’s implementation is inspired from the function <code class="language-plaintext highlighter-rouge">bt_load()</code> of the file <code class="language-plaintext highlighter-rouge">nbtsort.c</code>, which merges two spools (the second one is for dead tuples).</p>
</li>
<li>For creating the spools we need to get the tuples corresponding to each index separately. For this we do an index only scan the get all tuples for the particular index. Then we create a Scankey such that all the tuples are returned. Currently we assumed the entries being greater than a particular number ( we can use the smallest integer which fits in an <code class="language-plaintext highlighter-rouge">int</code> for this). After creating the spools, they are sent into the <code class="language-plaintext highlighter-rouge">tuplesort_performsort()</code> function. Although the spools are already sorted, the sortstate needs to be setup properly which is done by the given function. Merging of level N-1 into root is handled separately but uses a similar merging logic.</li>
</ul>
<h2 id="run-through">Run Through</h2>
<p><strong>K = 3, N = 3, max_tuple_per_index = 4</strong></p>
<p><code class="language-plaintext highlighter-rouge">create table foo (uid int, name varchar(20));</code> # Create a sample table
<code class="language-plaintext highlighter-rouge">create index sm on foo using smerge (uid);</code> # Creates the smerge index</p>
<p><code class="language-plaintext highlighter-rouge">insert into foo values (1, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (2, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (3, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (4, 'axzagd');</code><br />
——————– Memory index fills up. A new index1 is created and the filled index goes to level 0<br />
<code class="language-plaintext highlighter-rouge">insert into foo values (5, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (6, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (7, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (8, 'axzagd');</code><br />
——————– Similar index 2 is created<br />
<code class="language-plaintext highlighter-rouge">insert into foo values (9, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (10, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (11, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (12, 'axzagd');</code><br />
——————– Similar index 3 is created. Level 0 fills up. Index 1, 2, 3 are merged to create a level 1 index.<br />
<code class="language-plaintext highlighter-rouge">insert into foo values (13, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (14, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (15, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (16, 'axzagd');</code><br />
——————– New level 0 index is created and so on.<br />
<code class="language-plaintext highlighter-rouge">insert into foo values (17, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (18, 'axzagd');</code><br />
<code class="language-plaintext highlighter-rouge">insert into foo values (19, 'axzagd');</code><br />
.<br />
.<br />
After <code class="language-plaintext highlighter-rouge">N-1</code>th level fills up, it is merged with the single root relation.</p>
<h2 id="further-work">Further Work</h2>
<ul>
<li>We had hard-coded the parameters <code class="language-plaintext highlighter-rouge">N</code> and <code class="language-plaintext highlighter-rouge">K</code> into the code which could be kept as user-parameters which could then be changed later on.</li>
<li>The cost operations are to be implemented properly</li>
<li>As of now postgres choosed btrees for the default indices(primary key, foreign key etc.). Changes need to be made so that smerge is chosen.</li>
<li>Currently, for search queries, we are starting our search from the root relation moving upwards which may not necessarily produce outputs in sorted order(which might be desired in certain situations). In short, the ordering property is not supported and the step to output tuples could be modified to sort before giving output.</li>
<li>There are memory(specifically relcache memory leaks) leaks which were not properly handled in the code which should be properly handled before doing performance improvement tests against btrees.</li>
<li>Update and Delete operations are not yet supported in the project which we have implemented. Once the order by operation is handled, these could be done efficiently. In addition, bloom filters might be needed for performing these.</li>
</ul>
<h2 id="resources">Resources</h2>
<p><a href="https://www.postgresql.org/docs/9.6/static/xindex.html">https://www.postgresql.org/docs/9.6/static/xindex.html</a> (Prequel for the below)
<a href="https://www.postgresql.org/docs/9.6/static/indexam.html">https://www.postgresql.org/docs/9.6/static/indexam.html</a>
<a href="https://www.postgresql.org/files/developer/internalpics.pdf">https://www.postgresql.org/files/developer/internalpics.pdf</a>
<a href="https://www.pgcon.org/2016/schedule/attachments/434_Index-internals-PGCon2016.pdf">https://www.pgcon.org/2016/schedule/attachments/434_Index-internals-PGCon2016.pdf</a></p>
<p>*All mentions of B-tree actually refer to B+ trees</p>Database course project and how I almost ditched it!2017-12-12T08:20:00+00:002017-12-12T08:20:00+00:00https://www.akashtrehan.com/project-ditch<p><img src="/assets/images/database.jpg" alt="Databases" /></p>
<h2 id="background">Background</h2>
<p>This was the best semester ever! All the courses I took were Computer Systems courses (except Psychology, which is another subject I love). I had 3 labs which were great fun, and the cherry on the cake was this databases project I took up with three of my friends.</p>
<p>We were given the freedom to choose whatever project we liked, which more often than not is a responsibility …aagh another responsibility!</p>
<p>We were given some sample projects we could take up, most of which were Android apps. Their main focus was software development and understanding how to design database schemas. Most of the teams came up with great ideas for this type of project but as usual my rebellious self kicked in.</p>
<p><em>“This is the only course project you’re doing this semester! It must be something different, something awesome!”</em></p>
<p>The first step was to convince the team to take on a hard project. Since the project counted for 30% of the course marks, not being able to complete it would be devastating for our grade. The team had some discussions and by the end all of us were pretty excited to take on the challenge. We knew it was a risk but we did it anyways.</p>
<p>We talked with our guide, <a href="https://www.cse.iitb.ac.in/~sudarsha/">Prof. Sudarshan S</a> and decided on the project you’re reading about. “Hacking Postgres Internals” had a nice ring to it I thought. The project actually implements a part of <a href="https://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/indexbuffering-vldb97.pdf">his paper from 1997</a>.</p>
<h2 id="the-team">The Team</h2>
<p><a href="https://github.com/tastelessjolt/">Harshith Goka</a>, <a href="AbhishekKumar16">Abhishek Kumar</a> and <a href="https://github.com/vermatarunv">Tarun Verma</a> were my teammates. I have teamed up with Goka a few times before. He’s very enthusiastic above software development and has always been a great teammate. With Abhishek, I had done the Digital Logic Design project before and we became good friends since. I had never teamed up with Tarun before but knew he was a sincere guy. We really enjoyed doing the project together!</p>
<h2 id="the-preparation">The Preparation</h2>
<p>When we started, we had little idea about what we’d gotten ourselves into. We didn’t have a lot of idea about postgres internals. So it was a long ride to successfully adding a new feature to it.</p>
<p>Prof. Sudarshan provided us with a lot of helpful material on the subject and on our request even agreed to take a session explaining the basics. We attended the session, learnt new stuff, sincerely decided to start on it the next day itself and then forgot about it for a few weeks :P</p>
<p>When we finally got to it we had forgotten everything from the session, so we started all over again. We went through the material slowly but steadily. After finishing the reading, it was time to start implementing. We decided to start on it the next day itself. You know what happened after. We didn’t start until after our final exams :P</p>
<p>(The preparation material is mentioned at the end of the <a href="../indexing-schemes/">project report</a>)</p>
<h2 id="to-be-or-not-to-be">To be or not to be</h2>
<p>A problem with adding a feature to an existing project is that you have to spend time understanding the existing code. It is exhausting but there’s no other way. We spent a lot of time on this during our preparation. Like a lot of time. A lot I mean. We ourselves hadn’t added much to the code. It’s not a very good feeling. So much effort but nothing concrete to show. I’ll be honest, I started having doubts if we would be able to complete the project. In fact, I discussed with the team and we decided that we would switch to a simpler project which we were sure to complete :/</p>
<p>We went to Prof. Sudarshan to tell him (read: ask permission from him :P) about our decision. I usually take lead in such situations and I knew it was going to be awkward (and sad). I started by telling him about our pain of not feeling a sense of progress. He was very positive and told us ways to take the project forward. He was so excited about the project and talked about it so passionately that I just wasn’t able to tell him we were planning to switch.</p>
<p>So no permission, no switch.</p>
<p>We got our heads back into postgres and determined to complete it.</p>
<p>And we did end up completing the project. It was <a href="https://www.youtube.com/watch?v=VCeblzSL4cE">lengen…wait for it…dary</a>. Legendary!</p>
<h2 id="the-project-report">The Project Report</h2>
<p>All the engineering details are mentioned in the report.</p>
<p>The report is in another post <a href="../indexing-schemes/">here</a>.</p>
<p>The <a href="https://github.com/codemaxx/postgres">code for this project</a> has been open-sourced on Github.</p>
<p><br />
Do check out other <a href="../../projects">projects</a>, <a href="../../blog">my blog</a> or <a href="../../writeups">my write-ups</a> for various CTFs.</p>CSec - Binary Exploitation 22017-11-08T00:00:00+00:002017-11-08T00:00:00+00:00https://www.akashtrehan.com/csec-binary-exploitation-2<p><img src="/assets/images/csec.png" alt="Binary Exploitation" /></p>
<p>The second episode of my Binary Exploitation series is out!</p>
<p>(The first one can be found <a href="../csec-binary-exploitation-1">here</a>.)</p>
<p>In this one I talk about some more advanced exploitation techniques, mitigation stratergies used against buffer overflow attacks and how to bypass them. There’s a lot of stuff this time. Infact, it’s about double the length of the previous video.</p>
<p>Don’t miss the demo at the end!</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/hIIHNUiyw4A" frameborder="0" allowfullscreen=""></iframe>
<p>Constructive criticism is much appreciated.</p>
<p><strong>Cheers!</strong></p>
<p><strong>If you are an Infosec person, don’t forget to checkout my <a href="../../writeups">CTF Write-ups</a></strong></p>
<p><a href="../blog">See other Blog posts</a></p>Won Ubisoft GameJam 20172017-10-18T11:00:00+00:002017-10-18T11:00:00+00:00https://www.akashtrehan.com/Ubisoft-GameJam-Winners<p><img src="/assets/images/ubisoft.jpg" alt="Ubisoft" /></p>
<p>With high hopes and curious eyes we entered the Ubisoft Office. We were in Pune for the final round of Ubisoft’s GameJam.</p>
<p>Let’s rewind back a month. Ubisoft had announced a qualifier round for the GameJam. We were required to form teams and submit game ideas. The top four ideas would go on for the final round. Since I try to participate in every hackathon with my team <a href="https://github.com/Ferozepurwale">Ferozepurwale</a>, this was our next target. After a good amount of brainstorming we came up with about five game ideas. We voted on them and decided to submit a role-play game. Since this post exists, you already know that we made it through.</p>
<p>Coming back to the final round…</p>
<p>The theme given to us for the hackathon was <strong>Flood!</strong></p>
<p>Fortunately one of the ideas we had in our mind touched upon the theme. We decided to go 3D with Unity. I had never used Unity before but my teammates had.</p>
<p>A good game needs a good backstory. After working on the backstory and overall idea of the game, we got started. (BTW do take a look at the official <a href="https://unity3d.com/learn/tutorials">Unity3D’s tutorials</a> - they’re great!)</p>
<p>We had about 30 hours to complete our game - the graphics, level design, characters and an overall immersive experience.</p>
<p>The philosophy of level design was one of the best things I learnt during the hackathon. How to introduce the features of your game, what upgrades to add in each level and how to increase the difficulty without making it impossible… it was all really cool.</p>
<p>After two days of coding, free food, hot chocolate and some great mentoring from Ubisoft we complete our game and also ended up winning the hackathon!</p>
<p>Here’s an aftermovie of the hackathon by Ubisoft:</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/cP-jsgug0FU" frameborder="0" gesture="media" allow="encrypted-media" allowfullscreen=""></iframe>
<p>The code for our game is open source and available <a href="https://github.com/Ferozepurwale/Flood-League">here</a>.</p>
<p>Thank you for reading!</p>
<p><strong>Cheers!</strong></p>
<p><strong>If you are an Infosec person, don’t forget to checkout my <a href="../../writeups">CTF Write-ups</a></strong></p>
<p><a href="../blog">See other Blog posts</a></p>CSec - Binary Exploitation 12017-08-25T00:00:00+00:002017-08-25T00:00:00+00:00https://www.akashtrehan.com/csec-binary-exploitation-1<p><img src="/assets/images/csec.png" alt="Binary Exploitation" /></p>
<p><a href="https://www.facebook.com/groups/csec.iitb/">CSec</a> is the cybersecurity club of IIT Bombay started by me a few months ago. I have two aims in mind for the club:</p>
<ul>
<li>To spread awareness about various technical/non-technical stuff related Computer Security</li>
<li>To build some strong teams for Capture the Flag competitions</li>
</ul>
<p>Although the school year remains very buzy, I try to give as much time as possible towards this end. Aligned with this goal, I decided to start a vodcast series on Binary Exploitation with help from the legendary Web & Coding Club.</p>
<p>Presenting the first video from the series (This is my debut video; go easy on me :stuck_out_tongue_winking_eye:) -</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/wOJl6N5oiQQ?ecver=1" frameborder="0" allowfullscreen=""></iframe>
<p>Liked it ??? Didn’t like it ????? Let me know!</p>
<p>Constructive criticism is much appreciated.</p>
<p><strong>Cheers!</strong></p>
<p><strong>If you are an Infosec person, don’t forget to checkout my <a href="../../writeups">CTF Write-ups</a></strong></p>
<p><a href="../blog">See other Blog posts</a></p>