August 3, 2022August 29, 2022 by Nirupam Biswas

RunPage tech overview: AWS Textract integration

RunPage now supports two new APIs

api.danfo.readTablesFromImageFile(ImageFile)
api.danfo.readTablesFromPdfFile(PdfFile)

As their name suggest they can convert any provided image or Pdf files and use AI to extract Danfo DataFrames from them!

Checkout below for a demo of it:

AWS Textract

The AI intelligence is provided by Amazon Web Service’s Textract service. This service accepts image and Pdf binary data and returns a whole host of meta data, like the individual words, lines etc. It can also recognise Forms and Tables, of which only the Tables recognition capability is used.

I also checked out competing product from Google – Document AI. The result from Textract were stellar. But results from preliminary tests using Document AI were not that great.

Challenges with authentication

Since the mantra of RunPage is to not send user data, like uploaded files to server-side for processing so that means the Textract API will need to be called from the browser itself. That means sharing server-side credentials. Right now the same AWS credential is used for all users, but if we leak it to the browser then enterprising users can capture those and use them elsewhere too.

RunPage solves it by getting a temporary credential before each run. The JS sandbox asks the server for a temporary short-lived temporary credential. The server gets them from AWS using AWS STS (Security Token Service) API and passes it on to the browser.

Challenges with handling PDF

RunPage uses synchronous API of Textract to analyse the images. However, that API does not support PDF files. For PDF the async version of the API is required. That version has loads of challenges.

First, it won’t accept PDF binary directly, instead it requires that we upload the PDF to S3 bucket and provide that location to the API. The API will then directly get it from the bucket. This itself is very challenging since now we have to integrate with S3 as well. Also now we need to store the PDF on a remote server, and it will stay there until we delete it from there. And, until the time it is deleted it can be accessed by other users since all share the same AWS account. A huge security issue.

The second issue is when Textract is done processing the PDF it will use AWS SNS service to push notification to RunPage server. So a deployment profile for SNS too is required, and a callback listener API on RunPage server. Also when then API is invoked it has to notify the client somehow, which will requires either Websocket or server polling. This also means that I need to store the response from AWS SNS for sometime in the DB. Too much plumbing just for an API, on top of that the additional cost of using S3 and SNS.

I thought of a long-polling trick which avoids some of these plumbing requirements on backend server side. The client makes a call to backend server for update from AWS SNS, and the backend API simply stores the Express response object in a hash-map. Since the response object has not been used to send any data yet, the request will show as pending on browser side. The timeout value can be set by client. The callback endpoint which will be triggered by AWS SNS can then lookup in that hash-map and pick the appropriate response object and directly send the reply to browser. However, this won’t work if we have more than one nodes running on backend sever side. The backend architecture uses a Nginx server Docker container which acts as reverse proxy for the NodeJs Docker container. Currently only one instance of the NodeJs container runs but the server is specifically stateless so that multiple instances of it can also be run behind load balancer.

Solution to the PDF handling challenges

My solution was to locally (that is in-browser) convert the PDF to images and then use the AWS Textract synchronous API for images.

To locally convert PDF to images I used Mozlilla’s PDF.js. The NPM package of that is called pdfjs-dist. It is a weird name, probably because the pdfjs name is already taken by some other developer.

Using that library also turned out to have its own mini challenge. The library requires that we provide it with a CanvasRenderingContext2D which we can get from <canvas> DOM. However, the sandboxed JS APIs run inside a Worker as described in my blog post – https://blog.applegrew.com/2022/07/runpage-tech-overview-js-sandboxing/. In Web Worker side we do not have DOM access. There is an experimental API called OffscreenCanvas, and browser support for that is not too great yet (in 2022). To handle such scenarios RunPage has some internal plumbing which we make use of here. The PDF File‘s ArrayBuffer is sent from the Worker thread to the main thread with instructions to decode the bytes and return the images of it page by page. The Worker then simply sends each page’s image to Textract for analysis. This has another good side-effect that we can pick and choose the exact pages we would want to analyse and only those pages will get sent. Another ace point for security.

A side note: While working on this I stumbled upon a browser bug which I have reported here - https://bugs.chromium.org/p/chromium/issues/detail?id=1349207. That took a lot of time figure out.

Overall it is cool and super useful feature. Head over to run.applegrew.com to experience yourself.

July 25, 2022August 29, 2022 by Nirupam Biswas

RunPage tech overview: Danfo.js integration

In my previous post – RunPage tech overview: JS Sandboxing, I discussed how I handled client-side sandboxing of script-block codes. Here I will describe how Danfo.js was integrated into RunPage API.

Why integrate Danfo.js?

Relational DB is quite popular because it allows us to query for variety of information by use of SQL statements. It is very versatile and powerful. However, that requires we host a relational DB at the server-side which further means now all data now needs to be imported into the DB and fetched back to client-side for processing. So there goes the speed and security of data features of RunPage.

Our modern browsers now do support Web SQL which allows to store all data right in the browser which solves the above two fundamental issues. But, we now then need to clear and rebuild the DB on every run. Remember every run of RunPage starts with a clean slate. Also we will need to deal provide an ORM to ensure easier working with Web SQL as SQL is inherently quite verbose and the table creation and data insertion process is pretty cumbersome. What we really need is SQL’s query power without its other pains. The answer to that is virtual tables or DataFrames.

DataFrame concept is made popular by Python’s Pandas. Danfo.js is meant to provide the power of Pandas in Javascript.

Using Proxy

ES6 have a neat little api called Proxy. I have used that extensibly while integrating Danfo.js into RunPage. If you do not know what Proxy is then in short it allows you to wrap any JS object where invoking any method, property, etc. on the wrapped object can be intercepted by your code and you have the ability to change the complete outcome. The way it is different from creating a simple wrapper is that instanceof operator will still work with your wrapper object as it would with the wrapped object, and you need not reimplement all methods etc. of the wrapped object to intercept them. A single method in your wrapper can intercept all method and property calls.

The reason I had to wrap Danfo.js’ DataFrame and Series is because of its plot api – https://danfo.jsdata.org/api-reference/plotting/line-charts. If you notice the api there the plot is a method which takes input a DOM’s id where the graph is to be plotted. However, you do not have DOM access in script-block and I do not want you to have that access since RunPage needs to be able to decide where to render your graph. I use Proxy to to replace DataFrame and Series‘ plot with my own version which eventually returns a JSON which the main thread code can interpret as instructions to render the plot.

Conceptually this was simple but quite challenging to actually implement because any kind of operation and series of method calls can eventually return a DataFrame or Series object. So RunPage wraps all objects returned by the wrapper object recursively including Arrays, Functions, etc.

How the main thread renders the graphs

The JSON object returned by the proxy plot is something like below.

{
   "$renderAs":"plot",
   "data":{
      "method":"pie",
      "args":[
            // Arguments passed to plot.pie()
      ],
      "dataframeOrSeriesJ": // DataFrame or Series data as JSON
   }
}

Using these data a Danfo DataFrame or Series is recreated and the actual plot method is invoked. In the above example the invocation code will be something like dfObject.plot('generatedDomId').pie(...args).

The main thread runs a series of output formatters each of which is meant for to render a particular type of JSON. The DOM returned by the output formatter is then used another sub-routine to finally add it into appropriate location on the page. So, the plot output formatter does not know when its given DOM will be add to the page. Only when the DOM is added then the above code needs to be run to actually render the graph. For this another trick is used.

const id = uuidv4();
const {method, args, dataframeOrSeriesJ} = json.data;

const div = document.createElement('div');
div.innerHTML = `<div class="outbox plot" id="${id}"></div>
<img data-id="plotLoader" src="${dummyImg}" style="height:1px;width:1px;" />`;
div.querySelector('[data-id="plotLoader"]').addEventListener('load', function plotter() {
    const dataframe = DataFrameOrSeriesJsonToDataFrameOrSeries(dataframeOrSeriesJ);
    const plotter = dataframe.plot(id);
    const plotM = plotter[method];
    plotM.apply(plotter, args);
});

return div;

Here we generate a div with a unique id which we later pass to the plot method. The neat trick to note here is the use of img tag. The src attribute contains path to an actual one pixel image transparent. When the image is loaded the browser invokes its load event handler which further invokes the actual plot function. Since the img tag is after the plot’s div hence we can be sure that by the time img‘s load is fired the div with the given id is already available.

Problems with Danfo.js

It claims to be Pandas’ equivalent in Javascript but in reality it provides fraction of the tools when compared with Pandas. It does not even provide a ‘not’ or ‘invert’ operator when querying data.

As of now I have filed three defects which are not even assigned to anyone or has any activity yet. The first of which was filed 20 days back. So it looks like after April 2022 the activity on this codebase has suddenly died out.

Looking at the kind of issues I have found the quality of this library is very poor. For example, it seems it is meant to process only numbers, strings and boolean data. If you store other JSON objects in DataFrame then it won’t complain but silently give you wrong and unexpected results. (ref) There lot more fundamental issues which makes it unreliable. In data processing the one thing which cannot be compromised on is reliability else what is the point of processing data if you cannot be sure if you can rely on its output or not! It even has a defect filed which claims that the current latest version 1.1.1’s package on NPM contains old code – https://github.com/javascriptdata/danfojs/issues/462; and this defect is more than month old and still zero activity on it.

So many issues and on top of that it has dependency on @tensorflow/tfjs which I do not need at all.

Given all these factors I am considering ripping out Danfo.js out of RunPage and replacing it with Data-Forge. However, I will first evaluate that extensibly so as not to commit the same mistake I did by integrating with Danfo.js.

July 25, 2022August 29, 2022 by Nirupam Biswas

RunPage tech overview: JS Sandboxing

Featured Title, Javascript, My Work, Regulars, RunPage
2 Comments

In this post I will explain how RunPage runs the sandboxed Javascript code in your browser.

How the sandboxing works

It achieves sandboxing by running the provided code inside a dedicated Web Worker. The worker first instantiates a constructor of Async function using the following code.

const AsyncFunction = Object.getPrototypeOf(async function(){}).constructor;

This constructor is used to a create an async function with the script-block code as the function body, and executed as below.

try {
    const f = new AsyncFunction("globalThis", "api", "\"use strict\";\n" + scriptBlockCode);
    result = f(SharedGlobal, Api);
} catch (e) {
    // Report script error
}

SharedGlobal is the globalThis object using which script-blocks on a page can share objects among themselves. Api provides access to all the apis provided by RunPage.

The worker is instantiated when the page is executed. The same worker instance is used for all script blocks on the page and is disposed when the execution is complete. So for every run a new worker instance is created and disposed-off immediately. This ensures so memory leak persists from one run to another and the states are properly reset on every run.

The use of worker also ensures that there is no DOM access, however other browser apis like fetch etc. are available.

The main thread which initiates the worker, works by passing code of each script-block to the worker one-by-one. When the code of first script-block is executed and the main thread gets the output then only it sends the code of next script-block for execution. This means on error the main thread can terminate the process then and there and skip the rest of the script-blocks. Also this allows the main thread to set a time limit for each script-block execution. If it does not hear from the worker within the set time it can destroy the worker, effectively killing that run.

Finally the use of worker ensures that the UI is not frozen while the script-block codes are running.

Challenges with the implementation

The biggest challenge is passing data between the main thread and worker. The browser auto serializes objects when passing between these two domains. However, few objects cannot be serialized like functions which have captured a scope, etc. So many complex objects are converted into JSON before sending across the domains.

Some apis provided by RunPage allow access to other blocks on the page, like file selector, input and table blocks. These actually require access to those blocks’ DOMs. The api on the worker side does this by passing instruction messages to corresponding “server” code living on the main thread. The code on the main thread access the DOM and gets appropriate data from them and passes them back to the worker.

There is one more challenge which I have not been able to solve yet. It is reporting clear precise error. Right now the stack trace is captured and presented as output to the page user but the stack trace includes code lines from the worker and hence could be confusing to end-user. Also it does not report clearly which exact line and column in the code in the script-block ran into error. Fortunately the code can still be debugged by putting a debugger statement in the script-block code and opening the browser console. The browser will correctly pause at that point and full browser debugging facility can be used.

February 3, 2020February 7, 2020 by Nirupam Biswas

Deep dive into new tax regime of Budget 2020

Budget 2020 is a mixed bag and for the first time it provides the option to choose your tax slab. You have two slabs to choose from. The well understood old one and the new one. New one offers lower tax slabs but without any deductions (except for select few like 80CCD(2)).

There is nothing simple to having two slabs as option. Let’s compare the two and try to understand which one is better and under what circumstances.

Circumstance is the key word here; hence there are so many articles and videos which try to explain this using specific examples. I am not going into a specific case. But will use the power of graphs to plot all possible scenarios from income level 0 to 2Cr. This will hopefully provide some more insight into this mess.

In the below interactive chart the blue line is the tax amount (including applicable surcharge and 4% cess) as per the new slab. The orange line is the tax as per the old slab but without claiming any deductions. It is clear that purely slab-wise the new plan is lighter on tax. The jumps at 50L and 1Cr points are due to surcharges – 10% after 50L and 15% after 1Cr. Irrespective of that the tax as per new slab linearly increases similar to old slab while maintaining almost same difference.

See the Pen New vs Old Tax Comparision (data only) by Nirupam (@applegrew) on CodePen.

Interactive chart 1

From the graph above it might look like taxes from both slabs are exactly equidistant but if we zoom onto the green line at the bottom, we can see that it is not exactly that.

The difference increases as we move towards higher income. It is fixed after 15L slab. After that it increases in steps at 50L & 1Cr points.

What is clear is that as your income increase you need to claim more deductions to benefit from the old slabs.

Zooming into the portion before 15L shows a pretty unpredictable “wavy” difference. That means predicting if you will loose or gain if you use the new plan is much harder here. What is clear is that as your income increase you need to claim more deductions to benefit from the old slabs.

The below interactive chart shows the amount of deductions you need to claim in old slab to just match the tax benefits you get from new slab. In the topmost interactive chart this data is shown by red line near the bottom of the chart.

See the Pen Tax deductions comparision only (data only) by Nirupam (@applegrew) on CodePen.

Interactive chart 2

From 15L point it pretty fixed. You need to claim more than 2.5L of deductions to get benefit from old slab.

From 15L point it pretty fixed. You need to claim more than 2.5L of deductions to get benefit from old slab. If you cannot then switch to new slab. Out of 2.5L 50k Standard deduction you get for free, so what is left is 2L deduction. For that you need to max out your 80C, and NPS or 80D. If you have a home loan then it would be easier because you can claim 2L per annum of interest amount you paid for home loans. However, loans typically have more interest component towards the start and more principal amount at the end. To see how much interest you are paying year wise see – https://blog.applegrew.com/2019/01/calculating-amortisation-schedule-of-your-loans/.

The big dips after exact 50L and 1Cr points are due to surcharges. Even claiming a small deduction can bring your income to a slab where surcharge is zero or less, making your taxes match the gain in new slab. However, this lasts for approximate 4.5L range.

Let’s have a look at the range before 15L point more closely.

From 5L to 7.5L range the required deduction linearly increases from zero to 1.24L. So if you max out your 80C then that is good enough reason for you to keep using old slab.

From 7.5L to 10L range the rate of increase in required deduction amount lessens. At 10L point the required deduction is 1.88L. This plateaus out and continues until 12L point like that. Removing 50k, we are left with 1.38L deductions to fulfil. Here too if you just max out your 80C then old slab is great for you.

12L to 12.5L is one small range and then 12.5 to 15L range. The deduction for this range varies from 1.88L to 2.08L and 2.08L to 2.5L respectively. Here you need to pretty much max out 80C with NPS or 80D or should have home loans.

Finally

HRA is also one significant amount which I have not considered here. All in all figure out your gross income then use interactive chart 2 and locate your income level on x-axis. That should provide you with the min deduction amount you need claim to benefit from old slab. Add all your actual deductions and see if that fits the requirement.

However, even after the flat 50k deduction if you have the need to switch to new tax then you are not saving enough!

Addendum

Update1: I almost forgot about Standard deduction of 50k which you get in old slabs but not in new one. Updated the article accordingly.

January 19, 2019September 6, 2020 by Nirupam Biswas

Calculating Amortisation Schedule of your loans

Featured, Javascript, My Work, Regulars
1 Comment

Be it home loan, car loan or any other loans; they all have same style of calculation. These calculations are non-trivial and it becomes more complicated if you are going to make some prepayments.

To make it all too easy I have built an app to do all the hard work for you. You will get insight into how much interest money you are really paying on your “low interest” rate loans, and how a small repayment early can save you big money. Also it gives you visibility into exactly how much interest vs principal you are repaying on your each EMI.

If you are new to this then the last statement might not be very clear to you. Each month you pay a fixed amount as EMI (Equated Monthly Instalments), however, each of those EMIs pay off part of the principal amount (the loan amount you borrowed) and part of the overall interest amount. The interesting thing is that percentage principal and percentage interest amount you pay in each EMI is not fixed. Towards the start of loan period the interest part is more and the principal part is less, as the loan progresses their ratio gradually progressively approaches 1:1 near the mid of your loan tenure then it flips and your principal ratio becomes more than your interest. Eventually it is only principal which is left; at which point the loan is fully repaid.

That is why making lump sum prepayments towards end of loan tenure is not too beneficial as you are only repaying the principal which anyway you need to repay. Where you can save on, is the interest part. Always keep in mind the amount of interest you are paying is directly proportional to the amount you loaned and the time you are taking to return that amount.

Play with different values in the calculator below and notice the pattern in the graph.

April 25, 2016 by Nirupam Biswas

Displaying a DOM while dragging

First check out the demo below.

See the Pen EKeVXZ by Nirupam (@applegrew) on CodePen.

There are no drop targets so you won’t be able to drop it anywhere.

Code for this directive can be seen on https://github.com/applegrew/drag-dom

The basic trick applied by agDragDom, is setting a one pixel transparent gif as the drag image. Since the drag image is set based on how the image “looks” on the webpage, it has to be added to the document and should be visible. However, since it transparent and have only one pixel dimension, it remains hidden in plain sight.

The second trick was to clone the DOM being dragged and absolutely position the cloned DOM, such that it is always at a particular location with respect to the drag pointer location. If you want an effect like drag to move, and want that it should feel like the user is moving the actual DOM, then on dragstart, hide the actual DOM after cloning it.

However, as is evident from the directive code the trick is not straight forward to implement. This code works well on Chrome. On Firefox it has trouble animating drag over element which does not bubble up the dragover event. It is not tested on any other browsers as of now.

আপেল-গ্রেয়াস মাইন্ড

আপেলগৃউয়ার মাইন্ড | Tech & random

Featured

Wifi STB Remote+ (JAVA) for laptop and desktops

Lessons learned from PhoneGap (Cordova) and jQueryMobile on Android

The Emperor’s Old Clothes

Django-Select2: Select2 for Django

ADF Super Code Snippets

Digit Math: Introduction

Exceptional short documentary on the city Delhi

NASA to launch inexpensive Android ‘phonesats’ into space

CInk version 2 finally released!

Migrate from Apache to Nginx: The new guide

Category / Coding