August 3, 2022August 29, 2022 by Nirupam Biswas

RunPage tech overview: AWS Textract integration

RunPage now supports two new APIs

api.danfo.readTablesFromImageFile(ImageFile)
api.danfo.readTablesFromPdfFile(PdfFile)

As their name suggest they can convert any provided image or Pdf files and use AI to extract Danfo DataFrames from them!

Checkout below for a demo of it:

AWS Textract

The AI intelligence is provided by Amazon Web Service’s Textract service. This service accepts image and Pdf binary data and returns a whole host of meta data, like the individual words, lines etc. It can also recognise Forms and Tables, of which only the Tables recognition capability is used.

I also checked out competing product from Google – Document AI. The result from Textract were stellar. But results from preliminary tests using Document AI were not that great.

Challenges with authentication

Since the mantra of RunPage is to not send user data, like uploaded files to server-side for processing so that means the Textract API will need to be called from the browser itself. That means sharing server-side credentials. Right now the same AWS credential is used for all users, but if we leak it to the browser then enterprising users can capture those and use them elsewhere too.

RunPage solves it by getting a temporary credential before each run. The JS sandbox asks the server for a temporary short-lived temporary credential. The server gets them from AWS using AWS STS (Security Token Service) API and passes it on to the browser.

Challenges with handling PDF

RunPage uses synchronous API of Textract to analyse the images. However, that API does not support PDF files. For PDF the async version of the API is required. That version has loads of challenges.

First, it won’t accept PDF binary directly, instead it requires that we upload the PDF to S3 bucket and provide that location to the API. The API will then directly get it from the bucket. This itself is very challenging since now we have to integrate with S3 as well. Also now we need to store the PDF on a remote server, and it will stay there until we delete it from there. And, until the time it is deleted it can be accessed by other users since all share the same AWS account. A huge security issue.

The second issue is when Textract is done processing the PDF it will use AWS SNS service to push notification to RunPage server. So a deployment profile for SNS too is required, and a callback listener API on RunPage server. Also when then API is invoked it has to notify the client somehow, which will requires either Websocket or server polling. This also means that I need to store the response from AWS SNS for sometime in the DB. Too much plumbing just for an API, on top of that the additional cost of using S3 and SNS.

I thought of a long-polling trick which avoids some of these plumbing requirements on backend server side. The client makes a call to backend server for update from AWS SNS, and the backend API simply stores the Express response object in a hash-map. Since the response object has not been used to send any data yet, the request will show as pending on browser side. The timeout value can be set by client. The callback endpoint which will be triggered by AWS SNS can then lookup in that hash-map and pick the appropriate response object and directly send the reply to browser. However, this won’t work if we have more than one nodes running on backend sever side. The backend architecture uses a Nginx server Docker container which acts as reverse proxy for the NodeJs Docker container. Currently only one instance of the NodeJs container runs but the server is specifically stateless so that multiple instances of it can also be run behind load balancer.

Solution to the PDF handling challenges

My solution was to locally (that is in-browser) convert the PDF to images and then use the AWS Textract synchronous API for images.

To locally convert PDF to images I used Mozlilla’s PDF.js. The NPM package of that is called pdfjs-dist. It is a weird name, probably because the pdfjs name is already taken by some other developer.

Using that library also turned out to have its own mini challenge. The library requires that we provide it with a CanvasRenderingContext2D which we can get from <canvas> DOM. However, the sandboxed JS APIs run inside a Worker as described in my blog post – https://blog.applegrew.com/2022/07/runpage-tech-overview-js-sandboxing/. In Web Worker side we do not have DOM access. There is an experimental API called OffscreenCanvas, and browser support for that is not too great yet (in 2022). To handle such scenarios RunPage has some internal plumbing which we make use of here. The PDF File‘s ArrayBuffer is sent from the Worker thread to the main thread with instructions to decode the bytes and return the images of it page by page. The Worker then simply sends each page’s image to Textract for analysis. This has another good side-effect that we can pick and choose the exact pages we would want to analyse and only those pages will get sent. Another ace point for security.

A side note: While working on this I stumbled upon a browser bug which I have reported here - https://bugs.chromium.org/p/chromium/issues/detail?id=1349207. That took a lot of time figure out.

Overall it is cool and super useful feature. Head over to run.applegrew.com to experience yourself.

July 25, 2022August 29, 2022 by Nirupam Biswas

RunPage tech overview: Danfo.js integration

In my previous post – RunPage tech overview: JS Sandboxing, I discussed how I handled client-side sandboxing of script-block codes. Here I will describe how Danfo.js was integrated into RunPage API.

Why integrate Danfo.js?

Relational DB is quite popular because it allows us to query for variety of information by use of SQL statements. It is very versatile and powerful. However, that requires we host a relational DB at the server-side which further means now all data now needs to be imported into the DB and fetched back to client-side for processing. So there goes the speed and security of data features of RunPage.

Our modern browsers now do support Web SQL which allows to store all data right in the browser which solves the above two fundamental issues. But, we now then need to clear and rebuild the DB on every run. Remember every run of RunPage starts with a clean slate. Also we will need to deal provide an ORM to ensure easier working with Web SQL as SQL is inherently quite verbose and the table creation and data insertion process is pretty cumbersome. What we really need is SQL’s query power without its other pains. The answer to that is virtual tables or DataFrames.

DataFrame concept is made popular by Python’s Pandas. Danfo.js is meant to provide the power of Pandas in Javascript.

Using Proxy

ES6 have a neat little api called Proxy. I have used that extensibly while integrating Danfo.js into RunPage. If you do not know what Proxy is then in short it allows you to wrap any JS object where invoking any method, property, etc. on the wrapped object can be intercepted by your code and you have the ability to change the complete outcome. The way it is different from creating a simple wrapper is that instanceof operator will still work with your wrapper object as it would with the wrapped object, and you need not reimplement all methods etc. of the wrapped object to intercept them. A single method in your wrapper can intercept all method and property calls.

The reason I had to wrap Danfo.js’ DataFrame and Series is because of its plot api – https://danfo.jsdata.org/api-reference/plotting/line-charts. If you notice the api there the plot is a method which takes input a DOM’s id where the graph is to be plotted. However, you do not have DOM access in script-block and I do not want you to have that access since RunPage needs to be able to decide where to render your graph. I use Proxy to to replace DataFrame and Series‘ plot with my own version which eventually returns a JSON which the main thread code can interpret as instructions to render the plot.

Conceptually this was simple but quite challenging to actually implement because any kind of operation and series of method calls can eventually return a DataFrame or Series object. So RunPage wraps all objects returned by the wrapper object recursively including Arrays, Functions, etc.

How the main thread renders the graphs

The JSON object returned by the proxy plot is something like below.

{
   "$renderAs":"plot",
   "data":{
      "method":"pie",
      "args":[
            // Arguments passed to plot.pie()
      ],
      "dataframeOrSeriesJ": // DataFrame or Series data as JSON
   }
}

Using these data a Danfo DataFrame or Series is recreated and the actual plot method is invoked. In the above example the invocation code will be something like dfObject.plot('generatedDomId').pie(...args).

The main thread runs a series of output formatters each of which is meant for to render a particular type of JSON. The DOM returned by the output formatter is then used another sub-routine to finally add it into appropriate location on the page. So, the plot output formatter does not know when its given DOM will be add to the page. Only when the DOM is added then the above code needs to be run to actually render the graph. For this another trick is used.

const id = uuidv4();
const {method, args, dataframeOrSeriesJ} = json.data;

const div = document.createElement('div');
div.innerHTML = `<div class="outbox plot" id="${id}"></div>
<img data-id="plotLoader" src="${dummyImg}" style="height:1px;width:1px;" />`;
div.querySelector('[data-id="plotLoader"]').addEventListener('load', function plotter() {
    const dataframe = DataFrameOrSeriesJsonToDataFrameOrSeries(dataframeOrSeriesJ);
    const plotter = dataframe.plot(id);
    const plotM = plotter[method];
    plotM.apply(plotter, args);
});

return div;

Here we generate a div with a unique id which we later pass to the plot method. The neat trick to note here is the use of img tag. The src attribute contains path to an actual one pixel image transparent. When the image is loaded the browser invokes its load event handler which further invokes the actual plot function. Since the img tag is after the plot’s div hence we can be sure that by the time img‘s load is fired the div with the given id is already available.

Problems with Danfo.js

It claims to be Pandas’ equivalent in Javascript but in reality it provides fraction of the tools when compared with Pandas. It does not even provide a ‘not’ or ‘invert’ operator when querying data.

As of now I have filed three defects which are not even assigned to anyone or has any activity yet. The first of which was filed 20 days back. So it looks like after April 2022 the activity on this codebase has suddenly died out.

Looking at the kind of issues I have found the quality of this library is very poor. For example, it seems it is meant to process only numbers, strings and boolean data. If you store other JSON objects in DataFrame then it won’t complain but silently give you wrong and unexpected results. (ref) There lot more fundamental issues which makes it unreliable. In data processing the one thing which cannot be compromised on is reliability else what is the point of processing data if you cannot be sure if you can rely on its output or not! It even has a defect filed which claims that the current latest version 1.1.1’s package on NPM contains old code – https://github.com/javascriptdata/danfojs/issues/462; and this defect is more than month old and still zero activity on it.

So many issues and on top of that it has dependency on @tensorflow/tfjs which I do not need at all.

Given all these factors I am considering ripping out Danfo.js out of RunPage and replacing it with Data-Forge. However, I will first evaluate that extensibly so as not to commit the same mistake I did by integrating with Danfo.js.

July 25, 2022August 29, 2022 by Nirupam Biswas

RunPage tech overview: JS Sandboxing

Featured Title, Javascript, My Work, Regulars, RunPage
2 Comments

In this post I will explain how RunPage runs the sandboxed Javascript code in your browser.

How the sandboxing works

It achieves sandboxing by running the provided code inside a dedicated Web Worker. The worker first instantiates a constructor of Async function using the following code.

const AsyncFunction = Object.getPrototypeOf(async function(){}).constructor;

This constructor is used to a create an async function with the script-block code as the function body, and executed as below.

try {
    const f = new AsyncFunction("globalThis", "api", "\"use strict\";\n" + scriptBlockCode);
    result = f(SharedGlobal, Api);
} catch (e) {
    // Report script error
}

SharedGlobal is the globalThis object using which script-blocks on a page can share objects among themselves. Api provides access to all the apis provided by RunPage.

The worker is instantiated when the page is executed. The same worker instance is used for all script blocks on the page and is disposed when the execution is complete. So for every run a new worker instance is created and disposed-off immediately. This ensures so memory leak persists from one run to another and the states are properly reset on every run.

The use of worker also ensures that there is no DOM access, however other browser apis like fetch etc. are available.

The main thread which initiates the worker, works by passing code of each script-block to the worker one-by-one. When the code of first script-block is executed and the main thread gets the output then only it sends the code of next script-block for execution. This means on error the main thread can terminate the process then and there and skip the rest of the script-blocks. Also this allows the main thread to set a time limit for each script-block execution. If it does not hear from the worker within the set time it can destroy the worker, effectively killing that run.

Finally the use of worker ensures that the UI is not frozen while the script-block codes are running.

Challenges with the implementation

The biggest challenge is passing data between the main thread and worker. The browser auto serializes objects when passing between these two domains. However, few objects cannot be serialized like functions which have captured a scope, etc. So many complex objects are converted into JSON before sending across the domains.

Some apis provided by RunPage allow access to other blocks on the page, like file selector, input and table blocks. These actually require access to those blocks’ DOMs. The api on the worker side does this by passing instruction messages to corresponding “server” code living on the main thread. The code on the main thread access the DOM and gets appropriate data from them and passes them back to the worker.

There is one more challenge which I have not been able to solve yet. It is reporting clear precise error. Right now the stack trace is captured and presented as output to the page user but the stack trace includes code lines from the worker and hence could be confusing to end-user. Also it does not report clearly which exact line and column in the code in the script-block ran into error. Fortunately the code can still be debugged by putting a debugger statement in the script-block code and opening the browser console. The browser will correctly pause at that point and full browser debugging facility can be used.

আপেল-গ্রেয়াস মাইন্ড

আপেলগৃউয়ার মাইন্ড | Tech & random

Featured

Wifi STB Remote+ (JAVA) for laptop and desktops

Lessons learned from PhoneGap (Cordova) and jQueryMobile on Android

The Emperor’s Old Clothes

Django-Select2: Select2 for Django

ADF Super Code Snippets

Digit Math: Introduction

Exceptional short documentary on the city Delhi

NASA to launch inexpensive Android ‘phonesats’ into space

CInk version 2 finally released!

Migrate from Apache to Nginx: The new guide

Category / RunPage