RunPage tech overview: AWS Textract integration

RunPage now supports two new APIs

api.danfo.readTablesFromImageFile(ImageFile)
api.danfo.readTablesFromPdfFile(PdfFile)

As their name suggest they can convert any provided image or Pdf files and use AI to extract Danfo DataFrames from them!

Checkout below for a demo of it:

AWS Textract

The AI intelligence is provided by Amazon Web Service’s Textract service. This service accepts image and Pdf binary data and returns a whole host of meta data, like the individual words, lines etc. It can also recognise Forms and Tables, of which only the Tables recognition capability is used.

I also checked out competing product from Google – Document AI. The result from Textract were stellar. But results from preliminary tests using Document AI were not that great.

Challenges with authentication

Since the mantra of RunPage is to not send user data, like uploaded files to server-side for processing so that means the Textract API will need to be called from the browser itself. That means sharing server-side credentials. Right now the same AWS credential is used for all users, but if we leak it to the browser then enterprising users can capture those and use them elsewhere too.

RunPage solves it by getting a temporary credential before each run. The JS sandbox asks the server for a temporary short-lived temporary credential. The server gets them from AWS using AWS STS (Security Token Service) API and passes it on to the browser.

Challenges with handling PDF

RunPage uses synchronous API of Textract to analyse the images. However, that API does not support PDF files. For PDF the async version of the API is required. That version has loads of challenges.

First, it won’t accept PDF binary directly, instead it requires that we upload the PDF to S3 bucket and provide that location to the API. The API will then directly get it from the bucket. This itself is very challenging since now we have to integrate with S3 as well. Also now we need to store the PDF on a remote server, and it will stay there until we delete it from there. And, until the time it is deleted it can be accessed by other users since all share the same AWS account. A huge security issue.

The second issue is when Textract is done processing the PDF it will use AWS SNS service to push notification to RunPage server. So a deployment profile for SNS too is required, and a callback listener API on RunPage server. Also when then API is invoked it has to notify the client somehow, which will requires either Websocket or server polling. This also means that I need to store the response from AWS SNS for sometime in the DB. Too much plumbing just for an API, on top of that the additional cost of using S3 and SNS.

I thought of a long-polling trick which avoids some of these plumbing requirements on backend server side. The client makes a call to backend server for update from AWS SNS, and the backend API simply stores the Express response object in a hash-map. Since the response object has not been used to send any data yet, the request will show as pending on browser side. The timeout value can be set by client. The callback endpoint which will be triggered by AWS SNS can then lookup in that hash-map and pick the appropriate response object and directly send the reply to browser. However, this won’t work if we have more than one nodes running on backend sever side. The backend architecture uses a Nginx server Docker container which acts as reverse proxy for the NodeJs Docker container. Currently only one instance of the NodeJs container runs but the server is specifically stateless so that multiple instances of it can also be run behind load balancer.

Solution to the PDF handling challenges

My solution was to locally (that is in-browser) convert the PDF to images and then use the AWS Textract synchronous API for images.

To locally convert PDF to images I used Mozlilla’s PDF.js. The NPM package of that is called pdfjs-dist. It is a weird name, probably because the pdfjs name is already taken by some other developer.

Using that library also turned out to have its own mini challenge. The library requires that we provide it with a CanvasRenderingContext2D which we can get from <canvas> DOM. However, the sandboxed JS APIs run inside a Worker as described in my blog post – https://blog.applegrew.com/2022/07/runpage-tech-overview-js-sandboxing/. In Web Worker side we do not have DOM access. There is an experimental API called OffscreenCanvas, and browser support for that is not too great yet (in 2022). To handle such scenarios RunPage has some internal plumbing which we make use of here. The PDF File‘s ArrayBuffer is sent from the Worker thread to the main thread with instructions to decode the bytes and return the images of it page by page. The Worker then simply sends each page’s image to Textract for analysis. This has another good side-effect that we can pick and choose the exact pages we would want to analyse and only those pages will get sent. Another ace point for security.

A side note: While working on this I stumbled upon a browser bug which I have reported here - https://bugs.chromium.org/p/chromium/issues/detail?id=1349207. That took a lot of time figure out.

Overall it is cool and super useful feature. Head over to run.applegrew.com to experience yourself.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.