Webscraping government portal

Introduction

Not all portals an organisation, always have an API that provides opportunities for integrations. In particular, government portals are often very closed and protected. This does not have to be a problem in order to establish a reliable connection. Numerix was asked to read data from a government portal on an ongoing basis.

The primary objective was to start using the data for monthly preparation of sales invoices with the numbers indicated on the portal. As a by-product, we were able to provide a PowerBI report that is more comprehensive than the original site.


APPROACH

Numerix built a link entirely based on Open Source resources. Only for data visualisation was Microsoft's PowerBI used, which requires a licence. The link consists of three flows:

  1. Every hour, people log on to the government portal. For this purpose, a Dagster Orchestrator is deployed that communicates with a Selenium browser and a Vault for user data. Each execution reads the data that was loaded on the portal since the previous execution (one hour earlier). The result is a structured database that makes the data easily available.

     
  2. Three times a day, data is uploaded via an on-premise gateway to a Data Lake in Azure. Once the data is in the cloud, end users can consult it in a PowerBI report.

    Afbeelding met tekst, schermopname, ontwerp, Lettertype

Door AI gegenereerde inhoud is mogelijk onjuist.

     
  3. Every month-end, the data is interpreted and flat rates are recorded in the invoicing software. The numbers on the sales invoices are thus directly linked to the data read out on the government portal.
     

RESULT

By linking to the government portal, sales invoices are prepared automatically, saving manual labour. The client notes that far fewer complaints are received, as invoices are prepared faster and cannot longer contain errors. In case of complaints about sales invoices, the PowerBI report helps clarify where necessary.