Kevin is a data engineer and advanced analytics developer. He has over 20 years experience in the field.
Data Engineering is the science of researching and extracting data from various sources and architecting data models and data sets which are stored in data warehouses for use with Machine Learning, Advanced Analytics, other formats of data science endeavors. Data is either extracted in batches or streaming or even on demand or scraped from web pages. Data Engineers design and build the data pipelines that data science runs on.
Regardless what many voices are saying, Data Engineering can be done using almost every programming language if the language has libraries, frameworks or other APIs that allow it to connect to a data source and a data target, including VBA (Visual Basic for Applications).
Most sources are either business systems like SAP or Oracle’s eSuite, or database systems (corporate or external), web services, files. Most often, the target is a data warehouse which can consist of one database or multiple databases called data marts or one large database, depending on your design.
The one tool that is ubiquitous in Data Engineering and Data Sciences is Excel. You can use Excel as a data source (xlsx or csv files), or for data analysis, profiling and cleansing and even, to a certain extent, as a data target. You can include Excel files as part of your data flow DAG (Directed-Acyclic-Graph)
which allows a data engineer (developer) to orchestrate to flow and transformation and storage of the data to be used by data analyst, scientists, Lob people. While Excel has many default data connectors to handle many data connections, there are situations (use cases) that require more granular control over the data and VBA is great solution. Since VBA integrates with Microsoft .NET and many other Microsoft APIs, the possibilities are almost endless for data extraction, wrangling and storage.
Excel offers a lot of great of data connectors natively through the “Get Data” commands, however, there are numerous occasions where you will need more granular control over the data extraction process, like for instance, if you need to extract unstructured data from log files, or you need to scrape multiple parts of a web page. For these situations, you can use VBA (Visual Basic for Applications).
In this article, I will cover two APIs: MSHTML.HTMLDocument, SHDocVw.InternetExplorer API to extract data from various Internet based sources.
HTMLDocument is part of the VSTO (Visual Studio Office Tools) API and provides a handle to a HTML document. You can use the methods and properties to extract certain parts of a web page, for instance a table or a collection of words. This is known as web page scraping.
The HTMLDocument reads the HTML elements on a web page and extracts the content from that element.
For the following example, I will use the country list and associated populations on Wikipedia at: https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations) . Specifically, I want to extract the data from the table. Of course, for this simple example, I could use the web page connector in Excel, but I want to demonstrate how with VBA.
In order to understand and extract the data, you need to analyze and explore the structural source of the document. If you are using Google Chrome, click the F12 key to open the Developer view from the Wikipedia page. See figure 1 below or view the video at https://youtu.be/lqdZImk4P6A .
We will need to extract the table contents, however, from the video you can see that this page has a couple of tables, so we will need to locate the second one and extract the data that we need, which is the country name and population columns only.
Usually HTML elements have classes or Ids (names) that computer programming languages and stylesheets (CSS) can reference to interact with the specific element but in this case, we don’t have that so we will need to get a list of Table elements and loop through the sub-elements and extract the data.
From the VBA Editor in Excel, under the Tools menu you will find the References sub-menu. Scroll through the list and select “Microsoft HTML Object Library” and “Microsoft Internet Controls”. The first library contains all the APIs for working with HTML Document Object Model and the latter is the API Interface objects for Internet Explorer which allows you to use Internet Explorer without opening the application manually.
Next, we will need to define the Internet Explorer and HTMLDocument objects
From the open VBA editor, double-click on Sheet1 object in the “Project – VBA Project” explorer pane on left (From the “View” menu, select “Project Explorer” or ctrl+R) to open the “Sheet1” object in the editor
Define a new macro method as follows (type sub ExtractTableData and press Enter key and the editor will auto complete the code)
Using the macro function
You can attach the macro function to a button or menu item to run on demand as follows:
Option 1: You can run the macro from the Developer menu by clicking on Macros button and selecting the ExtractTableData macro that we just created
Option 2: Add to a button
From the Insert menu item in the Developer section, select the button control and using your mouse cursor “draw it” on a open section of the sheet1 worksheet.
Option 3: Add to the ribbon
From the File menu, select Options, then Quick Access Toolbar. Select Macros from Choose commands from, then select Sheet1.ExtractTableData and click Add button to assign the Macro to the ribbon
You can access the macro from the top Quick Access Toolbar.
In this tutorial I showed you how to scrape data from a website using MSHTML.HTMLDocument API and VBA. For simple tasks, you can extract the data directly using the Web connector or the OData connector as in this article Data Engineering Using Excel.
This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.
© 2020 Kevin Languedoc