Updated date:

Data Engineering using Excel and VBA and MSHTML.HTMLDocument

Kevin is a data engineer and advanced analytics developer. He has over 20 years experience in the field.

Data Engineering is the science of researching and extracting data from various sources and architecting data models and data sets which are stored in data warehouses for use with Machine Learning, Advanced Analytics, other formats of data science endeavors. Data is either extracted in batches or streaming or even on demand or scraped from web pages. Data Engineers design and build the data pipelines that data science runs on.

Basic ETL model

Basic ETL model

Regardless what many voices are saying, Data Engineering can be done using almost every programming language if the language has libraries, frameworks or other APIs that allow it to connect to a data source and a data target, including VBA (Visual Basic for Applications).

Most sources are either business systems like SAP or Oracle’s eSuite, or database systems (corporate or external), web services, files. Most often, the target is a data warehouse which can consist of one database or multiple databases called data marts or one large database, depending on your design.

The one tool that is ubiquitous in Data Engineering and Data Sciences is Excel. You can use Excel as a data source (xlsx or csv files), or for data analysis, profiling and cleansing and even, to a certain extent, as a data target. You can include Excel files as part of your data flow DAG (Directed-Acyclic-Graph)

DAG model

DAG model

which allows a data engineer (developer) to orchestrate to flow and transformation and storage of the data to be used by data analyst, scientists, Lob people. While Excel has many default data connectors to handle many data connections, there are situations (use cases) that require more granular control over the data and VBA is great solution. Since VBA integrates with Microsoft .NET and many other Microsoft APIs, the possibilities are almost endless for data extraction, wrangling and storage.

Excel offers a lot of great of data connectors natively through the “Get Data” commands, however, there are numerous occasions where you will need more granular control over the data extraction process, like for instance, if you need to extract unstructured data from log files, or you need to scrape multiple parts of a web page. For these situations, you can use VBA (Visual Basic for Applications).

In this article, I will cover two APIs: MSHTML.HTMLDocument, SHDocVw.InternetExplorer API to extract data from various Internet based sources.

MSHTML.HTMLDocument

HTMLDocument is part of the VSTO (Visual Studio Office Tools) API and provides a handle to a HTML document. You can use the methods and properties to extract certain parts of a web page, for instance a table or a collection of words. This is known as web page scraping.

The HTMLDocument reads the HTML elements on a web page and extracts the content from that element.

For the following example, I will use the country list and associated populations on Wikipedia at: https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations) . Specifically, I want to extract the data from the table. Of course, for this simple example, I could use the web page connector in Excel, but I want to demonstrate how with VBA.

In order to understand and extract the data, you need to analyze and explore the structural source of the document. If you are using Google Chrome, click the F12 key to open the Developer view from the Wikipedia page. See figure 1 below or view the video at https://youtu.be/lqdZImk4P6A .

Explore HTML Document Structure

Explore HTML Document Structure

We will need to extract the table contents, however, from the video you can see that this page has a couple of tables, so we will need to locate the second one and extract the data that we need, which is the country name and population columns only.

Usually HTML elements have classes or Ids (names) that computer programming languages and stylesheets (CSS) can reference to interact with the specific element but in this case, we don’t have that so we will need to get a list of Table elements and loop through the sub-elements and extract the data.

From the VBA Editor in Excel, under the Tools menu you will find the References sub-menu. Scroll through the list and select “Microsoft HTML Object Library” and “Microsoft Internet Controls”. The first library contains all the APIs for working with HTML Document Object Model and the latter is the API Interface objects for Internet Explorer which allows you to use Internet Explorer without opening the application manually.

Add references

Add references

Next, we will need to define the Internet Explorer and HTMLDocument objects

From the open VBA editor, double-click on Sheet1 object in the “Project – VBA Project” explorer pane on left (From the “View” menu, select “Project Explorer” or ctrl+R) to open the “Sheet1” object in the editor

Define a new macro method as follows (type sub ExtractTableData and press Enter key and the editor will auto complete the code)

Sub ExtractTableData()
    Dim ieObj As InternetExplorer
    Dim htmlDoc As HTMLDocument
    Dim htlmInnerText As String
    Dim htmlElementCol As IHTMLElementCollection
    Dim htmlElement As IHTMLElement
    Dim tableChild As IHTMLElement
     
    Set ieObj = New InternetExplorer
    
    
    ieObj.Visible = False
    
    ' you could reference the url from a cell also or a custom input box. Value hard coded for simplicity
    ieObj.navigate "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
    
    ' it can be several seconds below a page completely loads, so we need to loop until the page is completely loaded into memory
    Do While ieObj.readyState <> READYSTATE_COMPLETE
        Application.StatusBar = "Loading web page..."
        DoEvents
    Loop
    
    ' get the document that is loaded in InternetExplorer
    Set htmlDoc = ieObj.document
     

    Set htmlElementCol = htmlDoc.getElementsByClassName("headerSort")
    
    Dim rowNumber As Integer
    Dim colNumber As Integer
    Dim tableheader As String
   
    rowNumber = 1
    colNumber = 1
    Cells.Clear
    
    ' header row
    For Each tableChild In htmlElementCol
       'dataList = Split(tableChild.innerHTML, " ")
       If rowNumber = 1 Then
        tableheader = tableChild.innerText
        Cells(rowNumber, colNumber).Value2 = tableheader
       End If
       colNumber = colNumber + 1
    Next
    ' finished setting row header
    rowNumber = rowNumber + 1
    colNumber = 1
        
    Set htmlElementCol = htmlDoc.getElementsByTagName("table")(3).getElementsByTagName("tbody")(0).getElementsByTagName("tr")
    For Each Row In htmlElementCol
        For Each Value In Row.Children
            Cells(rowNumber, colNumber).Value2 = Value.innerText
            colNumber = colNumber + 1
        Next
        rowNumber = rowNumber + 1
        colNumber = 1
       
    Next

    Worksheets("Sheet1").Range("A1:F1").Columns.AutoFit
    Set ieObj = Nothing
    Application.StatusBar = ""
    
    
    
End Sub

Using the macro function

You can attach the macro function to a button or menu item to run on demand as follows:

Option 1: You can run the macro from the Developer menu by clicking on Macros button and selecting the ExtractTableData macro that we just created

Run macro from Developer menu

Run macro from Developer menu

Option 2: Add to a button

From the Insert menu item in the Developer section, select the button control and using your mouse cursor “draw it” on a open section of the sheet1 worksheet.

Add button control

Add button control

Assign macro to button

Assign macro to button

Option 3: Add to the ribbon

From the File menu, select Options, then Quick Access Toolbar. Select Macros from Choose commands from, then select Sheet1.ExtractTableData and click Add button to assign the Macro to the ribbon

Add macro to Quick Access Toolbar

Add macro to Quick Access Toolbar

You can access the macro from the top Quick Access Toolbar.

Quick Access Toolbar

Quick Access Toolbar

Conclusion

In this tutorial I showed you how to scrape data from a website using MSHTML.HTMLDocument API and VBA. For simple tasks, you can extract the data directly using the Web connector or the OData connector as in this article Data Engineering Using Excel.

This article is accurate and true to the best of the author’s knowledge. Content is for informational or entertainment purposes only and does not substitute for personal counsel or professional advice in business, financial, legal, or technical matters.

© 2020 Kevin Languedoc