3. Catalog

3.1 What is the catalog?

Databricks Catalog Explorer provides a UI to explore and manage data, schemas (databases), tables, models, functions, and other AI assets.

Catalog Explorer has two primary functions:

Finding data assets.

For example, you can use Catalog Explorer to view schema details, preview sample data, see table and model details, and explore entity relationships.

Managing Unity Catalog and Delta Sharing.

For example, you can use use Catalog Explorer to create catalogs, create shares, manage external locations, view and change object ownership, and grant and revoke permissions on all objects.

The tables that appear in the catalog can also be accessed by an external data analysis and visualization platform, say Apache Superset. This is only so far as the appropriate configurations have been implemented. The data in the catalog can be analysed using SQL queries from the SQL Lab tab of Apache Superset. For more on Apache Superset and how to run queries and create visualizations, see this.

3.2 The medallion architecture

A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables). Medallion architectures are sometimes also referred to as “multi-hop” architectures.

Medallion architecture

The terms bronze, silver and gold refer to as follows:

i. Bronze layer: This phase marks the input of raw data, which is stored as it is collected, usually from a variety of sources and in formats such as CSV or JSON. The data is usually raw data and varies in quality and structure.
ii. Silver layer: At this point, the data is processed and transformed to achieve cleaner, more structured data. Tasks such as filtering, validation and normalisation of the data are carried out and stored in efficient formats. This phase may include defined schemas and additional metadata.
iii. Gold layer: This stage contains data already prepared for analysis and business use. In the Gold layer, advanced transformations and aggregations are performed to create rich data sets. The data is structured, optimised for fast queries and can be enriched with additional information or merged with other data sources for deeper insights.

In short, in a Medallion architecture, the quality and structure of data improves as it passes through each layer. The bronze layer contains raw data, the silver layer contains cleansed and enriched data, and the gold layer contains data that is aggregated and ready to be analysed and integrated into business applications.

How is this related to the catalog? This is because in the catalog, you will see specific folders which reflect the stage at which the data is in.

Let’s see this in action.

3.3. Practical

Click on the Catalog button under the Sidebar section.

Catalog button

Depending on your account, you will see a number of folders, but of key interest are the ns-ii-tech-dev-catalog and the ns-ii-tech-stg-catalog folders. This is because these folders contain the bronze, silver and gold subfolders. There is no difference in the (sub)-folders contained within the ns-ii-tech-dev-catalog and the ns-ii-tech-stg-catalog. It’s only that the dev and stage refer to certain development environments.

The catalog

If you click on the bronze folder in, say, the ns-ii-tech-dev-catalog folder, you will see a list of tables and views of different Survey123 (S123) forms that have been ingested in their raw format from AGOL to Azure Data Lake Storage Gen2.

Clicking on a particular table or view in the bronze folder will create a new interface containing Overview, Sample Data, Details, Permissions and other tabs which provide more info on the table or view you have selected.

Below you can see the interface of the dvc_register_sublayer_survey subtable.

Catalog subtable

The Browse DBFS button also shows the files currently inside the Azure Cloud storage.

Browse DBFS

This is a deprecated method of searching for files but is useful in say, copying the path to a particular subtable. This can be achieved by clicking on the dropdown next to the subtable.

Browse DBFS

Apache superset

We have talked about Apache superset at Section 3.1. However, how do we get to it.

At NS, we use two sites for data visualization using Apache Superset. These are the Natural Impact Portal - staging and the Natural Impact Portal - dev. The former is accessed from here and latter from here. Since the former contains sensitive information, we will use the dev portal which is just a replica of that in the staging environment.

As mentioned earlier, any data that appears in the catalog will also appear in Apache Superset’s schemas provided that the configurations are all right.

NIP Dev portal

In here, if you go to the SQL lab, you can write queries that use the tables availabe in our catalog. These tables are found in SEE TABLE SCHEMA drop down.

If you select the Create chart button, a new interface will show up which consists various dropdowns and buttons to create your chart. However, it is necessary to first save your dataset by going to Chart source>Save your dataset. Missing this step will result in an error when trying to create and/or save your charts. The Query preview and View in SQL lab buttons are useful in viewing and/or editing your SQL queries that generated the output table.

Save dataset

Play around with various settings to create your chart. Here is an example of a bar chart.

Example chart

Save your chart by clicking on the Save button at the top right. If there is a dashboard present, you can save it to the dashboard too.

Clicking on the Dashboard menu at the top will take you to the dashboard interface.

Dashboards in Apache superset

If there are preexisting dashboards, they will appear here. Otherwise, click the Dashboard+ button. A new interface will appear from where you can add your created charts to the dashboard as well as design the layout. It is strongly recommended you play along with every tool to have a feel of what it does. The only limit here is imagination.

Dashboards in Apache superset

Once happy with your design, you can click on the Save button on the top right. Saved dashboards will appear under the Dashboard menu like below. The links to the dashboards are shareable too.

Existing dashboards