Learning Singer and Meltano

Introduction

We recently had the pleasure to look into what the lovely people of the Singer and Meltano communities are conceiving.

This article outlines a few struggles we encountered when diving in “bottom-up”, by just going ahead and sitting down to conceive corresponding data source and sink adapters for CrateDB.

It probably was the wrong approach, and we should have dedicated more time reading the documentation in one way or another. Still, we hope you enjoy our reports.

Singer specification

The canonical places to inspect the original Singer specification for Composable Open Source ETL are on the Singer GitHub, and the Meltano documentation pages.

Singer wire and file format

The tap_countries.singer file is perfect for demonstrating the capabilites of the wire format. It is made of SCHEMA, RECORD, and STATE messages, in this case wrapping two streams of data, called continents and countries. In the context of running streams in and out of databases, they correspond to database tables.

{"type": "SCHEMA", "stream": "continents", "schema": {"properties": {"code": {"type": ["null", "string"]}, "name": {"type": ["null", "string"]}}, "type": "object"}, "key_properties": ["code"]}
{"type": "RECORD", "stream": "continents", "record": {"code": "AF", "name": "Africa"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "AN", "name": "Antarctica"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "AS", "name": "Asia"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "EU", "name": "Europe"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "NA", "name": "North America"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "OC", "name": "Oceania"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "SA", "name": "South America"}}
{"type": "STATE", "value": {"bookmarks": {"continents": {}}}}
{"type": "SCHEMA", "stream": "countries", "schema": {"properties": {"code": {"type": ["string", "null"]}, "name": {"type": ["string", "null"]}, "native": {"type": ["string", "null"]}, "phone": {"type": ["string", "null"]}, "capital": {"type": ["string", "null"]}, "currency": {"type": ["string", "null"]}, "emoji": {"type": ["string", "null"]}, "continent": {"properties": {"code": {"type": ["string", "null"]}, "name": {"type": ["string", "null"]}}, "type": ["object", "null"]}, "languages": {"items": {"properties": {"code": {"type": ["string", "null"]}, "name": {"type": ["string", "null"]}}, "type": "object"}, "type": ["array", "null"]}}, "type": "object"}, "key_properties": ["code"]}
{"type": "RECORD", "stream": "countries", "record": {"code": "AD", "name": "Andorra", "native": "Andorra", "phone": "376", "continent": {"code": "EU", "name": "Europe"}, "capital": "Andorra la Vella", "currency": "EUR", "languages": [{"code": "ca", "name": "Catalan"}], "emoji": "\ud83c\udde6\ud83c\udde9"}}
{"type": "RECORD", "stream": "countries", "record": {"code": "AE", "name": "United Arab Emirates", "native": "\u062f\u0648\u0644\u0629 \u0627\u0644\u0625\u0645\u0627\u0631\u0627\u062a \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u0627\u0644\u0645\u062a\u062d\u062f\u0629", "phone": "971", "continent": {"code": "AS", "name": "Asia"}, "capital": "Abu Dhabi", "currency": "AED", "languages": [{"code": "ar", "name": "Arabic"}], "emoji": "\ud83c\udde6\ud83c\uddea"}}
{"type": "RECORD", "stream": "countries", "record": {"code": "AF", "name": "Afghanistan", "native": "\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646", "phone": "93", "continent": {"code": "AS", "name": "Asia"}, "capital": "Kabul", "currency": "AFN", "languages": [{"code": "ps", "name": "Pashto"}, {"code": "uz", "name": "Uzbek"}, {"code": "tk", "name": "Turkmen"}], "emoji": "\ud83c\udde6\ud83c\uddeb"}}
{"type": "RECORD", "stream": "countries", "record": {"code": "AG", "name": "Antigua and Barbuda", "native": "Antigua and Barbuda", "phone": "1268", "continent": {"code": "NA", "name": "North America"}, "capital": "Saint John's", "currency": "XCD", "languages": [{"code": "en", "name": "English"}], "emoji": "\ud83c\udde6\ud83c\uddec"}}
{"type": "STATE", "value": {"bookmarks": {"continents": {}, "countries": {}}}}

In a full-qualified notation, table schema name and table name are separated by a dash - character. So, when addressing a database table called sys.summits, the corresponding stream name would be sys-summits.

Singer vs. Meltano

Meltano as a framework fills many gaps and makes Singer convenient to actually use. It is impossible to outline all details and every difference, so we will focus on the “naming things” aspects for now.

Both ecosystems use different names for the same elements. That may be confusing at first, but it is easy to learn: For the notion of data source vs. data sink, common to all pipeline systems in one way or another, Singer uses tap vs. target, while Meltano uses extractor vs. loader. Essentially, they are the same things under different names.

Ecosystem Data source Data sink
Singer Tap Target
Meltano Extractor Loader

In Singer jargon, you tap data from a source, and send it to a target. In Meltano jargon, you extract data from a source, and then load it into the target system.

Creating a database tap

Selecting streams and entities on taps

Well, it is all right there:

However, we don’t know why, but we haven’t been able to connect the dots between those concepts sitting on top of Singer. One reason might have been that we studied the documentation wrongly, not intensively enough, and that we were trying to discover something about “Meltano Services” (apparently the underlying machinery for those plugins/extras), which we had picked up from the code base.

So, we think everything is excellently documented from a user’s perspective, and it’s only us who took the wrong “bottom up” approach. :person_facepalming:

Defining schema overrides on taps

Within Singer, schema definitions are emitted by the data source components, which is natural. Within Meltano, there is a subsystem to define and resolve schema override rules, in order to adjust the data schema according to your downstream needs.

In response to that, we have been successful to switch to tap-spreadsheets-anywhere for reading from a JSONL file, and apply a schema override rule like outlined above. Apparently, it works because this tap provides both the catalog and discover capabilities? Excellent!


  1. Which probably was the wrong choice. ↩︎