Introduction
We recently had the pleasure to look into what the lovely people of the Singer and Meltano communities are conceiving.
This article outlines a few struggles we encountered when diving in “bottom-up”, by just going ahead and sitting down to conceive corresponding data source and sink adapters for CrateDB.
- GitHub - crate-workbench/meltano-tap-cratedb: A Singer tap / Meltano extractor for CrateDB, built with the Meltano SDK, and based on the Meltano PostgreSQL tap.
- GitHub - crate-workbench/meltano-target-cratedb: A Singer target / Meltano loader for CrateDB, built with the Meltano SDK, and based on the Meltano PostgreSQL target.
It probably was the wrong approach, and we should have dedicated more time reading the documentation in one way or another. Still, we hope you enjoy our reports.
Singer specification
The canonical places to inspect the original Singer specification for Composable Open Source ETL are on the Singer GitHub, and the Meltano documentation pages.
- getting-started/docs/SPEC.md at master · singer-io/getting-started · GitHub
- https://hub.meltano.com/singer/spec/
Singer wire and file format
The tap_countries.singer file is perfect for demonstrating the capabilites of the wire format. It is made of SCHEMA
, RECORD
, and STATE
messages, in this case wrapping two streams of data, called continents
and countries
. In the context of running streams in and out of databases, they correspond to database tables.
{"type": "SCHEMA", "stream": "continents", "schema": {"properties": {"code": {"type": ["null", "string"]}, "name": {"type": ["null", "string"]}}, "type": "object"}, "key_properties": ["code"]}
{"type": "RECORD", "stream": "continents", "record": {"code": "AF", "name": "Africa"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "AN", "name": "Antarctica"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "AS", "name": "Asia"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "EU", "name": "Europe"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "NA", "name": "North America"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "OC", "name": "Oceania"}}
{"type": "RECORD", "stream": "continents", "record": {"code": "SA", "name": "South America"}}
{"type": "STATE", "value": {"bookmarks": {"continents": {}}}}
{"type": "SCHEMA", "stream": "countries", "schema": {"properties": {"code": {"type": ["string", "null"]}, "name": {"type": ["string", "null"]}, "native": {"type": ["string", "null"]}, "phone": {"type": ["string", "null"]}, "capital": {"type": ["string", "null"]}, "currency": {"type": ["string", "null"]}, "emoji": {"type": ["string", "null"]}, "continent": {"properties": {"code": {"type": ["string", "null"]}, "name": {"type": ["string", "null"]}}, "type": ["object", "null"]}, "languages": {"items": {"properties": {"code": {"type": ["string", "null"]}, "name": {"type": ["string", "null"]}}, "type": "object"}, "type": ["array", "null"]}}, "type": "object"}, "key_properties": ["code"]}
{"type": "RECORD", "stream": "countries", "record": {"code": "AD", "name": "Andorra", "native": "Andorra", "phone": "376", "continent": {"code": "EU", "name": "Europe"}, "capital": "Andorra la Vella", "currency": "EUR", "languages": [{"code": "ca", "name": "Catalan"}], "emoji": "\ud83c\udde6\ud83c\udde9"}}
{"type": "RECORD", "stream": "countries", "record": {"code": "AE", "name": "United Arab Emirates", "native": "\u062f\u0648\u0644\u0629 \u0627\u0644\u0625\u0645\u0627\u0631\u0627\u062a \u0627\u0644\u0639\u0631\u0628\u064a\u0629 \u0627\u0644\u0645\u062a\u062d\u062f\u0629", "phone": "971", "continent": {"code": "AS", "name": "Asia"}, "capital": "Abu Dhabi", "currency": "AED", "languages": [{"code": "ar", "name": "Arabic"}], "emoji": "\ud83c\udde6\ud83c\uddea"}}
{"type": "RECORD", "stream": "countries", "record": {"code": "AF", "name": "Afghanistan", "native": "\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646", "phone": "93", "continent": {"code": "AS", "name": "Asia"}, "capital": "Kabul", "currency": "AFN", "languages": [{"code": "ps", "name": "Pashto"}, {"code": "uz", "name": "Uzbek"}, {"code": "tk", "name": "Turkmen"}], "emoji": "\ud83c\udde6\ud83c\uddeb"}}
{"type": "RECORD", "stream": "countries", "record": {"code": "AG", "name": "Antigua and Barbuda", "native": "Antigua and Barbuda", "phone": "1268", "continent": {"code": "NA", "name": "North America"}, "capital": "Saint John's", "currency": "XCD", "languages": [{"code": "en", "name": "English"}], "emoji": "\ud83c\udde6\ud83c\uddec"}}
{"type": "STATE", "value": {"bookmarks": {"continents": {}, "countries": {}}}}
In a full-qualified notation, table schema name and table name are separated by a dash -
character. So, when addressing a database table called sys.summits
, the corresponding stream name would be sys-summits
.
Singer vs. Meltano
Meltano as a framework fills many gaps and makes Singer convenient to actually use. It is impossible to outline all details and every difference, so we will focus on the “naming things” aspects for now.
Both ecosystems use different names for the same elements. That may be confusing at first, but it is easy to learn: For the notion of data source vs. data sink, common to all pipeline systems in one way or another, Singer uses tap vs. target, while Meltano uses extractor vs. loader. Essentially, they are the same things under different names.
Ecosystem | Data source | Data sink |
---|---|---|
Singer | Tap | Target |
Meltano | Extractor | Loader |
In Singer jargon, you tap data from a source, and send it to a target. In Meltano jargon, you extract data from a source, and then load it into the target system.