Panodata Community

Problem backfilling a large historical dataset into InfluxDB

Help! When pushing data from NSIDC into InfluxDB, it croaks with

{"error":"engine: error rolling WAL segment: error opening new segment file for wal (2): open /var/lib/influxdb/wal/nsidc/autogen/55885/_00001.wal: too many open files"}

You might want to have a look at this…

If it’s really the client, as also outlined within [1] please try invoking

ulimit -n 65535

before running your import program.

[1] Too many open file on client · Issue #4569 · influxdata/influxdb · GitHub


According to /lib/systemd/system/influxdb.service, the InfluxDB service itself is already running with

LimitNOFILE=65536

Currently, I’m hesitant on increasing the overall server limits even further.

However, it looks like InfluxDB is creating a huge number of shards on the nsidc database, which might not be intended.

root@eltiempo:~# l /var/lib/influxdb/wal/nsidc/autogen/ | wc -l
1479

Do you see any way to share your import program with us? Maybe we can optimize this detail.

So, it makes sense for InfluxDB to operate like that when the time series covers a huge timespan. Is this the case with your specific dataset?

at least I can say: we’ve a maximum of one record a day. within the last ~20years: daily, before than (until 1978) we have 2-4days between each … four records.

1 Like

Background on “shard group duration”

So, you might consider creating the database with a specific shard group duration.

See also:

Recommendation

According to the recommendation for backfilling data cited above, this might help you along:

CREATE DATABASE <database_name> WITH SHARD DURATION 52w

When saying “we highly recommend temporarily setting a longer shard group duration so fewer shards are created”, how and to what value am I reverting afterwards?

Just leave it like it is as it should reasonably match the time resolution of this dataset, right?

The current database shows it contains just 45 shards (probably matching the number of years aka. blocks of 52 weeks each)

root@eltiempo:~# l /var/lib/influxdb/data/nsidc/autogen | wc -l
45

each containing only a few kB worth of data

root@eltiempo:~# du -sch /var/lib/influxdb/data/nsidc/autogen/*
44K	/var/lib/influxdb/data/nsidc/autogen/61615
44K	/var/lib/influxdb/data/nsidc/autogen/61616
44K	/var/lib/influxdb/data/nsidc/autogen/61617
44K	/var/lib/influxdb/data/nsidc/autogen/61618

So, when querying and processing it, nobody will suffer.


P.S.: Unless further experiences regarding this…