GH GambleHub

Multipart exports and large uploads

1) When "big" exports are needed and what is important

Scenarios: financial reports, user activity uploads, audits/regulators, BI uploads, partner directories, backups. Key requirements:
  • Data consistency (snapshot/point in time).
  • Passability in volume (parallel write/read, streaming serialization).
  • Renewable and partial delivery.
  • Integrity (checksum) and verifiability (manifest).
  • Security/PII (masking, encryption, access control).
  • Cost management (compression, timeouts, CDN, TTL).

2) Data formats: pros/cons

CSV - compact, fast to write/read; cons: shielding, types lost. Good for tabular reports.
JSON Lines (JSONL) - by line per object, convenient for streaming and partial sampling; cons: volume.
Parquet/Avro - column/circuit formats, compression and predict pushdown; ideal for analytics and big data.
Mixed: JSONL for API download → offline conversion to Parquet.

Compression: 'gzip '/' zstd' (better). For very large volumes - split archives (~ 128-512 MB per part).

3) Consistency: How to get a "snapshot"

DB: REPEATABLE READ/SNAPSHOT transactional isolation; for threads, logical replication slots or watermark (max. 'updated _ at '/version).
Event sourcing: exporting by offset log.
Slices: "full" export + "deltas" (subsequent uploads of changes since 'watermark').

4) Multipart/chunking

4. 1 Types of "multipart"

Upload (to us): multipart/form-data (small files), S3 Multipart Upload (MPU )/GCS Resumable (large).
Download (from us): HTTP Range ('bytes = start-end'), 'multipart/byteranges' (several ranges in one response), "zip of parts," directories in the object stack.

4. 2 Splitting strategies

By size (for example, 256 MB per part).
By key/date (sharding by 'tenant _ id', 'YYYY/MM/DD').
By table/entity (individual files per type).

Balance: Parts of 64-512 MB download well in parallel and do not overheat memory.

5) Export API architecture (asynchronous model)

Steps:

1. 'POST/exports' → job in queue (metadata: format, filters, encryption, lifetime).

2. Workers build snapshots, stream data, and write parts to object storage.

3. Generate a manifest (JSON) with a list of parts, sizes, checksum, schema version.

4. 'GET/exports/{ id}' returns status and link (s) to parts of/pre-signed URL.

5. `GET /exports/{id}/manifest. json '- truth machine for verification/reloading.

Example manifest:
json
{
"export_id": "exp_2025_10_31_001",
"created_at": "2025-10-31T14:23:00Z",
"schema": "orders_v3",
"format": "parquet+zstd",
"parts": [
{"name":"part-00000. parquet. zst","size":268435456,"sha256":"...","url":"...","range":"bytes=0-268435455"},
{"name":"part-00001. parquet. zst","size":241172480,"sha256":"...","url":"..."}
],
"total_bytes": 509607936,
"encryption": {"type":"AES-256-GCM","key_id":"kms/keys/exp"},
"watermark": {"type":"updated_at","value":"2025-10-31T00:00:00Z"}
}

6) Resumable

HTTP Range: the client loads the "tail" of the file: 'Range: bytes = 241172480-'.
Multiple ranges: 'Range: bytes = 0-999,2000-2999' → 'Content-Type: multipart/byteranges' response.
Client strategy: parallel "workers" in parts, verification of'sha256 'of each, retrai with exponential backoff.
CDN: Range support, large response buffering disabled.

7) Large downloads to us (resumable upload)

S3 Multipart Upload: clients download parts (5-5000), the server collects'CompleteMultipartUpload '.
GCS Resumable - one session, offsets - client can continue with'Content-Range '.
TUS (protocol) is an independent renewable appload on top of HTTP.

Pattern B2B: we send the pre-signed URL for the apload of parts directly to the store, and the metadata to our API.

8) Compression, encryption, integrity

Compression: 'zstd' preferred (better ratio/speed). Compress each part separately (it is more convenient to renew/cache).

Encryption:
  • On the wire: TLS 1. 2+.
  • At rest: server-side KMS (SSE-KMS) or client-side (AES-256-GCM) with key wrapping.
  • Never put a "raw" key in a manifesto.
  • Checksum: minimum SHA-256 per part + common for the entire set. Check on the client before ack.

9) Perimeter Integration: NGINX/CDN

NGINX (Range + long timeouts + disable buffering):
nginx server {
listen 443 ssl http2;
server_name downloads. example. com;

location /exports/ {
proxy_buffering off;
proxy_request_buffering off;
proxy_read_timeout 3600s;
add_header Accept-Ranges bytes;
proxy_pass http://export-backend;
}
}
Response headers:
  • `Content-Disposition: attachment; filename="export_2025-10-31_part-00000. parquet. zst"`
  • 'ETag '/' If-Range'for correct loading.
  • 'Cache-Control '(for example,' private, max-age = 3600 ') for personal uploads.

10) Safety and compliance

Authentication/authorization: issuing exports only to the owner/roles; pre-signed with short TTL.
PII: masking/aliasing; in the manifest - only technical fields.
GDPR/local regulators: deletion of exports by TTL, audit of downloads, prohibition of cross-regional issuance without reason.
Rate limiting & quotas: limit the number of simultaneous exports and the total volume per day/month (per-tenant).
Anti-scraping: CAP/bot filters for issuing links, limiting ranges (max parallel parts).

11) Observability and operation

Metrics:
  • `export_jobs_total{status}` (queued/running/succeeded/failed/expired)
  • `export_bytes_total`, `export_part_duration_ms{p50,p95,p99}`
  • `download_range_requests_total`, `resumes_total`, `checksum_fail_total`
  • `storage_cost_estimate` и `egress_bytes_cdn`
Logs/Audits:
  • Who created the export, filters, watermark, download list (IP/UA/time).
  • Parts hashes and customer side reconciliation (confirmations).
Tracing:
  • Spans: snapshot → serialize → upload part → finalize.

12) Performance and cost

Parallelism: Generate multiple parts simultaneously (N workers), but limit I/O.
Memory: streaming serialization (iterators, database cursors, chunks of 4-16 MB).
Dedup: for frequently repeated exports, smart cache of parts by filters/hash.
CDN: beneficial for "generic" public sets; for personal - caution (safety/PII).

13) Examples of interfaces

13. 1 Create Export (REST)

http
POST /exports
Content-Type: application/json
Authorization: Bearer <token>

{
"format": "parquet+zstd",
"filters": {"date_from":"2025-10-01","date_to":"2025-10-31","tenant":"acme"},
"split_size_mb": 256,
"encryption": {"mode":"server-side-kms","key_id":"kms/keys/exp"}
}
Answer:
json
{
"id":"exp_2025_10_31_001",
"status":"queued",
"estimated_parts": 12,
"manifest_url": "/exports/exp_2025_10_31_001/manifest. json"
}

13. 2 Issuing part with Range

http
GET /exports/exp_.../part-00003. parquet. zst
Range: bytes=1048576-

13. 3 Vorker pseudocode

pseudo snapshot = db. begin_snapshot()
for shard in plan_shards(snapshot):
part_stream = encode_stream(shard. rows, format="parquet", compress="zstd")
url = object_store. upload_stream(part_stream, part_name, encryption=KMS)
manifest. add(part_name, size, sha256, url)
write_manifest(manifest)

14) Delta export patterns

Full export once in N (day/week) + deltas every hour by 'updated _ at> watermark'.
On the consumer side, apply the deltas in order by verifying the'version '/' seq'.
Keep the last watermark in the consumer and in the manifest.

15) Anti-patterns

Export generation in request (synchronous) - timeouts and OOM.
One giant file without splitting is the inability to resume/parallel download.
Lack of checksum and manifest - you cannot prove integrity.
Issuance of permanent public links to personal data.

Buffering SSE/CDN or disabled Range - breaks the "reload."

Export dirty data (without snapshot/isolation).

16) Implementation checklist

  • Format and compression: CSV/JSONL/Parquet + 'zstd/gzip'.
  • Consistency: transactional snapshot or watermark/offset.
  • Partitioning: 64-512 MB parts, parallel generation and download.
  • Manifest: parts list, dimensions, SHA-256, schema version, watermark.
  • Renewal: HTTP Range, 'multipart/byteranges' support, client retrays.
  • Security: pre-signed URLs, TTL, encryption (KMS/AEAD), PII masking.
  • Limits/quotas: per-tenant, daily volumes, number of active jobs.
  • Observability: job/part metrics, download audits, checksum-fail alerts.
  • Cost: CDN for public sets, TTL and auto-cleanup in the store.
  • Runbooks/Game Days: network breaks, store unavailability, database overheating, KMS failure.

17) Game Days (playbooks)

Network drop during download: Client must continue with'Range '.
One part failed to load: part retray without recreating the entire export.
Worker's fall: The job resumes with the last unconfirmed chunk.
KMS unavailable: secure degradation (generation pause, do not release unencrypted).
Data growth × 2: check generation time, redistribute parallelism, do not kill the database.

18) Totals

Reliable large offloads are asynchronous architecture + partitioning + renewal + verifiable integrity. Take a snapshot, write parts in parallel, publish a manifest with checksum, support HTTP Range and short-lived links. Think through security, limits and observability - and gigabyte exports will cease to be a nightmare for teams and users.

Contact

Get in Touch

Reach out with any questions or support needs.We are always ready to help!

Telegram
@Gamble_GC
Start Integration

Email is required. Telegram or WhatsApp — optional.

Your Name optional
Email optional
Subject optional
Message optional
Telegram optional
@
If you include Telegram — we will reply there as well, in addition to Email.
WhatsApp optional
Format: +country code and number (e.g., +380XXXXXXXXX).

By clicking this button, you agree to data processing.