Data Tools
A list of tools that generally make life easier when working with data.
Things I use and recommend
- jq is a lightweight and flexible command-line JSON processor. See Peter Koppstein’s A Stream oriented Introduction to jq for the missing manual.
- csvkit is a suite of command-line tools for converting to and working with CSV, the king of tabular file formats.
- gron gron transforms JSON into discrete assignments to make it easier to grep for what you want and see the absolute ‘path’ to it. It eases the exploration of APIs that return large blobs of JSON but have terrible documentation. Its primary purpose is to make it easy to find the path to a value in a deeply nested JSON blob when you don’t already know the structure; much of jq’s power is unlocked only once you know that structure.
-
yq a lightweight and portable command-line YAML processor. yq uses jq like syntax but works with yaml files as well as json.
I find it useful to work with
terraform show -json ~/tmp/tfplan
. - q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files). * vscode-edit-csv is an extension for Visual Studio Code that allows you to edit csv files with an Excel-like table UI.
Things I haven’t tried yet
- Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages.
- VisiData is a terminal interface for exploring and arranging tabular data.
- SQLFluff is a dialect-flexible and configurable SQL linter.
- angle-grinder allows you to parse, aggregate, sum, average, min/max, percentile, and sort your data. You can see it, live-updating, in your terminal.
- immudb is a database with built-in cryptographic proof and verification. It can track changes in sensitive data and the integrity of the history will be protected by the clients, without the need to trust the server. It can operate as a key-value store or as relational database (SQL).
- lux is a Python library that facilitate fast and easy data exploration by automating the visualization and data analysis process. By simply printing out a dataframe in a Jupyter notebook, Lux recommends a set of visualizations highlighting interesting trends and patterns in the dataset. Visualizations are displayed via an interactive widget that enables users to quickly browse through large collections of visualizations and make sense of their data. Blog. Demo
- NoCoDB turns any MySQL, PostgreSQL, SQL Server, SQLite & MariaDB into a smart-spreadsheet.
- hadolint is Dockerfile linter that also uses Shellcheck to parse inline Bash code.
- Nushell is a new shell inspired by Powershell, functional programming and modern CLI tools.
- jc JSONifies the output of many CLI tools and file-types for easier parsing in scripts. See the Parsers section for supported commands and file-types.
- jtbl accepts piped JSON data from stdin and outputs a text table representation to stdout
- bpfcc-tools contains various Linux kernel tracing tools. For example, execsnoop can list all executed processes while it runs.
- htmlq is like jq but for html.
- RESTler is a stateful rest API fuzzer.
- Turbolift is a simple tool to help apply changes across many GitHub repositories simultaneously. Perhaps similar to clustergit?
- [sqlean], extra functions for sqlite
- vscode-csv-markdown is a VSCode extension to convert CSV text to Markdown or JIRA table.
- qsv is a command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast and composable. qsv is a fork of the popular xsv utility.
- migra is like diff but for PostgreSQL.
- GitPop2 finds the most popular fork of a project on GitHub. Useful for finding a mantained fork of an abandoned project. I found qsv from xsv this way!
- Active Github Forks another tool that works like GitPop2. It may be dead becuase it showed a Heroku error page on 2023-10-02.
- gojq is a “pure Go” implementation of jq. Potential advantages are “neice error messages” and “YAML input/output”.
- Reshape is an easy-to-use, zero-downtime schema migration tool for Postgres. It automatically handles complex migrations that would normally require downtime or manual multi-step changes. During a migration, Reshape ensures both the old and new schema are available at the same time, allowing you to gradually roll out your application.
- SQLime is an online SQLite playground for debugging and sharing SQL snippets. Like SQLFiddle from back in the day.
-
Results from the survey about LibreOffice Calc in 2021 is most interesting for the alternative toolkit of its surveyed users. I need to try some of these!
-
Office:
- OnlyOffice
- WPS Office
- OpenOffice
- Collabora Online
- Zoho Sheet
- Softmaker
- IBM Symphony
- MS Office from 1990
- Quattro Pro
-
Code:
- numpy
- Pandas
- Matlab
- GNU Octave
- command line tools and editors
-
Database:
-
Cleanup:
-
Collaboration:
-
Statistical:
-
Plotting:
-
-
Results from the survey about LibreOffice Impress in 2022 is interesting for the same reason. Avoid PowerPoint. Prefer text-driven tools.
- Special extensions: LaTeX Beamer
- Graphical tools: Inkscape with Sozi, GIMP, Figma
- Office packages: OnlyOffice, Collabora Office, Apple Keynote
- Web sites or browser based: Impress.js, slides.com, Slidev, Miro
- Markup language tools: Marp. Wanaprez!
- Specialized tools: Genially, Propresenter
- OpenRefine
- Calligra Sheets, according to a frustrated LibreOffice user, says that Calligra supports sub-cell scrolling. It seems to depend on KDE. I don’t know if that would play nicely with Ubuntu’s Gnome desktop.
- dsutils are Jeroen Janssens’ command line tools for doing data science. They accompany his excellent book Data Science at the Command Line. It includes the
header
command to fix headerless columnar data. - jmespath is a query language for JSON. It’s not quite as powerful as jq, but it’s convenient for some transformations and it’s built into the AWS CLI.
- Markdown lint tool (mdl) is a tool to check markdown files and flag style issues.
- jq kung fu is a jq playground written in web assembly. It allows you to test jq expressions in the browser.
- jqplay is a playground for jq 1.6. It also allows you to test jq expression in the browser. It’s not clear whether these playgrounds are actively maintained.
- ripgrep ripgrep is a line-oriented search tool that recursively searches the current directory for a regex pattern. It is similar to other popular search tools like The Silver Searcher, ack and grep.
- YAML Multiline helps you find the right syntax for your YAML multiline strings. I never, ever managed to remember the right syntax. Maybe allowing unquoted strings was a bad idea?
- jo is a small utility to create JSON objects at the command line using a more bash-friendly key-value syntax.
- Fira Code is a free monospaced font containing ligatures for common programming multi-character combinations. This is just a font rendering feature: underlying code remains ASCII-compatible. This helps to read and understand code faster.
- Gitleaks is a SAST tool for detecting and preventing hardcoded secrets like passwords, API keys, and tokens, past or present, in git repos.
- truffleHog searches through git repositories for high entropy strings and secrets, digging deep into commit history.
- git-secrets prevents you from committing secrets and credentials into git repositories.
- AWS vendor accounts is a list of AWS account IDs compiled by Cloudmapper. These are account IDs published by vendors to create a trust relationship for their products hosted in AWS.
- Gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.
- Grist is a modern relational spreadsheet. It combines the flexibility of a spreadsheet with the robustness of a database to organize your data and make you more productive.
- Fugue provides an easier interface to using distributed compute effectively and accelerates big data projects. It does this by minimizing the amount of code you need to write, in addition to taking care of tricks and optimizations that lead to more efficient execution on distrubted compute.
- utt is the universal text transformer. utt is intended for converting between textual data representations. At the time of writing, support formats are JSON, XML, CSV, YAML, Java Properties, TOML, Base 64, and Plain Text. * Bash loading animations are ready-to-use loading animations in ASCII and UTF-8 for easy integration into your Bash scripts. Could be useful for improving the user interface of a quick data collection script.
- dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
- partiql provides SQL-compatible access to relational, semi-structured, and nested data.
- AWS Data Wrangler is Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
- Difftastic is an experimental diff tool that compares files based on their syntax.
- Panderas is A data validation library for scientists, engineers, and analysts seeking correctness.
- Great expectations is a shared, open standard for data quality. It helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
- is used to provide a simple python interface for interacting with Atlassian products (Server, Data Center and Cloud) and apps from ecosystem (Portfolio, XRay). It is based on the official public Rest API documentation and private methods (+ xml+rpc, raw http request).
- Dagger is a portable devkit for CI/CD pipelines. Build powerful CI/CD pipelines quickly, then run them anywhere. Give your developers parity between dev and CI environments. Test and debug your pipelines locally. Run the same pipeline on any CI environment without re-writes. Developed in the open by the creators of Docker.
- Debezium provides a low-latency data streaming platform for change data capture (CDC).
- Renovate gives automated dependency updates. Multi-platform and multi-language.
- [Zed])(https://github.com/brimdata/zed) is a lot like jq but is built from the ground up as a search and analytics engine based on the Zed data model. Since Zed data is a proper superset of JSON, zq also works natively with JSON.
- JSONata, yet another JSON query language.
- Markdown Monster a Markdown editor for Windows.
- easy-rsais a CLI utility to build and manage a PKI CA. In laymen’s terms, this means to create a root certificate authority, and request and sign certificates, including intermediate CAs and certificate revocation lists (CRL).
- Siteleaf a static site CMS. I could rebuild my site using this.
- deterministic-zip is a simple (almost drop-in) replacement for zip that produces deterministic files. Useful for reproducible builds.
- trivy is a comprehensive security scanner. It is reliable, fast, extremely easy to use, and it works wherever you need it.
- PanWriter is a distraction-free markdown editor with pandoc integration and a preview pane. See Liam Proven’s review on The Register and view the comments for many more editor recommendations that you may ever need.
- Atuin replaces your existing shell history with a SQLite database, and records additional context for your commands. Additionally, it provides optional and fully encrypted synchronisation of your history between machines, via an Atuin server.
- CLI for Microsoft 365 helps you manage your Microsoft 365 tenant and SharePoint Framework projects. Its use may be restricted by policies in the organization. * Microsoft Graph Explorer is a developer tool that lets you conveniently make Microsoft Graph REST API requests and view corresponding responses. Use Graph Explorer to try APIs on the default sample tenant to explore capabilities, or sign in to your own tenant and use it as a prototyping tool to fulfill your app scenarios. This tool includes helpful features such as code snippets (C#, Java, and JavaScript), Microsoft Graph Toolkit and adaptive cards integration, and more.
- SQL Notebook: Import your data from CSV, Excel, Microsoft SQL Server, PostgreSQL, and MySQL. Use a Jupyter-style notebook interface for exploratory queries, and write stored procedures for reusable logic. SQL Notebook is powered by an extended SQLite engine, supporting both standard SQL queries and SQL Notebook-specific commands and functions.
- Grav
- CI/CD Goat is a deliberately vulnerable CI/CD environment. Learn CI/CD security through multiple challenges. * Git Credential Manager. Compared to Git’s built-in credential helpers (Windows: wincred, macOS: osxkeychain, Linux: gnome-keyring/libsecret) which provides single-factor authentication support working on any HTTP-enabled Git repository, GCM provides multi-factor authentication support for Azure DevOps, Azure DevOps Server (formerly Team Foundation Server), GitHub, Bitbucket, and GitLab.
- pass is a lightweight directory-based password manager that stores, retrieves, generates, and synchronizes passwords securely using gpg and git. Can be used as a backend for Git Credential Manager. Has an accompanying Android app.
- GitGoat is an open source tool that was built to enable DevOps and Engineering teams to design and implement a sustainable misconfiguration prevention strategy. It can be used to test products with access to GitHub repositories without a risk to your production environment.
- JSON Crack Seamlessly visualizes your JSON data instantly into graphs; paste, import or fetch!
- Spruce is a general purpose YAML & JSON merging tool. It is designed to be an intuitive utility for merging YAML/JSON templates together to generate complicated YAML/JSON config files in a repeatable fashion. It can be used to stitch together some generic/top level definitions for the config and pull in overrides for site-specific configurations to DRY your configs up as much as possible. (It is part of the BOSH tool ecosystem. BOSH is an open source tool for release engineering, deployment, lifecycle management, and monitoring of distributed systems.)
- dyff is a diff tool for YAML files, and sometimes JSON. dyff is inspired by the way the old BOSH v1 deployment output reported changes from one version to another by only showing the parts of a YAML file that change. * Trusted Language Extensions for PostgreSQL (pg_tle) is an open source project that lets developers extend and deploy new PostgreSQL functionality with lower administrative and technical overhead. Developers can use Trusted Language Extensions for PostgreSQL to create and install extensions on restricted filesystems and work with PostgreSQL internals through a SQL API.
- WP2TXT extracts text and category data from Wikipedia dump files (encoded in XML / compressed with Bzip2), removing MediaWiki markup and other metadata.
- Deequ is a library built on top of Apache Spark for defining “unit tests for data”, which measure data quality in large datasets. Most applications that work with data have implicit assumptions about that data, e.g., that attributes have certain types, do not contain NULL values, and so on. If these assumptions are violated, your application might crash or produce wrong outputs. The idea behind deequ is to explicitly state these assumptions in the form of a “unit-test” for data, which can be verified on a piece of data at hand. If the data has errors, we can “quarantine” and fix it, before we feed it to an application.
- Soda Core is a free, open-source, command-line tool that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries. When it runs a scan on a dataset, Soda Core executes the checks to find invalid, missing, or unexpected data. When your Soda Checks fail, they surface the data that you defined as “bad”.
- Lewis automates chemical data extraction. It generates machine readable molecular structures from images and documents with high accuracy.
- Explaining why that molecule (exmol) is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help users understand why a molecule is predicted to have a property.
- Browsh is a fully interactive, real-time, and modern text-based browser rendered to TTYs and browsers.
- install-release is a cli tool to install any tool for your device directly from their github releases and keep them updated. you can consider it as a small package manager for github releases.
- transset sets the transparency on a window so that you can work on something such as terminal or text editor in the foreground while having something running visibily in the background such as a video.
- GlassIt Linux is like like transset for Visual Studio Code on Linux, because transset doesn’t affect it.
- Tube Archivist is a self-hosted Youtube media server.
- corruptor is a simple image glitcher suitable for producing nice looking i3lock backgrounds.
- yt-dlp is a youtube-dl fork based on the now inactive youtube-dlc. The main focus of this project is adding new features and patches while also keeping up to date with the original project.
- MITM cheat sheet’s authors tried to put together all known MITM attacks and methods of protection against these attacks. Here is also contains tools for carrying out MITM attacks, some interesting attack cases and some tricks associated with them.
- Nala is a front-end for libapt-pkg. for newer users it can be hard to understand what apt is trying to do when installing or upgrading. The authors aim to solve this by not showing some redundant messages, formatting the packages better, and using color to show specifically what will happen with a package during install, removal, or an upgrade.
- deb-get provides apt-get functionality for .debs published in 3rd party repositories or via direct download.
- xournal is an application for notetaking, sketching, keeping a journal using a stylus. It’s also a way to sign PDFs without having to print, use pen on paper, and scan.
- CrypTool 2 is a modern e-learning program for Windows, which visualizes cryptography and cryptanalysis. It includes not only the encryption and cryptanalysis of ciphers, but also their basics and the whole spectrum of modern cryptography. In 2023 it was used to read previously undeciphered letters from Mary Queen of Scots, which she wrote to Michel de Castelnau Mauvissière, the French ambassador to England, written between 1578 and 1584.
- tree is a handy little utility
to display a tree view of directories. The
--fromfile
option can also display a tree view of arbitrary prefix hierarchies, such as S3 object paths. - GitPitch is the perfect slide deck solution for tech conferences, training, developer advocates, and educators. * Open external links in a container is a Firefox extension that enables support for opening links in specific containers using custom protocol handler.
- PlantUML generates diagrams from a textual description.
- Diagrams lets you draw the cloud system architecture in Python code.
- Diagram as code survey article: Diagrams, Mermaid, Asciiflow, DOT-to-ASCII, Monodraw, PlantUML, Markmap, Go diagrams.
- handlr manages default applications for opening files in the terminal. Seems to be more flexible than xdg-open.
- JMeter an Open Source Java application designed to measure performance and load test applications by The Apache Software Foundation. It can measure performance and load test static and dynamic web applications.
- Image to Data Uri is a Visual Studio Code extension that can convert an image to an html data uri.
- Docusaurus is a project for building, deploying, and maintaining open source project websites easily. Save time and focus on text documents. Simply write docs and blog posts with MDX, and Docusaurus builds them into static HTML files ready to be served. You can even embed React components in your Markdown thanks to MDX.
- Markdown for the component era. MDX allows you to use JSX in your markdown content. You can import components, such as interactive charts or alerts, and embed them within your content. This makes writing long-form content with components a blast. 🚀
- bump is a generic version tracking and update tool. Bump can be used to automate version updates where other version and package management system does not fit or can’t be used. This can be for example when having versions of dependencies in Makefiles, Dockerfiles, scripts or other kinds of texts.
- Pano is a clipboard manager for the Gnome shell with content-aware previews and notifications. (Via OMG Ubuntu).
- Gnome Extensions CLI provdes a way to install Gnome extensions using the CLI. The official way is to use a Firefox extension.
- Audit Logs Wall of Shame is a list of vendors that don’t prioritize high-quality, widely-available audit logs for security and operations teams. CloudTrail only gets a C+.
- The SSO Wall of Shame is a list of vendors that treat single sign-on as a luxury feature, not a core security requirement.
- sandbox.bio lets you learn how to use bioinformatics tools right from your browser. Everything runs in a sandbox, so you can experiment all you want. Has playgrounds for awk, jq, grep, and sed, and tutorials for bioinformatics tools such as bedtools, bowtie2, and samtools.
- Commitizen is a Git helper that prompts to fill out any required commit fields at commit time. No more waiting until later for a git commit hook to run and reject your commit (though that can still be helpful). No more digging through CONTRIBUTING.md to find what the preferred format is. Get instant feedback on your commit message formatting and be prompted for required fields.
- Better Commits is a CLI for writing better commits, following the conventional commit guidelines.
- jm (“JSON Machine”) by pkoppstein makes it easy to splat (that is, to stream) JSON arrays or JSON objects losslessly, even if they occur in very large JSON structures. (Losslessly here refers primarily to numeric precision, not the handling of duplicate keys within JSON objects.) See “Performance Comparisons” for why you might want to use it.
- Rewrap is an extension for hard-wrapping
Markdown paragraphs, line comments in code, and more. Just press
Alt+Q
. I used to use Reflow Markdown to do this for Markdown, but these days I prefer to use editor soft wraps in prose. - The DAM (Digital Asset Management) Book 3.0, Peter Krogh’s latest release, provides a holistic approach to the creation, storage and deployment of photographic media. It addresses the entire ecosystem of our visually connected world in a very complete and unified manner.
- Tiling Assistant is a GNOME Shell extension which adds a Windows-like snap assist to the GNOME desktop. It expands GNOME’s 2 column tiling layout and adds many more features. See Window Grab Modes in the wiki for an intro to the basic features.
- Metrics is an infographics generator with 30+ plugins and 300+ options to display stats about your GitHub account and render them as SVG, Markdown, PDF or JSON.
- Jupyter Book is an open-source tool for building publication-quality books and documents from computational material. This may help me to automate my own documentation.
- The Sleuth Kit® (TSK) is a library and collection of command line digital forensics tools that allow you to investigate volume and file system data. The library can be incorporated into larger digital forensics tools and the command line tools can be directly used to find evidence.
- Systemd Configurations Helper for Visual Studio Code supports systemd service configuration in VS Code. Provides syntax highlighting, autocomplete, linting, and documentation.
- Repo for learning k8s Opensource Platform for learning kubernetes and eks and preparation for for Certified Kubernetes Specialist (CKA ,CKS , CKAD) exams.
- GPT4 powered CLI for TDD: You write the test, GPT writes the code to pass it.
- GPT CLI auto-generate impressive commits in 1 second. Looks like it uses the conventional commits standard.
- dict is client/server software, human language dictionary databases, and tools supporting the DICT protocol (RFC 2229). Via Network World.
- Json Incremental Digger (jid) drills down JSON interactively by using filtering queries like jq. Suggestion and Auto completion of this tool will provide you a very comfortable JSON drill down.
- jiq is jid with jq. You can drill down interactively by using jq filtering queries. jiq uses jq internally, and it requires you to have jq in your PATH. If you prefer, there’s an experimental, standalone, purely client-side web version.
- GNU datamash
- Qwant is a search engine that knows nothing about you. Does it have its own index or does it scrape something else? Via a rant on The Regsiter.
- DeltaChat is a messaging app that works over e-mail. Via a rant on The Regsiter.
- VSCodium provides binary releases of VS Code without MS branding/telemetry/licensing.
- Comcast simuluates shitty network connections so you can build better systems. See section “I don’t trust you, this code sucks, I hate Go, etc” for a crash course in how to use the underlying tools.
- csvtk is a cross-platform, efficient and practical CSV/TSV toolkit in Golang. Read the section “versus csvkit”.
- Approach Bash Like A Developer is a programming guide for Bash.
- How can I handle command-line options and arguments in my script easily? is the next page I need to read from Greg Wooledge’s Bash guide.
- Utterances is a lightweight comments widget built on GitHub issues. Use GitHub issues for blog comments, wiki pages and more!
- giscus is a comments system powered by GitHub Discussions. Let visitors leave comments and reactions on your website via GitHub! Heavily inspired by utterances.
- Bash Line Editor is a command line editor written in pure Bash which replaces the default GNU Readline. Supports proper multi-line editing and syntax highlighting.
- Shellharden is a syntax highlighter and a tool to semi-automate the rewriting of scripts to ShellCheck conformance, mainly focused on quoting. The default mode of operation is like cat, but with syntax highlighting in foreground colors and suggestive changes in background colors.
- Cursorless s a spoken language for structural code editing, enabling developers to code by voice at speeds not possible with a keyboard. Cursorless decorates every token on the screen and defines a spoken language for rapid, high-level semantic manipulation of structured text. See the intro video on the homepage.
- Monolith is a CLI tool for saving complete web pages as a single HTML file, embedding CSS, image, and JavaScript assets all at once, producing a single HTML5 document that is a joy to store and share.
- HTTP Toolkit HTTP Toolkit is an open-source tool for debugging, testing and building with HTTP(S) on Windows, Linux & Mac. You can use it to intercept, inspect & rewrite HTTP(S) traffic, from everything to anywhere. Explore Android app traffic, mock requests between your microservices, and x-ray your browser traffic to debug, understand and test anything.
- mitmproxy is a set of tools that provide an interactive, SSL/TLS-capable intercepting proxy for HTTP/1, HTTP/2, and WebSockets.
- jaggr is a command line tool to aggregate in real time a series of JSON logs. The main goal of this tool is to prepare data for plotting with jplot.
- Jplot tracks expvar-like (JSON) metrics and plot their evolution over time right into your iTerm2 terminal (or DRCS Sixel Graphics).
- Vegeta is a versatile HTTP load testing tool built out of a need to drill HTTP services with a constant request rate. It can be used both as a command line utility and a library.
- age is a simple, modern and secure file encryption tool, format, and Go library. It features small explicit keys, no config options, and UNIX-style composability.
- Jesth is a next-level human-readable data serialization format.
- trurl is a command line tool for URL parsing and manipulation.
- what is a Bash function that gets
info about a command, like what exactly it is and where. It can help with
understanding a command’s behaviour and troubleshooting issues. For example,
if you run an executable, delete it, then try running it again, Bash may try
to run the file that you just deleted (due to pathname hashing), leading to a
confusing error message. what will tell you about that problem. Along with it
is
symlink-info
, which details complicated symlinks.what
uses it on symlinked executable files. - SOPS: Secrests OPerationS is an editor of encrypted files that supports YAML, JSON, ENV, INI and BINARY formats and encrypts with AWS KMS, GCP KMS, Azure Key Vault, age, and PGP. It encrypts values within files instead of the whole file, so you can still see the structure without decryption.