Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)

Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)
Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)
https://www.hey.earth/posts/duckdb-doom
#ycombinator #duckdb #sql #wasm #doom
Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)
Wow! Let's resurrect the '90s by using #SQL to render 3D graphics
https://www.hey.earth/posts/duckdb-doom #3Dgraphics #90sGaming #HackerNews #ngated
Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)
#nextflow DuckdbCache 1.0.0-rc2 released
Nextflow's #DuckDB Cache Plugin enables the storage of task execution metadata in a DuckDB database, ensuring data persistence and scalability. This plugin is particularly useful for workflows that require robust, centralized caching mechanisms across distributed environments.
Read more at
Well this looks interesting!
https://cidree.github.io/duckspatial/
I get PTSD every time I think of building a #CSV parser. Excellent work from #duckdb as always on showing the versatility of using their tool.
It's one of the core reasons I use them for all my file data needs.
https://duckdb.org/2025/04/16/duckdb-csv-pollock-benchmark.html
Drop #640 (2025-04-17): Twisted Topics Thursday
DuckDB MCP; Bad Bots: Whatcha Gonna Do?; Terrifying Telemetry
As this tariFFable week slinks ever so further to a close, we’ll mix it up a bit for the Thursday Drop and cover some AI, the increasingly dead internet, and why you should never open your internet browser ever again.
And, we’ll make today’s a bit beefier to make up for no Wednesday Drop.
TL;DR
(This is an LLM/GPT-generated summary of today’s Drop using Ollama + llama 3.2 and a custom prompt.)
(Linkifying the WHOLE PARAGRAPH was an unexpected ‘hallucination’.)
DuckDB MCP
Photo by Brenadette Grindle Faust on Pexels.comOur fine, feathered friends over at MotherDuck (the cloudy version of DuckDB) have a spiffy Model Context Protocol server for DuckDB which is super easy to hook up to, say, Claude Desktop to start exploring your data in some fancy and fun new ways.
I’m not a big fan of including all the args
in the JSON configs of all these MCP clients (including Claude) so I made an entry for some tag, CVE, and sensor data from my $WORK like this:
"gndb": { "command": "/Users/me/bin/gnddbmcp.sh"}
and that corresponding shell script is just:
#!/usr/bin/env bash/Users/me/.cargo/bin/uvx mcp-server-motherduck --db-path /Users/me/Data/gn.ddb
Claude has an “Attach to MCP” button which you can use to connect to the server and start exploring your data. And when I asked it what tables I had, you get something like this back:
> what tables are there?To see what tables exist in your database, I can help you run a query. Let me do that for you:{ `query`: `SHOW TABLES;`}Based on the query results, there are 4 tables in your database:1. cveActivity2. cveCpe3. sensors4. tagActivityWould you like to explore the schema of any of these tables to understand their structure?
I asked it a query that I already have scripts for (to see how well it performed):
> What tags have anomalous recent spikes with at least one day having over 200 unique IPs?To find tags with anomalous recent spikes with at least one day having over 200 unique IPs, I'll need to analyze the tagActivity table. Let me write a query to identify this pattern:{ `query`: `-- First, let's examine the schema of the tagActivity table to understand its structureDESCRIBE tagActivity;`}Now I'll query for tags that have had at least one day with over 200 unique IPs, and analyze patterns to identify anomalous spikes:{ `query`: `WITH tag_stats AS ( -- Calculate the average unique IPs per day for each tag SELECT tag_id, name, AVG(unique_ips) AS avg_unique_ips, MAX(unique_ips) AS max_unique_ips, STDDEV(unique_ips) AS stddev_unique_ips FROM tagActivity GROUP BY tag_id, name HAVING MAX(unique_ips) > 200 -- Filter for tags with at least one day over 200 unique IPs),tag_anomalies AS ( -- Join back to get days with anomalous activity -- defined as days where unique_ips exceeds average + 2*stddev SELECT ta.day, ta.tag_id, ta.name, ta.unique_ips, ts.avg_unique_ips, ts.stddev_unique_ips, (ta.unique_ips - ts.avg_unique_ips) / NULLIF(ts.stddev_unique_ips, 0) AS z_score FROM tagActivity ta JOIN tag_stats ts ON ta.tag_id = ts.tag_id AND ta.name = ts.name WHERE ta.unique_ips > ts.avg_unique_ips + 2 * ts.stddev_unique_ips AND ta.unique_ips > 200 ORDER BY z_score DESC)-- Get the final results with the most anomalous tags firstSELECT tag_id, name, MAX(z_score) AS max_z_score, MAX(unique_ips) AS max_unique_ips, AVG(unique_ips) AS avg_unique_ips_during_anomalies, COUNT(*) AS anomalous_daysFROM tag_anomaliesGROUP BY tag_id, nameORDER BY max_z_score DESCLIMIT 15;`}Let's get a bit more detail on the most significant anomalies by looking at their daily activity patterns:{ `query`: `-- Let's look at the daily pattern for the top 5 anomalous tagsWITH top_anomalous_tags AS ( SELECT tag_id FROM ( WITH tag_stats AS ( SELECT tag_id, name, AVG(unique_ips) AS avg_unique_ips, MAX(unique_ips) AS max_unique_ips, STDDEV(unique_ips) AS stddev_unique_ips FROM tagActivity GROUP BY tag_id, name HAVING MAX(unique_ips) > 200 ), tag_anomalies AS ( SELECT ta.day, ta.tag_id, ta.name, ta.unique_ips, ts.avg_unique_ips, ts.stddev_unique_ips, (ta.unique_ips - ts.avg_unique_ips) / NULLIF(ts.stddev_unique_ips, 0) AS z_score FROM tagActivity ta JOIN tag_stats ts ON ta.tag_id = ts.tag_id AND ta.name = ts.name WHERE ta.unique_ips > ts.avg_unique_ips + 2 * ts.stddev_unique_ips AND ta.unique_ips > 200 ) SELECT tag_id, name, MAX(z_score) AS max_z_score FROM tag_anomalies GROUP BY tag_id, name ORDER BY max_z_score DESC LIMIT 5 ))-- Get daily activity for these tags over the last 30 daysSELECT ta.day, ta.tag_id, ta.name, ta.unique_ipsFROM tagActivity taJOIN top_anomalous_tags t ON ta.tag_id = t.tag_idWHERE ta.day >= (SELECT MAX(day) FROM tagActivity) - INTERVAL 30 DAYORDER BY ta.tag_id, ta.day;`}Let's now find out when the anomalous spikes occurred (finding days when the unique_ips count exceeded 200 for these tags):{ `query`: `-- Find days where the unique_ips count exceeded 200 for these tagsSELECT day, tag_id, name, unique_ipsFROM tagActivityWHERE unique_ips > 200ORDER BY day DESC, unique_ips DESC;`}…
I’ll spare you from the entire session and just drop the result (this is all 100% accurate from what I already knew):
Most Significant Anomalous Spikes
- AWS Configuration Scanner (bbf098dc-a1e5-4e28-a5c3-cf7d96765151)
- Had a major spike on April 9, 2025, with 1,742 unique IPs
- This is an extreme outlier with a z-score of 6.52 (meaning it’s 6.52 standard deviations above normal activity)
- Normal activity for this tag averages around 130 unique IPs per day
- Also had another spike on April 16, 2025, with 280 unique IPs
- robots.txt Scanner (215d03ae-6899-49dd-9167-4c22fe1e832d)
- Experienced a significant spike on April 7, 2025, with 9,478 unique IPs
- This represents a z-score of 5.75
- This scanner typically has high activity (average ~5,800 IPs daily), but the April 7 spike was still notably anomalous
- Spring Boot Actuator Crawler (d9f421c7-d877-448f-a822-95a24f2a0ad7)
- Saw a major spike on April 1, 2025, with 4,810 unique IPs
- Represents a z-score of 5.63
- Normal activity averages around 348 unique IPs per day
Other Recent Anomalous Activity in April
- Git Config Crawler (c536e78c-7206-47e9-90f3-6e4302444185)
- Spike on April 16, 2025, with 1,004 unique IPs (z-score 3.31)
- Normal activity averages around 347 unique IPs
- MVPower CCTV DVR RCE CVE-2016-20016 Attempt (3b9f5a45-63d0-4b08-a4c7-cd514cdc92f4)
- Spike on April 16, 2025, with 1,917 unique IPs (z-score 3.24)
- Normal activity averages around 515 unique IPs
- JAWS Webserver RCE (55ca1afe-c8a3-4d7b-a741-81075b9f8200)
- Spike on April 16, 2025, with 1,917 unique IPs (z-score 3.24)
- Normal activity averages around 515 unique IPs
- WordPress Backup Migration RCE Attempt (95ecfd16-5307-47d7-9500-2f64b93adb11)
- Spike on April 13, 2025, with 388 unique IPs (z-score 3.33)
- Normal activity averages around 51 unique IPs
- SPIP Scanner (487d4a3f-884c-416e-9e69-239b04baa5c8)
- Spike on April 12, 2025, with 273 unique IPs (z-score 3.02)
- Normal activity averages around 63 unique IPs
- Apache Tomcat Crawler (ebd17317-0de5-431b-9aa6-c91407eaf40f)
- Spike on April 14, 2025, with 547 unique IPs (z-score 3.12)
- Normal activity averages around 218 unique IPs
Patterns and Observations
- The AWS Configuration Scanner showed the most statistically significant anomaly, with unique IP counts over 13x normal levels
- Several security vulnerability scanners appear in the list, suggesting coordinated scanning activities
- Many of these anomalous spikes occurred on specific dates (April 9, April 16, April 13), which could indicate coordinated scanning campaigns
- There’s a notable pattern of targeting web-related vulnerabilities (AWS configurations, Spring Boot, WordPress, Apache Tomcat)
These spikes represent unusual scanning activity that deviates significantly from normal patterns and could indicate new vulnerability scanning campaigns or increased interest in exploiting specific vulnerabilities.
I went on to ask a more gnarly question: “Which sensor + persona combos experienced degraded performance during the timespan in the records?”, and I put the results here.
You’ll see I had to tell it about the fact that the last day’s records were not complete, so it had to rerun the analysis again. I double-checked the results with my Observable Framework dashboard, and — sure enough — it got it right.
I know lots of folks are reluctant to have these tools touch real data and draw real conclusions. And, that is 100% fair. Just because you’re reducing the content pool for the probabilistic choices does not mean it won’t make up, but — if you know your domain — these can be tremendously useful tools to augment your regular data analysis.
Bad Bots: Whatcha Gonna Do?
Photo by Pavel Danilyuk on Pexels.comImperva is one of the decent names in both “cloud” and “security” spaces. And, like all vendors (including my $WORK), they released a new report right before the annual corproate cyber megacon (RSA). This year, it’s on “bad bots”, and paints a stark picture of an internet where automated bots, empowered by artificial intelligence, now dominate web traffic and pose escalating risks across the globe.
For the first time in a decade, automated traffic exceeded human activity, making up 51% of all web traffic in 2024. This surge is largely attributed to the proliferation of AI tools and large language models, which have dramatically lowered the technical barriers for attackers. As a result, malicious bots—those designed to commit fraud, steal data, or disrupt services—accounted for 37% of all internet traffic, up sharply from 32% the previous year. “Good” bots, such as search engine crawlers, now represent just 14% of traffic.
This rise in bad bot activity is not just a matter of volume but also of sophistication. Attackers increasingly use advanced tactics to mimic human behavior, evade detection, and exploit vulnerabilities in business logic, especially within APIs. In 2024, 44% of advanced bot traffic targeted APIs, compared to just 10% directed at traditional web applications. These attacks are highly effective because APIs are the backbone of digital transformation, powering everything from payments to analytics, and often lack the same level of scrutiny as user-facing web interfaces. Attackers exploit API logic to automate fraud, scrape sensitive data, and bypass security controls, frequently with devastating financial and reputational consequences for organizations.
Industry analysis reveals shifting patterns in bot targeting. The travel sector overtook retail in 2024 as the most attacked industry, accounting for 27% of all bad bot attacks. Travel and retail both face advanced threats: bots disrupt inventory, manipulate pricing models, and hijack customer accounts. In the travel industry, “seat spinning” bots hold tickets without purchasing, while scraping bots distort look-to-book ratios, undermining revenue management and competitive pricing. Retailers face ongoing threats from scalping, gift card fraud, and price scraping, with bot attacks now occurring year-round rather than just during peak shopping seasons.
The report also details the economic and regulatory consequences of unchecked bot activity. Successful attacks can lead to direct financial losses, regulatory penalties under laws like GDPR and CCPA, legal costs, and long-term reputational harm. For instance, a case study highlights how a global talent agency saw 83% of its web traffic generated by bad bots, skewing marketing analytics and draining advertising budgets until advanced bot protection was deployed.
There’s TONS more info in the report, along with recommendations for mitigating bot threats. It’s a great read, with a pretty modern/bold design (though I could have done without the ).
Terrifying Telemetry
Photo by dabatepatfotos on Pexels.comThe 2025 “Web Browser Telemetry” report from sizeof.cat reveals how modern browsers communicate with external servers without our knowledge. The findings suggest that privacy concerns are warranted, as nearly all mainstream browsers engage in background network activities.
Researchers tested various browsers by analyzing network requests during fresh launches and navigation to blank pages. Chrome and Edge proved to be the most communicative, immediately sending data to Google and Microsoft servers for updates, safe browsing features, and browser-wielder metrics collection.
Firefox, though often recommended for privacy, still connects to Mozilla servers at startup for telemetry, updates, and security checks. Even after disabling telemetry in settings, some network requests continue for security updates and features like Pocket.
Modified Firefox versions like LibreWolf and Waterfox present a more nuanced situation. These browsers avoid unsolicited connections to Mozilla or analytics servers at startup, supporting their privacy-focused claims. Yet community reports suggest they may still make minimal connections for extension updates or certificate verification, though far fewer than standard Firefox. For folks prioritizing minimal telemetry, these modified versions currently offer the best balance, despite potentially slower security updates due to smaller development teams.
Chromium-based browsers like Brave and Vivaldi, which emphasize privacy features, also maintain some background connections for their own systems. While they block many external trackers, they still communicate with their respective servers for updates and occasionally for metrics collection, though less aggressively than Chrome or Edge.
Advanced privacy tools like Pi-hole or DNS-based blocking provide only partial protection, as some browsers bypass system DNS settings through direct IP connections to telemetry endpoints. This bypassing is particularly common in Windows and Chrome-based browsers that increasingly use hardcoded DNS-over-HTTPS or direct IP addresses to avoid local network controls.
It’s a pretty bleak and detailed report, but the blows are dampened if you move to a different tab, and watch the site cycle through different <title>
sequences that appear to make it look like various other online services.
FIN
Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:
@dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy
Drop #640 (2025-04-17): Twisted Topics Thursday
Today's Drop discusses MotherDuck's #DuckDB Model Context Protocol server with a walk through a look at my sensor network data. It further highlights the rise of automated bots, which now constitute 51% of web traffic, and, lastly, emphasizes the privacy concerns surrounding modern web browsers' telemetry practices.
http://dailydrop.hrbrmstr.dev/2025/04/17/drop-640-2025-04-17-twisted-topics-thursday/
AI/LLM/GPT haters gonna hate, but this is fun/helpful. #DuckDB #MCP #Claude #GreyNoise
@ChristosArgyrop While this has truth, there is still #DuckDB and #ApacheArrow that can help tons with large datasets before we need actual SQL! Don't know if that's counting as giggles though
Is there some nice web frontend for #duckdb that lets users do faceted searches and supports permalinks?
Bonus Drop #82 (2025-04-12): Quick Hits
QH: Lit Lit; QH: Beyond Quacking; QH: Granary
This hit kind of hard Friday night (the Trump administration deliberately dropped that nugget after the news cycle), so y’all are getting barely edited notes for a bonus Drop this weekend.
I know I sound like some insane person, but if you are a reader who is in the U.S., it would not be unwise to stock up on shelf-stable/non-perishable item before the 20th. If I’m wrong, the worst-case scenario is that you have extra stuff for a while. If I’m right…
TL;DR
(This is an AI-generated summary of today’s Drop using Ollama + llama 3.2 and a custom prompt.)
For the record, I’ve not changed anything, and it just started to do this “summary of” stuff o_O.
QH: Lit Lit
Photo by Lgh_9 on Pexels.comLongtime readers of the Drop know I Lit webcomponents. This past week, I came across — quite literally — the absolute best post describing what they are, why they rock, and how best to use them.
In “The Case for Web Components with Lit”, Philipp Kunz covers all these topics concisely, but thorough enough to both make convincing case to use Lit, but also to teach an old dog (like me) some new and spiffy tricks when it comes to using Lit in the most optimal way possible.
css
tagrepeat
directiveQH: Beyond Quacking
Photo by Pixabay on Pexels.comI read “Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB” earlier in April, and it’s a perfect item for this weekend’s “Quick Hits” edition since I don’t have a DuckDB-focused edition on the planning calendar for a bit.
The paper features FlockMTL — a DuckDB extension integrating LLMs and RAG capabilities directly into SQL workflows, targeting knowledge-intensive analytical applications. It introduces novel schema objects (MODEL
, PROMPT
), semantic functions, and optimizations to simplify complex data pipelines combining structured/unstructured data analysis.
These are the notes on FlockMTL core features ripped from my Joplin page on it:
MODEL
& PROMPT
are first-class resources letting us modify global LLM/ML configs without modifying our actual queries:
CREATE GLOBAL MODEL('model-relevance-check', 'gpt-4o-mini', 'openai');CREATE PROMPT('joins-prompt', 'is related to join algos given abstract');
FlockMTL comes with new semantic DDB functions:
llm_filter
, llm_complete_json
)llm_rerank
, llm_first/last
)FlockMTL introduces several optimizations to enhance LLM-integrated SQL workflows, focusing on efficiency and usability. Here’s a technical breakdown of the key mechanisms:
“Batching” dynamically groups input tuples into optimal batch sizes based on LLM context windows, achieving 7× speedup for chat completions and 48× for embeddings. The system automatically reduces batch sizes by 10% upon context overflow errors and handles edge cases where individual tuples exceed context limits by returning NULL.
“Meta-Prompting” implements structured prompt templates that combine user instructions with:
“Execution Plan Controls” extends DuckDB’s query planner with:
Example Workflow
WITH relevant_papers AS ( SELECT * FROM research_papers WHERE llm_filter(params, prompt, {title, abstract}))SELECT llm_complete_json(params, prompt, {content}) AS insights, llm_rerank(params, prompt, content) AS rankingsFROM relevant_papers
Supports cloud (OpenAI/Azure) and local (Ollama) LLMs.
If you have an Apple Silicon Mac or another system with a compatible GPU, this is a nice way to experiment with RAG in a SQL context.
QH: Granary
Photo by Juan Martin Lopez on Pexels.comEvery social media platform uses a different “content object” format, despite all of them (pretty much) having the same overall structure.
Granary (GH) is a spiffy library and REST API designed to fetch and convert social data across various platforms and formats. It acts as a (sort of) universal translator for social web data, enabling interoperability between native APIs, scraped HTML, and standardized formats like ActivityStreams v1/v2, microformats2, Atom, RSS, and JSON Feed. The framework emphasizes open protocols while supporting major platforms including Twitter, Mastodon, Bluesky/AT Protocol, GitHub, and Instagram.
The tool provides two primary interfaces:
pip install granary
) for programmatic accessRecent updates (v8.1 as of March 2025) added enhanced Bluesky integration including hashtag facets, improved content truncation for non-space-delimited languages, and NIP-27 support for Nostr. The project emphasizes privacy – no user data is stored, only transformed in transit.
Tis a super-niche solution, but is also a nice codebase to peruse if you want to get familiar with the various socmed platform data formats.
FIN
Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:
@dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy
Bonus Drop #82 (2025-04-12): Quick Hits
The weekend Bonus Drop contains 3 "quick hits". One highlights Lit web components' advantages, another showcases FLOCKMTL's integration of LLMs into #DuckDB SQL workflows, and features Granary's role as a universal translator for social web data.
http://dailydrop.hrbrmstr.dev/2025/04/12/bonus-drop-82-2025-04-12-quick-hits/
Just released duckfetch v0.4.0 which allows you to manage DuckDB installation on Linux/Windows and macOS
https://github.com/PMassicotte/duckfetch/releases/tag/v0.4.0
Connect to DuckDB in #FileMaker
Today we like to connect to a #DuckDB database from FileMaker using the SQL functions in MBS FileMaker Plugin.
DatabricksでS3に外部テーブルとして作成したDeltaテーブルにローカル環境のPythonからアクセスしてみた
https://dev.classmethod.jp/articles/databricks-s3-delta-python/