mastodon.online is one of the many independent Mastodon servers you can use to participate in the fediverse.
A newer server operated by the Mastodon gGmbH non-profit

Server stats:

10K
active users

#duckdb

11 posts11 participants1 post today

Drop #640 (2025-04-17): Twisted Topics Thursday

DuckDB MCP; Bad Bots: Whatcha Gonna Do?; Terrifying Telemetry

As this tariFFable week slinks ever so further to a close, we’ll mix it up a bit for the Thursday Drop and cover some AI, the increasingly dead internet, and why you should never open your internet browser ever again.

And, we’ll make today’s a bit beefier to make up for no Wednesday Drop.

Subscribe

TL;DR

(This is an LLM/GPT-generated summary of today’s Drop using Ollama + llama 3.2 and a custom prompt.)

(Linkifying the WHOLE PARAGRAPH was an unexpected ‘hallucination’.)

The 2025 Web Browser Telemetry report from sizeof.cat shows that nearly all mainstream browsers, including privacy-focused ones, make unsolicited background connections, with only modified versions like LibreWolf and Waterfox minimizing telemetry at the expense of slower updates.

A hands-on walkthrough demonstrates connecting Claude Desktop to a DuckDB Model Context Protocol server, running anomaly detection SQL queries, and interpreting results to identify significant scanning spikes in security data.

Imperva’s 2025 Bad Bot Report reveals that automated bots now account for 51% of web traffic, with malicious bots making up 37% and targeting APIs and industries like travel and retail using increasingly sophisticated tactics.

DuckDB MCP

Photo by Brenadette Grindle Faust on Pexels.com

Our fine, feathered friends over at MotherDuck (the cloudy version of DuckDB) have a spiffy Model Context Protocol server for DuckDB which is super easy to hook up to, say, Claude Desktop to start exploring your data in some fancy and fun new ways.

I’m not a big fan of including all the args in the JSON configs of all these MCP clients (including Claude) so I made an entry for some tag, CVE, and sensor data from my $WORK like this:

"gndb": {  "command": "/Users/me/bin/gnddbmcp.sh"}

and that corresponding shell script is just:

#!/usr/bin/env bash/Users/me/.cargo/bin/uvx mcp-server-motherduck --db-path /Users/me/Data/gn.ddb

Claude has an “Attach to MCP” button which you can use to connect to the server and start exploring your data. And when I asked it what tables I had, you get something like this back:

> what tables are there?To see what tables exist in your database, I can help you run a query. Let me do that for you:{  `query`: `SHOW TABLES;`}Based on the query results, there are 4 tables in your database:1. cveActivity2. cveCpe3. sensors4. tagActivityWould you like to explore the schema of any of these tables to understand their structure?

I asked it a query that I already have scripts for (to see how well it performed):

> What tags have anomalous recent spikes with at least one day having over 200 unique IPs?To find tags with anomalous recent spikes with at least one day having over 200 unique IPs, I'll need to analyze the tagActivity table. Let me write a query to identify this pattern:{  `query`: `-- First, let's examine the schema of the tagActivity table to understand its structureDESCRIBE tagActivity;`}Now I'll query for tags that have had at least one day with over 200 unique IPs, and analyze patterns to identify anomalous spikes:{  `query`: `WITH tag_stats AS (    -- Calculate the average unique IPs per day for each tag    SELECT        tag_id,        name,        AVG(unique_ips) AS avg_unique_ips,        MAX(unique_ips) AS max_unique_ips,        STDDEV(unique_ips) AS stddev_unique_ips    FROM        tagActivity    GROUP BY        tag_id, name    HAVING        MAX(unique_ips) > 200  -- Filter for tags with at least one day over 200 unique IPs),tag_anomalies AS (    -- Join back to get days with anomalous activity    -- defined as days where unique_ips exceeds average + 2*stddev    SELECT        ta.day,        ta.tag_id,        ta.name,        ta.unique_ips,        ts.avg_unique_ips,        ts.stddev_unique_ips,        (ta.unique_ips - ts.avg_unique_ips) / NULLIF(ts.stddev_unique_ips, 0) AS z_score    FROM        tagActivity ta    JOIN        tag_stats ts ON ta.tag_id = ts.tag_id AND ta.name = ts.name    WHERE        ta.unique_ips > ts.avg_unique_ips + 2 * ts.stddev_unique_ips        AND ta.unique_ips > 200    ORDER BY        z_score DESC)-- Get the final results with the most anomalous tags firstSELECT    tag_id,    name,    MAX(z_score) AS max_z_score,    MAX(unique_ips) AS max_unique_ips,    AVG(unique_ips) AS avg_unique_ips_during_anomalies,    COUNT(*) AS anomalous_daysFROM    tag_anomaliesGROUP BY    tag_id, nameORDER BY    max_z_score DESCLIMIT 15;`}Let's get a bit more detail on the most significant anomalies by looking at their daily activity patterns:{  `query`: `-- Let's look at the daily pattern for the top 5 anomalous tagsWITH top_anomalous_tags AS (    SELECT        tag_id    FROM (        WITH tag_stats AS (            SELECT                tag_id,                name,                AVG(unique_ips) AS avg_unique_ips,                MAX(unique_ips) AS max_unique_ips,                STDDEV(unique_ips) AS stddev_unique_ips            FROM                tagActivity            GROUP BY                tag_id, name            HAVING                MAX(unique_ips) > 200        ),        tag_anomalies AS (            SELECT                ta.day,                ta.tag_id,                ta.name,                ta.unique_ips,                ts.avg_unique_ips,                ts.stddev_unique_ips,                (ta.unique_ips - ts.avg_unique_ips) / NULLIF(ts.stddev_unique_ips, 0) AS z_score            FROM                tagActivity ta            JOIN                tag_stats ts ON ta.tag_id = ts.tag_id AND ta.name = ts.name            WHERE                ta.unique_ips > ts.avg_unique_ips + 2 * ts.stddev_unique_ips                AND ta.unique_ips > 200        )        SELECT            tag_id,            name,            MAX(z_score) AS max_z_score        FROM            tag_anomalies        GROUP BY            tag_id, name        ORDER BY            max_z_score DESC        LIMIT 5    ))-- Get daily activity for these tags over the last 30 daysSELECT    ta.day,    ta.tag_id,    ta.name,    ta.unique_ipsFROM    tagActivity taJOIN    top_anomalous_tags t ON ta.tag_id = t.tag_idWHERE    ta.day >= (SELECT MAX(day) FROM tagActivity) - INTERVAL 30 DAYORDER BY    ta.tag_id, ta.day;`}Let's now find out when the anomalous spikes occurred (finding days when the unique_ips count exceeded 200 for these tags):{  `query`: `-- Find days where the unique_ips count exceeded 200 for these tagsSELECT    day,    tag_id,    name,    unique_ipsFROM    tagActivityWHERE    unique_ips > 200ORDER BY    day DESC, unique_ips DESC;`}…

I’ll spare you from the entire session and just drop the result (this is all 100% accurate from what I already knew):

Most Significant Anomalous Spikes

  1. AWS Configuration Scanner (bbf098dc-a1e5-4e28-a5c3-cf7d96765151)
    • Had a major spike on April 9, 2025, with 1,742 unique IPs
    • This is an extreme outlier with a z-score of 6.52 (meaning it’s 6.52 standard deviations above normal activity)
    • Normal activity for this tag averages around 130 unique IPs per day
    • Also had another spike on April 16, 2025, with 280 unique IPs
  2. robots.txt Scanner (215d03ae-6899-49dd-9167-4c22fe1e832d)
    • Experienced a significant spike on April 7, 2025, with 9,478 unique IPs
    • This represents a z-score of 5.75
    • This scanner typically has high activity (average ~5,800 IPs daily), but the April 7 spike was still notably anomalous
  3. Spring Boot Actuator Crawler (d9f421c7-d877-448f-a822-95a24f2a0ad7)
    • Saw a major spike on April 1, 2025, with 4,810 unique IPs
    • Represents a z-score of 5.63
    • Normal activity averages around 348 unique IPs per day

Other Recent Anomalous Activity in April

  1. Git Config Crawler (c536e78c-7206-47e9-90f3-6e4302444185)
    • Spike on April 16, 2025, with 1,004 unique IPs (z-score 3.31)
    • Normal activity averages around 347 unique IPs
  2. MVPower CCTV DVR RCE CVE-2016-20016 Attempt (3b9f5a45-63d0-4b08-a4c7-cd514cdc92f4)
    • Spike on April 16, 2025, with 1,917 unique IPs (z-score 3.24)
    • Normal activity averages around 515 unique IPs
  3. JAWS Webserver RCE (55ca1afe-c8a3-4d7b-a741-81075b9f8200)
    • Spike on April 16, 2025, with 1,917 unique IPs (z-score 3.24)
    • Normal activity averages around 515 unique IPs
  4. WordPress Backup Migration RCE Attempt (95ecfd16-5307-47d7-9500-2f64b93adb11)
    • Spike on April 13, 2025, with 388 unique IPs (z-score 3.33)
    • Normal activity averages around 51 unique IPs
  5. SPIP Scanner (487d4a3f-884c-416e-9e69-239b04baa5c8)
    • Spike on April 12, 2025, with 273 unique IPs (z-score 3.02)
    • Normal activity averages around 63 unique IPs
  6. Apache Tomcat Crawler (ebd17317-0de5-431b-9aa6-c91407eaf40f)
    • Spike on April 14, 2025, with 547 unique IPs (z-score 3.12)
    • Normal activity averages around 218 unique IPs

Patterns and Observations

  • The AWS Configuration Scanner showed the most statistically significant anomaly, with unique IP counts over 13x normal levels
  • Several security vulnerability scanners appear in the list, suggesting coordinated scanning activities
  • Many of these anomalous spikes occurred on specific dates (April 9, April 16, April 13), which could indicate coordinated scanning campaigns
  • There’s a notable pattern of targeting web-related vulnerabilities (AWS configurations, Spring Boot, WordPress, Apache Tomcat)

These spikes represent unusual scanning activity that deviates significantly from normal patterns and could indicate new vulnerability scanning campaigns or increased interest in exploiting specific vulnerabilities.

I went on to ask a more gnarly question: “Which sensor + persona combos experienced degraded performance during the timespan in the records?”, and I put the results here.

You’ll see I had to tell it about the fact that the last day’s records were not complete, so it had to rerun the analysis again. I double-checked the results with my Observable Framework dashboard, and — sure enough — it got it right.

I know lots of folks are reluctant to have these tools touch real data and draw real conclusions. And, that is 100% fair. Just because you’re reducing the content pool for the probabilistic choices does not mean it won’t make 💩 up, but — if you know your domain — these can be tremendously useful tools to augment your regular data analysis.

Bad Bots: Whatcha Gonna Do?

Photo by Pavel Danilyuk on Pexels.com

Imperva is one of the decent names in both “cloud” and “security” spaces. And, like all vendors (including my $WORK), they released a new report right before the annual corproate cyber megacon (RSA). This year, it’s on “bad bots”, and paints a stark picture of an internet where automated bots, empowered by artificial intelligence, now dominate web traffic and pose escalating risks across the globe.

For the first time in a decade, automated traffic exceeded human activity, making up 51% of all web traffic in 2024. This surge is largely attributed to the proliferation of AI tools and large language models, which have dramatically lowered the technical barriers for attackers. As a result, malicious bots—those designed to commit fraud, steal data, or disrupt services—accounted for 37% of all internet traffic, up sharply from 32% the previous year. “Good” bots, such as search engine crawlers, now represent just 14% of traffic.

This rise in bad bot activity is not just a matter of volume but also of sophistication. Attackers increasingly use advanced tactics to mimic human behavior, evade detection, and exploit vulnerabilities in business logic, especially within APIs. In 2024, 44% of advanced bot traffic targeted APIs, compared to just 10% directed at traditional web applications. These attacks are highly effective because APIs are the backbone of digital transformation, powering everything from payments to analytics, and often lack the same level of scrutiny as user-facing web interfaces. Attackers exploit API logic to automate fraud, scrape sensitive data, and bypass security controls, frequently with devastating financial and reputational consequences for organizations.

Industry analysis reveals shifting patterns in bot targeting. The travel sector overtook retail in 2024 as the most attacked industry, accounting for 27% of all bad bot attacks. Travel and retail both face advanced threats: bots disrupt inventory, manipulate pricing models, and hijack customer accounts. In the travel industry, “seat spinning” bots hold tickets without purchasing, while scraping bots distort look-to-book ratios, undermining revenue management and competitive pricing. Retailers face ongoing threats from scalping, gift card fraud, and price scraping, with bot attacks now occurring year-round rather than just during peak shopping seasons.

The report also details the economic and regulatory consequences of unchecked bot activity. Successful attacks can lead to direct financial losses, regulatory penalties under laws like GDPR and CCPA, legal costs, and long-term reputational harm. For instance, a case study highlights how a global talent agency saw 83% of its web traffic generated by bad bots, skewing marketing analytics and draining advertising budgets until advanced bot protection was deployed.

There’s TONS more info in the report, along with recommendations for mitigating bot threats. It’s a great read, with a pretty modern/bold design (though I could have done without the 🍩).

Terrifying Telemetry

Photo by dabatepatfotos on Pexels.com

The 2025 “Web Browser Telemetry” report from sizeof.cat reveals how modern browsers communicate with external servers without our knowledge. The findings suggest that privacy concerns are warranted, as nearly all mainstream browsers engage in background network activities.

Researchers tested various browsers by analyzing network requests during fresh launches and navigation to blank pages. Chrome and Edge proved to be the most communicative, immediately sending data to Google and Microsoft servers for updates, safe browsing features, and browser-wielder metrics collection.

Firefox, though often recommended for privacy, still connects to Mozilla servers at startup for telemetry, updates, and security checks. Even after disabling telemetry in settings, some network requests continue for security updates and features like Pocket.

Modified Firefox versions like LibreWolf and Waterfox present a more nuanced situation. These browsers avoid unsolicited connections to Mozilla or analytics servers at startup, supporting their privacy-focused claims. Yet community reports suggest they may still make minimal connections for extension updates or certificate verification, though far fewer than standard Firefox. For folks prioritizing minimal telemetry, these modified versions currently offer the best balance, despite potentially slower security updates due to smaller development teams.

Chromium-based browsers like Brave and Vivaldi, which emphasize privacy features, also maintain some background connections for their own systems. While they block many external trackers, they still communicate with their respective servers for updates and occasionally for metrics collection, though less aggressively than Chrome or Edge.

Advanced privacy tools like Pi-hole or DNS-based blocking provide only partial protection, as some browsers bypass system DNS settings through direct IP connections to telemetry endpoints. This bypassing is particularly common in Windows and Chrome-based browsers that increasingly use hardcoded DNS-over-HTTPS or direct IP addresses to avoid local network controls.

It’s a pretty bleak and detailed report, but the blows are dampened if you move to a different tab, and watch the site cycle through different <title> sequences that appear to make it look like various other online services.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • 🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • 🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

☮️

Drop #640 (2025-04-17): Twisted Topics Thursday

Today's Drop discusses MotherDuck's #DuckDB Model Context Protocol server with a walk through a look at my sensor network data. It further highlights the rise of automated bots, which now constitute 51% of web traffic, and, lastly, emphasizes the privacy concerns surrounding modern web browsers' telemetry practices.

dailydrop.hrbrmstr.dev/2025/04

hrbrmstr's Daily Drop · Drop #640 (2025-04-17): Twisted Topics Thursday
More from hrbrmstr's Daily Drop

Bonus Drop #82 (2025-04-12): Quick Hits

QH: Lit Lit; QH: Beyond Quacking; QH: Granary

This hit kind of hard Friday night (the Trump administration deliberately dropped that nugget after the news cycle), so y’all are getting barely edited notes for a bonus Drop this weekend.

I know I sound like some insane person, but if you are a reader who is in the U.S., it would not be unwise to stock up on shelf-stable/non-perishable item before the 20th. If I’m wrong, the worst-case scenario is that you have extra stuff for a while. If I’m right…

TL;DR

(This is an AI-generated summary of today’s Drop using Ollama + llama 3.2 and a custom prompt.)

For the record, I’ve not changed anything, and it just started to do this “summary of” stuff o_O.

  • Summary of Lit web components (https://lit.dev) – A concise overview of Lit’s core advantages and features, including native browser support, encapsulation via Shadow DOM, and TypeScript-first design.
  • Summary of FlockMTL paper (FlockMTL) – An introduction to FlockMTL, a DuckDB extension integrating LLMs and RAG capabilities into SQL workflows, targeting knowledge-intensive analytical applications.
  • Summary of Granary library (https://granary.io/) – A universal translator for social web data, enabling interoperability between native APIs, scraped HTML, and standardized formats like ActivityStreams v1/v2, microformats2, Atom, RSS, and JSON Feed.

QH: Lit Lit

Photo by Lgh_9 on Pexels.com

Longtime readers of the Drop know I 💙 Lit webcomponents. This past week, I came across — quite literally — the absolute best post describing what they are, why they rock, and how best to use them.

In “The Case for Web Components with Lit”, Philipp Kunz covers all these topics concisely, but thorough enough to both make convincing case to use Lit, but also to teach an old dog (like me) some new and spiffy tricks when it comes to using Lit in the most optimal way possible.

  • Core Web Component advantages: native browser support, encapsulation via Shadow DOM, framework interoperability, scalability
  • Lit library features: reactive properties, declarative templates, performance optimizations, TypeScript-first design
  • TypeScript integration patterns: type-safe properties/events, interface-driven development, enhanced tooling support
  • Component development techniques:
    • Slot-based composition
    • Reactive controllers for state management
    • Lifecycle hooks and DOM querying
    • CSS-in-JS styling with Lit’s css tag
  • Testing strategies using modern web test runners
  • Performance optimization methods:
    • Template caching
    • Efficient update detection
    • Lit’s repeat directive
  • Framework interoperability examples for React/Vue
  • Real-world implementation patterns:
    • Design system components
    • Micro-frontend architectures
  • TypeScript best practices:
    • Strong event/property typing
    • Generic component patterns
    • Custom event definitions
  • Companion tooling showcase (dees-wcctools)
  • Resource references for deeper exploration

QH: Beyond Quacking

Photo by Pixabay on Pexels.com

I read “Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB” earlier in April, and it’s a perfect item for this weekend’s “Quick Hits” edition since I don’t have a DuckDB-focused edition on the planning calendar for a bit.

The paper features FlockMTL — a DuckDB extension integrating LLMs and RAG capabilities directly into SQL workflows, targeting knowledge-intensive analytical applications. It introduces novel schema objects (MODELPROMPT), semantic functions, and optimizations to simplify complex data pipelines combining structured/unstructured data analysis.

These are the notes on FlockMTL core features ripped from my Joplin page on it:

MODEL & PROMPT are first-class resources letting us modify global LLM/ML configs without modifying our actual queries:

CREATE GLOBAL MODEL('model-relevance-check', 'gpt-4o-mini', 'openai');CREATE PROMPT('joins-prompt', 'is related to join algos given abstract');

FlockMTL comes with new semantic DDB functions:

  • Scalar (llm_filterllm_complete_json)
  • Aggregate (llm_rerankllm_first/last)
  • Hybrid search fusion (RRF, CombMNZ, etc.)

FlockMTL introduces several optimizations to enhance LLM-integrated SQL workflows, focusing on efficiency and usability. Here’s a technical breakdown of the key mechanisms:

“Batching” dynamically groups input tuples into optimal batch sizes based on LLM context windows, achieving 7× speedup for chat completions and 48× for embeddings. The system automatically reduces batch sizes by 10% upon context overflow errors and handles edge cases where individual tuples exceed context limits by returning NULL.

“Meta-Prompting” implements structured prompt templates that combine user instructions with:

  • Formatting rules
  • Output expectations
  • Serialized tabular inputs

“Execution Plan Controls” extends DuckDB’s query planner with:

  • Batch size visualization/overrides
  • Serialization format selection (XML/JSON/Markdown)
  • Meta-prompt inspection tools

Example Workflow

WITH relevant_papers AS (  SELECT * FROM research_papers   WHERE llm_filter(params, prompt, {title, abstract}))SELECT   llm_complete_json(params, prompt, {content}) AS insights,  llm_rerank(params, prompt, content) AS rankingsFROM relevant_papers

Supports cloud (OpenAI/Azure) and local (Ollama) LLMs.

If you have an Apple Silicon Mac or another system with a compatible GPU, this is a nice way to experiment with RAG in a SQL context.

QH: Granary

Photo by Juan Martin Lopez on Pexels.com

Every social media platform uses a different “content object” format, despite all of them (pretty much) having the same overall structure.

Granary (GH) is a spiffy library and REST API designed to fetch and convert social data across various platforms and formats. It acts as a (sort of) universal translator for social web data, enabling interoperability between native APIs, scraped HTML, and standardized formats like ActivityStreams v1/v2, microformats2, Atom, RSS, and JSON Feed. The framework emphasizes open protocols while supporting major platforms including Twitter, Mastodon, Bluesky/AT Protocol, GitHub, and Instagram.

The tool provides two primary interfaces:

  1. A Python library (pip install granary) for programmatic access
  2. A REST API available at granary.io for HTTP-based interactions. Both methods handle OAuth authentication and return data wrapped in ActivityStreams 1.0 JSON with OpenSocial envelope formatting by default. The API supports multiple output formats including ActivityPub-compatible JSON, Atom feeds, and microformats2 HTML.

Recent updates (v8.1 as of March 2025) added enhanced Bluesky integration including hashtag facets, improved content truncation for non-space-delimited languages, and NIP-27 support for Nostr. The project emphasizes privacy – no user data is stored, only transformed in transit.

Tis a super-niche solution, but is also a nice codebase to peruse if you want to get familiar with the various socmed platform data formats.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • 🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • 🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

☮️

Bonus Drop #82 (2025-04-12): Quick Hits

The weekend Bonus Drop contains 3 "quick hits". One highlights Lit web components' advantages, another showcases FLOCKMTL's integration of LLMs into #DuckDB SQL workflows, and features Granary's role as a universal translator for social web data.

dailydrop.hrbrmstr.dev/2025/04

hrbrmstr's Daily Drop · Bonus Drop #82 (2025-04-12): Quick Hits
More from hrbrmstr's Daily Drop