Stop your Roku from tracking you

My roommate got a Roku TV last year, and since it’s a “smart” TV, I was naturally suspicious. In case you haven’t heard about smart TV’s spying on people, there’s more than a few examples to choose from. But then I remembered I had a wonderful Synology router, complete with fine-grained segmenting and filtering tools.

That means that not only could I block the TV’s access to certain domains, but I could actually watch which domains it connects to!

Now, I know people have already studied the tracking domains TV’s connect to. I know you could block most tracking domains by just dropping a Pi-Hole on your network. In fact, my Synology router already has preset block list for ads. But I don’t want to bother with a Pi-Hole, and it’s more interesting to investigate this stuff myself! Plus, that way I know more about what’s going on and don’t have to wonder if those block lists are missing anything (as they’re known to do).

Qualitative results

The results were fascinating. First off, I actually didn’t observe as much obvious spying as I expected. None of the domains I observed seemed clearly connected to TCL, the manufacturer. In fact, when you’re just in the Roku menus and not in an app, the only domain it seems to connect to is ravm.tv, which seems to be owned by Roku. Perhaps part of the “Roku TV” agreement they made with TCL involves Roku doing all the data collection too, selling it to TCL on the back end. Well luckily it’s not necessary. I’ve blocked it and never noticed anything broken except the ads on the Roku home screen. However, roku.com is necessary so ¯_(ツ)_/¯

As for the apps themselves, well it’s interesting to see that the tracking domains are often pretty obvious. And plentiful. They usually have things like “ad”, “beacon”, “metrics”, or “pixel” in the name. Many don’t, though. I just used trial and error, blocking different ones and checking if the app still works to see what’s necessary.

The data

After going through this process, I thought it might be useful for others if I post my findings. I’ll try to update this list as I learn more. These lists are not exhaustive. There could be more domains these apps connect to that are actually necessary or unnecessary that I haven’t discovered or confirmed. But these are ones I have observed and investigated.

Continue reading

Internet Archive: Please be more careful

The lawsuit the Internet Archive is facing is still pretty concerning for me, and I wanted to put together my thoughts on exactly why.

The lawsuit could have been a death blow

I won’t explain the background here, but you can get the basics from this Vox article. It also explains how people have been saying that this lawsuit threatens the very existence of the Archive, but that’s not exactly true.

However, it seems the only reason it’s not an existential threat to the Archive is because of restraint on behalf of the publishers. I’m not a lawyer, but from the article it seems that if the publishers wanted to, they could push for the full $150,000 in damages for all 1.4 million books. Or at least a few hundred or thousand books, which would be enough to get into the hundreds of millions of dollars in damages. Given that the Archive’s annual operating revenue is only $19 million, it looks like this lawsuit could have bankrupted it.

It’s hard to overstate how important the Internet Archive is

That’s pretty scary for me, because I care about the Internet Archive a lot. The amount of irreplaceable human history it contains is staggering. Its Wayback Machine is its most famous project, and deservedly so. 25% of all webpages are gone within 5 years. If you’ve ever tried to look up a news article from 10 years ago, chances are the site’s response was “Huh? Oh, did you want to look at today’s front page?”. And beyond the web, the Archive is digitizing huge troves of books, records, and all manner of media that are long lost and out of print. And it stores all sorts of special collections, like classic video games and software. As more and more of the world moves to the cloud, we’re facing the prospect of a digital dark age when those sites and servers inevitably go offline.

That’s why I am so thankful that the Archive exists, and that its leadership seems intent on preserving as much human culture as possible, for the long term. There are too many web preservation projects whose structure, funding, or leadership doesn’t give much confidence that they’ll be around for the long haul. WebCite already stopped archiving due to lack of funding, and who knows how long they’ll be able to pay to keep their servers up? But the Internet Archive seems like an organization with the singular focus and guidance to become an institution that preserves history for posterity. That’s why I’ve donated more to it than any other tech organization over the years. It’s why I’ve participated in their hackathons and archiving efforts.

The Archive is on track to become one of the most important repositories of human history. The British Museum contains our ancient history, the Library of Congress has the history of the print age, and the Internet Archive preserves the digital age.*

Please don’t gamble it all

But just because it exists now doesn’t mean that history is safe. It turns out the burning of the Library of Alexandria is a myth, but if something were to happen to the Archive, it would be a reality as terrible as the legend. I want this institution to be around for generations. And for that to happen, we can’t be taking risks like the National Emergency Library.

In all my reading I haven’t come across any legal precedent that makes this move defensible in court. And as far as regular copyright laws go, this seems to be pretty clear infringement. Before this, the Archive seemed to have found a pretty nice compromise where they only lent out a limited number of copies at a time, just like a regular library. No one sued, and it seems to me a pretty legally defensible position. This is more in line with the smart way the Archive usually handles copyright. It’s technically in a legal gray area, having copied tons of other people’s works. But at least in the case of the Wayback Machine they’ve been able to avoid lawsuits by responding to individual takedown requests with a simple “yes”. They bend like a willow in the wind: remove individual items in order to preserve the rest. Normally they’re good at walking the copyright line. Which is why I’m even more baffled by the Emergency Library.

Don’t get me wrong, I’m no fan of current copyright laws. I’m happy for someone to challenge them in novel ways, but please, leave that to others like the EFF. Even if the risk posed by this lawsuit is less than it seems, it’s still too much. If this institution is going to last generations, it can’t start rolling the dice like this every few years. Eventually those odds catch up and you’ll lose. We’ll all lose. An entire period of history. Gone.

* I don’t mean to suggest the British Museum and LoC contain the majority of the history from those respective ages (and I don’t mean to condone how they may have obtained their artifacts). In fact, the Internet Archive is distinguished from the other two in that it probably does contain the majority of the preserved items from its respective age.

Careful when using $RANDOM

I thought I’d put out a PSA about the dangers of using bash’s convenient, built-in source of random numbers: $RANDOM.

No, this isn’t the usual lecture about using a cryptographically secure random number generator. There’s lots of situations where you just need a random blob and you’re not worried about malicious attacks. No, this is about why, even in those situations, you need to consider whether $RANDOM is random enough.

For instance, I was just using it to generate unique filenames in a bash loop. I just wanted to be able to generate filenames without worrying about collisions.

However, I overestimated the entropy provided by $RANDOM and underestimated the birthday paradox. I know $RANDOM only gives you a number from 0-32767 (15 bits of entropy), and I know about the birthday paradox, but it’s surprising what the combination of those two can result in.

I was only generating 45 filenames, but I actually encountered a collision. Only 45 numbers from 0-32767 and two are the same? How?!

Well, it’s more likely than you think. Specifically, it’s 3% likely.* Still rare, but likely enough to be plausible that I encountered it by chance.

Continue reading

A nice way to deal with query strings in Python

TL;DR: I accidentally wrote an argparse for web framework views. You can get it here.

What?

Do you care about your query strings? Do you like them to look nice? Do you find yourself repeatedly writing code to validate parameter values?

Well, I recently got annoyed with repeatedly solving those problems for myself in my Django-based site and wrote a nice solution that I thought others might find useful.

Why?

First, let me lay out the problems I wanted to solve:

  1. I’d like to omit parameters that are already the default.
  2. I’d like to have the parameters in a preferred order.
  3. I kept having to write code to cast GET parameter values into a certain type and check that they’re valid.
  4. My code didn’t make it very clear what the potential parameters were for each view, what their types were, and what their defaults were.

To expand on #1: my webapps often use query strings. But it’s true that query strings are ugly and I’d like beautiful urls. So when possible, I’d like to minimize the query strings by omitting default parameters. Most of the time, my apps need few or no non-default parameters, so this point is significant.

Continue reading

Save webpages in the Internet Archive

You know the Internet Archive’s Wayback Machine? Well you should. The Archive has been saving webpages since 1996 (think like Google cached pages but better). You can search for any url and, if it’s a hit, see a snapshot of it in the past, often at several different times.

But there are lots of pages that they miss with their automated crawler. They do have manual help from librarians, but they can only do so much. Which is why it’s awesome that now they’ve provided a way for you to manually save pages! Now, if you bookmark a url but you’re worried it’ll be dead when you try to access it in a year or two, you can get peace of mind by using the “Save Page Now” feature on their homepage. Just like that, it’s archived for all posterity. (And really, don’t worry. They’re serious about sticking around. They even built a backup at the Library of Alexandria.)

But it’s a little inconvenient to keep going to their homepage and pasting in urls. Luckily, it’s simple to make a bookmarklet to automate it. Just bookmark this (in the “url” field of the bookmark), and click on it when you’re on a page you want to archive:

javascript:(function(){window.open('https://web.archive.org/web/*/'+document.location.href,'_blank')})()

First, it’ll search for any existing entries, and if you don’t find any, you’ll be able to save it yourself. Now go, and contribute to the preservation of culture!