Stop your Roku from tracking you

My roommate got a Roku TV last year, and since it’s a “smart” TV, I was naturally suspicious. In case you haven’t heard about smart TV’s spying on people, there’s more than a few examples to choose from. But then I remembered I had a wonderful Synology router, complete with fine-grained segmenting and filtering tools.

That means that not only could I block the TV’s access to certain domains, but I could actually watch which domains it connects to!

Now, I know people have already studied the tracking domains TV’s connect to. I know you could block most tracking domains by just dropping a Pi-Hole on your network. In fact, my Synology router already has preset block list for ads. But I don’t want to bother with a Pi-Hole, and it’s more interesting to investigate this stuff myself! Plus, that way I know more about what’s going on and don’t have to wonder if those block lists are missing anything (as they’re known to do).

Qualitative results

The results were fascinating. First off, I actually didn’t observe as much obvious spying as I expected. None of the domains I observed seemed clearly connected to TCL, the manufacturer. In fact, when you’re just in the Roku menus and not in an app, the only domain it seems to connect to is ravm.tv, which seems to be owned by Roku. Perhaps part of the “Roku TV” agreement they made with TCL involves Roku doing all the data collection too, selling it to TCL on the back end. Well luckily it’s not necessary. I’ve blocked it and never noticed anything broken except the ads on the Roku home screen. However, roku.com is necessary so ¯_(ツ)_/¯

As for the apps themselves, well it’s interesting to see that the tracking domains are often pretty obvious. And plentiful. They usually have things like “ad”, “beacon”, “metrics”, or “pixel” in the name. Many don’t, though. I just used trial and error, blocking different ones and checking if the app still works to see what’s necessary.

The data

After going through this process, I thought it might be useful for others if I post my findings. I’ll try to update this list as I learn more. These lists are not exhaustive. There could be more domains these apps connect to that are actually necessary or unnecessary that I haven’t discovered or confirmed. But these are ones I have observed and investigated.

Continue reading

Internet Archive: Please be more careful

The lawsuit the Internet Archive is facing is still pretty concerning for me, and I wanted to put together my thoughts on exactly why.

The lawsuit could have been a death blow

I won’t explain the background here, but you can get the basics from this Vox article. It also explains how people have been saying that this lawsuit threatens the very existence of the Archive, but that’s not exactly true.

However, it seems the only reason it’s not an existential threat to the Archive is because of restraint on behalf of the publishers. I’m not a lawyer, but from the article it seems that if the publishers wanted to, they could push for the full $150,000 in damages for all 1.4 million books. Or at least a few hundred or thousand books, which would be enough to get into the hundreds of millions of dollars in damages. Given that the Archive’s annual operating revenue is only $19 million, it looks like this lawsuit could have bankrupted it.

It’s hard to overstate how important the Internet Archive is

That’s pretty scary for me, because I care about the Internet Archive a lot. The amount of irreplaceable human history it contains is staggering. Its Wayback Machine is its most famous project, and deservedly so. 25% of all webpages are gone within 5 years. If you’ve ever tried to look up a news article from 10 years ago, chances are the site’s response was “Huh? Oh, did you want to look at today’s front page?”. And beyond the web, the Archive is digitizing huge troves of books, records, and all manner of media that are long lost and out of print. And it stores all sorts of special collections, like classic video games and software. As more and more of the world moves to the cloud, we’re facing the prospect of a digital dark age when those sites and servers inevitably go offline.

That’s why I am so thankful that the Archive exists, and that its leadership seems intent on preserving as much human culture as possible, for the long term. There are too many web preservation projects whose structure, funding, or leadership doesn’t give much confidence that they’ll be around for the long haul. WebCite already stopped archiving due to lack of funding, and who knows how long they’ll be able to pay to keep their servers up? But the Internet Archive seems like an organization with the singular focus and guidance to become an institution that preserves history for posterity. That’s why I’ve donated more to it than any other tech organization over the years. It’s why I’ve participated in their hackathons and archiving efforts.

The Archive is on track to become one of the most important repositories of human history. The British Museum contains our ancient history, the Library of Congress has the history of the print age, and the Internet Archive preserves the digital age.*

Please don’t gamble it all

But just because it exists now doesn’t mean that history is safe. It turns out the burning of the Library of Alexandria is a myth, but if something were to happen to the Archive, it would be a reality as terrible as the legend. I want this institution to be around for generations. And for that to happen, we can’t be taking risks like the National Emergency Library.

In all my reading I haven’t come across any legal precedent that makes this move defensible in court. And as far as regular copyright laws go, this seems to be pretty clear infringement. Before this, the Archive seemed to have found a pretty nice compromise where they only lent out a limited number of copies at a time, just like a regular library. No one sued, and it seems to me a pretty legally defensible position. This is more in line with the smart way the Archive usually handles copyright. It’s technically in a legal gray area, having copied tons of other people’s works. But at least in the case of the Wayback Machine they’ve been able to avoid lawsuits by responding to individual takedown requests with a simple “yes”. They bend like a willow in the wind: remove individual items in order to preserve the rest. Normally they’re good at walking the copyright line. Which is why I’m even more baffled by the Emergency Library.

Don’t get me wrong, I’m no fan of current copyright laws. I’m happy for someone to challenge them in novel ways, but please, leave that to others like the EFF. Even if the risk posed by this lawsuit is less than it seems, it’s still too much. If this institution is going to last generations, it can’t start rolling the dice like this every few years. Eventually those odds catch up and you’ll lose. We’ll all lose. An entire period of history. Gone.

* I don’t mean to suggest the British Museum and LoC contain the majority of the history from those respective ages (and I don’t mean to condone how they may have obtained their artifacts). In fact, the Internet Archive is distinguished from the other two in that it probably does contain the majority of the preserved items from its respective age.

Save webpages in the Internet Archive

You know the Internet Archive’s Wayback Machine? Well you should. The Archive has been saving webpages since 1996 (think like Google cached pages but better). You can search for any url and, if it’s a hit, see a snapshot of it in the past, often at several different times.

But there are lots of pages that they miss with their automated crawler. They do have manual help from librarians, but they can only do so much. Which is why it’s awesome that now they’ve provided a way for you to manually save pages! Now, if you bookmark a url but you’re worried it’ll be dead when you try to access it in a year or two, you can get peace of mind by using the “Save Page Now” feature on their homepage. Just like that, it’s archived for all posterity. (And really, don’t worry. They’re serious about sticking around. They even built a backup at the Library of Alexandria.)

But it’s a little inconvenient to keep going to their homepage and pasting in urls. Luckily, it’s simple to make a bookmarklet to automate it. Just bookmark this (in the “url” field of the bookmark), and click on it when you’re on a page you want to archive:

javascript:(function(){window.open('https://web.archive.org/web/*/'+document.location.href,'_blank')})()

First, it’ll search for any existing entries, and if you don’t find any, you’ll be able to save it yourself. Now go, and contribute to the preservation of culture!