So, something funny happened the other day. A friend searched my name on Google and, to my surprise, pictures of me popped up. Not just any pictures, but ones I had deleted from LinkedIn ages ago. I thought I had erased those awkward photos for good, but there they were, in all their cringy glory. How? Why? What is this sorcery?
Turns out, Google has this thing called a web crawler. It's like a little internet robot that zooms around the web, checking out new pages and saving copies of them. These copies are called cached versions. So even if you delete something, Google might still have a backup of it. Thanks, Google. Really appreciate it.
Anyway, this got me thinking. If Google can cache pages from LinkedIn, maybe it does the same for other websites. You know, the ones with annoying paywalls? Most websites want to show up in Google search results, so they let the web crawler in. And bingo! The crawler saves
a version of the page that isn't paywalled.
Here’s where it gets interesting. I realized I could access these cached versions and read the full articles without paying a dime. Just a little trick I stumbled upon in my quest to find my lost photos.
But there's a catch. As soon as the cached page loads, it quickly updates and throws the paywall back in my face. Talk about frustrating! But I found a way around that too. All I had to do was disable JavaScript. Sounds technical, but it's super easy. Just go to your browser settings, find the JavaScript option, and turn it off.
And voila! The paywalled page stays open, letting me read all the juicy details. It's like finding a hidden treasure chest on the internet.
However, disabling JavaScript in my browser was only a temporary fix. I soon ran into a few more hurdles:
Scripts Re-injecting the Paywall
Even after removing the scripts on the client side, new scripts could be dynamically loaded, re-triggering the paywall. This was the main headache. Imagine thinking you've outsmarted the system
only to have the paywall pop up again. Ugh!
CORS Issues
When I tried to fetch resources directly in the browser, Cross-Origin Resource Sharing (CORS) policies kicked in, blocking my access. Websites are pretty strict about where their content can be loaded, and this was a major roadblock.( one would think they don't want me to access a content you` have to pay for);
When you try to fetch resources directly in the browser from a different origin (domain),
the browser's same-origin policy kicks in. This policy restricts how resources are loaded
from different domains to enhance security. If a website doesn't explicitly allow your
domain through CORS headers, your browser will block the request.
So, what did I do? I decided to bring in the big guns: a backend server.( evil laughs);
**The code is available on my github here **
How it works
Fetching Cached Pages:
I set up an Express server that fetches the cached version of any URL I pass to it.
Stripping Out Scripts
Using the Cheerio library, I remove all the script tags from the cached page. This stops the paywall scripts from re-injecting themselves.
ByPassing CORS
By using a backend server, I make the request to Google's cached page from the server, not the browser. Since the server is not subject to the same-origin policy like a browser, it can fetch resources from any domain without being blocked. Once the server fetches the page, it processes the content (removing script tags in this case) and then sends the cleaned-up content back to the browser.
With this setup, I can read those coveted articles without worrying about paywalls or script shenanigans. It’s like having a secret key to unlock all the content I want.
So, if you’re tired of hitting paywalls and have a bit of tech savvy (or a friend who does), give this method a try. It might just open up a whole new world of information for you. Happy reading!
via GIPHY