The Wayback Machine

In an ever-evolving digital world, websites come and go, often taking valuable information with them. The Wayback Machine, a digital web page archive created by the Internet Archive, seeks to preserve these moments in time for future generations to get a snapshot into the past.

Regularly used by journalists, researchers, and even web developers (to see the history of pages), the Wayback Machine has become an increasingly valuable tool for time traveling through the internet.

The Wayback Machine works by crawling websites, following links, and saving the content the crawler finds. The site boasts of having over 805 billion web pages archived.

The data archived by the Wayback Machine can be very useful for cyber security professionals, black hat hackers, and Open Source Intelligence (OSINT) gatherers. As a developer, there are several potential dangers of your website being achieved:

Comments in the code revealing sensitive data could expose your company to attack
Contact information on the site could be used in phishing attacks
Blog posts or other public content could reveal sensitive information about your organization's policies and procedures

Even though the Wayback Machine does present a security challenge, the Internet Archive's important mission of preserving the web for future generations benefits everyone. It's important to be aware of the fact that anything posted to the internet is probably archived somewhere. Cyber security specialists can help educate and inform their community, employers, and clients of the importance of not sharing potentially compromising information publicly on the web by knowing about and understanding tools like the Wayback Machine.

How can developers utilize the Wayback Machine?

If you work with clients or are on a dysfunctional team you may be familiar with stories like this... "Our website is down because our hosting server got deleted and we don't have backups!" This issue could be caused by someone not paying the bill, a malicious threat actor, or an incompetent employee. Either way, it's your responsibility to fix it.

Without a live website to restore or access the actual data on the servers, your options are limited. You might have never even visited the website you're trying to recover. This is where Wayback comes in. Using the Wayback Machine you can look back in time and see archived versions of the site, hopefully enabling you to rebuild the site. Not a great resolution, but without Wayback you would have had to start 100% from scratch.

Another common use case for Wayback is Always Online tools. Cloudflare in particular, has a feature powered by the Wayback Machine that enables their customers to opt into an Always Online feature. When your site's server is offline or otherwise unreachable, an archived version of your site is shown to your users with a little banner that lets them know this is an archived version. This feature is invaluable during downtime or other cases where your server may be offline. This is all thanks to the efforts of the Internet Archive and the contributors to the Wayback Machine.

How can I protect myself from the potential security vulnerabilities associated with the Wayback Machine?

Wayback in and of itself isn't a malicious or otherwise harmful service. In fact, as the previous section discussed, Wayback can be valuable to developers and IT professionals. The issues arise when site owners carelessly expose sensitive information that could be used by malicious threat actors to build out a target profile or as clues to conduct better-informed attacks.

Information security is paramount when it comes to the internet because any number of actors are constantly scraping the internet and saving what they find. Search engines like Google do this to provide better search results, the Wayback Machine does this to compile an internet library, and well-funded malicious actors such as nation-states and organized crime groups may be scraping the internet to gather intelligence and leverage. Because of these reasons, it must not be forgotten that anything you post on the internet might as well be published on the front page of every newspaper in the world and memorialized in lights on every billboard from Time Square in New York to Moscow, Russia and everywhere in between.

What policies can my organization implement to minimize the threat of scrapping archives?

Every organization's information security policies should dictate the importance of ensuring confidential data stays confidential. A few helpful ideas include:

Utilize secret scanning tools like GitHub's code scanning features with push protection: https://docs.github.com/en/code-security/secret-scanning/about-secret-scanning
Don't publically expose internal documentation and resources/tooling.
Require peer code reviews and ensure that secure programming practices are followed.
Avoid disclosure of confidential information or specific technologies in job postings. Rather than saying we use WordPress version 2.0 and MySQL say we use CMSs like WordPress, Sanity, Druple, and Contentful along with database technologies like MongoDB, MySQL, and PostgreSQL.
Never put personal employee contact details on your website.
Restrict employees from disclosing confidential information about the internals of your systems in interviews, podcasts, engineering articles, etc.
Utilize a tool or build step that removes comments from your code. It's a great practice to write descriptive comments in your code, but it's not a great practice to share those comments with an attacker attempting to reverse engineer your site.