Last updated: March 30, 2025
We have successfully restored the primary domain (www.cdc.gov) of the site. With this success, we plan to launch publicly. There are several parts of the site that are still works-in-progress.
- All subdomains (e.g., XYZ.cdc.gov) are under reconstruction. This is taking longer because the archiving mechanism for the primary domain is not available for the subdomains. We are a lot closer on this than ever before. We aim to restore subdomains before the end of April 2025.
Some links are broken as a result of the archival crawl. We hope to capture and repair these through our bug tracking system Patching process is COMPLETED (actual patches to known errors will happen soon)
We do have all of the datasets downloaded, but not necessarily their pages. Offering those data for individual download is on our to-do. COMPLETED
Progress
- March 3, 2025 - the primary domain and pages are launched!
- March 5, 2025 - the data subdomain consisting of datasets is configured to permit file downloads
- March 17, 2025 - we released a comparison tool that shows how content changed between the restored version and the live CDC version of the same page
- March 29, 2025 - we established a way of patching the web crawls without touching the original levelDB contents - the initial test added back a broken image to a patched levelDB container
Our Process
Primary www subdomain
Github repository: https://github.com/RestoredCDC/CDC_zim_mirror
- Download the archived zim file from archive.org
- Convert its contents to a LeveldDB file (~95GB).
- Set up Flask python server with adjustments to add the banner and break logo image links.
Other subdomains
Github repository: https://github.com/RestoredCDC/Restore-CDC-WARC
- Identify WayBack WARC.
- Extract its contents.
- Process html to add banner and break logo image links.
Patching
Github repository: https://github.com/RestoredCDC/Patch_LevelDB
Since our underlying infrastructure is levelDB, we needed a way of patching that was non-destructive to comply with all transparency and archival standards. This library establishes a patch levelDB for any additions or replacements/modifications with logging. When serving content, first look in the patch levelDB and then fallback to the base levelDB.
Data
- Download the torrent file from archive.org.
- Use a torrent client and, where possible, provide seeding for other downloads.
- Provide back to public in the current format of a directory with subfolders to group related data.