We have successfully restored the primary domain (www.cdc.gov) of the site. With this success, we plan to launch publicly. There are several parts of the site that are still works-in-progress.
- All subdomains (e.g., XYZ.cdc.gov) are under reconstruction. This is taking longer because the archiving mechanism for the primary domain is not available for the subdomains.
- Some links are broken as a result of the archival crawl. We hope to capture and repair these through our bug tracking system.
- We do have all of the datasets downloaded, but not necessarily their pages. Offering those data for individual download is on our to-do.
Our Process
Primary www subdomain
The GutHub of this programming to be shared by March 10, 2025.
- Download the archived zim file from archive.org
- Convert its contents to a LeveldDB file (~95GB).
- Set up Flask python server with adjustments to add the banner and break logo image links.
Other subdomains
- Identify WayBack WARC.
- Extract its contents.
- Process html to add banner and break logo image links.
Data
- Download the torrent file from archive.org.
- Use a torrent client and, where possible, provide seeding for other downloads.
- Provide back to public? <- we are here