A year in review for reddit
Introduction
> 01/18/2020
This is my latest project, and probably the first long term project I've persued in a while. This blog post will go through the high-level design of how it was built along with tech-stack details, and serve as my own documentation if I decide to do this again for 2021.
"A year in review for reddit" is exactly what it sounds like. I'll be building a web app that showcases a full year of reddit activity day by day. Starting January 1st 2021 you'll be able to open https://reddityearinreview.com and see the top posts on reddit for any individual day of 2020. By the end of any given year I've usually forgotten everything that happened between January and November. I'm hoping this "year in review" site will help give us a better idea of the events that took place in 2020. Everything from news headlines to the hottest meme templates from 2020 will be found on the site.
To make this project work I need to collect some data. Reddit's API doesn't have any method of querying post or subreddit data for a particular day, so I had to do this part on my own. I created an AWS Lambda function that reads the top posts and trending subreddits at the time it is invoked. Pairing this with a Cloudwatch rule to run the lambda function each day we have ourselves enough data to build this! We just have to wait a full year...
When to collect the data from Reddit?
When considering a daily collection of data it's important to think about WHEN the data should be collected, especially for this project. Getting the daily top reddit posts is relative to what time zone you're in, so just getting the API data once a day won't cut it. With that in mind, I decided to poll the reddit API every 4 hours and use EST time zone to determine day cut-offs. I was happy with the collection schedule, and although using EST isn't perfect, it'll work.
Where to store the data?
Storing the data properly marked the first of what will likely be many mistakes made. I had the idea for this project a week before the new year, so I had to put together something quickly. I decided to just throw all the data from reddit's api into a Dynamo table without much thinking and later realized it wasn't in the best format to query multiple daily entries by month. Later, I decided it would be better for it all to live in S3. It's got simple in the name, and all I really needed was a loosely organized json file store in the first place. So, I spent a few days writing a Clojure app to migrate my existing Dynamo entries into S3 objects.
Aggregating daily reddit posts
> 11/25/2020
Once the raw reddit data was collected for a given month, it was time to aggregate it all together and determine the "winning" posts for each day. This process was fairly straightforward, but I definitely ran into a few edge cases along the way. For example, some posts would be in the top three upvoted posts for multiple days in a row. To avoid showing repeat posts, I decided to only display a post on the first day it arrived in the top three. The other stumbling point was figuring out what fields were actually required from reddit's post data to make the UI work. The fields being returned from reddit's api didn't make the post types super obvious (link post, text post, video content, image content, etc). There were fields like "post_hint" that you could rely on sometimes, but in other posts you had to go manually check if the "media" field or the "self" field contained anything. Regardless, I'd just have to wait to filter out the unnecessary data after I built the UI.
Serving aggregated post data
Now that the data is aggregated and filtered, where do we store it, and how should it be organized? My idea for the site was that a user would navigate the posts by month. A single page of posts would only represent a single month of 2020. This would let users do quick, simple searches, and it gave me a good excuse to avoid writing a more complicated filtering system. With that in mind I decided to store the post preview data in S3 objects by month. What I mean by "preview data" is the data required to get the unexpanded posts displaying correctly. Thumbnails, titles, number of upvotes, content type, etc. I didn't need things like the post body text or the comments mixed in with this data. You can't see this information unless you click into the post anyways. Its also a safe assumption that visitors of the site would only ever click into ~1% of posts, so there is no need to load all that extra data on the previews page. I took a similar strategy with the individual post details data (all the preview data + text body & comment data). Each individual post would get its own S3 object named after its post id making for easy key-value lookups.
So, now we have our data, and we just need a way for the UI to get ahold of it. I immediately started thinking about writing some api gateway function to get me the data I wanted, but the way I had stored everything in S3 was incredibly simple and I didn't need any api logic to access the right data stored inside the S3 objects. Plus, the data didn't need to be filtered of any private information before being returned to a user. So, what I really needed was something that just fetched raw files from S3, cached them, and if possible distributed the file's data across multiple server locations for quick page loads regardless of location. Oh wait that's literally what a CDN does, and AWS has one that can easily forward requests to S3. That was a long-winded way of revealing that I used Cloudfront. It does its job. I'm happy with it.
Putting the data on display
The UI for this project was built with Typescript, React, and NextJS. This was the first time I'd used NextJS, and I would highly recommend trying it if you haven't. I was able to pre-build all the month preview pages for quicker load times, and the out-of-the-box page routing made things super easy.
I hit a real stand-still while working on this part of the app. Most of the unpredictable pieces of the project had been taken care of, and I had a whole year to get the rest of it done. Matching all the fonts, colors, and icons of https://old.reddit.com into my UI wasn't super thrilling either. I could just push it all off until November, go through a mild panic that I wouldn't finish in time, and then wrap it up as planned over a couple weekends. So that's exactly what I did. Having a hard deadline of Dec 31st was great. I got some time to slack off, but it forced me to complete it by December because if I didn't the project was kinda useless. People aren't going to look at this project in February or March of next year. My only window of success with this webiste is going to be at the end of December and the first few days of January when everyone is feeling nostalgic.
All in all, the front-end work went pretty smoothly. I like how it looks and hope visitors of the site will appreciate the "old reddit" feel.
Thanks!
If you've read this far and you took the time to check out reddityearinreview.com, thank you. I enjoyed putting this together, and seeing others visit the site motivates me to continue building more projects in the future.