3 Replies Latest reply on Oct 26, 2018 2:45 PM by Chris Adams

    What is the technology behind crowd.loc.gov?

    Kelly Osborn Tracker

      This is such a neat project. What kind of technology do you have behind your crowdsourcing platform, and would it be possible for me to see the code?

        • Re: What is the technology behind crowd.loc.gov?

          Hi Kelly,


          I'm one of the developers who works on crowd.loc.gov. Our project is developed in the open over on https://github.com/LibraryOfCongress/concordia and the code is free to use.


          The application that powers crowd.loc.gov is a Python / Django website with a Postgres database backend. We're running it in Amazon Web Services Elastic Container Service using Fargate tasks.


          I'd be happy to answer any questions you have about the code. Also, feel free to submit an issue or a pull request.



          1 person found this helpful
            • Re: What is the technology behind crowd.loc.gov?
              techhistorynerd Adventurer

              Apologies - this is a pretty technical set of questions, but the nerd in me can't resist...


              1.  What would you say are minimal hardware requirements for running a development docker container to test this out (e.g., are there minimum RAM/CPU/disk space requirements to be aware of?)


              2.  What sort of stress testing (if any) have you been able to try to see how the platform performs when under high load (either or both of lots of connections/queries and very large numbers of images/records in the database)?


              3.  Have you had any discussions with the ARC website designers (and/or the folks doing the follow-on work from the earlier NARA survey) about how your infrastructure compares to theirs and if there is potential for collaboration between NARA and LOC in this area?

                • Re: What is the technology behind crowd.loc.gov?

                  Another crowd.loc.gov developer here!


                  The minimal requirements are fairly low – there are a few operations in the admin or importer which use memory in proportion to the number of items being processed so for local development you might need only more than 100MB or so if you tried to delete a large batch all at once. Most of this will be a scaling question for the number of simultaneous requests, and with the dev environment that's going to be one.


                  For performance, there are a couple of things we did. During development, we used tools like Django Debug Toolbar to make sure that the database queries were reasonable for all of the public views. After the deployment situation had stabilized, I ran some tests using tools like siege and work to load a large number of site URLs randomly so we could measure min/max/median performance and look for unexpected load. I also used some scripts with https://github.com/yujiosaka/headless-chrome-crawler to walk the entire site in Google Chrome, record page-load timing data (which is usually more front-end than back-end but useful to review), and catch any JavaScript errors or 404s.


                  I have not talked with anyone at ARC but I do think there are some interesting discussions about infrastructure. This was our first big serverless project and the auto-scaling setup was especially nice for peace of mind.

                  1 person found this helpful