My take on Git-Scraping[tm]
I was inspired by the git scraping technique from Simon Willison a while ago which led me to maintaining my own public scraper.
My scraper is a little bit different though. The initial technique involves scraping and overwriting the data. My technique is to commit incremental changes instead. The idea is that this allows me to have a public git dashboard as well!
Here’s a workflow demo that collects download statistics from my python projects. There’s a command line script that prints to stdout, which is then logged in the appropriate file. I concatenate all the results in a single file at the end before I commit the changes back to master.
name: Kollekt Pepy
on:
workflow_dispatch:
schedule:
- cron: '0 10 * * *'
jobs:
scheduled:
runs-on: ubuntu-latest
steps:
- name: Check out this repo
uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt - name: Fetch latest data
run: |-
python download/pepy.py scikit-lego >> data/pepy/scikit-lego.jsonl
python download/pepy.py human-learn >> data/pepy/human-learn.jsonl
python download/pepy.py whatlies >> data/pepy/whatlies.jsonl
python download/pepy.py drawdata >> data/pepy/drawdata.jsonl
python download/pepy.py tokenwiser >> data/pepy/tokenwiser.jsonl
python download/pepy.py memo >> data/pepy/memo.jsonl
python download/pepy.py clumper >> data/pepy/clumper.jsonl
python download/pepy.py mktestdocs >> data/pepy/mktestdocs.jsonl - name: Concatenate it all
run: |-
python common/concat.py data/pepy/*.jsonl data/pepy/downloads.csv - name: Commit and push if it changed
run: |-
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
git add -A
timestamp=$(date -u)
git commit -m "Latest data: ${timestamp}" || exit 0 git push
You can find the project on github.
It’s a pretty powerful technique that you can easily combine with my justcharts library. I’m using the related github pages to host a dashboard here but the data can also be viewed via flatgithub. This flatgithub project is part of the flat data effort on github.