Event

Hyphe – A research-oriented web crawler

Hands-on History talk with Benjamin Ooghe-Tabanou (médialab – Sciences Po)

Developed by Sciences Po médialab as an open source software, Hyphe was designed to provide researchers and students with a research oriented crawler to build and enrich corpuses of websites through a qualitative fieldwork methodology. It provides a method and a tool to build a research corpus from web content (web pages and HTTP links) with an innovative approach meant to address two of the main social sciences problems when working with automatized web mining: how to build a theme focused corpus and how to delineate an actor’s presence on the web.

A step-by-step iterative process supports Hyphe users in dynamically curating and defining “web entities” in a way that is both granular and flexible by choosing single pages, a subdomain, a combination of websites, etc. The pages residing under these entities are then crawled, in order to extract the outgoing links and part of the textual contents. The most cited “web entities” can then be prospected manually in order to expand the corpus before visualizing it in the form of a network and exporting it for cleaning and analysis in other tools such as Gephi.

In partnership with France’s official web archiving teams, Hyphe was recently adapted in order to also crawl web archives from archive.org as well the national french library (BnF) and the audiovisual national institute (INA), empowering users to build web corpuses from the past or to complete web corpuses from the live web with archives of disappeared websites

About the speaker

Benjamin Ooghe Tabanou

Trained as a multidisciplinary engineer, Benjamin Ooghe-Tabanou specialises in applying computer science to scientific research. After multiple experiences within the field of astrophysics at Johns Hopkins University in the USA and École Normale Supérieure in France, he enters the social sciences field, first as an Open Data and Parliament transparency activist, cocreating the NGO Regards Citoyens, then as a research engineer joining médialab Sciences Po in 2012 in order to develop open source tools for social sciences such as the webcrawler Hyphe. He supervises médialab’s technical team of research engineers since 2020.