Artist entity extraction for all - opening our “Puddle”
It’s the year 2006 and co-Founder Brian and early Nest developer Ryan were trying to figure out how to associate the world of free text on the internet to musical artists. We already were crawling tens of thousands of documents a day (now millions!) but a Google-style index of unstructured text about music was not our goal. We needed to somehow quickly associate a new incoming page to an artist ID so that we could quickly retrieve all the documents about an artist as well as run our statistics on the text to find out what people were saying. Brian sketched this classic diagram, soon to be placed in the Echo Nest Museum (next to SoundCloud’s award):
That crudely drawn “blob of intelligence” that could take unstructured free text and quickly identify artist names quickly became known as “the Puddle,” a term that entered Echo Nest lore alongside “grankle” and “flat.” We use a form of the Puddle to this day. Every piece of text that our crawlers generate goes through a custom entity extraction process— it’s how we know what blogs are writing about which artists and it’s what powers our artist similarity engine, as we need to figure out what people are saying about which artists as soon as it’s said. It’s a powerful and fast changing piece of our infrastructure trying to attack a hard problem.
Entity extraction is even more useful today. If you wanted to build a Twitter app that figured out the bands a user was talking about, how could you do it? You’d need a huge database of artists (check, we have over 1.6 million), a lot of fast computers (check), and tons of rules learned from our customers over the years about artist resolution— aliases, stopwords, tokenization, merged artists and so on. Given a simple tweet:
Can we figure out what band Brian’s talking about, automatically? Well, now you can. We’ve decided to open up a beta version of our entity extraction toolkit called artist/extract to developers. Pass in any text and you’ll get back a list of artist names (in order of appearance by default, but you can sort by any Echo Nest feature) that was mentioned in the text. Think of it as a form of artist search that can take anything — Facebook comments, tweets, blog posts, reviews, SMSes.
We support all sorts of fancy things to help you. We know that “Led Zep” is an alias for Led Zeppelin. We try to deal with common word band names via capitalization rules. You can of course detect multiple artists in the same block of text. And you can use personal catalogs and Rosetta Stone to limit results to music your user owns or is playable by our partners Rdio or 7digital. And you can add the standard buckets — hotttnesss, familiarity and so on — to get information about the artists all in one call.
This is beta and still has some issues. It lags a little behind our internal entity resolving for performance reasons, and things like this can never be perfect. But it’s very helpful. Some ideas we have batting around:
- Suggest bands to Facebook users using our new Facebook Rosetta service and by parsing their comments for band names
- Recommend Twitter followers based on the music they talk about
- Play a radio stream for any blog using our playlist APIs by parsing their posts for artist names
- See how “indie” your friends are by computing average hotttnesss of all the bands they mention in email
Enjoy. Let us know if you have any issues!