Data

Content tagged "Data".

The Canonical Path

the most canonical paths across the shared surface of the world’s music, starting at the point of some particular artist and going…outward, every other direction at once.

I’ve enjoyed the playlists generated by this so far.

The Democratization of Data Science

Jonathan Cornelissen:

Relegating all data knowledge to a handful of people within a company is problematic on many levels. Data scientists find it frustrating because it’s hard for them to communicate their findings to colleagues who lack basic data literacy. Business stakeholders are unhappy because data requests take too long to fulfill and often fail to answer the original questions. In some cases, that’s because the questioner failed to explain the question properly to the data scientist.

​…

A data-literate team makes better requests. Even a basic understanding of tools and resources greatly improves the quality of interaction among colleagues. When the “effort level” — the amount of back-and-forth needed to clarify what is wanted — of each request goes down, speed and quality go up.

​…

Shared skills improve workplace culture and results in another way, too: They improve mutual understanding. If you know how hard it will be to get a particular data output, you’ll adjust the way you interact with the people in charge of giving you that output. Such adjustments improve the workplace for everyone.

No Observability Without Theory

Dan Slimmon:

When building a dashboard, start with a set of questions you want to answer about a system’s behavior, and then choose where and how to add instrumentation; not the other way around.

The incident response team’s common ground is their theory of the system’s behavior – in order to make troubleshooting observations meaningful, that theory needs to be kept up to date with the data.

Data Visualisation Mistakes

Sarah Leo:

At The Economist, we take data visualisation seriously. Every week we publish around 40 charts across print, the website and our apps. With every single one, we try our best to visualise the numbers accurately and in a way that best supports the story. But sometimes we get it wrong. We can do better in future if we learn from our mistakes — and other people may be able to learn from them, too.

Machine Learning for Data Structures

The Case for Learned Index Structures:

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.

Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets.

The shadow org chart

Henry Ward:

I have long felt there is a shadow org chart, much like a shadow economy, where employees trade ideas, give direction, offer help, and spread culture. This shadow org chart is built bottom up by employees and is very different from the top down hierarchical org chart set by me.

I wanted to map this shadow org chart and find employees who have disproportionate levels of influence relative to their hierarchical position. I also wanted to see the influence centers and decision makers, and the directional current between them and the rest of the company.

Some fascinating analysis and data visualisation of a graph of influence.

PumpkinDB

About PumpkinDB:

PumpkinDB is essentially a database programming environment, largely inspired by core ideas behind MUMPS. Instead of M, it has a Forth-inspired stack-based language, PumpkinScript. Instead of hierarchical keys, it has a flat key namespace and doesn’t allow overriding values once they are set.

The core ideas behind PumpkinDB stem from the so called lazy event sourcing approach which is based on storing and indexing events while delaying domain binding for as long as possible. That said, the intention of this database is to be a building block for different kinds of event sourcing systems, ranging from the classic one (using it as an event store) all the way to the lazy one (using indices) and anywhere in between.

Haunted by Data

Maciej Cegłowski:

A more recent and less fictitious example is electronic logging devices on trucks. These are intended to limit the hours people drive, but what do you do if you’re caught ten miles from a motel?

The device logs only once a minute, so if you accelerate to 45 mph, and then make sure to slow down under the 10 mph threshold right at the minute mark, you can go as far as you want.

So we have these tired truckers staring at their phones, bunny-hopping down the freeway late at night.

Of course there’s an obvious technical countermeasure. You can start measuring once a second.

Notice what you’re doing, though. Now you’re in an adversarial arms race with another human being that has nothing to do with measurement. It’s become an issue of control, agency and power.

You thought observing the driver’s behavior would get you closer to reality, but instead you’ve put another layer between you and what’s really going on.

These kinds of arms races are a symptom of data disease. We’ve seen them reach the point of absurdity in the online advertising industry, which unfortunately is also the economic cornerstone of the web. Advertisers have built a huge surveillance apparatus in the dream of perfect knowledge, only to find themselves in a hall of mirrors, where they can’t tell who is real and who is fake.

Music Curation at Spotify

Alex Heath:

During internal testing, his team realized that if you don’t recognize a single artist in a playlist, you might question if it’s actually geared for you. That’s why the playlist is intended to have a mix of mostly new tracks with a few songs you’ve heard before.

“There’s something compelling about this humans versus robots narrative: a lovingly curated playlist versus an algorithm screwing up your sexy time,” says Ogle. “That whole distinction no longer really describes how we work. Discover Weekly is humans all the way down. Every single track that appears in Discover Weekly is because other humans being have said, ‘Hey this is a good song, and here’s why.’”

My Discover Weekly playlist has been hitting the spot and it’s interesting to get an insight into how they are put together.

Although Spotify has had humans making playlists for years, its efforts got a major boost last year with the introduction of Truffle Pig, an internal tool from The Echo Nest that breaks music down into thousands of categories like “wonky,” “chillwave,” “stomp and holler,” or “downtempo."

What you hear from everyone at Spotify is that humans using data insights are key to curating music on a large scale. Naturally, they’re also using data to evaluate how well playlists are working.

An example of data-centric tools aiding human decision makers.

Designing Data-Driven Interfaces

Truth Labs:

“Dashboard”, “Big Data”, “Data visualization”, “Analytics” — there’s been an explosion of people and companies looking to do interesting things with their data. I’ve been lucky to work on dozens of data-heavy interfaces throughout my career and I wanted to share some thoughts on how to arrive at a distinct and meaningful product.

I’ve been doing a lot of data-driven interface work over the last few years and the advice in this article is spot on.

RSS feed for content about Data.