- One developer‘s chat data analysis of 1.2 million messages spans 20 years and five social platforms.
- His chat data analysis uncovered friendship half-lives, emotional bandwidth limits, and a vocabulary frozen in his early 20s.
- Filtering noise was the hardest technical challenge — over 40% of messages in his longest thread were conversational filler.
- The project raises bigger questions about whether our digital trails can tell us who we actually are to other people.
- One developer’s chat data analysis of 1.2 million messages spans 20 years and five social platforms.
- His chat data analysis uncovered friendship half-lives, emotional bandwidth limits, and a vocabulary frozen in his early 20s.
- Filtering noise was the hardest technical challenge — over 40% of messages in his longest thread were conversational filler.
- The project raises bigger questions about whether our digital trails can tell us who we actually are to other people.
When Self-Tracking Gets Uncomfortably Honest
Most people who obsess over personal data stop at step counts and sleep scores. Valentin Drobinin went considerably further. His chat data analysis — spanning two decades, five platforms, and 1.2 million messages — started as a self-improvement project and ended up as something closer to an unsettling psychological audit. The question he set out to answer: Am I actually a good friend? What he found was more complicated than a yes or no.
The impulse, he writes, traces back to a 2014 essay by Tim Urban of WaitButWhy — the now-famous “Your Life in Weeks” piece, where a grid of squares represents every week of a human life, most of them already gone. That image stuck. Drobinin started tracking things because of it, but biometric data felt hollow. Steps and heart rate don’t tell you whether you were kind, present, or worth talking to. He wanted a different kind of record.
Journaling didn’t work either. Paper, then text files, then Obsidian — each method captured what felt important in the moment and missed everything slow-moving and structural. The patterns you can only see in retrospect. So he turned to the one data source that had been recording everything all along, without him having to remember to press save: his chat history. That decision to treat his message archives as the raw material for chat data analysis turned out to be the most revealing choice he made.
Building a Digital Archaeology Project
Drobinin’s online life breaks into three distinct eras. The 2000s were ICQ, IRC, and DC++ — midnight channels, teenage banter, all gone. He’s not mourning those. The 2010s spread across VK (a post-Soviet social network that still holds his archives back to 2008), Twitter, and Facebook. More recently, Instagram DMs and Telegram have taken over, and he notes that more people are quietly swapping WhatsApp for Telegram — a migration that privacy researchers have been tracking for years.
Armed with GDPR data access requests, he pulled archives from all five platforms. Then came the messy reality of actually parsing them. Instagram double-encodes Cyrillic text through latin-1. Telegram assigns different internal message IDs depending on when you export. Facebook’s belated end-to-end encryption rollout means identical messages appear in three different folders. VK just dumps everything with no ceremony. Instagram can’t distinguish a broadcast message from a personal conversation.
None of this is surprising to anyone who’s ever tried to work with social media data exports — they’re designed to technically satisfy data portability laws, not to actually be useful. Once Drobinin hammered everything into a uniform tab-separated format, the five sources produced genuinely different kinds of signal. Telegram and VK skew toward direct messages. Instagram adds story interactions and a follower graph. Twitter required the reply and mention graph to extract anything meaningful, since DMs there are half support tickets and half conference logistics. At this stage, the chat data analysis was less about insight and more about survival — getting the data into a shape where any analysis was even possible.
Chat Data Analysis Hits a Noise Problem First
Before any interesting insight, there’s a deeply unglamorous problem: most of what people say to each other is noise. In Drobinin’s longest thread — 486,000-plus messages exchanged with his partner over ten years — only 58.7% qualifies as substantive text. The rest breaks down into 28.4% conversational filler, 9.1% media, 2.4% links, and 1.5% emoji-only messages. That’s 41% of a decade-long conversation that tells you essentially nothing about the people having it. Anyone attempting a similar chat data analysis will hit this wall early.
Filtering emoji and links is trivial. Filtering filler is genuinely hard. A three-word minimum cuts too aggressively — “he died” and “we lost” are two words each and carry more weight than most paragraphs. A denylist of “hahaha” and “noice” and their variants falls apart immediately across multiple languages. What ultimately worked was a sampling approach: pull from five offset positions across the chat, count token frequency, manually review the top 80, then pair the denylist with a protected set for short messages that mark real life events. It’s inelegant, but it held.
After cleaning, the corpus contains roughly 52,000 unique lemmas. The more striking finding is what happened to the novelty rate — the share of words Drobinin had never used before in any chat. It’s been declining since 2008 and plateaued at 6% about six years ago. Most of his active vocabulary was effectively locked in by his early 20s. That’s not a unique finding — linguists have observed that adult vocabulary growth slows sharply after formal education ends — but seeing it mapped against your own message history makes it land differently. This is the kind of result that makes chat data analysis feel less like a technical exercise and more like a mirror.
The Identity Problem: Which Sasha?
Once the noise is cleared, the next challenge is person resolution — figuring out that the Alex in your Facebook messages, the Xander in Telegram, and the Sasha in a VK group chat are all the same human being. This is harder than it sounds. Slavic diminutives alone create a small identity crisis: Alexander can become Al, Alex, Sandy, Alec, or Sasha depending on context and who’s talking. In Slavic languages, Sasha is also gender-neutral, which adds another layer of ambiguity.
Morphological analysers handle case inflection reasonably well but won’t touch slang. Named entity recognition models are built for formal text, not casual conversation. Heuristics break down across thousands of first-name-only mentions scattered through group chats. Drobinin acknowledges the only real solution is a classifier trained on message content — essentially teaching a model to infer which person is being referenced from surrounding context. That’s a non-trivial ML problem for what is, at its core, a personal chat data analysis project.
This identity resolution challenge is, incidentally, the same one that commercial people analytics tools and enterprise CRM systems have been struggling with for years. LinkedIn has dedicated substantial engineering effort to deduplicating professional identities across partial name matches and company changes. The difference is that LinkedIn has billions of data points and Drobinin has his chat history — but the problem structure is identical.
What the Numbers Actually Revealed
The findings Drobinin surfaces from his chat data analysis are personal enough that they resist easy generalisation, but some themes are broadly recognisable. He describes discovering what he calls “friendship half-lives” — the rate at which contact with specific people decays over time — and “endearment cycles,” patterns in how affection and attention ebb and flow within relationships. He also found limits on his own emotional bandwidth: the data made visible just how many relationships he was quietly letting lapse through inattention rather than intention. A more structured chat data analysis framework might have labelled these findings differently, but the emotional weight would be the same.
There’s something quietly uncomfortable about that. Most people sense, vaguely, that they’re not as good at maintaining relationships as they’d like to be. Seeing it quantified — as a decay rate, a message frequency graph, a sentiment trend — removes the comfortable ambiguity. Drobinin himself frames it with characteristic directness, noting that cooking a steak from scratch (he once learned to hunt and dress a deer to do exactly that) is considerably easier than human connection. The data just made the gap harder to ignore.
The Bigger Question This Project Is Really Asking
Personal data projects like this one sit at an intersection that’s getting more crowded. The quantified self movement has been building since the early 2010s, and tools like Obsidian, Notion, and a growing ecosystem of personal analytics apps have made self-tracking more accessible. But most of that infrastructure is optimised for productivity and health — not for the texture of relationships and the quality of who you are to other people. Chat data analysis occupies a different category entirely: it’s self-tracking that points outward, at how you show up for others rather than how your body performs.
Drobinin’s approach — building a personal CRM from the record rather than from memory — points at something the mainstream tools aren’t really addressing. Your Fitbit knows when you slept badly. Nothing in your app ecosystem knows that you went two months without reaching out to someone who matters to you, or that your messages to a close friend shifted in tone after a particular event. That kind of signal is sitting in your chat history. Most people just don’t have the engineering inclination, or the nerve, to look. Accessible chat data analysis tools could change that calculus significantly.
As large language models get better at processing personal text data locally — tools like on-device models from Apple and Google are already moving in this direction — the barrier to this kind of analysis is going to drop sharply. The question won’t be whether the technology can do it. It’ll be whether people actually want to know what their messages say about them. Drobinin’s project suggests the answer is at once yes and unsettling.




