Most implementations are actually moving in the opposite direction. Previously, there was a tendency to look to aggregate words into phrases to better capture the "context" of a word. Now, most approaches are splitting words into sub-word parts or even characters. With networks that capture temporal relationships across tokens (as opposed to older, "bag of words" models), multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts.
> multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts
Indeed. Do you have an example of a library or snippet that demonstrates this?
My limited understanding of BERT (and other) word embeddings was that they only contain the word's position in the 728 (I believe) dimensional space but doesn't contain queryable temporal information no?
I like ngrams as a sort of untagged / unlabelled entity.
When using BERT (and all the many things like it, such as earlier ELMO, ULMfit and later ROBERTA/ERNIE/ALBERTa/etc) as the 'embeddings' you provide as input all the tokens in a sequence. You don't get an "embedding for word foobar in position 123", you get an embedding for all the sequence at once, so whatever corresponds to that token is a 728-dimensional "embedding for word foobar in position 123 conditional on all the particular other words that were before and after it'. Including very long-distance relations.
One of the simpler ways to try that out in your code seems to be running BERT-as-a-service https://github.com/hanxiao/bert-as-service , or alternatively the huggingface libraries that are discussed in the original article.
It's kind of the other way around compared to word2vec-style systems; before that you used to have a 'thin' embedding layer that's essentially just a lookup table followed by a bunch of complex layers of neural networks (e.g. multiple Bi-LSTMs followed by CRF); in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.
> in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.
Would you say they are still usually called "embeddings" when using this new style? This sounds more like just a pretrained network which includes both some embedding scheme and a lot of learning on top of it, but maybe the word "embedding" stuck anyway?
They do seem to be still called "embeddings" although yes, that's become a somewhat misleading misonmer in some sense.
However, the analogy still is somewhat meaningful, because if you want to look at the properties of a particular word or token, it's not just a general pretrained network, it still preseves the one-to-one mapping between the input token and the output vector corresponding to each particular token; which is very important for all kinds of sequence labeling or span/boundary detection tasks. So you can use them just as word2vec embeddings - for example, if you'd do word similarity or word difference metrics with 'transformer-stack-embeddings' then that would work just as well as word2vec (though you'd have to get to a word-level measurement instead of wordpiece or BPE subword tokens) with the added bonus of having done contextual disambiguation; you probably could do a decent word sense disambiguation system just by directly clustering these embeddings; the mouse-as-animal and mouse-as-computer-peripheral should have clearly different embeddings.
> Do you have an example of a library or snippet that demonstrates this?
All NLP neural nets (based on LSTM or Transformer) do this. It's their main function - to create contextual representations of the input tokens.
The word 'position' in the 728 dimensional space is an embedding and it can be compared with other words by dot product. There are libraries that can do dot product ranking fast (such as annoy).
Strong Analytics | Chicago, IL | Full-time Data Scientists, Data Engineers | https://www.strong.io
We help companies integrate state-of-the-art machine learning into their products, internal tools, and infrastructure. We've designed, built, and deployed products in the automotive space, pharma, gaming, retail, tech, and many other verticals.
Requires an advanced degree (M.S./Ph.D.) in a quantitative science and 1+ years applying machine learning to real-world problems.
Survival modeling is exactly what's needed for these situations. It allows you to (a) consider censored data (i.e., active customers who you know stay for at least X months) and, (b) use flexible survival distributions beyond the standard exponential distribution assumed in the typical monthly churn rate calculations.
Source: Run a data science company and we work on a lot of customer lifecycle modeling projects with companies much younger than yours.
I've done a bit of survival modeling, but my purpose was to understand retention across cohorts with certain attributes (typically, sign-up date, though occasionally campaign).
I'm interesting in how you've used this to model churn. Is there a blog post or resource you recommend to learn more about this?
A quantile-based confidence interval from bootstrapping can yield a 100% confidence interval that does not contain 0, i.e., with 100% of cases positive/negative. But that does not (necessarily) mean that there is a 100% chance that the new version is better than the old one. Confidence intervals are not Bayesian credible intervals and cannot be treated as such. (That said, making some certain assumptions about the underlying model can in some times allow one to treat nonparametric bootstraps in such a way.)
Right. The author finds 100% of the time for his current dataset but makes a statement that implies some certainty or inference on future cases. Like taking 100 men, 100 women and finding that 100 randomly matched pairs had the man taller than the woman 100 times, and making the claim that there is a 100% chance that men are taller than women.
The more I type the more I realize how pedantic this is, but we're emphasized in stats to pay extra attention to the conclusions we draw from the data we analyze.
Data security is hugely important. Here are a couple things we do to deal with it: (1) We cleanse the data of Personal Identifiable Information (PII) as quickly as possible (i.e., before it touches our servers), and (2) We host our databases behind secure networks and follow best practices with regards to authentication, encryption, etc.
We were all drawn to applying the statistical, experimental, and algorithmic approaches we learned in graduate school (and in our spare time) to a range of problems in industry. Every project has a big learning component that keeps things exciting and fresh.
Our first handful of clients all came from our professional network. I've been a developer and consultant for a long time (shockingly, over half my life!) and so, despite selling my companies and heading to graduate school, I had a bit of a network of other founders who knew me and were supportive of the new venture.
A few lessons learned, in brief: (1) Try really, really hard to be specific about what you offer (even when in reality you offer a lot of different things), (2) Write great proposals -- they become the project bible and really help streamline client conversations, and (3) Understanding a client's data and business always takes longer than you'd think.
Author, here. In this post, I review the various ways that we put our email marketing optimization algorithm to the test, starting from simple sim environments in R, to scrappy real-world tests, more complex simulations, and ultimately a private beta with a production app. I hope it helps those thinking of bringing their own algorithms to market, and would love any feedback!
We only just launched yesterday, so I think we are still working to find that exact product-market fit. At the moment, however, we're targeting medium- to large-sized businesses that use drip email marketing campaigns, such as onboarding campaigns, lead nurturing campaigns, and retention campaigns.
Thanks for the feedback on the copy! If you are feeling extra generous and have a second, shoot me a PM and let me know what parts you found confusing. We want it to be accessible to non-technical marketers who might be using older software like Mailchimp, etc. (After all, a main benefit of using AI here is that you don't need to get into confusing automation-building stuff like multi-branching decision trees.)
Sorry for the late response. I'm a developer and a few sections I had to Google definitions for were:
> Our suite of advanced analytics, including Cohort Analyses, Sequence Analyses, and Algorithm Metrics will unlock new insights in to your customers and campaigns.
> We use a combination of deep reinforcement learning and hierarchical Bayesian models to quickly and accurately learn about your customers and optimize your campaigns in real-time.
I think you could simplify those a bit, maybe hitting from a higher-level with a link to a more detailed explanation of how Optimail does its thing. I totally get the value in using these terms, but for the landing page I would imagine using simpler language may result in non-technical users being able to grok the benefits of your product a bit better.
Thanks for the feedback - it's so hard to hit the right level of description. I'm definitely biased here after just leaving academia since I keep thinking "if we don't mention the hierarchical models they're going to think we're frauds!" - ha.
But you're absolutely right - our goal is to appeal to developers as well as marketing types, so perhaps this level of detail is just confusing and unnecessary on the landing page.
Feel free to shoot me a message if you have any other feedback - I really appreciate it!!
Will do! Btw, I signed up yesterday and really like the look of everything! I'll try and give it a spin in the next month or so whenever I'm ready to set up a campaign.