mathstodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance for maths people. We have LaTeX rendering in the web interface!

Server stats:

2.8K
active users

This week on Saturday Morning Math Writer's Club, I've got an interview to prepare for with Rob Schapire, inventor of boosting in machine learning. I think this is going to be a great story about a preeminently practical result that came from pure theory. I have a slew of questions to ask Rob, but in the mean time I want to find examples of people using it in production.

Jeremy Kun

Now I know there's a long list of Kaggle competition winners who use XGBoost: github.com/dmlc/xgboost/tree/m

But I don't consider this a production setting.

The same doc has a list of "use cases" and "integrations," but the only two that seem like they count are brief notes from the "Tencent data platform team" and the "autohome.com ad platform team." I will have to dig through the integrated tools to see if they list any compelling users.

GitHubxgboost/demo at master · dmlc/xgboostScalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow - xgboos...

A friend also brought up the possibility of writing about cuckoo hashing, but I don't know of anyone who uses this in prod. Plus, it seems like a relatively minor upgrade over something like linear probing, so I'm not sure in what context this would be particularly useful.

Apparently boosted decision trees are popular in particle physics? Like LHC, identify the higgs boson kind of problems.

ok, will file away some particle physics stuff for later. In the mean time: anyone know of production systems that use boosting? Like, not just an ML framework that supports it, or a Kaggle competition that won with it, but an actual production system that uses it?

Got sidetracked about some network science applications. Apparently Twitter does a community detection routine via matrix factorization to help with tweet recommendations

Had to take a break to take the kid on the world's slowest bike ride around the block. Writing will continue during his nap.

I emailed an academic who published some papers in particle physics about the use of boosted decision trees. In an 05 paper he predicted they'd be widespread in use for particle physics data analysis. I asked if he felt that prediction had come true.

He responded with "here's what Chat gpg [sic] replied," followed by the sort of generic LLM text that is completely useless.

What a trashy decision.

He followed up with a quote from a book he wrote that was equally unhelpful, and better than chat gpt, but still. The gall.

@j2kun 🤦‍♂️ but also, “in a 2005 paper at journal xyz with doi abc, this douchebag…” well played lol

@j2kun that seems like a new version of a lmgtfy link. Certainly it's rude in response to your specific question. Somehow asking GPT is better than an actual human-written source found by a search engine? I'm amazed at how fast people are adopting this technique as if it's magic and always true.

@j2kun lmgpttfy is the next lmgtfy?
Boo them.

When I was a grad student, I was reading an arxiv preprint from an academic and got confused. I sent him an email (the first cold email I sent) about his paper, and he responded with roughly "I don't have time to talk to students. Bye."

That was also offputting, especially to young me.

@j2kun Yandex uses (or at least used) it a lot: see Matrixnet and Catboost

@j2kun When cuckoo hashing might be better than linear probing: (1) when some kind of real-time response rate is needed and you're willing to pay two cache misses instead of one in order to guarantee that you will find your key in constant time, compared to the logarithmic worst case behavior of linear probing; (2) when you're dealing with adversarial input that will find the worst case keys of your hash table and repeatedly hit on them; (3) if you want all cache operations to take the same time as protection against timing attacks.

But I think in the vast majority of real-world applications none of these apply.

@11011110 thanks for the tips! I suspect you wrote the Wikipedia page on this, because it reads familiar :) anyway, I could see it being useful in some kind of global-scale something or other that Google or Facebook has, and the math analysis is nice, but unless the story is a bit richer I think it's not the best fit. But it could suffice if I'm struggling to fill out the book.

@j2kun Cuckoo filters in place of Bloom filters might have greater practicality. So that's the other reason for talking about cuckoo hashing: so you can use it to explain cuckoo filters.

@11011110 Oh boy, if I can get an interview with someone from TikTok... that would be wild. Maybe also get me in trouble with various state entities???