mathstodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance for maths people. We have LaTeX rendering in the web interface!

Server stats:

2.8K
active users

#stylometry

0 posts0 participants0 posts today

Reminder for those who may not realize this, but #Stylometry is kind of an insane field of study, and you can be uniquely identified based on your writing style alone.

This has, in the past, been applied to open source developers and programming code too, and it was found that using stylometry techniques you can identify the author of a
compiled binary based on their open source code style ~78% of the time

https://arxiv.org/pdf/1512.08546v1

There are some techniques to avoid this luckily, which involve fairly basic changes to your writing style and structure that can very effectively anonymize things again:

https://en.wikipedia.org/wiki/Adversarial_stylometry

Later today at #CHR2024, we are going to present our work on #Multilingual #Stylometry!

We isolated the influence of #language on #authorship #attribution #accuracy by translating multiple #corpora into each others' languages while keeping #corpus composition stable.

Interactive showcase: showcases.clsinfra.io/stylomet

Full paper: ceur-ws.org/Vol-3834/paper9.pd

This work was developed within the @CLSinfra project in #Trier, #Krakow and #Prague with Artjoms Šeļa, Evgeniia Fileva and Julia Dudar.

Look what landed on my doorstep 😍 The book is also available #OpenAccess online at #heiUP: heiup.uni-heidelberg.de/catalo and I would like to thank the very patient editors who had to deal with switching the publisher and coming up with ways to improve the quality of my illustrations in my article about #stylometry in #French and #Spanish for #Picasso 's writings: @christof @josecalvo @u_henny and Robert Hesselbach, Daniel Schlör

New #paper out: « Code #stylometry vs formatting and minification » peerj.com/articles/cs-2142/ , where we show how much current code stylometry techniques (i.e., how to automatically detect the author of a source code snippet) are resistent to automatic code formatting and minification. (Spoiler: quite a bit, authors can still be identified after those source-to-source transformations.) Available #openaccess on #PeerJ CS.

PeerJ Computer ScienceCode stylometry vs formatting and minificationThe automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.

Now up at #DH2024, Maciej Eder, developer of #stylo and co-organizer of #DH2016 in #Krakow, on various distance measures for #Stylometry: "Manhattan, Euclidean and their Siblings. Exploring Exotic Measures of Text Similarities...".

Key idea: Manhattan distance is L1-norm based, Euclidean is L2. But we can vary this parameter for a wide range of values, from 0.1 to 10. Then evaluate accuracy for authorship attribution.

Result: For longer vectors, it pays off to use a value of less than 1!

Replied in thread

@jcls Another paper we would like to highlight, again for the lovers of #novels

Dorothy Henriette Modrall Sperling, Mike Kestemont & Vincent Neyt (2023), “The Authorship of Stephen King’s Books Written Under the Pseudonym “Richard #Bachman”: A Stylometric Analysis”, Journal of Computational Literary Studies 2(1), 1–35. doi: doi.org/10.48694/jcls.3594

Keywords: #Stephen_King, #stylometry, #pop_culture, #authorship verification, contemporary English-language #fiction

Continued thread

This next paper is about #stylometry in a #translation setting involving novels in #Swedish and #Danish:

Martje Wijers (2023), “Why the Daisy sisters are different. A stylometric study on the oeuvre of Swedish author Henning #Mankell and the Dutch translations of his work”, Journal of Computational Literary Studies 2 (1), 1–27. doi: doi.org/10.48694/jcls.3585

Keywords: #stylometry, #cluster analysis, #PCA, #delta, #zeta, #translation